Andrew Dalke: Software mentioned in ChEMBL SD records

When I work with real-world data sets, I usually start with PubChem before going on to ChEMBL. Why? The PubChem data is all generated by one chemistry toolkit - I'm pretty sure it's OEChem - while ChEMBL data comes from many sources.

To get a sense of the diversity, I processed the ChEBML 21 release to get the second line of each record. The ctfile format specification says that if non-blank then the 8 characters starting in the third position should contain the program name. The documentation includes a description of the fields:

IIPPPPPPPPMMDDYYHHmmddSS…

where those fields are:

II: two characters for the user's initials
PPPPPPPP: eight characters for the program name
MMDDYY: two characters for the month, two for the day, two for the year
HHmm: two characters for the hour, two for the minutes
… additional fields omitted …

Here are a few examples:

  SciTegic12231509382D
  Mrv0541 05211314572D          
          08180812482D 1   1.00000     0.00000     0
  Symyx   11300911332D 1   1.00000     0.00000     0

The third program wishes to remain anonymous.

Extract program names from the ChEMBL 21 SDF

I'll write a program to extract those program names and count how many times each one occurs. I don't need a general-purpose SD file reader because ChEMBL uses a subset of the SD format. For example, there are only two lines in each record which start with "CHEMBL", the title line (the first line of the record) and the data line after the "chembl_id" tag.

My code can read line through the file. The first time it sees a "CHEMBL" line is the title line, so the following line (the second line) contains the data. Then when it sees "> <chembl_id>" it knows to skip the following line, which will have a CHEMBL on it.

There are two oddities here. First, the gzip reader returns byte strings. I decided to do the pattern matching on the byte strings to ignore the overhead of converting everything to Unicode when I only need a few characters from the file. Second, a Python file object is its own iterator, so I can use "infile" in both the for-loop and in the body itself.

import sys, gzip
from collections import Counter
counter = Counter()
with gzip.open("/Users/dalke/databases/chembl_21.sdf.gz", "rb") as infile:
  for line in infile:
      # Get the line after the title line (which starts with 'CHEMBL')
      if line[:6] == b"CHEMBL":
          next_line = next(infile)
          # Print unique program names
          program = next_line[2:10]
          if program not in counter:
              print("New:", repr(str(program, "ascii")))
          counter[program] += 1
      
      if line[:13] == b">":
          ignore = next(infile) # skip the CHEMBL
  
  print("Done. Here are the counts for each program seen.")
  for name, count in counter.most_common():
    print(repr(str(name, "ascii")), count)

Program counts

Here is the table of counts: #count_table { border-collapse: collapse; margin-left: 5em; } #count_table th { border-bottom: 1px solid black; } #count_table th:nth-child(2) { border-left: 1px solid grey; } #count_table tr td:nth-child(2) { border-left: 1px solid grey; text-align: right; } pre.code img { border: 1px dotted #084; }

Program name	count
'SciTegic'	1105617
' '	326445
'CDK 9'	&nbsp69145
'Mrv0541 '	30962
''	24531
'CDK 6'	10805
'CDK 1'	8642
'CDK 5'	4771
'CDK 8'	2209
'-ISIS- '	281
'Symyx '	171
'CDK 7'	144
'-OEChem-'	96
'Marvin '	61
'Mrv0540 '	13
' RDKit'	3
'CDK 3'	1

The number of different CDK lines is not because of a version number but because CDK doesn't format the line correctly. The specification states that the first few fields of a non-blank line are supposed to be:

IIPPPPPPPPMMDDYYHHmmddSS…

while the CDK lines use:

  CDK    7/28/10,10:58
  CDK    8/10/10,12:22
  CDK    9/16/09,9:40
  CDK    10/7/09,10:42
  CDK    11/9/10,11:20

were the date starts one byte too early. The date also isn't in the specified format.

Further examination shows these were only generated in 2009 and 2010. The current CDK implementation is correct, and a 'git annotate' suggests it was fixed in 2010.

I don't think anyone uses that line for anything, and don't see the point of changing anything, so I don't think it's worthwhile to even ask ChEBML to change these old records.

Andrew Dalke: Software mentioned in ChEMBL SD records

Extract program names from the ChEMBL 21 SDF

Program counts

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112