Retrieving PubChem Compound Information

By Daniel Himmelstein

January 30, 2015

Here, we take the pubchem compound identifiers for sterio drugs in SIDER 2 and find the corresponding parent compound and canonical smiles. This code is the third notebook in a project to parse and analyze SIDER 2 data.

import os
import csv
import collections

We use the pubchempy package to query PubChem's API within python. Information on pubchempy is available at the:

import pubchempy

def get_pubchem_parent(cid, orphans_as_self=True):
    """
    From a pubchem_cid, retreive the parent compound's cid.
    If function is unsuccesful in retrieving a single parent,
    `orphans_as_self = True` returns `cid` rather than None.
    
    According to pubmed:
    
    > A parent is conceptually the "important" part of the molecule
    > when the molecule has more than one covalent component.
    > Specifically, a parent component must have at least one carbon
    > and contain at least 70% of the heavy (non-hydrogen) atoms of
    > all the unique covalent units (ignoring stoichiometry).
    > Note that this is a very empirical definition and is subject to change.

    A parallel query can be executed using the REST PUG API:
    http://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/11477084/cids/XML?cids_type=parent
    """
    assert cid
    
    try:
        parent_cids = pubchempy.get_cids(identifier=cid, namespace='cid', domain='compound', cids_type='parent')
    except pubchempy.BadRequestError as e:
        print 'Error getting parent of {}. {}'.format(cid, e)
        return cid if orphans_as_self else None
    try:
        parent_cid, = parent_cids
        return parent_cid
    except ValueError:
        print 'Error getting parent of {}. Parents retreived: {}'.format(cid, parent_cids)
    return cid if orphans_as_self else None

path = os.path.join('..', 'data', 'sider_compounds_pubchem.txt')
with open(path) as read_file:
    reader = csv.DictReader(read_file, fieldnames=['pubchem_cid'])
    rows = list(reader)
rows[:3]

[{'pubchem_cid': '119'}, {'pubchem_cid': '137'}, {'pubchem_cid': '143'}]

for row in rows:
    cid = row['pubchem_cid']
    parent_cid = get_pubchem_parent(cid)
    cid_props, cid_parent_props = pubchempy.get_properties(
        properties=['canonical_smiles'], identifier=[cid, parent_cid], namespace='cid')
    row['canonical_smiles'] = cid_props['CanonicalSMILES']
    row['pubchem_cid_parent'] = parent_cid
    row['canonical_smiles_parent'] = cid_parent_props['CanonicalSMILES']
rows[:3]

Error getting parent of 271. Parents retreived: []
Error getting parent of 312. Parents retreived: []
Error getting parent of 402. Parents retreived: []
Error getting parent of 784. Parents retreived: []
Error getting parent of 807. Parents retreived: []
Error getting parent of 888. Parents retreived: []
Error getting parent of 947. Parents retreived: []
Error getting parent of 948. Parents retreived: []
Error getting parent of 977. Parents retreived: []
Error getting parent of 2770. Parents retreived: []
Error getting parent of 3161. Parents retreived: []
Error getting parent of 5238. Parents retreived: []
Error getting parent of 5785. Parents retreived: []
Error getting parent of 5825. Parents retreived: []
Error getting parent of 6691. Parents retreived: []
Error getting parent of 9052. Parents retreived: []
Error getting parent of 14791. Parents retreived: []
Error getting parent of 14888. Parents retreived: []
Error getting parent of 20585. Parents retreived: []
Error getting parent of 23954. Parents retreived: []
Error getting parent of 23987. Parents retreived: []
Error getting parent of 24393. Parents retreived: []
Error getting parent of 24843. Parents retreived: []
Error getting parent of 25959. Parents retreived: []
Error getting parent of 26924. Parents retreived: []
Error getting parent of 28486. Parents retreived: []
Error getting parent of 43805. Parents retreived: []
Error getting parent of 65027. Parents retreived: []
Error getting parent of 66376. Parents retreived: []
Error getting parent of 71368. Parents retreived: []
Error getting parent of 145068. Parents retreived: []
Error getting parent of 160051. Parents retreived: []
Error getting parent of 4517618. Parents retreived: []
Error getting parent of 5280452. Parents retreived: []
Error getting parent of 5280962. Parents retreived: []
Error getting parent of 5280972. Parents retreived: []
Error getting parent of 5281008. Parents retreived: []
Error getting parent of 5281011. Parents retreived: []
Error getting parent of 5281021. Parents retreived: []
Error getting parent of 5281106. Parents retreived: []
Error getting parent of 5282044. Parents retreived: []
Error getting parent of 5360126. Parents retreived: []
Error getting parent of 6326970. Parents retreived: []
Error getting parent of 6474909. Parents retreived: []
Error getting parent of 11598201. Parents retreived: []
Error getting parent of 11979316. Parents retreived: []
Error getting parent of 16132418. Parents retreived: []
Error getting parent of 16132438. Parents retreived: []
Error getting parent of 25077648. Parents retreived: []
Error getting parent of 44387541. Parents retreived: []
Error getting parent of 45267081. Parents retreived: []

[{'canonical_smiles': u'C(CC(=O)O)CN',
  'canonical_smiles_parent': u'C(CC(=O)O)CN',
  'pubchem_cid': '119',
  'pubchem_cid_parent': 119},
 {'canonical_smiles': u'C(CC(=O)O)C(=O)CN',
  'canonical_smiles_parent': u'C(CC(=O)O)C(=O)CN',
  'pubchem_cid': '137',
  'pubchem_cid_parent': 137},
 {'canonical_smiles': u'C1C(N(C2=C(N1)NC(=NC2=O)N)C=O)CNC3=CC=C(C=C3)C(=O)NC(CCC(=O)O)C(=O)O',
  'canonical_smiles_parent': u'C1C(N(C2=C(N1)NC(=NC2=O)N)C=O)CNC3=CC=C(C=C3)C(=O)NC(CCC(=O)O)C(=O)O',
  'pubchem_cid': '143',
  'pubchem_cid_parent': 143}]

collections.Counter(str(row['pubchem_cid']) == str(row['pubchem_cid_parent']) for row in rows)

Counter({True: 990, False: 140})

path = os.path.join('..', 'data', 'compounds.txt')
with open(path, 'w') as write_file:
    fieldnames = ['pubchem_cid', 'pubchem_cid_parent', 'canonical_smiles', 'canonical_smiles_parent']
    writer = csv.DictWriter(write_file, fieldnames=fieldnames, delimiter='\t')
    writer.writeheader()
    writer.writerows(rows)

Download

For constructing compound networks, compounds.txt can be used as a node attributes table and similarities.txt can be used like a .sif file for edges. To exclude similarities for compound pairs where less than all three methods produce a score, use similarities-complete.txt.