Quantcast
Channel: Planet Python
Viewing all articles
Browse latest Browse all 22462

Andrew Dalke: Use FragmentOnBonds to fragment a molecule in RDKit

$
0
0

This is part of a continuing series on how to fragment a molecular graph. So far I have covered:

Those were mostly pedagogical. They describe the low-level details of how to cut a bond to fragment a molecule into two parts, and still maintain correct chirality.

If you are writing production code, you likely shouldn't use any of that code, because the built-in RDKit function FragmentOnBonds() does the same thing.

FragmentOnBonds

FragmentOnBonds() will fragment a molecule given a list of bond identifiers. In default use it attaches a new "dummy" atom (what I've been calling a wildcard atom) to the atoms at the end of each broken bond. It will also set the isotope label for each newly created dummy atom to the atom index of the atom to which it is attached.

>>> from rdkit import Chem
>>> bond_idx = vanillin.GetBondBetweenAtoms(4, 5)
>>> from rdkit import Chem
>>> vanillin = Chem.MolFromSmiles("c1(C=O)cc(OC)c(O)cc1")
>>> bond = vanillin.GetBondBetweenAtoms(4, 5)
>>> bond
<rdkit.Chem.rdchem.Bond object at 0x102bec8a0<
>>> bond.GetIdx()
4
>>> new_mol = Chem.FragmentOnBonds(vanillin, [bond.GetIdx()])
>>> Chem.MolToSmiles(new_mol, isomericSmiles=True)
'[4*]OC.[5*]c1cc(C=O)ccc1O'
I don't want a label, so I'll specify 0 for the atom labels. (An unlabeled atom has an isotope of zero.)
>>> new_mol = Chem.FragmentOnBonds(vanillin, [bond.GetIdx()], dummyLabels=[(0, 0)])
>>> Chem.MolToSmiles(new_mol, isomericSmiles=True)
'[*]OC.[*]c1cc(C=O)ccc1O'

I'll turn this into function that I can use for testing:

def fragment_builtin(mol, a1, a2):
    bond = mol.GetBondBetweenAtoms(a1, a2)
    return Chem.FragmentOnBonds(mol, [bond.GetIdx()], addDummies=True, dummyLabels=[(0, 0)])
That's certainly much easier than fragment_chiral()!

FragmentOnBonds implementation

The RDKit source code is available, so we can look at how this built-in function is implemented. A search for "FragmentOnBonds":

Code/GraphMol/Wrap/MolOps.cpp
1631:      python::def("FragmentOnBonds", fragmentOnBondsHelper,
plus some investigation shows that the actual C++ code has the name "fragmentOnBonds". The RDKit convention is to use an intial uppercase letter for Python function, and an initial lowercase letter for C++ functions.

A search this time for "fragmentOnBonds" has a few more hits, including this very suggestive one:

Code/GraphMol/ChemTransforms/MolFragmenter.cpp
320:        ROMol *nm=fragmentOnBonds(mol,fragmentHere,addDummies,dummyLabelsHere,bondTypesHere,lCutsPerAtom);
329:    ROMol *fragmentOnBonds(const ROMol &mol,const std::vector &bondIndices,
That second result is start of the FragmentOnBondsfunction definition,. Here's the key part, with commentary.

The function can fragment multiple bonds. It iterates through the bonds in order. For each one it gets the bond type and the begin and end atoms, and if requested it stores information about the number of cuts per atom.

  for (unsigned int i = 0; i  bondsToRemove.size(); ++i) {
    const Bond *bond = bondsToRemove[i];
    unsigned int bidx = bond->getBeginAtomIdx();
    unsigned int eidx = bond->getEndAtomIdx();
    Bond::BondType bT = bond->getBondType();
    res->removeBond(bidx, eidx);
    if (nCutsPerAtom) {
      (*nCutsPerAtom)[bidx] += 1;
      (*nCutsPerAtom)[eidx] += 1;
    }
FragmentOnBonds() by default will add "dummy" atoms (atoms with atomic number 0), but this can be disabled. If enabled, then it will create the two atoms with atomic number 0. By default it will set the istope to be the index of the atom it will be attached to, but that can also be specified with the dummyLabels parameter:
    if (addDummies) {
      Atom *at1, *at2;
      at1 = new Atom(0);
      at2 = new Atom(0);
      if (dummyLabels) {
        at1->setIsotope((*dummyLabels)[i].first);
        at2->setIsotope((*dummyLabels)[i].second);
      } else {
        at1->setIsotope(bidx);
        at2->setIsotope(eidx);
      }
Next, make the bond from the old terminal atoms to the new wildcard atoms. By default the bond type for the new bonds is the same as the bond which was broken (determined earlier). Otherwise, there's an option to specify an alternate bond type.
      unsigned int idx1 = res->addAtom(at1, false, true);
      if (bondTypes) bT = (*bondTypes)[i];
      res->addBond(eidx, at1->getIdx(), bT);
      unsigned int idx2 = res->addAtom(at2, false, true);
      res->addBond(bidx, at2->getIdx(), bT);

The last part does what the comment says it does; if the atom has a tetrahedral chirality then check if the permutation order has changed and invert the chirality if needed:
      // figure out if we need to change the stereo tags on the atoms:
      if (mol.getAtomWithIdx(bidx)->getChiralTag() ==
              Atom::CHI_TETRAHEDRAL_CCW ||
          mol.getAtomWithIdx(bidx)->getChiralTag() ==
              Atom::CHI_TETRAHEDRAL_CW) {
        checkChiralityPostMove(mol, mol.getAtomWithIdx(bidx),
                               res->getAtomWithIdx(bidx),
                               mol.getBondBetweenAtoms(bidx, eidx));
      }
      if (mol.getAtomWithIdx(eidx)->getChiralTag() ==
              Atom::CHI_TETRAHEDRAL_CCW ||
          mol.getAtomWithIdx(eidx)->getChiralTag() ==
              Atom::CHI_TETRAHEDRAL_CW) {
        checkChiralityPostMove(mol, mol.getAtomWithIdx(eidx),
                               res->getAtomWithIdx(eidx),
                               mol.getBondBetweenAtoms(bidx, eidx));
}
The code to check if the chirality has changed iterates through all of the bonds of the old atom ("oAt") to make a list ("newOrder") containing all of the atom identifiers on the other side of the bond. (The "OEDGE_ITER" is part of the adapter to work with Boost Graph library. It's an "out_edge_iterator".)
void checkChiralityPostMove(const ROMol &mol, const Atom *oAt, Atom *nAt,
                            const Bond *bond) {
  INT_LIST newOrder;
  ROMol::OEDGE_ITER beg, end;
  boost::tie(beg, end) = mol.getAtomBonds(oAt);
  while (beg != end) {
    const BOND_SPTR obond = mol[*beg];
    ++beg;
The code knows that the RemoveBond()/AddBond() caused the new dummy atom to be placed at the end of bond list, so when it sees the old bond it simply ignores it. Once it's gone through the old atoms, it adds the new wildcard atom id to the end of the list. Interestingly, this knowledge isn't something I can depend on, because it's an implementation detail which might change in the future. That's why I needed to substitute the new bond id at this point in my code.
    if (obond.get() == bond) {
      continue;
    }
    newOrder.push_back(obond->getIdx());
  }
  newOrder.push_back(bond->getIdx());
The last bit is to compute the permutation order, or as it says, "perturbation order". I believe that is a typo. Dig down a bit and it uses a selection sort to count the number of swaps needed to make the "oAt" atom list and the "newOrder" list match. The result, modulo two, gives the parity. If the parity is odd, invert the chirality:
  unsigned int nSwaps = oAt->getPerturbationOrder(newOrder);
  if (nSwaps % 2) nAt->invertChirality();
}
This was a bit different than my approach, which compared the parity before and after, but it gives the same results.

Jumping back to the fragmentOnBonds() function, the last bit of the function is:

  res->clearComputedProps();
  return static_cast(res);
This is needed because some of the computed properties may depend on bond patterns which are no longer present. Note that it does not use SanitizeMol().

While the details are a bit different between my fragment_chiral() and FragmentOnBonds(), the general approach is the same.

A possible optimization

I mentioned that checkChiralityPostMove() uses insider knowledge. It knows that the deleted bond is removed from the list and the new bond added at the end. It uses this information to build the "newOrder" list with atom indices in the correct order, before determining the difference in parity between that list and the list of atom indices around the new bond.

There's an easier way. Count the number of bond between where the deleted bond was and the end, and take the result modulo 2. For example, if the transforamation were:

initial neighbors = [1, 6, 5, 4]
   - remove bond to atom 5
neighbors after delete = [1, 6, 4]
   - add bond to new atom 9
final neighbors = [1, 6, 4, 9]
then the bond to atom 5 was in the third position, with one bond (to atom 4) between it and the end of the list. The parity is therefore odd. I added this suggestion to the RDKit issue tracker as #1033.

The proof is simple. Assume there are n elements afterward the bond to delete. Then there are n swaps needed to bring it to the end, where it can be replaced with the connection to the new atom. Hence the parity is n%2.

SanitizeMol is slow

In the earlier essay in this series, "Fragment chiral molecules in RDKit", I investigated some failure cases where the chirality wasn't preserved. Greg Landrum pointed out:after you break one or more bonds, you really, really should re-sanitize the molecule (or at least call ClearComputedProps() .... I found that if I added SanitizeMol() then some of the error cases done by using ClearComputedProps() were no longer error cases. This may be because of a bug in FastFindRings. Greg, in the second link, writes: There is a workaround until this is fixed; explicitly sanitize your molecules or just cal GetSymmSSSR() before you generate SMILES.

So that's what I did. However, without SanitizeMol() there were only 232 failures out of 1.4M+ input structures. Adding SanitizeMol() didn't fix all of the remaining errors. On the other hand, the performance overhead to SanitizeMol() is quite a bit. In one timing test I made, the time to process 10,000 structures using fragment_chiral() without SanitizeMol() was 107 seconds, while the time with SanitizeMol() was 255 seconds. The overall processing time takes about twice as long. (There is an additional constant overhead to parse the SMILES and canonicalize the fragmented molecule, which I did not include.)

I don't think the performance impact is worth a slightly error rate, so I will not include SanitizeMol() in this essay. That means I can use my new fragment_on_mol() function as-is, and I'll modify fragment_chiral() so it calls ClearComputedProps() instead of SanitizeMol().

In any case, the explicit call to SanitizeMol() is a workaround. I hope ClearComputedProps() will suffice in the RDKit of the future world that you, the reader, inhabit.

Cross-validation

In the earlier essay I tested the fragment_chiral() function by attempting to re-connect the fragments. If the canonicalized result doesn't match the canonicalized input structure then there's a problem. (And there were a small number of problems, at the edge of RDKit's ability to handle chemisty.)

That test ignored what SMILES calls "directional bonds", that is the "/" and "\" in "F/C=C\F", which is how SMILES denotes the stereochemistry of double bonds. This is how SMILES deals with "E-Z notation", which would be more familiar to a chemist. The tests ignored those bonds for the simple reason that FragmentOnBonds() doesn't preserve that information if it cuts the bond. In a future essay I'll show how to implement that ability.

Instead of fully testing fragment_on_bonds(), I'll do a weaker test where I compare its results to fragment_chiral(). I say it's weaker because what it shows is that they are likely bug-compatible, which is different than being correct. It's also weaker because the two implementation methods are very similar. A stronger test would use a different way to fragment.

I think of this as a cross-validation test. As Wikipedia helpful points out, that term is overloaded. Statistics uses a different definition than chemistry; the latter term is very similar to what's used in verification and validation, so I think I'm right to use the term.

The test code is not interesting, so I'll put it at the end and not discuss it in detail. The results weren't interesting either. There were no mismatches. I just had to wait for a few hours while I let it process the 1.4M+ structures from ChEMBL 20.

Here are the last few lines of output:

Processed 1455762 records, 1455763 molecules, 16017700 tests, 0 mismatches.  T_chiral: 5912.26 T_on_bond 3554.22
elapsed time: 8:30:15
The two implementation gave identical results, and time needed for fragment_chiral() is about 1.7x that needed for fragment_on_bond(). The fragment_on_bond() code is shorter, faster, and easier to understand, so you should always use it instead of fragment_chiral().

The code

The fragment_chiral() function here is a copy of the previous essay, except I replaced the SanitizeMol(new_mol) with a mol.ClearComputedProps() for 2x performance, at the expense of perhaps a few incorrect structures. I put it here so the test code was self-contained, except of course you'll need your own dataset.

from __future__ import print_function
import time
import datetime
from rdkit import Chem

# Two different methods to cut a bond and fragment an RDKit molecule.
#   - fragment_on_bond() uses RDKit's FragmentOnBonds()
#   - fragment_chiral() uses lower-level API calls

# This also includes a cross-validation function which checks that the
# two methods produce the same fragments as output.
# The final two lines when I evaluated ChEMBL20 were:
#
# Processed 1455762 records, 1455763 molecules, 16017700 tests, 0 mismatches.  T_chiral: 5912.26 T_on_bond 3554.22
# elapsed time: 8:30:15
#
# The timings show that fragment_on_bond() is about 1.7x the performance of fragment_chiral().

# I also used this to evalute the performance impact of SanitizeMol()
# After 10,000 molecules, the times were:
#      with SanitizeMol(): T_chiral: 254.94 T_on_bond 210.82
#   without SanitizeMol(): T_chiral: 107.22 T_on_bond 64.99
# That call adds a lot of overhead for little gain in accuracy.


#### Fragment using a high-level RDKit API function

def fragment_on_bond(mol, atom1, atom2):
    bond = mol.GetBondBetweenAtoms(atom1, atom2)
    new_mol = Chem.FragmentOnBonds(mol, [bond.GetIdx()], dummyLabels=[(0, 0)])
    # After breaking bonds, should re-sanitize
    # See https://github.com/rdkit/rdkit/issues/1022#issuecomment-239355482
    # However, I didn't see much of an improvement, except for chiral
    # carbon bridgeheads, and the result has 1/3 the performance
    #Chem.SanitizeMol(new_mol)
    return new_mol


#### Fragment using low-level RDKit API functions

# See http://www.dalkescientific.com/writings/diary/archive/2016/08/14/fragment_chiral_molecules.html
# for implementation discussion

CHI_TETRAHEDRAL_CW = Chem.ChiralType.CHI_TETRAHEDRAL_CW
CHI_TETRAHEDRAL_CCW = Chem.ChiralType.CHI_TETRAHEDRAL_CCW

def parity_shell(values):
    # Simple Shell sort; while O(N^2), we only deal with at most 4 values
    # See http://www.dalkescientific.com/writings/diary/archive/2016/08/15/fragment_parity_calculation.html
    # for faster versions for fixed-size lists.
    values = list(values)
    N = len(values)
    num_swaps = 0
    for i in range(N-1):
        for j in range(i+1, N):
            if values[i] > values[j]:
                values[i], values[j] = values[j], values[i]
                num_swaps += 1
    return num_swaps % 2


def get_bond_parity(mol, atom_id):
    """Compute the parity of the atom's bond permutation

    Return None if it does not have tetrahedral chirality,
    0 for even parity, or 1 for odd parity.
    """
    atom_obj = mol.GetAtomWithIdx(atom_id)
    
    # Return None unless it has tetrahedral chirality
    chiral_tag = atom_obj.GetChiralTag()
    if chiral_tag not in (CHI_TETRAHEDRAL_CW, CHI_TETRAHEDRAL_CCW):
        return None
    
    # Get the list of atom ids for the each atom it's bonded to.
    other_atom_ids = [bond.GetOtherAtomIdx(atom_id) for bond in atom_obj.GetBonds()]
    
    # Use those ids to determine the parity
    return parity_shell(other_atom_ids)


def set_bond_parity(mol, atom_id, old_parity, old_other_atom_id, new_other_atom_id):
    """Compute the new bond parity and flip chirality if needed to match the old parity"""
    
    atom_obj = mol.GetAtomWithIdx(atom_id)
    # Get the list of atom ids for the each atom it's bonded to.
    other_atom_ids = [bond.GetOtherAtomIdx(atom_id) for bond in atom_obj.GetBonds()]
    
    # Replace id from the new wildcard atom with the id of the original atom
    i = other_atom_ids.index(new_other_atom_id)
    other_atom_ids[i] = old_other_atom_id
    
    # Use those ids to determine the parity
    new_parity = parity_shell(other_atom_ids)
    if old_parity != new_parity:
        # If the parity has changed, invert the chirality
        atom_obj.InvertChirality()

def fragment_chiral(mol, atom1, atom2):
    """Cut the bond between atom1 and atom2 and replace with connections to wildcard atoms

    Return the fragmented structure as a new molecule.
    """
    rwmol = Chem.RWMol(mol)
    
    atom1_parity = get_bond_parity(mol, atom1)
    atom2_parity = get_bond_parity(mol, atom2)
    
    rwmol.RemoveBond(atom1, atom2)
    wildcard1 = rwmol.AddAtom(Chem.Atom(0))
    wildcard2 = rwmol.AddAtom(Chem.Atom(0))
    new_bond1 = rwmol.AddBond(atom1, wildcard1, Chem.BondType.SINGLE)
    new_bond2 = rwmol.AddBond(atom2, wildcard2, Chem.BondType.SINGLE)
    
    if atom1_parity is not None:
        set_bond_parity(rwmol, atom1, atom1_parity, atom2, wildcard1)
    if atom2_parity is not None:
        set_bond_parity(rwmol, atom2, atom2_parity, atom1, wildcard2)
    
    # After breaking bonds, should re-sanitize
    # See https://github.com/rdkit/rdkit/issues/1022#issuecomment-239355482
    # However, I didn't see much of an improvement, except for chiral
    # carbon bridgeheads, and the result has 1/2 the performance.
    new_mol = rwmol.GetMol()
    # I must ClearComputedProps() after editing the structure.
    new_mol.ClearComputedProps()
    #Chem.SanitizeMol(new_mol)
    return new_mol

####### Cross-validation code

def read_records(filename):
    with open(filename) as infile:
        for recno, line in enumerate(infile):
            smiles, id = line.split()
            if "*" in smiles:
                continue
            yield recno, id, smiles

def cross_validate():
    # Match a single bond not in a ring
    BOND_SMARTS = "[!#0;!#1]-!@[!#0;!#1]"
    single_bond_pat = Chem.MolFromSmarts(BOND_SMARTS)
    
    num_mols = num_tests = num_mismatches = 0
    time_chiral = time_on_bond = 0.0

    # Helper function to print status information. Since
    # the function is defined inside of another function,
    # it has access to variables in the outer function.
    def print_status():
        print("Processed {} records, {} molecules, {} tests, {} mismatches.""  T_chiral: {:.2f} T_on_bond {:.2f}"
              .format(recno, num_mols, num_tests, num_mismatches,
                      time_chiral, time_on_bond))
    
    filename = "/Users/dalke/databases/chembl_20_rdkit.smi"
    start_time = datetime.datetime.now()
    for recno, id, smiles in read_records(filename):
        if recno % 1000 == 0:
            print_status()

        mol = Chem.MolFromSmiles(smiles)
        if mol is None:
            continue
        num_mols += 1

        # Find all of the single bonds
        matches = mol.GetSubstructMatches(single_bond_pat)

        for a1, a2 in matches:
            # For each pair of atom indices, split on that bond
            # and determine the canonicalized fragmented molecule.
            # Do it for both fragment_chiral() ...
            t1 = time.time()
            mol_chiral = fragment_chiral(mol, a1, a2)
            t2 = time.time()
            time_chiral += t2-t1
            smiles_chiral = Chem.MolToSmiles(mol_chiral, isomericSmiles=True)

            # ... and for fragment_on_bond()
            t1 = time.time()
            mol_on_bond = fragment_on_bond(mol, a1, a2)
            t2 = time.time()
            time_on_bond += t2-t1
            smiles_on_bond = Chem.MolToSmiles(mol_on_bond, isomericSmiles=True)

            # Report any mismatches and keep on going.
            num_tests += 1
            if smiles_chiral != smiles_on_bond:
                print("Mismatch: record: {} id: {} smiles: {} cut: {} {}".format(
                      recno, id, smiles, a1, a2))
                print(" smiles_chiral:", smiles_chiral)
                print("smiles_on_bond:", smiles_on_bond)
                num_mismatches += 1
    
    end_time = datetime.datetime.now()
    elapsed_time = str(end_time - start_time)
    elapsed_time = elapsed_time.partition(".")[0]  # only display subseconds
    
    print("Done.")
    print_status()
    print("elapsed time:", elapsed_time)
                
if __name__ == "__main__":
    cross_validate()

Viewing all articles
Browse latest Browse all 22462

Trending Articles