pythonchemistryrdkit

Converting an Isometric SMILE into its atoms and non-hydrogen neighbours


Hope whoever is reading this is well.

I have an issue with my code. I am trying to convert an isometric SMILE, a descriptor of a molecule, into its atomic groups and neightbours.

My code is below.

import rdkit
from rdkit import Chem

def collect_bonded_atoms(smile):
    mol = Chem.MolFromSmiles(smile)
    atom_counts = {}
    for atom in mol.GetAtoms():
        neighbors = [(neighbor.GetSymbol(), bond.GetBondType())
                     for neighbor, bond in zip(atom.GetNeighbors(), atom.GetBonds())]
        neighbors.sort()
        key = "{}-{}".format(atom.GetSymbol(), "".join(f"{symbol}{'-' if bond_order == 1 else '=' if bond_order == 2 else '#'}" for symbol, bond_order in neighbors))
        atom_counts[key] = atom_counts.get(key, 0) + 1
    return atom_counts

smile = "CC(C)(C)C(=O)O"
print(collect_bonded_atoms(smile))

And output is

{'C-C-': 3, 'C-C-C-C-C-': 1, 'C-C-O-O=': 1, 'O-C=': 1, 'O-C-': 1}

Whilst this works well for this molecule's SMILE, though preferably I would've liked it to be structured as,

{'C-C-': 3, 'C-C(-C)(-C)-C-': 1, 'C-C-O(=O)': 1, 'O=C': 1, 'O-C-': 1}

I can't figure out how to fix this. This is a side issue.

The main issue I have is when using this molecule

smile = "CCCCCCCCN1C=C[N+](=C1)C.F[P-](F)(F)(F)(F)F"

My output is very wrong. This is my output.

{'C-C-': 1, 'C-C-C-': 6, 'C-C-N-': 1, 'N-C-C#C#': 2, 'C-C#N#': 2, 'C-N#N#': 1, 'C-N-': 1, 'F-P-': 6, 'P-F-F-F-F-F-F-': 1}

First is that double bonds (bond_order == 2) are shown as a #. Second where it shows the number 1 in the molecule SMILE, that represents a ring. This means that it connects to the next 1. In the output, it is all over the place.

Can I please have some guidance on this? Thanks

Advice on how to fix it, or even better a modification. The side issue isn't as important, but if possible same for it.


Solution

  • You're getting the # because those bonds are part of an aromatic ring, making their bond type AROMATIC instead of a SINGLE or DOUBLE. In RdKit, AROMATIC bonds have a bond order of 1.5 so all those bonds are ending up in the else loop.

    To fix this, you can do two things:

    1. Change your if-else condition to acknowledge the 1.5 bond order
    2. Kekulize the mol object to remove conjugation due to the aromatic ring and make all the bonds static. You can do this by updating your code as:
    mol = Chem.MolFromSmiles(smile)
    Chem.Kekulize(mol)