pythonrdkitfingerprinting

Can't convert molecule to fingerprint with rdkit


I'm trying to convert molecular smiles into fingerprints using rdkit. I have two smiles: Nc1cccc(N)n1 and Nc1cc(CSc2ccc(O)cc2)cc(N)n1. The first one was expanded into the second one. In other words, the second molecule contains the first one in its structure.

What I did was use rdkit to remove the common part to obtain smiles of a fragment that differs (CSC1=CC=C(O)C=C1 in kekulized form). I'm trying to convert that fragment into a molecule and then to a fingerprint to calculate similarity with a reference molecule.

Desired transformation

But I get an error: 'Can't kekulize atoms' with indices of those atoms. This is strange to me because all the smiles (the two input smiles and the resulting fragment smiles) can be easily visualized using MarvinSketch or Chemdraw (software for drawing molecules). I even had Marvin kekulize the fragment smiles and tried making a molecule from that but I still get the same error. Here is my code for removing the fragment:

def remove_initial_fragment(mol_smiles, fragment_smiles):
    mol = Chem.MolFromSmiles(mol_smiles) #creates molecule from the longer smiles
    fragment = Chem.MolFromSmiles(fragment_smiles) #the molecule I want to remove

    rm = AllChem.DeleteSubstructs(mol, fragment) #creates new molecule

    return Chem.MolToSmiles(rm) #converts the mol I want back into smiles

smiles_frags = [remove_initial_fragment(x, fragment_smiles) for x in smiles]
mols_frags = [Chem.MolFromSmiles(x) for x in smiles_frags]

In my case, the 'fragment_smiles' is the same for all selected smiles. But then I get an error when trying to convert molecules from the 'mols_frags' list into fingerprints:

MFP_2 = [AllChem.GetMorganFingerprintAsBitVect(x, 2) for x in mols_frags]

I tried looking online for answers but nothing really helped. I even tried to create kekulized smiles separately and passing them directly as input for creating the fingerprints but I still get the same error.

It's super weird to me because when I try to do the same process with the same code for one set of smiles (fragment, longer smiles, resulting smiles), it works without a problem and I can create the fingerprint without any error. But it seems to me that once I input the smiles/molecules as a list, I get the error. Any idea why this could be? Or do you see any error in my code that I'm unaware of?


Solution

  • With fragment_smiles = 'Nc1cccc(N)n1' and a list like smiles = ['Nc1cc(CSc2ccc(O)cc2)cc(N)n1', 'Nc1cc(COc2ccc(O)cc2)cc(N)n1']. I have no problem getting a fingerprint.

    It looks as if, after deleting the substructure, there are some smiles_frags that are not correct SMILES.

    To prove wich SMILES in the list gives the problem you can use

    from rdkit.Chem import AllChem as Chem
    
    fragment = Chem.MolFromSmiles('Nc1cccc(N)n1')
    
    smiles = ['Nc1cc(CSc2ccc(O)cc2)cc(N)n1', 'Nc1cc(COc2ccc(O)cc2)cc(N)n1', 'CC1=CC=Cc2c(N)nc(N)cc12']
    
    for smi in smiles:
        try:
            mol = Chem.MolFromSmiles(smi)
            f1 = Chem.DeleteSubstructs(mol, fragment)
            f2 = Chem.MolFromSmiles(Chem.MolToSmiles(f1))
            fp = Chem.GetMorganFingerprintAsBitVect(f2, 2)
        except:
            print('SMILES:', smi)
            f = Chem.DeleteSubstructs(mol, fragment)
            print('smiles_frag:', Chem.MolToSmiles(f1))
    

    This will give:

    SMILES: CC1=CC=Cc2c(N)nc(N)cc12
    smiles_frag: ccccC