I'm trying to convert molecular smiles into fingerprints using rdkit
. I have two smiles:
Nc1cccc(N)n1 and Nc1cc(CSc2ccc(O)cc2)cc(N)n1. The first one was expanded into the second one. In other words, the second molecule contains the first one in its structure.
What I did was use rdkit to remove the common part to obtain smiles of a fragment that differs (CSC1=CC=C(O)C=C1 in kekulized form). I'm trying to convert that fragment into a molecule and then to a fingerprint to calculate similarity with a reference molecule.
But I get an error: 'Can't kekulize atoms' with indices of those atoms. This is strange to me because all the smiles (the two input smiles and the resulting fragment smiles) can be easily visualized using MarvinSketch or Chemdraw (software for drawing molecules). I even had Marvin kekulize the fragment smiles and tried making a molecule from that but I still get the same error. Here is my code for removing the fragment:
def remove_initial_fragment(mol_smiles, fragment_smiles):
mol = Chem.MolFromSmiles(mol_smiles) #creates molecule from the longer smiles
fragment = Chem.MolFromSmiles(fragment_smiles) #the molecule I want to remove
rm = AllChem.DeleteSubstructs(mol, fragment) #creates new molecule
return Chem.MolToSmiles(rm) #converts the mol I want back into smiles
smiles_frags = [remove_initial_fragment(x, fragment_smiles) for x in smiles]
mols_frags = [Chem.MolFromSmiles(x) for x in smiles_frags]
In my case, the 'fragment_smiles' is the same for all selected smiles. But then I get an error when trying to convert molecules from the 'mols_frags' list into fingerprints:
MFP_2 = [AllChem.GetMorganFingerprintAsBitVect(x, 2) for x in mols_frags]
I tried looking online for answers but nothing really helped. I even tried to create kekulized smiles separately and passing them directly as input for creating the fingerprints but I still get the same error.
It's super weird to me because when I try to do the same process with the same code for one set of smiles (fragment, longer smiles, resulting smiles), it works without a problem and I can create the fingerprint without any error. But it seems to me that once I input the smiles/molecules as a list, I get the error. Any idea why this could be? Or do you see any error in my code that I'm unaware of?
With fragment_smiles = 'Nc1cccc(N)n1'
and a list
like smiles = ['Nc1cc(CSc2ccc(O)cc2)cc(N)n1', 'Nc1cc(COc2ccc(O)cc2)cc(N)n1']
. I have no problem getting a fingerprint.
It looks as if, after deleting the substructure, there are some smiles_frags that are not correct SMILES.
To prove wich SMILES in the list
gives the problem you can use
from rdkit.Chem import AllChem as Chem
fragment = Chem.MolFromSmiles('Nc1cccc(N)n1')
smiles = ['Nc1cc(CSc2ccc(O)cc2)cc(N)n1', 'Nc1cc(COc2ccc(O)cc2)cc(N)n1', 'CC1=CC=Cc2c(N)nc(N)cc12']
for smi in smiles:
try:
mol = Chem.MolFromSmiles(smi)
f1 = Chem.DeleteSubstructs(mol, fragment)
f2 = Chem.MolFromSmiles(Chem.MolToSmiles(f1))
fp = Chem.GetMorganFingerprintAsBitVect(f2, 2)
except:
print('SMILES:', smi)
f = Chem.DeleteSubstructs(mol, fragment)
print('smiles_frag:', Chem.MolToSmiles(f1))
This will give:
SMILES: CC1=CC=Cc2c(N)nc(N)cc12
smiles_frag: ccccC