rdkitbioservices

Chem.RDKFingerprint did not match C++ signature for some SMILES, but okay for others


I'm working on trying to use ligands that are referenced in UniProt with the same ligand in PDB entries. For many ligands (e.g. FAD), the three-letter code is the same in both UniProt and PDB entries, but for some there is a slight difference. For example, for haemoglobin 1a9w chain A, in the PDB file I find "HEM" but in the corresponding UniProt entry (P69905) I find "heme b". "heme b" (in the UniProt json) has chebi id CHEBI:60344.

I downloaded the full ChEBI sdf file from https://ftp.ebi.ac.uk/pub/databases/chebi/SDF/, and find there are three haems that are close to what I want. So far, so good.

If I use the following code to calculate Tanimoto coefficients using CHEBI:60344 as a reference, one of the haems is okay but the other raises a C++ exception that I haven't been able to catch in my Python code. The problem is that if my list of chebi ids is the other way round, the code always fails before I get a value for the Tanimoto coefficient.

My question is - is this a bug in my implementation of the RDKIT code, is it a bug in the RDKIT code, is it a bug in the ChEBI module of bioservices, is the SMILES string in the ChEBI sdf file written incorrectly, or is there another issue?

This is all using conda installed rdkit, bioservices, python3.9 etc on a (old) Mac Pro running High Sierra (can't upgrade to a newer OS).

Ran this code:

from rdkit import Chem, DataStructs
from bioservices import ChEBI
heme = ChEBI()
heme_chebi_id = "CHEBI:60344"
heme_smiles = heme.getCompleteEntity(heme_chebi_id).smiles
target = Chem.MolFromSmiles(heme_smiles)
fp2 = Chem.RDKFingerprint(target)
for chebi_id in ["CHEBI:17627", "CHEBI:26355"]:
    ch = ChEBI()
    smiley = ch.getCompleteEntity(chebi_id).smiles
    print("reference:", heme_chebi_id)
    print("target:   ", chebi_id)
    print("reference:", heme_smiles)
    print("target:   ", smiley)
    ref = Chem.MolFromSmiles(smiley)
    fp1 = Chem.RDKFingerprint(ref)
    Tan = DataStructs.TanimotoSimilarity(fp1, fp2)
    print(Tan)
    print("-" * 64)
exit()

got this output:

reference: CHEBI:60344
target:    CHEBI:17627
reference: CC1=C(CCC([O-])=O)C2=[N+]3C1=Cc1c(C)c(C=C)c4C=C5C(C)=C(C=C)C6=[N+]5[Fe--]3(n14)n1c(=C6)c(C)c(CCC([O-])=O)c1=C2

target:    CC1=C(CCC(O)=O)C2=[N+]3C1=Cc1c(C)c(C=C)c4C=C5C(C)=C(C=C)C6=[N+]5[Fe--]3(n14)n1c(=C6)c(C)c(CCC(O)=O)c1=C2
Tanimoto coefficient: 1.0

reference: CHEBI:60344
target:    CHEBI:26355
reference: CC1=C(CCC([O-])=O)C2=[N+]3C1=Cc1c(C)c(C=C)c4C=C5C(C)=C(C=C)C6=[N+]5[Fe--]3(n14)n1c(=C6)c(C)c(CCC([O-])=O)c1=C2
target:    CC1=C(CCC(O)=O)C2=[N]3C1=Cc1c(C)c(C=C)c4C=C5C(C)=C(C=C)C6=[N]5[Fe]3(n14)n1c(=C6)c(C)c(CCC(O)=O)c1=C2

[12:36:26] Explicit valence for atom # 9 N, 4, is greater than permitted
Traceback (most recent call last):
File "/Volumes/Users/harry/icl/phyre2-ligand/./tanimoto_test.py", line 20, in <module>
fp1 = Chem.RDKFingerprint(ref)
Boost.Python.ArgumentError: Python argument types in
rdkit.Chem.rdmolops.RDKFingerprint(NoneType)
did not match C++ signature:
RDKFingerprint(RDKit::ROMol mol, unsigned int minPath=1, unsigned int maxPath=7, unsigned int fpSize=2048, unsigned int nBitsPerHash=2, bool useHs=True, double tgtDensity=0.0, unsigned int minSize=128, bool branchedPaths=True, bool useBondOrder=True, boost::python::api::object atomInvariants=0, boost::python::api::object fromAtoms=0, boost::python::api::object atomBits=None, boost::python::api::object bitInfo=None)

Solution

  • This error means that the input to the function Chem.RDKFingerprint is None. That means that ref is None. You can try printing the value of ref to verify.

    In this case, this is None because RdKit is not able to parse the given SMILES to a proper mol object. It has even raised the following warning if you look at the error carefully:

    Explicit valence for atom # 9 N, 4, is greater than permitted

    This is because of the co-ordinate bond present in the molecule which RdKit doesn't support. RdKit will treat it as a single bond which will raise the valency of both the Nitrogen atoms to 4 and hence an invalid molecule. Here's the same molecule generated from other sources:

    ref molecule

    To deal with this error, you'll have to modify the SMILES manually to make it such that either there's a charge on those nitrogen atoms or [Fe] is a separate atom rather than connected with a bond. Something like this:

    modified ref

    This isn't really an issue with the SMILES but more of a limitation with RDKit for its inability to support co-ordinate bonds. I have faced this issue many times and always had to modify the SMILES manually to get around it. One suggestion for you is that you can programmatically modify the SMILES because this kind of error will most likely occur for Metal-Ligand catalysts where a co-ordinate bond is almost always there. So you can search for atoms like [Fe] or [Pt] in the SMILES string and then modify them.