rdkitcheminformatics

RDKit: how to check molecules for exact match?


I'm using RDKit and trying to check molecules for exact match. After using Chem.MolFromSmiles() the expression m == p apparently doesn't lead to the desired result. Of course, I can check whether p is a substructure of m and whether m is a substructure of p. But to me this looks too complicated. I couldn't find or overlooked a code example for exact match in the RDKit-documentation. How do I do this correctly? Thank you for hints.

Code:

from rdkit import Chem

myPattern = 'c1ccc2c(c1)c3ccccc3[nH]2'          # Carbazole
myMolecule = 'C1=CC=C2C(=C1)C3=CC=CC=C3N2'      # Carbazole

m = Chem.MolFromSmiles(myMolecule)
p = Chem.MolFromSmiles(myPattern)

print(m == p)                    # returns False, first (unsuccessful) attempt to check for identity

print(m.HasSubstructMatch(p))    # returns True
print(p.HasSubstructMatch(m))    # returns True
print(m.HasSubstructMatch(p) and p.HasSubstructMatch(m))    # returns True, so are the molecules identical?

Solution

  • To check if two different SMILES represent the same molecule you can canonicalize the SMILES.

    from rdkit import Chem
    
    myPattern = 'c1ccc2c(c1)c3ccccc3[nH]2'
    myMolecule = 'C1=CC=C2C(=C1)C3=CC=CC=C3N2'
    
    a = Chem.CanonSmiles(myPattern)
    b = Chem.CanonSmiles(myMolecule)
    
    print(a)
    'c1ccc2c(c1)[nH]c1ccccc12'
    
    print(b)
    'c1ccc2c(c1)[nH]c1ccccc12'
    
    print(a==b)
    True