pythonchemistryrdkit

How to change representation of dummy atoms in SMILES format


Hi i want to perform the task of grouping the same molecular structures by using smiles code.

However, even with the same structure, it is difficult to group them because the representation of dummy atoms is different.

I'm using the RDKIT program and I've tried changing several options but haven't found a solution yet. I would like to ask for your help. (rdkit version 2022.3.4)

Example smiles: (same structure but different smiles code -> desired code format)

  1. [1*]C(=O)OC, [13*]C(=O)OC -> *C(=O)OC
  2. [31*]C1=CC=CC2=C1C=CC=N2, [5*]C1=CC=CC2=C1C=CC=N2 -> *C1=CC=CC2=C1C=CC=N2
  3. [45*]C(N)=O, [5*]C(N)=O, [19*]C(N)=O, [16*]C(N)=O -> *C(N)=O

Solution

  • Sounds a little weired, but you can replace AnyAtom with AnyAtom.

    You can use ReplaceSubstructs() for this.

    from rdkit import Chem
    
    smiles = ['[1*]C(=O)OC', '[13*]C(=O)OC',
              '[31*]C1=CC=CC2=C1C=CC=N2', '[5*]C1=CC=CC2=C1C=CC=N2',
              '[45*]C(N)=O', '[5*]C(N)=O', '[19*]C(N)=O', '[16*]C(N)=O']
    
    search_patt = Chem.MolFromSmiles('*') # finds AnyAtom with or without numbers
    sub_patt = Chem.MolFromSmiles('*')    # AnyAtom without numbers
    
    for s in smiles:
        m=Chem.MolFromSmiles(s, sanitize=False)
        new_m = Chem.ReplaceSubstructs(m, search_patt, sub_patt, replaceAll=True)
        print(s , '-->', Chem.MolToSmiles(new_m[0], kekuleSmiles=True))
    

    Output:

    [1*]C(=O)OC --> *C(=O)OC
    [13*]C(=O)OC --> *C(=O)OC
    [31*]C1=CC=CC2=C1C=CC=N2 --> *C1=CC=CC2=C1C=CC=N2
    [5*]C1=CC=CC2=C1C=CC=N2 --> *C1=CC=CC2=C1C=CC=N2
    [45*]C(N)=O --> *C(N)=O
    [5*]C(N)=O --> *C(N)=O
    [19*]C(N)=O --> *C(N)=O
    [16*]C(N)=O --> *C(N)=O