[SOLVED] How to change representation of dummy atoms in SMILES format

How to change representation of dummy atoms in SMILES format

Hi i want to perform the task of grouping the same molecular structures by using smiles code.

However, even with the same structure, it is difficult to group them because the representation of dummy atoms is different.

I'm using the RDKIT program and I've tried changing several options but haven't found a solution yet. I would like to ask for your help. (rdkit version 2022.3.4)

Example smiles: (same structure but different smiles code -> desired code format)

[1*]C(=O)OC, [13*]C(=O)OC -> *C(=O)OC
[31*]C1=CC=CC2=C1C=CC=N2, [5*]C1=CC=CC2=C1C=CC=N2 -> *C1=CC=CC2=C1C=CC=N2
[45*]C(N)=O, [5*]C(N)=O, [19*]C(N)=O, [16*]C(N)=O -> *C(N)=O

Solution

Sounds a little weired, but you can replace AnyAtom with AnyAtom.

You can use ReplaceSubstructs() for this.

from rdkit import Chem

smiles = ['[1*]C(=O)OC', '[13*]C(=O)OC',
          '[31*]C1=CC=CC2=C1C=CC=N2', '[5*]C1=CC=CC2=C1C=CC=N2',
          '[45*]C(N)=O', '[5*]C(N)=O', '[19*]C(N)=O', '[16*]C(N)=O']

search_patt = Chem.MolFromSmiles('*') # finds AnyAtom with or without numbers
sub_patt = Chem.MolFromSmiles('*')    # AnyAtom without numbers

for s in smiles:
    m=Chem.MolFromSmiles(s, sanitize=False)
    new_m = Chem.ReplaceSubstructs(m, search_patt, sub_patt, replaceAll=True)
    print(s , '-->', Chem.MolToSmiles(new_m[0], kekuleSmiles=True))

Output:

[1*]C(=O)OC --> *C(=O)OC
[13*]C(=O)OC --> *C(=O)OC
[31*]C1=CC=CC2=C1C=CC=N2 --> *C1=CC=CC2=C1C=CC=N2
[5*]C1=CC=CC2=C1C=CC=N2 --> *C1=CC=CC2=C1C=CC=N2
[45*]C(N)=O --> *C(N)=O
[5*]C(N)=O --> *C(N)=O
[19*]C(N)=O --> *C(N)=O
[16*]C(N)=O --> *C(N)=O