Hi i want to perform the task of grouping the same molecular structures by using smiles code.
However, even with the same structure, it is difficult to group them because the representation of dummy atoms is different.
I'm using the RDKIT program and I've tried changing several options but haven't found a solution yet. I would like to ask for your help. (rdkit version 2022.3.4)
Example smiles: (same structure but different smiles code -> desired code format)
Sounds a little weired, but you can replace AnyAtom
with AnyAtom
.
You can use ReplaceSubstructs()
for this.
from rdkit import Chem
smiles = ['[1*]C(=O)OC', '[13*]C(=O)OC',
'[31*]C1=CC=CC2=C1C=CC=N2', '[5*]C1=CC=CC2=C1C=CC=N2',
'[45*]C(N)=O', '[5*]C(N)=O', '[19*]C(N)=O', '[16*]C(N)=O']
search_patt = Chem.MolFromSmiles('*') # finds AnyAtom with or without numbers
sub_patt = Chem.MolFromSmiles('*') # AnyAtom without numbers
for s in smiles:
m=Chem.MolFromSmiles(s, sanitize=False)
new_m = Chem.ReplaceSubstructs(m, search_patt, sub_patt, replaceAll=True)
print(s , '-->', Chem.MolToSmiles(new_m[0], kekuleSmiles=True))
Output:
[1*]C(=O)OC --> *C(=O)OC
[13*]C(=O)OC --> *C(=O)OC
[31*]C1=CC=CC2=C1C=CC=N2 --> *C1=CC=CC2=C1C=CC=N2
[5*]C1=CC=CC2=C1C=CC=N2 --> *C1=CC=CC2=C1C=CC=N2
[45*]C(N)=O --> *C(N)=O
[5*]C(N)=O --> *C(N)=O
[19*]C(N)=O --> *C(N)=O
[16*]C(N)=O --> *C(N)=O