pythonchemistryrdkitcheminformatics

Why does ReplaceSubstructs() always replace the last C atom?


I'm trying to use ReplaceSubstructs() function to combine two SMILES string together by replacing a carbon (C) atom in the first molucule with second molecule. However, no matter what c_options I pick, the RDKit always replace the first or last C atom, and the indices of the atoms also seem to get shuffled.

Here is my code:

from rdkit import Chem

def combine_smile(smile1, smile2, c_options=1): 
    mol1 = Chem.MolFromSmiles(smile1) mol2 = Chem.MolFromSmiles(smile2)
    if mol1 is None or mol2 is None:
        raise ValueError("One or both SMILES strings are invalid.")

    index_mol1 = mol_with_atom_index(mol1)

    combined_mol = Chem.ReplaceSubstructs(index_mol1, Chem.MolFromSmarts("[CH3]"), mol2)[c_options]
    combined_smiles = Chem.MolToSmiles(combined_mol)

    print(combined_smiles)

def mol_with_atom_index( mol ): 
    atoms = mol.GetNumAtoms() for idx in range( atoms ): mol.GetAtomWithIdx( idx ).SetProp( 'molAtomMapNumber', str( mol.GetAtomWithIdx( idx ).GetIdx() ) ) 
    return mol

if name == "main": 
    mol1_smiles = "CC(C)(C)Cl" 
    mol2_smiles = "CN(C)C"
    
    combine_smile(mol1_smiles, mol2_smiles)

Output:

CN(C)C[C:1]([CH3:2])([CH3:3])[Cl:4] # for c_options=0
CN(C)C[C:1]([CH3:0])([CH3:3])[Cl:4] # for c_options=1
CN(C)C[C:1]([CH3:0])([CH3:2])[Cl:4] # for c_options=2

Desired Output:

CN(C)C[C:1]([CH3:2])([CH3:3])[Cl:4] # for c_options=0
[CH3:0][C:1](CN(C)C)([CH3:3])[Cl:4] # for c_options=1 
[CH3:0][C:1]([CH3:2])(CN(C)C)[Cl:4] # for c_options=2

Solution

  • I'm not chemist but as for me it replaces correct elements - in first it replaces [CH3:0], in second it replaces [CH3:2] and in last it replaces [CH3:3] - as you expect.

    Problem is if result can exists in structure which you expect - maybe it changes structure to create element which can exists in nature.


    If you only need to display it then you could get string and use regex to find all substrings [CH3:number] and later replace it with string CN(C)C and keep it as string.

    import re
    
    text = '[CH3:0][C:1]([CH3:2])([CH3:3])[Cl:4]'
    
    found = re.findall(r'\[CH3:\d+\]', text)
    print('found:', found)   # 
    
    results = []
    for item in found: 
        new_item = text.replace(item, 'CN(C)C')
        results.append(new_item)
        print('new_item:', new_item)
    #results = [text.replace(item, replacement) for item in found]  # shorter
    
    c_options = 0
    print(c_options, results[c_options])
    

    Result

    found: ['[CH3:0]', '[CH3:2]', '[CH3:3]']
    
    new_item: CN(C)C[C:1]([CH3:2])([CH3:3])[Cl:4]
    new_item: [CH3:0][C:1](CN(C)C)([CH3:3])[Cl:4]
    new_item: [CH3:0][C:1]([CH3:2])(CN(C)C)[Cl:4]
    
    0 CN(C)C[C:1]([CH3:2])([CH3:3])[Cl:4]
    

    If you will have to reuse it then you may create function

    import re
    
    def replace_substring(text, pattern, replacement):
        found = re.findall(pattern, text)
    
        results = []
        for item in found: 
             new_item = text.replace(item, replacement)
             results.append(new_item)
        #results = [text.replace(item, replacement) for item in found]  # shorter
    
        return results
    
    # --- main ---
    
    text = '[CH3:0][C:1]([CH3:2])([CH3:3])[Cl:4]'
    pattern = r'\[CH3:\d+\]'
    replacement = 'CN(C)C'
    
    results = replace_substring(text, pattern, replacement)
    
    for c_options in range(3):
        print(c_options, results[c_options])
    
    #for c_options, item in enumerate(results):
    #    print(c_options, item)
    
    0 CN(C)C[C:1]([CH3:2])([CH3:3])[Cl:4]
    1 [CH3:0][C:1](CN(C)C)([CH3:3])[Cl:4]
    2 [CH3:0][C:1]([CH3:2])(CN(C)C)[Cl:4]
    

    It may need to learn regex to create more complex patterns.