regressionpredictionchemistryrdkitcheminformatics

How do I get molecular structural information from SMILES


My question is: is there any algorithm that can convert a SMILES structure into a topological fingerprint? For example if glycerol is the input the answer would be 3 x -OH , 2x -CH2 and 1x -CH.

I'm trying to build a python script that can predict the density of a mixture using an artificial neural network. As an input I want to have the structure/fingerprint of my molecules starting from the SMILES structure.

I'm already familiar with -rdkit and the morganfingerprint but that is not what i'm looking for. I'm also aware that I can use the 'matching substructure' search in rdkit, but then I would have to define all the different subgroups. Is there any more convenient/shorter way?


Solution

  • For most of the structures, there's no existing option to find the fragments. However, there's a module in rdkit that can provide you the number of fragments especially when it's a function group. Check it out here. As an example, let's say you want to find the number of aliphatic -OH groups in your molecule. You can simply call the following function to do that

    from rdkit.Chem.Fragments import fr_Al_OH
    fr_Al_OH(mol)
    

    or the following would return the number of aromatic -OH groups:

    from rdkit.Chem.Fragments import fr_Ar_OH
    fr_Ar_OH(mol)
    

    Similarly, there are 83 more functions available. Some of them would be useful for your task. For the ones, you don't get the pre-written function, you can always go to the source code of these rdkit modules, figure out how they did it, and then implement them for your features. But as you already mentioned, the way would be to define a SMARTS string and then fragment matching. The fragment matching module can be found here.