pythonpandasdictionaryrdkit

How can i insert x dictionary in a DataFrame?


I have to insert a series of dictionaries into a dataframe that I generate via a for loop, I would like to make the keys of the dictionaries become the labels of the columns of the dataframe and insert the values ​​from my dictionary into each row.

def description_all(df):
    full_descriptor=pd.DataFrame()
    for index, row in df.iterrows():
        mol = Chem.MolFromSmiles(df.at[index, 'smiles'])
        vals = Descriptors.CalcMolDescriptors(mol)
        ... #i need this part 

little code explanation

i'm currently using the rdkit library (a library for chem stuff) bassicaly i pass a dataframe that contain a list of smile(that is like an ID for a substance) , after a conversion with the Chem.MolFromSmiles(...) i obtain a mol class and then with the Descriptors.CalcMolDescriptors(mol) i obtain a dictionary.
so i need a way to put that dictionary inside a dataframe for every for loop.

edit

from rdkit import Chem
from rdkit.Chem import Descriptors

#example of input 
data = [
        {"ID": "ZINC000000793995", "SMILES": "COc1cccc(CNc2ccc(S(=O)(=O)Nc3nccs3)cc2)c1O"},
        {"ID": "ZINC000000579895", "SMILES": "COc1cc(-c2ccc(O)cc2)c(OC)c(O)c1-c1ccc(OCC=C(C)C)c(O)c1"},
        {"ID": "ZINC000000532501", "SMILES": "O=C(O)c1cc2cc(Cl)ccc2o1"}
       ]
df = pd.DataFrame(data)

the output i would expected is made by converting the the dictionary that the function Descripots.CalcMolDescriptors(mol) give as output into a dataframe, the dictionary is made of 210 keys with a value (based on some calc that the function does) every key has only 1 value (float) so i would like to create a dataframe that is make like:

MaxAbsEStateIndex MaxEstateindex MinEstateIndex qed SPS and so on
11.038589741337306 11.038589741337306 0.05306747081548657 -0.12265326012304678 0.4358923373790937 ....
10.528754724111867 10.528754724111867 0.07197530864197499 -1.0765532879818591 0.7631338366688623 ....

So that the label are the key of the dictionary (all the dictionary that the function Descripots.CalcMolDescriptors(mol) are the same) and every row corrispond to the value of each key in the top label


Solution

  • I have adapted a solution from the official documentation on Descriptor Calculation.

    import pandas as pd
    from rdkit import Chem
    from rdkit.Chem import Descriptors
    
    
    # create df
    data = [
            {"ID": "ZINC000000793995", "SMILES": "COc1cccc(CNc2ccc(S(=O)(=O)Nc3nccs3)cc2)c1O"},
            {"ID": "ZINC000000579895", "SMILES": "COc1cc(-c2ccc(O)cc2)c(OC)c(O)c1-c1ccc(OCC=C(C)C)c(O)c1"},
            {"ID": "ZINC000000532501", "SMILES": "O=C(O)c1cc2cc(Cl)ccc2o1"}
           ]
    df = pd.DataFrame(data)
    
    # get mol objects from smiles
    df['mol'] = df['SMILES'].apply(Chem.MolFromSmiles)
    
    # get descriptors, convert to list for dataframe creation
    list_of_descriptors = df['mol'].apply(Descriptors.CalcMolDescriptors).to_list()
    full_descriptor = pd.DataFrame(list_of_descriptors)
    

    Explanation:

    Alternatively, you can also store the mol objects in a temporary variable instead of a new column in the original DataFrame:

    mols = df['smiles'].apply(Chem.MolFromSmiles)
    list_of_descriptors = mols.apply(Descriptors.CalcMolDescriptors).to_list()
    

    In general, if you want to apply a function to every row/column of a DataFrame/Series, the .apply() function is typically preferred for better readability. Furthermore, often you can use the outputted DataFrame/Series directly without any further processing.


    Don't forget to accept this answer if it solves your problem ;)