I have to insert a series of dictionaries into a dataframe that I generate via a for loop, I would like to make the keys of the dictionaries become the labels of the columns of the dataframe and insert the values from my dictionary into each row.
def description_all(df):
full_descriptor=pd.DataFrame()
for index, row in df.iterrows():
mol = Chem.MolFromSmiles(df.at[index, 'smiles'])
vals = Descriptors.CalcMolDescriptors(mol)
... #i need this part
little code explanation
i'm currently using the rdkit library (a library for chem stuff) bassicaly i pass a dataframe that contain a list of smile(that is like an ID for a substance) , after a conversion with the Chem.MolFromSmiles(...) i obtain a mol class and then with the Descriptors.CalcMolDescriptors(mol) i obtain a dictionary.
so i need a way to put that dictionary inside a dataframe for every for loop.
edit
from rdkit import Chem
from rdkit.Chem import Descriptors
#example of input
data = [
{"ID": "ZINC000000793995", "SMILES": "COc1cccc(CNc2ccc(S(=O)(=O)Nc3nccs3)cc2)c1O"},
{"ID": "ZINC000000579895", "SMILES": "COc1cc(-c2ccc(O)cc2)c(OC)c(O)c1-c1ccc(OCC=C(C)C)c(O)c1"},
{"ID": "ZINC000000532501", "SMILES": "O=C(O)c1cc2cc(Cl)ccc2o1"}
]
df = pd.DataFrame(data)
the output i would expected is made by converting the the dictionary that the function Descripots.CalcMolDescriptors(mol) give as output into a dataframe, the dictionary is made of 210 keys with a value (based on some calc that the function does) every key has only 1 value (float) so i would like to create a dataframe that is make like:
MaxAbsEStateIndex | MaxEstateindex | MinEstateIndex | qed | SPS | and so on |
---|---|---|---|---|---|
11.038589741337306 | 11.038589741337306 | 0.05306747081548657 | -0.12265326012304678 | 0.4358923373790937 | .... |
10.528754724111867 | 10.528754724111867 | 0.07197530864197499 | -1.0765532879818591 | 0.7631338366688623 | .... |
So that the label are the key of the dictionary (all the dictionary that the function Descripots.CalcMolDescriptors(mol) are the same) and every row corrispond to the value of each key in the top label
I have adapted a solution from the official documentation on Descriptor Calculation.
import pandas as pd
from rdkit import Chem
from rdkit.Chem import Descriptors
# create df
data = [
{"ID": "ZINC000000793995", "SMILES": "COc1cccc(CNc2ccc(S(=O)(=O)Nc3nccs3)cc2)c1O"},
{"ID": "ZINC000000579895", "SMILES": "COc1cc(-c2ccc(O)cc2)c(OC)c(O)c1-c1ccc(OCC=C(C)C)c(O)c1"},
{"ID": "ZINC000000532501", "SMILES": "O=C(O)c1cc2cc(Cl)ccc2o1"}
]
df = pd.DataFrame(data)
# get mol objects from smiles
df['mol'] = df['SMILES'].apply(Chem.MolFromSmiles)
# get descriptors, convert to list for dataframe creation
list_of_descriptors = df['mol'].apply(Descriptors.CalcMolDescriptors).to_list()
full_descriptor = pd.DataFrame(list_of_descriptors)
Explanation:
Chem.MolFromSmiles()
to each element in the 'smiles'
column and store the object in a newly created 'Mol'
column.Descriptors.CalcMolDescriptors()
to every Mol
object and store the resulting dictionaries as a list of dictionaries.full_descriptors
DataFrame from the list of dictionaries. The keys will be used as columns.Alternatively, you can also store the mol objects in a temporary variable instead of a new column in the original DataFrame:
mols = df['smiles'].apply(Chem.MolFromSmiles)
list_of_descriptors = mols.apply(Descriptors.CalcMolDescriptors).to_list()
In general, if you want to apply a function to every row/column of a DataFrame/Series, the .apply()
function is typically preferred for better readability. Furthermore, often you can use the outputted DataFrame/Series directly without any further processing.
Don't forget to accept this answer if it solves your problem ;)