how are you all? Hope you're doing good!
So, get this. I need to convert some .CIF files (found here: https://www.ccdc.cam.ac.uk/support-and-resources/downloads/ - MOF Collection) to a format that i can use with pandas, such as CSV or XLS. I'm researching about using MOF's for hydrogen storage, and this collection from Cambrigde's Structural Database would do wonders for me.
So far, i was able to convert them using ToposPro, but not to a format that i can use with Pandas readTo.
So, do any of you know of a way to do this? I've also read about pymatgen and matminer, but i've never used them before.
Also, sorry for any mishap with my writing, english isn't my main language. And thanks for your help!
To read a .CIF file as a pandas DataFrame, you can use Bio.PDB.MMCIF2Dict
module from biopython to firstly parse the .CIF file and return a dictionnary. Then, you will need pandas.DataFrame.from_dict
to create a dataframe from the bio-dictionnary. Finally, you have to pandas.DataFrame.transpose
to make rows as columns (since we'll define index
as an orientation for the dict to deal with "missing" values).
You need to install biopython
by executing this line in your (Windows) terminal :
pip install biopython
Then, you can use the code below to read a specific .CIF file :
import pandas as pd
from Bio.PDB.MMCIF2Dict import MMCIF2Dict
dico = MMCIF2Dict(r"path_to_the_MOF_collection\abavij_P1.cif")
df = pd.DataFrame.from_dict(dico, orient='index')
df = df.transpose()
>>> display(df)
Now, if you need the read the whole MOF collection (~10k files) as a dataframe, you can use this :
from pathlib import Path
import pandas as pd
from Bio.PDB.MMCIF2Dict import MMCIF2Dict
from time import time
mof_collection = r"path_to_the_MOF_collection"
start = time()
list_of_cif = []
for file in Path(mof_collection).glob('*.cif'):
dico = MMCIF2Dict(file)
temp = pd.DataFrame.from_dict(dico, orient='index')
temp = temp.transpose()
temp.insert(0, 'Filename', Path(file).stem) #to get the .CIF filename
list_of_cif.append(temp)
df = pd.concat(list_of_cif)
end = time()
print(f'The DataFrame of the MOF Collection was created in {end-start} seconds.')
df
>>> output
I'm sure you're aware that the .CIF files may have different number of columns. So, feel free to concat (or not) the MOF collection. And last but not least, if you want to get a .csv and/or an .xlsx file of your dataframe, you can use either pandas.DataFrame.to_csv
or pandas.DataFrame.to_excel
:
df.to_csv('your_output_filename.csv', index=False)
df.to_excel('your_output_filename.xlsx', index=False)
To read the structure of a .CIF file as a DataFrame, you can use the as_dataframe()
method by using pymatgen
:
from pymatgen.io.cif import CifParser
parser = CifParser("abavij_P1.cif")
structure = parser.get_structures()[0]
structure.as_dataframe()
>>> output
In case you need to check if a .CIF file has a valid structure, you can use :
if len(structure)==0:
print('The .CIF file has no structure')
Or:
try:
structure = parser.get_structures()[0]
except:
print('The .CIF file has no structure')