pythonpython-3.xpandasdata-analysispandas-profiling

How to change variable type when working with pandas-profiling?


For reproducing the issue, Notebook, data, output: github link
I have Contract variable/column in my dataset which looks like this, all look like numbers but they are actually categorical.
enter image description here

When read with pandas, the info says it is read as int. Since the contract variable is a category(from the metadata I received) so I manually changed the variable type like below

df['Contract'] = df['Contract'].astype('categorical')
df.dtypes # shows modified dtype now

I then tried to get report from pandas_profiling. The generated report shows that contact interpreted as real number, even though I changed the type from int to str/category.

# Tried both, but resulted in same.
ProfileReport(df)
df.profile_report()

enter image description here

Can you explain right way to interpret datatypes with pandas_profiling? i.e, change contract variable to categorical type.


Solution

  • After a long time posting this question, raising issue and creating a pull request for this on pandas-profiling GitHub page, I almost forgot this question. I thank IampShadesDrifter for reminding me to close this question by answering.

    Actually this behavior of pandas-profiling is expected. pandas-profiling tries to infer the data type that best suits for a column. And it is how it's written before. Since there wasn't a solution. it drove me to create my first ever pull request on GitHub.

    Now with the newly added parameter infer_dtypes in ProfileReport / profile_report, we can explicitly ask pandas-profiling not to infer any data type, but rather use the data type from pandas (df.dtypes).

    # for the df in the question,
    
    df['Contract'] = df['Contract'].astype('categorical')
    
    # `Contract` dtype now will be used as `categorical` as type-casted above. 
    # And `pandas-profiling` does not infer dtype on its own, rather uses dtypes as understood by pandas
    # for this we have to set `infer_dtypes=False`
    ProfileReport(df, infer_dtypes=False) # or
    df.profile_report(infer_dtypes=False)
    

    Please feel free to contribute for this answer, if you found anything worth mentioning.