For reproducing the issue, Notebook, data, output: github link
I have Contract variable/column in my dataset which looks like this, all look like numbers but they are actually categorical.
When read with pandas, the info says it is read as int. Since the contract variable is a category(from the metadata I received) so I manually changed the variable type like below
df['Contract'] = df['Contract'].astype('categorical')
df.dtypes # shows modified dtype now
I then tried to get report from pandas_profiling
. The generated report shows that contact
interpreted as real number, even though I changed the type from int
to str
/category
.
# Tried both, but resulted in same.
ProfileReport(df)
df.profile_report()
Can you explain right way to interpret datatypes with pandas_profiling
? i.e, change contract
variable to categorical
type.
After a long time posting this question, raising issue and creating a pull request for this on pandas-profiling
GitHub page, I almost forgot this question. I thank IampShadesDrifter for reminding me to close this question by answering.
Actually this behavior of pandas-profiling
is expected. pandas-profiling
tries to infer the data type that best suits for a column. And it is how it's written before. Since there wasn't a solution. it drove me to create my first ever pull request on GitHub.
Now with the newly added parameter infer_dtypes
in ProfileReport
/ profile_report
, we can explicitly ask pandas-profiling
not to infer any data type, but rather use the data type from pandas
(df.dtypes
).
# for the df in the question,
df['Contract'] = df['Contract'].astype('categorical')
# `Contract` dtype now will be used as `categorical` as type-casted above.
# And `pandas-profiling` does not infer dtype on its own, rather uses dtypes as understood by pandas
# for this we have to set `infer_dtypes=False`
ProfileReport(df, infer_dtypes=False) # or
df.profile_report(infer_dtypes=False)
Please feel free to contribute for this answer, if you found anything worth mentioning.