I have a large dataset i need to read into a pandas dataframe.
It contains alot of categorical data consiting of some rather long string.
Trying to use the pandas read_sql_query method I can't seems to specify what columns should be treated as categorical data.
This means i get memory issues.
I have a background in R where i can specify things like, string as factor. Meaning you can have long strings with a small memory footprint since they are indexed as integers in R. Can't i do the same in Python/Pandas?
I would like to do it as i read the data from the database! not after. Converting string to category in pandas is easy once you have it in a dataframe, but that is not what I'm looking for.
I understand that i could simply encode the data in the database but I would like to avoid that.
I'm afraid currently encoding on the DB side (this can be done using JOIN with a mapping table) is the only viable option.
There were a few similar feature requests:
Reading data in chunks and converting each chunk to category
dtype might be tricky as one might need to join categories from all chunks.