pythondatabasepandascategorical-data

Python pandas string handling categorical data from SQL database


I have a large dataset i need to read into a pandas dataframe.

It contains alot of categorical data consiting of some rather long string.

Trying to use the pandas read_sql_query method I can't seems to specify what columns should be treated as categorical data.

This means i get memory issues.

I have a background in R where i can specify things like, string as factor. Meaning you can have long strings with a small memory footprint since they are indexed as integers in R. Can't i do the same in Python/Pandas?

I would like to do it as i read the data from the database! not after. Converting string to category in pandas is easy once you have it in a dataframe, but that is not what I'm looking for.

I understand that i could simply encode the data in the database but I would like to avoid that.


Solution

  • I'm afraid currently encoding on the DB side (this can be done using JOIN with a mapping table) is the only viable option.

    There were a few similar feature requests:

    Reading data in chunks and converting each chunk to category dtype might be tricky as one might need to join categories from all chunks.