[SOLVED] Pandas/Dask read_parquet columns case insensitive

Pandas/Dask read_parquet columns case insensitive

Can i have a columns argument on pd.read_parquet() that filters columns, but is case insensitive, I have files with the same columns, but some are camel case, some are all capital, some are lowercase, it is a mess, and i can't read all columns and filter afterwards, and sometimes I have to read directly to pandas.

I know read_csv has a usecols argument that can be callable, so when the files are csvs I can do this: pd.read_csv(filepath, usecols=lambda col: col.lower() in cols)

But read_parquet columns argument can't be callable, how can I do something similar?

Solution

This is a workaround only, but what one can do is use dask to lazy-load the parquet, inspect the column list, pick the ones of interest and do the actual load (or continue in the lazy fashion).

Here's the rough pseudocode:

from dask.dataframe import read_parquet

ddf = read_parquet("some_parquet")

# select columns
cols_of_interest = [c for c in ddf.columns if c.lower() in cols]

# continue with the dask.dataframe
ddf = read_parquet("some_parquet", columns= cols_of_interest)

# or convert to pandas, if necessary
df = ddf.compute()