Can i have a columns argument on pd.read_parquet() that filters columns, but is case insensitive, I have files with the same columns, but some are camel case, some are all capital, some are lowercase, it is a mess, and i can't read all columns and filter afterwards, and sometimes I have to read directly to pandas.
I know read_csv has a usecols argument that can be callable, so when the files are csvs I can do this: pd.read_csv(filepath, usecols=lambda col: col.lower() in cols)
But read_parquet columns argument can't be callable, how can I do something similar?
This is a workaround only, but what one can do is use dask
to lazy-load the parquet, inspect the column list, pick the ones of interest and do the actual load (or continue in the lazy fashion).
Here's the rough pseudocode:
from dask.dataframe import read_parquet
ddf = read_parquet("some_parquet")
# select columns
cols_of_interest = [c for c in ddf.columns if c.lower() in cols]
# continue with the dask.dataframe
ddf = read_parquet("some_parquet", columns= cols_of_interest)
# or convert to pandas, if necessary
df = ddf.compute()