I have some large parquet files in Azure blob storage and I am processing them using python polars.
Is there any way to read only some columns/rows of the file?
Currently I'm downloading the file into a io.BytesIO and doing pl.read_parquet(...) over it.
you can do it the following way:
import polars as pl
import pyarrow.dataset as ds
from adlfs import AzureBlobFileSystem
abfs = AzureBlobFileSystem(
account_name="ACCOUNT_NAME",
account_key="ACCOUNT_KEY")
df = pl.scan_pyarrow_dataset(
ds.dataset('az://CONTAINER/FILE_NAME.parquet',
filesystem=abfs))
# then you can query your lazyframe
df.select('column1','column2').collect()
EDIT: as of 2023-03-19 projection pushdown (filtering columns) reduces processing time. However slice pushdown (limiting the number of rows retrieved) does not seem to reduce the processing time.
Sources:
Apache Arrow website: https://arrow.apache.org/docs/python/parquet.html#reading-parquet-and-memory-mapping
ADLFS Github page: https://github.com/fsspec/adlfs