[SOLVED] How to read only some columns of a parquet file from Azure blob storage?

How to read only some columns of a parquet file from Azure blob storage?

I have some large parquet files in Azure blob storage and I am processing them using python polars.

Is there any way to read only some columns/rows of the file?

Currently I'm downloading the file into a io.BytesIO and doing pl.read_parquet(...) over it.

Solution

you can do it the following way:

import polars as pl
import pyarrow.dataset as ds
from adlfs import AzureBlobFileSystem

abfs = AzureBlobFileSystem(
    account_name="ACCOUNT_NAME", 
    account_key="ACCOUNT_KEY")

df = pl.scan_pyarrow_dataset(
    ds.dataset('az://CONTAINER/FILE_NAME.parquet',
    filesystem=abfs))

# then you can query your lazyframe
df.select('column1','column2').collect()

EDIT: as of 2023-03-19 projection pushdown (filtering columns) reduces processing time. However slice pushdown (limiting the number of rows retrieved) does not seem to reduce the processing time.

Sources:

Apache Arrow website: https://arrow.apache.org/docs/python/parquet.html#reading-parquet-and-memory-mapping
ADLFS Github page: https://github.com/fsspec/adlfs