pythondataframeazure-blob-storagepython-polars

How to read only some columns of a parquet file from Azure blob storage?


I have some large parquet files in Azure blob storage and I am processing them using python polars.

Is there any way to read only some columns/rows of the file?

Currently I'm downloading the file into a io.BytesIO and doing pl.read_parquet(...) over it.


Solution

  • you can do it the following way:

    import polars as pl
    import pyarrow.dataset as ds
    from adlfs import AzureBlobFileSystem
    
    abfs = AzureBlobFileSystem(
        account_name="ACCOUNT_NAME", 
        account_key="ACCOUNT_KEY")
    
    df = pl.scan_pyarrow_dataset(
        ds.dataset('az://CONTAINER/FILE_NAME.parquet',
        filesystem=abfs))
    
    # then you can query your lazyframe
    df.select('column1','column2').collect()
    

    EDIT: as of 2023-03-19 projection pushdown (filtering columns) reduces processing time. However slice pushdown (limiting the number of rows retrieved) does not seem to reduce the processing time.

    Sources: