pythonpython-polarspolars

Polars Pandas-like groupby save to files by each value


Boiling down a bigger problem to its essentials, I would like to do this:

import numpy as np
import pandas as pd

df = pd.DataFrame({'a': np.random.randint(0, 5, 1000), 'b': np.random.random(1000)})

for aval, subdf in df.groupby('a'):
    subdf.to_parquet(f'/tmp/{aval}.parquet')

in Polars using LazyFrame:

import numpy as np
import pandas as pd
import polars as pl

df = pd.DataFrame({'a': np.random.randint(0, 5, 1000), 'b': np.random.random(1000)})

lf = pl.LazyFrame(df)
# ???

I would like to be able to control the name of the output files in a similar way.

Thanks!


Solution

  • You could use a partitioning scheme e.g. PartitionByKey()

    lf.sink_parquet(
        pl.PartitionByKey("/tmp/output", by="a"),
        mkdir = True
    )
    

    For your example this creates:

    /tmp/output
    /tmp/output/a=0
    /tmp/output/a=0/0.parquet
    /tmp/output/a=1
    /tmp/output/a=1/0.parquet
    /tmp/output/a=2
    /tmp/output/a=2/0.parquet
    /tmp/output/a=3
    /tmp/output/a=3/0.parquet
    /tmp/output/a=4
    /tmp/output/a=4/0.parquet
    

    The docs show an example of file_path= being used with a callback to customize the filename further if required.