Boiling down a bigger problem to its essentials, I would like to do this:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': np.random.randint(0, 5, 1000), 'b': np.random.random(1000)})
for aval, subdf in df.groupby('a'):
subdf.to_parquet(f'/tmp/{aval}.parquet')
in Polars using LazyFrame:
import numpy as np
import pandas as pd
import polars as pl
df = pd.DataFrame({'a': np.random.randint(0, 5, 1000), 'b': np.random.random(1000)})
lf = pl.LazyFrame(df)
# ???
I would like to be able to control the name of the output files in a similar way.
Thanks!
You could use a partitioning scheme e.g. PartitionByKey()
lf.sink_parquet(
pl.PartitionByKey("/tmp/output", by="a"),
mkdir = True
)
For your example this creates:
/tmp/output
/tmp/output/a=0
/tmp/output/a=0/0.parquet
/tmp/output/a=1
/tmp/output/a=1/0.parquet
/tmp/output/a=2
/tmp/output/a=2/0.parquet
/tmp/output/a=3
/tmp/output/a=3/0.parquet
/tmp/output/a=4
/tmp/output/a=4/0.parquet
The docs show an example of file_path=
being used with a callback to customize the filename further if required.