I have this dataframe in Python Polars having dimensions (8442x7), basically, 1206 rows for each day of the week. The day of week appears as a simple string.
Thought I would exploit the pl.Enum
to encode the ISO_WEEKDAY
column, saving space on disk. See the following code.
#!/usr/bin/env python3
# encoding: utf-8
import calendar, polars as pl
df: pl.DataFrame = pl.read_parquet(source='some_data.parquet') # About 268K
# df has a column called ISO_WEEKDAY whose values are Monday, Tuesday...Sunday
weekdays:pl.Enum=pl.Enum(categories=iter(calendar.day_name)) # Encode all weekdays in the enum
df=df.with_columns(pl.col(ISO_WEEKDAY).cast(dtype=weekdays))
df.write_parquet('new_file.parquet') # ~ Same 268K
But seems it is not happening, i.e. even after transforming the column to pl.Enum
pl.DataFrame.estimated_size()
method) actually gets bigger.So is pl.Enum
advisable at all, even though the column values are repeated 1206 times each?
When polars (and most writers) write a parquet, they do so with both compression and dictionary encoding so you shouldn't expect much, if any, size on disk benefit of using Categoricals or Enums.
In memory though, things are different, it doesn't dictionary encode strings so repeated strings would actually take up additional space.
For instance if I do
df=pl.DataFrame({
'a':pl.date_range(pl.date(2020,1,1), pl.date(2024,12,31),'1d',eager=True)
}).with_columns(weekday=pl.col('a').dt.strftime("%A"))
then df.estimated_size()
is 20358 and df.with_columns(pl.col('weekday').cast(pl.Enum(df['weekday'].unique()))).estimated_size()
is 14666, so just 72% of the original footprint.
This is a much smaller df than yours so I don't know why you're not seeing a smaller df with Enums instead of Strings.