pythonstringmemory-managementenumspython-polars

Should Polars Enum Datatype Result in More Efficient Storage and Memory Footprint of DataFrame?


I have this dataframe in Python Polars having dimensions (8442x7), basically, 1206 rows for each day of the week. The day of week appears as a simple string.

Thought I would exploit the pl.Enum to encode the ISO_WEEKDAY column, saving space on disk. See the following code.

#!/usr/bin/env python3
# encoding: utf-8

import calendar, polars as pl


df: pl.DataFrame = pl.read_parquet(source='some_data.parquet') # About 268K
# df has a column called ISO_WEEKDAY whose values are Monday, Tuesday...Sunday

weekdays:pl.Enum=pl.Enum(categories=iter(calendar.day_name)) # Encode all weekdays in the enum
df=df.with_columns(pl.col(ISO_WEEKDAY).cast(dtype=weekdays))
df.write_parquet('new_file.parquet') # ~ Same 268K

But seems it is not happening, i.e. even after transforming the column to pl.Enum

So is pl.Enum advisable at all, even though the column values are repeated 1206 times each?


Solution

  • When polars (and most writers) write a parquet, they do so with both compression and dictionary encoding so you shouldn't expect much, if any, size on disk benefit of using Categoricals or Enums.

    In memory though, things are different, it doesn't dictionary encode strings so repeated strings would actually take up additional space.

    For instance if I do

    df=pl.DataFrame({
        'a':pl.date_range(pl.date(2020,1,1), pl.date(2024,12,31),'1d',eager=True)
    }).with_columns(weekday=pl.col('a').dt.strftime("%A"))
    

    then df.estimated_size() is 20358 and df.with_columns(pl.col('weekday').cast(pl.Enum(df['weekday'].unique()))).estimated_size() is 14666, so just 72% of the original footprint.

    This is a much smaller df than yours so I don't know why you're not seeing a smaller df with Enums instead of Strings.