python-polars

How to replicate df.groupby('some_column').resample('Q').agg('total':'count') in polars with group_by_dynamic


Given i want to group quarterly, then in pandas i would write (given that the index contains the 'date' column)

df = df.groupby('some_column').resample('Q').agg({"total": "count"})

I tried to replicate that with polars:

df = df.group_by_dynamic('date', every='3mo', group_by='some_column', include_boundaries=True).agg(pl.len())

In pandas i get the end of the quarter back. In polars the lower boundary because truncate is default set to True. How would i get the _upper_boundary back as my column 'date' ?

Obviously i could just run a selecte after setting include_boundaries=True but i actually want to set it to false because the docs say it will impact performance if set to true

Is every='3mo' equal to pandas resample('Q') ?

Examples:

df = pd.DataFrame({'date': pd.date_range("2018-01-01", periods=365*24, freq='H')})
# Create second column 'pokemon_type':
df['pokemon_type'] = 'fire'
df.loc[400:, 'pokemon_type'] = 'water'

                    date pokemon_type
0    2018-01-01 00:00:00         fire
1    2018-01-01 01:00:00         fire
2    2018-01-01 02:00:00         fire
3    2018-01-01 03:00:00         fire
4    2018-01-01 04:00:00         fire
                  ...          ...
8755 2018-12-31 19:00:00        water
8756 2018-12-31 20:00:00        water
8757 2018-12-31 21:00:00        water
8758 2018-12-31 22:00:00        water
8759 2018-12-31 23:00:00        water
[8760 rows x 2 columns]

Now groupby:
df['total'] = 0
df = df.set_index('date') 
df = df.groupby('pokemon_type').resample('Q').agg({'total':'count'})

df
pokemon_type date             
fire         2018-03-31    400
water        2018-03-31   1760
             2018-06-30   2184
             2018-09-30   2208
             2018-12-31   2208
df.index
MultiIndex([( 'fire', '2018-03-31'),
            ('water', '2018-03-31'),
            ('water', '2018-06-30'),
            ('water', '2018-09-30'),
            ('water', '2018-12-31')],
           names=['pokemon_type', 'date'])

For Polars: Sadly i don't know how to slice a column and assign sub values. Tried to figure it out but i am still on it.

The polars df will look like this

┌──────────────┬─────────────────────┬─────────────────────┬────────────┬───────┐
│ pokemon_type ┆ _lower_boundary     ┆ _upper_boundary     ┆ date       ┆ count │
│ ---          ┆ ---                 ┆ ---                 ┆ ---        ┆ ---   │
│ str          ┆ datetime[μs]        ┆ datetime[μs]        ┆ date       ┆ i64   │
╞══════════════╪═════════════════════╪═════════════════════╪════════════╪═══════╡
│ fire         ┆ 1994-01-01 00:00:00 ┆ 1994-04-01 00:00:00 ┆ 1994-01-01 ┆ 58    │
│ water        ┆ 1994-01-01 00:00:00 ┆ 1994-04-01 00:00:00 ┆ 1994-01-01 ┆ 38    │
└──────────────┴─────────────────────┴─────────────────────┴────────────┴───────┘

What i need is the _upper_boundary directly without having performance issues.


Solution

  • First, let's recreate your data (including the slice) using Polars. We'll take this in steps.

    Calculating Dates

    First, we'll use the datetime_range function in Polars to create our dates.

    import polars as pl
    from datetime import datetime
    
    df = pl.DataFrame({
        'date': pl.datetime_range(datetime(2018, 11, 28), datetime(2019, 11, 28), '1h', closed='left', eager=True),
    })
    df
    
    shape: (8_760, 1)
    ┌─────────────────────┐
    │ date                │
    │ ---                 │
    │ datetime[μs]        │
    ╞═════════════════════╡
    │ 2018-11-28 00:00:00 │
    │ 2018-11-28 01:00:00 │
    │ 2018-11-28 02:00:00 │
    │ 2018-11-28 03:00:00 │
    │ 2018-11-28 04:00:00 │
    │ …                   │
    │ 2019-11-27 19:00:00 │
    │ 2019-11-27 20:00:00 │
    │ 2019-11-27 21:00:00 │
    │ 2019-11-27 22:00:00 │
    │ 2019-11-27 23:00:00 │
    └─────────────────────┘
    

    Slicing

    Next, we'll use the int_range expression, along with a when/then/otherwise expression to slice our data and assign the pokemon_type.

    df = df.with_columns(
        pl.when(pl.int_range(pl.len()) < 400)
        .then(pl.lit('fire'))
        .otherwise(pl.lit('water'))
        .alias('pokemon_type')
    )
    df
    
    shape: (8_760, 2)
    ┌─────────────────────┬──────────────┐
    │ date                ┆ pokemon_type │
    │ ---                 ┆ ---          │
    │ datetime[μs]        ┆ str          │
    ╞═════════════════════╪══════════════╡
    │ 2018-11-28 00:00:00 ┆ fire         │
    │ 2018-11-28 01:00:00 ┆ fire         │
    │ 2018-11-28 02:00:00 ┆ fire         │
    │ 2018-11-28 03:00:00 ┆ fire         │
    │ 2018-11-28 04:00:00 ┆ fire         │
    │ …                   ┆ …            │
    │ 2019-11-27 19:00:00 ┆ water        │
    │ 2019-11-27 20:00:00 ┆ water        │
    │ 2019-11-27 21:00:00 ┆ water        │
    │ 2019-11-27 22:00:00 ┆ water        │
    │ 2019-11-27 23:00:00 ┆ water        │
    └─────────────────────┴──────────────┘
    

    Calculating quarter-end dates

    To obtain our end-quarter dates, we can use the when/then/otherwise and the datetime expression to create them from the date field.

    Note: I'll set include_bondaries=True, just to compare our calculated end-quarter date to the _upper_boundary date.

    (
        df
        .group_by_dynamic("date", every="3mo", group_by="pokemon_type", include_boundaries=True)
        .agg(pl.len())
        .with_columns(
            pl.when(pl.col("date").dt.month() < 4)
            .then(pl.datetime(pl.col('date').dt.year(), 4, 1))
            .when(pl.col("date").dt.month() < 7)
            .then(pl.datetime(pl.col("date").dt.year(), 7, 1))
            .when(pl.col("date").dt.month() < 10)
            .then(pl.datetime(pl.col("date").dt.year(), 10, 1))
            .otherwise(pl.datetime(pl.col("date").dt.year() + 1, 1, 1))
            .alias("quarter")
        )
    )
    
    shape: (6, 6)
    ┌──────────────┬─────────────────────┬─────────────────────┬─────────────────────┬──────┬─────────────────────┐
    │ pokemon_type ┆ _lower_boundary     ┆ _upper_boundary     ┆ date                ┆ len  ┆ quarter             │
    │ ---          ┆ ---                 ┆ ---                 ┆ ---                 ┆ ---  ┆ ---                 │
    │ str          ┆ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32  ┆ datetime[μs]        │
    ╞══════════════╪═════════════════════╪═════════════════════╪═════════════════════╪══════╪═════════════════════╡
    │ fire         ┆ 2018-10-01 00:00:00 ┆ 2019-01-01 00:00:00 ┆ 2018-10-01 00:00:00 ┆ 400  ┆ 2019-01-01 00:00:00 │
    │ water        ┆ 2018-10-01 00:00:00 ┆ 2019-01-01 00:00:00 ┆ 2018-10-01 00:00:00 ┆ 416  ┆ 2019-01-01 00:00:00 │
    │ water        ┆ 2019-01-01 00:00:00 ┆ 2019-04-01 00:00:00 ┆ 2019-01-01 00:00:00 ┆ 2160 ┆ 2019-04-01 00:00:00 │
    │ water        ┆ 2019-04-01 00:00:00 ┆ 2019-07-01 00:00:00 ┆ 2019-04-01 00:00:00 ┆ 2184 ┆ 2019-07-01 00:00:00 │
    │ water        ┆ 2019-07-01 00:00:00 ┆ 2019-10-01 00:00:00 ┆ 2019-07-01 00:00:00 ┆ 2208 ┆ 2019-10-01 00:00:00 │
    │ water        ┆ 2019-10-01 00:00:00 ┆ 2020-01-01 00:00:00 ┆ 2019-10-01 00:00:00 ┆ 1392 ┆ 2020-01-01 00:00:00 │
    └──────────────┴─────────────────────┴─────────────────────┴─────────────────────┴──────┴─────────────────────┘
    

    Likewise, if you wanted to use closed="left", you could also express your quarter-ending dates like this:

    (
        df
        .group_by_dynamic("date", every="3mo", group_by="pokemon_type", closed='left', include_boundaries=True)
        .agg(pl.len())
        .with_columns(
            pl.when(pl.col("date").dt.month() < 4)
            .then(pl.datetime(pl.col('date').dt.year(), 3, 31))
            .when(pl.col("date").dt.month() < 7)
            .then(pl.datetime(pl.col("date").dt.year(), 6, 30))
            .when(pl.col("date").dt.month() < 10)
            .then(pl.datetime(pl.col("date").dt.year(), 9, 30))
            .otherwise(pl.datetime(pl.col("date").dt.year(), 12, 31))
            .alias("quarter")
        )
    )
    
    shape: (6, 6)
    ┌──────────────┬─────────────────────┬─────────────────────┬─────────────────────┬──────┬─────────────────────┐
    │ pokemon_type ┆ _lower_boundary     ┆ _upper_boundary     ┆ date                ┆ len  ┆ quarter             │
    │ ---          ┆ ---                 ┆ ---                 ┆ ---                 ┆ ---  ┆ ---                 │
    │ str          ┆ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32  ┆ datetime[μs]        │
    ╞══════════════╪═════════════════════╪═════════════════════╪═════════════════════╪══════╪═════════════════════╡
    │ fire         ┆ 2018-10-01 00:00:00 ┆ 2019-01-01 00:00:00 ┆ 2018-10-01 00:00:00 ┆ 400  ┆ 2018-12-31 00:00:00 │
    │ water        ┆ 2018-10-01 00:00:00 ┆ 2019-01-01 00:00:00 ┆ 2018-10-01 00:00:00 ┆ 416  ┆ 2018-12-31 00:00:00 │
    │ water        ┆ 2019-01-01 00:00:00 ┆ 2019-04-01 00:00:00 ┆ 2019-01-01 00:00:00 ┆ 2160 ┆ 2019-03-31 00:00:00 │
    │ water        ┆ 2019-04-01 00:00:00 ┆ 2019-07-01 00:00:00 ┆ 2019-04-01 00:00:00 ┆ 2184 ┆ 2019-06-30 00:00:00 │
    │ water        ┆ 2019-07-01 00:00:00 ┆ 2019-10-01 00:00:00 ┆ 2019-07-01 00:00:00 ┆ 2208 ┆ 2019-09-30 00:00:00 │
    │ water        ┆ 2019-10-01 00:00:00 ┆ 2020-01-01 00:00:00 ┆ 2019-10-01 00:00:00 ┆ 1392 ┆ 2019-12-31 00:00:00 │
    └──────────────┴─────────────────────┴─────────────────────┴─────────────────────┴──────┴─────────────────────┘