Given i want to group quarterly, then in pandas i would write (given that the index contains the 'date' column)
df = df.groupby('some_column').resample('Q').agg({"total": "count"})
I tried to replicate that with polars:
df = df.group_by_dynamic('date', every='3mo', group_by='some_column', include_boundaries=True).agg(pl.len())
In pandas i get the end of the quarter back. In polars the lower boundary because truncate is default set to True. How would i get the _upper_boundary back as my column 'date' ?
Obviously i could just run a selecte after setting include_boundaries=True but i actually want to set it to false because the docs say it will impact performance if set to true
Is every='3mo' equal to pandas resample('Q') ?
Examples:
df = pd.DataFrame({'date': pd.date_range("2018-01-01", periods=365*24, freq='H')})
# Create second column 'pokemon_type':
df['pokemon_type'] = 'fire'
df.loc[400:, 'pokemon_type'] = 'water'
date pokemon_type
0 2018-01-01 00:00:00 fire
1 2018-01-01 01:00:00 fire
2 2018-01-01 02:00:00 fire
3 2018-01-01 03:00:00 fire
4 2018-01-01 04:00:00 fire
... ...
8755 2018-12-31 19:00:00 water
8756 2018-12-31 20:00:00 water
8757 2018-12-31 21:00:00 water
8758 2018-12-31 22:00:00 water
8759 2018-12-31 23:00:00 water
[8760 rows x 2 columns]
Now groupby:
df['total'] = 0
df = df.set_index('date')
df = df.groupby('pokemon_type').resample('Q').agg({'total':'count'})
df
pokemon_type date
fire 2018-03-31 400
water 2018-03-31 1760
2018-06-30 2184
2018-09-30 2208
2018-12-31 2208
df.index
MultiIndex([( 'fire', '2018-03-31'),
('water', '2018-03-31'),
('water', '2018-06-30'),
('water', '2018-09-30'),
('water', '2018-12-31')],
names=['pokemon_type', 'date'])
For Polars: Sadly i don't know how to slice a column and assign sub values. Tried to figure it out but i am still on it.
The polars df will look like this
┌──────────────┬─────────────────────┬─────────────────────┬────────────┬───────┐
│ pokemon_type ┆ _lower_boundary ┆ _upper_boundary ┆ date ┆ count │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ datetime[μs] ┆ datetime[μs] ┆ date ┆ i64 │
╞══════════════╪═════════════════════╪═════════════════════╪════════════╪═══════╡
│ fire ┆ 1994-01-01 00:00:00 ┆ 1994-04-01 00:00:00 ┆ 1994-01-01 ┆ 58 │
│ water ┆ 1994-01-01 00:00:00 ┆ 1994-04-01 00:00:00 ┆ 1994-01-01 ┆ 38 │
└──────────────┴─────────────────────┴─────────────────────┴────────────┴───────┘
What i need is the _upper_boundary
directly without having performance issues.
First, let's recreate your data (including the slice) using Polars. We'll take this in steps.
First, we'll use the datetime_range
function in Polars to create our dates.
import polars as pl
from datetime import datetime
df = pl.DataFrame({
'date': pl.datetime_range(datetime(2018, 11, 28), datetime(2019, 11, 28), '1h', closed='left', eager=True),
})
df
shape: (8_760, 1)
┌─────────────────────┐
│ date │
│ --- │
│ datetime[μs] │
╞═════════════════════╡
│ 2018-11-28 00:00:00 │
│ 2018-11-28 01:00:00 │
│ 2018-11-28 02:00:00 │
│ 2018-11-28 03:00:00 │
│ 2018-11-28 04:00:00 │
│ … │
│ 2019-11-27 19:00:00 │
│ 2019-11-27 20:00:00 │
│ 2019-11-27 21:00:00 │
│ 2019-11-27 22:00:00 │
│ 2019-11-27 23:00:00 │
└─────────────────────┘
Next, we'll use the int_range
expression, along with a when/then/otherwise
expression to slice our data and assign the pokemon_type
.
df = df.with_columns(
pl.when(pl.int_range(pl.len()) < 400)
.then(pl.lit('fire'))
.otherwise(pl.lit('water'))
.alias('pokemon_type')
)
df
shape: (8_760, 2)
┌─────────────────────┬──────────────┐
│ date ┆ pokemon_type │
│ --- ┆ --- │
│ datetime[μs] ┆ str │
╞═════════════════════╪══════════════╡
│ 2018-11-28 00:00:00 ┆ fire │
│ 2018-11-28 01:00:00 ┆ fire │
│ 2018-11-28 02:00:00 ┆ fire │
│ 2018-11-28 03:00:00 ┆ fire │
│ 2018-11-28 04:00:00 ┆ fire │
│ … ┆ … │
│ 2019-11-27 19:00:00 ┆ water │
│ 2019-11-27 20:00:00 ┆ water │
│ 2019-11-27 21:00:00 ┆ water │
│ 2019-11-27 22:00:00 ┆ water │
│ 2019-11-27 23:00:00 ┆ water │
└─────────────────────┴──────────────┘
To obtain our end-quarter dates, we can use the when/then/otherwise
and the datetime
expression to create them from the date
field.
Note: I'll set include_bondaries=True
, just to compare our calculated end-quarter date to the _upper_boundary
date.
(
df
.group_by_dynamic("date", every="3mo", group_by="pokemon_type", include_boundaries=True)
.agg(pl.len())
.with_columns(
pl.when(pl.col("date").dt.month() < 4)
.then(pl.datetime(pl.col('date').dt.year(), 4, 1))
.when(pl.col("date").dt.month() < 7)
.then(pl.datetime(pl.col("date").dt.year(), 7, 1))
.when(pl.col("date").dt.month() < 10)
.then(pl.datetime(pl.col("date").dt.year(), 10, 1))
.otherwise(pl.datetime(pl.col("date").dt.year() + 1, 1, 1))
.alias("quarter")
)
)
shape: (6, 6)
┌──────────────┬─────────────────────┬─────────────────────┬─────────────────────┬──────┬─────────────────────┐
│ pokemon_type ┆ _lower_boundary ┆ _upper_boundary ┆ date ┆ len ┆ quarter │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ u32 ┆ datetime[μs] │
╞══════════════╪═════════════════════╪═════════════════════╪═════════════════════╪══════╪═════════════════════╡
│ fire ┆ 2018-10-01 00:00:00 ┆ 2019-01-01 00:00:00 ┆ 2018-10-01 00:00:00 ┆ 400 ┆ 2019-01-01 00:00:00 │
│ water ┆ 2018-10-01 00:00:00 ┆ 2019-01-01 00:00:00 ┆ 2018-10-01 00:00:00 ┆ 416 ┆ 2019-01-01 00:00:00 │
│ water ┆ 2019-01-01 00:00:00 ┆ 2019-04-01 00:00:00 ┆ 2019-01-01 00:00:00 ┆ 2160 ┆ 2019-04-01 00:00:00 │
│ water ┆ 2019-04-01 00:00:00 ┆ 2019-07-01 00:00:00 ┆ 2019-04-01 00:00:00 ┆ 2184 ┆ 2019-07-01 00:00:00 │
│ water ┆ 2019-07-01 00:00:00 ┆ 2019-10-01 00:00:00 ┆ 2019-07-01 00:00:00 ┆ 2208 ┆ 2019-10-01 00:00:00 │
│ water ┆ 2019-10-01 00:00:00 ┆ 2020-01-01 00:00:00 ┆ 2019-10-01 00:00:00 ┆ 1392 ┆ 2020-01-01 00:00:00 │
└──────────────┴─────────────────────┴─────────────────────┴─────────────────────┴──────┴─────────────────────┘
Likewise, if you wanted to use closed="left"
, you could also express your quarter-ending dates like this:
(
df
.group_by_dynamic("date", every="3mo", group_by="pokemon_type", closed='left', include_boundaries=True)
.agg(pl.len())
.with_columns(
pl.when(pl.col("date").dt.month() < 4)
.then(pl.datetime(pl.col('date').dt.year(), 3, 31))
.when(pl.col("date").dt.month() < 7)
.then(pl.datetime(pl.col("date").dt.year(), 6, 30))
.when(pl.col("date").dt.month() < 10)
.then(pl.datetime(pl.col("date").dt.year(), 9, 30))
.otherwise(pl.datetime(pl.col("date").dt.year(), 12, 31))
.alias("quarter")
)
)
shape: (6, 6)
┌──────────────┬─────────────────────┬─────────────────────┬─────────────────────┬──────┬─────────────────────┐
│ pokemon_type ┆ _lower_boundary ┆ _upper_boundary ┆ date ┆ len ┆ quarter │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ u32 ┆ datetime[μs] │
╞══════════════╪═════════════════════╪═════════════════════╪═════════════════════╪══════╪═════════════════════╡
│ fire ┆ 2018-10-01 00:00:00 ┆ 2019-01-01 00:00:00 ┆ 2018-10-01 00:00:00 ┆ 400 ┆ 2018-12-31 00:00:00 │
│ water ┆ 2018-10-01 00:00:00 ┆ 2019-01-01 00:00:00 ┆ 2018-10-01 00:00:00 ┆ 416 ┆ 2018-12-31 00:00:00 │
│ water ┆ 2019-01-01 00:00:00 ┆ 2019-04-01 00:00:00 ┆ 2019-01-01 00:00:00 ┆ 2160 ┆ 2019-03-31 00:00:00 │
│ water ┆ 2019-04-01 00:00:00 ┆ 2019-07-01 00:00:00 ┆ 2019-04-01 00:00:00 ┆ 2184 ┆ 2019-06-30 00:00:00 │
│ water ┆ 2019-07-01 00:00:00 ┆ 2019-10-01 00:00:00 ┆ 2019-07-01 00:00:00 ┆ 2208 ┆ 2019-09-30 00:00:00 │
│ water ┆ 2019-10-01 00:00:00 ┆ 2020-01-01 00:00:00 ┆ 2019-10-01 00:00:00 ┆ 1392 ┆ 2019-12-31 00:00:00 │
└──────────────┴─────────────────────┴─────────────────────┴─────────────────────┴──────┴─────────────────────┘