I have a Polars DataFrame, and I want to create a summarized view where aggregated values (e.g., unique IDs, total sends) are displayed in a format that makes comparison across months easier. Here's an example of my dataset:
My example dataframe:
import polars as pl
df = pl.DataFrame({
"Channel": ["X", "X", "Y", "Y", "X", "X", "Y", "Y", "X", "X", "Y", "Y", "X", "X", "Y", "Y"],
"ID": ["a", "b", "b", "a", "e", "b", "g", "h", "a", "a", "k", "a", "b", "n", "o", "p"],
"Month": ["1", "2", "1", "2", "1", "2", "1", "2", "1", "2", "1", "2", "1", "2", "1", "2"]
})
Currently, I use the following group_by()
approach to calculate the number of unique IDs and the total number of sends for each Month and Channel:
(
df
.group_by(
pl.col("Month"),
pl.col("Channel")
)
.agg(
pl.col("ID").n_unique().alias("Uniques ID"),
pl.col("ID").len().alias("Total sends")
)
)
shape: (4, 4)
┌───────┬─────────┬────────────┬─────────────┐
│ Month ┆ Channel ┆ Uniques ID ┆ Total sends │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ u32 ┆ u32 │
╞═══════╪═════════╪════════════╪═════════════╡
│ 1 ┆ X ┆ 3 ┆ 4 │
│ 1 ┆ Y ┆ 4 ┆ 4 │
│ 2 ┆ X ┆ 3 ┆ 4 │
│ 2 ┆ Y ┆ 3 ┆ 4 │
└───────┴─────────┴────────────┴─────────────┘
However, my actual dataset is much larger, and have more agg_functions, so I want a format that better highlights comparisons across months. Ideally, I want the output to look like this:
| Channels | agg_func | months | months |
|----------|--------------|--------|--------|
| | | 1 | 2 |
| X | Uniques ID | 3 | 3 |
| X | Total sends | 4 | 4 |
| Y | Uniques ID | 4 | 3 |
| Y | Total sends | 4 | 4 |
I believe I could use .pivot()
and pass the aggregation functions as part of the index. But, I'm not sure how to implement this directly without creating an auxiliary DataFrame. Any suggestions?
You can aggregate multiple aggregates while pivoting and then explode the lists:
(
df.pivot(
on="Month",
values="ID",
aggregate_function=
pl.concat_list(
pl.element().n_unique().alias("value"),
pl.element().len().alias("value")
)
)
.with_columns(agg_func=["Uniques ID","Total sends"])
.explode(pl.exclude("Channel"))
)
shape: (4, 4)
┌─────────┬─────┬─────┬─────────────┐
│ Channel ┆ 1 ┆ 2 ┆ agg_func │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ u32 ┆ str │
╞═════════╪═════╪═════╪═════════════╡
│ X ┆ 3 ┆ 3 ┆ Uniques ID │
│ X ┆ 4 ┆ 4 ┆ Total sends │
│ Y ┆ 4 ┆ 3 ┆ Uniques ID │
│ Y ┆ 4 ┆ 4 ┆ Total sends │
└─────────┴─────┴─────┴─────────────┘
Or, you can do it with multiple pivots (one per aggregate function):
pl.concat([
df.pivot(
on="Month",
values="ID",
aggregate_function=agg_func
).with_columns(
pl.lit(agg_func_name).alias("agg_func")
)
for agg_func, agg_func_name in [
(pl.element().n_unique(), "Uniques ID"),
(pl.element().len(), "Total sends")
]
])
# alternatively group_by first and then pivot
# pl.concat([
# df.group_by("Month","Channel")
# .agg(agg_func)
# .with_columns(agg_func=pl.lit(agg_func_name))
# for agg_func, agg_func_name in [
# (pl.col.ID.n_unique(), "Uniques ID"),
# (pl.col.ID.len(), "Total sends")
# ]
# ]).pivot(on="Month", values="ID")
shape: (4, 4)
┌─────────┬─────┬─────┬─────────────┐
│ Channel ┆ 1 ┆ 2 ┆ agg_func │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ u32 ┆ str │
╞═════════╪═════╪═════╪═════════════╡
│ X ┆ 3 ┆ 3 ┆ Uniques ID │
│ Y ┆ 4 ┆ 3 ┆ Uniques ID │
│ X ┆ 4 ┆ 4 ┆ Total sends │
│ Y ┆ 4 ┆ 4 ┆ Total sends │
└─────────┴─────┴─────┴─────────────┘
Of course, you can also extend your solution with unpivot
and pivot
(
df
.group_by("Month","Channel")
.agg(
pl.col("ID").n_unique().alias("Uniques ID"),
pl.col("ID").len().alias("Total sends")
)
.unpivot(index=["Month","Channel"], variable_name="agg_func")
.pivot(on="Month", values="value")
)
shape: (4, 4)
┌─────────┬─────────────┬─────┬─────┐
│ Channel ┆ agg_func ┆ 2 ┆ 1 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ u32 ┆ u32 │
╞═════════╪═════════════╪═════╪═════╡
│ Y ┆ Uniques ID ┆ 3 ┆ 4 │
│ X ┆ Uniques ID ┆ 3 ┆ 3 │
│ Y ┆ Total sends ┆ 4 ┆ 4 │
│ X ┆ Total sends ┆ 4 ┆ 4 │
└─────────┴─────────────┴─────┴─────┘