I have a Polars DataFrame with months, categories, and IDs. I want to rank categories by their frequency within each month, then pivot the results to show which category held each rank position in each month.
My dataframe:
import polars as pl
df = pl.DataFrame(
{
"ID": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17],
"MONTH": [1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 2],
"CATEGORY": [
"C",
"B",
"A",
"C",
"B",
"A",
"C",
"C",
"A",
"B",
"A",
"A",
"C",
"C",
"A",
"B",
"B",
],
}
)
Desired Output:
shape: (3, 3)
┌───────┬─────┬─────┐
│ PLACE ┆ 1 ┆ 2 │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═══════╪═════╪═════╡
│ 1 ┆ A ┆ C │
│ 2 ┆ B ┆ A │
│ 3 ┆ C ┆ B │
└───────┴─────┴─────┘
Current Working Solution:
(
df.group_by(pl.col("MONTH"), pl.col("CATEGORY"))
.agg(pl.col("ID").len().alias("COUNT"))
.sort(by=["MONTH", "COUNT"], descending=True)
.with_columns(pl.lit(1).alias("aux"))
.with_columns(pl.col("aux").cum_sum().over(["MONTH"]).alias("PLACE"))
.pivot(index="PLACE", values="CATEGORY", on="MONTH", sort_columns=True)
)
This works correctly, but I feel like there might be a more elegant or direct way to achieve this ranking and pivoting operation in Polars. Is there a simpler approach?
(
df.group_by("MONTH", "CATEGORY")
.len()
.with_columns(
pl.col("len").rank("ordinal", descending=True).over("MONTH").alias("PLACE")
)
.sort("PLACE")
.pivot("MONTH", index="PLACE", values="CATEGORY", sort_columns=True)
)
shape: (3, 3)
┌───────┬─────┬─────┐
│ PLACE ┆ 1 ┆ 2 │
│ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str │
╞═══════╪═════╪═════╡
│ 1 ┆ A ┆ C │
│ 2 ┆ B ┆ A │
│ 3 ┆ C ┆ B │
└───────┴─────┴─────┘
(Note that PLACE is a u32, which differs from the i64 in your output; if that matters, add a cast at the end.)
Explanation:
ordinal
breaks ties arbitrarily, so it would've been possible to get CBA
instead of CAB
in column 2.