pythonpython-polars

Ranking categories by count within groups in Polars


I have a Polars DataFrame with months, categories, and IDs. I want to rank categories by their frequency within each month, then pivot the results to show which category held each rank position in each month.

My dataframe:

import polars as pl

df = pl.DataFrame(
    {
        "ID": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17],
        "MONTH": [1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 2],
        "CATEGORY": [
            "C",
            "B",
            "A",
            "C",
            "B",
            "A",
            "C",
            "C",
            "A",
            "B",
            "A",
            "A",
            "C",
            "C",
            "A",
            "B",
            "B",
        ],
    }
)

Desired Output:

shape: (3, 3)
┌───────┬─────┬─────┐
│ PLACE ┆ 1   ┆ 2   │
│ ---   ┆ --- ┆ --- │
│ i64   ┆ str ┆ str │
╞═══════╪═════╪═════╡
│ 1     ┆ A   ┆ C   │
│ 2     ┆ B   ┆ A   │
│ 3     ┆ C   ┆ B   │
└───────┴─────┴─────┘

Current Working Solution:

(
    df.group_by(pl.col("MONTH"), pl.col("CATEGORY"))
    .agg(pl.col("ID").len().alias("COUNT"))
    .sort(by=["MONTH", "COUNT"], descending=True)
    .with_columns(pl.lit(1).alias("aux"))
    .with_columns(pl.col("aux").cum_sum().over(["MONTH"]).alias("PLACE"))
    .pivot(index="PLACE", values="CATEGORY", on="MONTH", sort_columns=True)
)

This works correctly, but I feel like there might be a more elegant or direct way to achieve this ranking and pivoting operation in Polars. Is there a simpler approach?


Solution

  • (
        df.group_by("MONTH", "CATEGORY")
        .len()
        .with_columns(
            pl.col("len").rank("ordinal", descending=True).over("MONTH").alias("PLACE")
        )
        .sort("PLACE")
        .pivot("MONTH", index="PLACE", values="CATEGORY", sort_columns=True)
    )
    
    shape: (3, 3)
    ┌───────┬─────┬─────┐
    │ PLACE ┆ 1   ┆ 2   │
    │ ---   ┆ --- ┆ --- │
    │ u32   ┆ str ┆ str │
    ╞═══════╪═════╪═════╡
    │ 1     ┆ A   ┆ C   │
    │ 2     ┆ B   ┆ A   │
    │ 3     ┆ C   ┆ B   │
    └───────┴─────┴─────┘
    

    (Note that PLACE is a u32, which differs from the i64 in your output; if that matters, add a cast at the end.)

    Explanation:

    1. Get count per month+category
    2. Get rank of counts (descending, so highest = 1) partitioned by month — “over” is the missing piece in your solution. Note that ordinal breaks ties arbitrarily, so it would've been possible to get CBA instead of CAB in column 2.
    3. Sort by rank (highest count at top, lowest at bottom)
    4. Pivot