pythondataframevectorizationpython-polars

How to get the most frequent value over partitions in Polars?


I'm trying to use Polars to get the most frequent value in a column of a DataFrame, for each category, and this category is given by multiple other columns (i.e., using a composite primary key).

For example, I have a DataFrame that looks something like this:

Input

df = pl.from_repr("""
┌───────┬─────┬─────┬───────┐
│ index ┆ pk1 ┆ pk2 ┆ value │
│ ---   ┆ --- ┆ --- ┆ ---   │
│ i64   ┆ str ┆ str ┆ i64   │
╞═══════╪═════╪═════╪═══════╡
│ 0     ┆ a   ┆ x   ┆ 42    │
│ 1     ┆ b   ┆ y   ┆ 69    │
│ 2     ┆ a   ┆ x   ┆ 36    │
│ 3     ┆ b   ┆ x   ┆ 12    │
│ 4     ┆ a   ┆ x   ┆ 36    │
└───────┴─────┴─────┴───────┘
""")

And I would like the output to be:

Output

shape: (5, 5)
┌───────┬─────┬─────┬───────┬────────────────┐
│ index ┆ pk1 ┆ pk2 ┆ value ┆ majority_value │
│ ---   ┆ --- ┆ --- ┆ ---   ┆ ---            │
│ i64   ┆ str ┆ str ┆ i64   ┆ i64            │
╞═══════╪═════╪═════╪═══════╪════════════════╡
│ 0     ┆ a   ┆ x   ┆ 42    ┆ 36             │
│ 1     ┆ b   ┆ y   ┆ 69    ┆ 69             │
│ 2     ┆ a   ┆ x   ┆ 36    ┆ 36             │
│ 3     ┆ b   ┆ x   ┆ 12    ┆ 12             │
│ 4     ┆ a   ┆ x   ┆ 36    ┆ 36             │
└───────┴─────┴─────┴───────┴────────────────┘

I have tried using .mode() with .over() but it returns an exception:

ComputeError: the length of the window expression did not match that of the group

Code I've tried using:

def get_majority_value(primary_keys: list):
    return (
        pl.col("value")
        .mode()
        .over(primary_keys)
        .alias('majority_value')
    )

df.with_columns(
    get_majority_value(primary_keys=['pk1', 'pk2'])
)

Solution

  • Turns out the Polars' .mode() expression can return multiple values if there is a tie (more than one mode).

    So we have to add .first()

    def get_majority_value(primary_keys: list):
        return (
            pl.col("value")
            .drop_nulls() # Use this if you don't want to consider "None" as a possible mode
            .mode()
            .first()
            .over(primary_keys)
            .alias('majority_value')
        )
    
    df.with_columns(
        get_majority_value(primary_keys=['pk1', 'pk2'])
    )