I'm trying to use Polars to get the most frequent value in a column of a DataFrame, for each category, and this category is given by multiple other columns (i.e., using a composite primary key).
For example, I have a DataFrame that looks something like this:
Input
df = pl.from_repr("""
┌───────┬─────┬─────┬───────┐
│ index ┆ pk1 ┆ pk2 ┆ value │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 │
╞═══════╪═════╪═════╪═══════╡
│ 0 ┆ a ┆ x ┆ 42 │
│ 1 ┆ b ┆ y ┆ 69 │
│ 2 ┆ a ┆ x ┆ 36 │
│ 3 ┆ b ┆ x ┆ 12 │
│ 4 ┆ a ┆ x ┆ 36 │
└───────┴─────┴─────┴───────┘
""")
And I would like the output to be:
Output
shape: (5, 5)
┌───────┬─────┬─────┬───────┬────────────────┐
│ index ┆ pk1 ┆ pk2 ┆ value ┆ majority_value │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 ┆ i64 │
╞═══════╪═════╪═════╪═══════╪════════════════╡
│ 0 ┆ a ┆ x ┆ 42 ┆ 36 │
│ 1 ┆ b ┆ y ┆ 69 ┆ 69 │
│ 2 ┆ a ┆ x ┆ 36 ┆ 36 │
│ 3 ┆ b ┆ x ┆ 12 ┆ 12 │
│ 4 ┆ a ┆ x ┆ 36 ┆ 36 │
└───────┴─────┴─────┴───────┴────────────────┘
I have tried using .mode()
with .over()
but it returns an exception:
ComputeError: the length of the window expression did not match that of the group
Code I've tried using:
def get_majority_value(primary_keys: list):
return (
pl.col("value")
.mode()
.over(primary_keys)
.alias('majority_value')
)
df.with_columns(
get_majority_value(primary_keys=['pk1', 'pk2'])
)
Turns out the Polars' .mode()
expression can return multiple values if there is a tie (more than one mode).
So we have to add .first()
def get_majority_value(primary_keys: list):
return (
pl.col("value")
.drop_nulls() # Use this if you don't want to consider "None" as a possible mode
.mode()
.first()
.over(primary_keys)
.alias('majority_value')
)
df.with_columns(
get_majority_value(primary_keys=['pk1', 'pk2'])
)