I am trying to use polars to do a window aggregate over one value, but map it back to another.
For example, if i wanted to get the name of the max value in a group, instead of (or in combination to) just the max value.
assuming an input of something like this.
df = pl.from_repr("""
┌───────┬──────┬───────┐
│ label ┆ name ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞═══════╪══════╪═══════╡
│ a. ┆ foo ┆ 1.0 │
│ a. ┆ bar ┆ 2.0 │
│ b. ┆ baz ┆ 1.5 │
│ b. ┆ boo ┆ -1.0 │
└───────┴──────┴───────┘
""")
# 'max_by' is not a real method, just using it to express what i'm trying to achieve.
df.select(pl.col('label'), pl.col('name').max_by('value').over('label'))
i want an output like this
shape: (2, 2)
┌───────┬──────┐
│ label ┆ name │
│ --- ┆ --- │
│ str ┆ str │
╞═══════╪══════╡
│ a. ┆ bar │
│ b. ┆ baz │
└───────┴──────┘
ideally with the value
too. But i know i can easily add that in via pl.col('value').max().over('label')
.
shape: (2, 3)
┌───────┬──────┬───────┐
│ label ┆ name ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞═══════╪══════╪═══════╡
│ a. ┆ bar ┆ 2.0 │
│ b. ┆ baz ┆ 1.5 │
└───────┴──────┴───────┘
You were close. There is a sort_by
expression that can be used.
df.group_by('label').agg(pl.all().sort_by('value').last())
shape: (2, 3)
┌───────┬──────┬───────┐
│ label ┆ name ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞═══════╪══════╪═══════╡
│ a. ┆ bar ┆ 2.0 │
│ b. ┆ baz ┆ 1.5 │
└───────┴──────┴───────┘
If you need a windowed version of this:
df.with_columns(
pl.col('name','value').sort_by('value').last().over('label').name.suffix("_max")
)
shape: (4, 5)
┌───────┬──────┬───────┬──────────┬───────────┐
│ label ┆ name ┆ value ┆ name_max ┆ value_max │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ str ┆ f64 │
╞═══════╪══════╪═══════╪══════════╪═══════════╡
│ a. ┆ foo ┆ 1.0 ┆ bar ┆ 2.0 │
│ a. ┆ bar ┆ 2.0 ┆ bar ┆ 2.0 │
│ b. ┆ baz ┆ 1.5 ┆ baz ┆ 1.5 │
│ b. ┆ boo ┆ -1.0 ┆ baz ┆ 1.5 │
└───────┴──────┴───────┴──────────┴───────────┘