I have a dataset with multiple columns and an ID column. Each ID can have different magnitudes and varying sizes across these columns. I want to normalize the columns for each ID separately.
import polars as pl
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df = pl.DataFrame(
{ "ID" : [1,1,2,2,3,3],
"Values" : [1,2,3,4,5,6]}
)
If i do this, its using the scaler of the entire dataframe, and i would like to use scaler()
for each ID.
I tried this:
(
df
.with_columns(
Value_scaled = scaler.fit_transform(df.select(pl.col("Value"))).over("ID"),
)
)
But : AttributeError: 'numpy.ndarray' object has no attribute 'over'
And i also tried using a group_by()
(
df
.group_by(
pl.col("ID")
).agg(
scaler.fit_transform(pl.col("Value")).alias("Value_scaled")
)
)
And i get :
TypeError: float() argument must be a string or a real number, not 'Expr'
Following the definition outlined in the documentation, the functionality of the MinMaxScaler
can be implemented easily using polars' native expression API.
def min_max_scaler(x: str | pl.Expr) -> pl.Expr:
if isinstance(x, str):
x = pl.col(x)
return (x - x.min()) / (x.max() - x.min())
Then, it is compatible with polars' window functions, such as pl.Expr.over
, to apply the scaling separately for each ID
.
df.with_columns(min_max_scaler("Values").over("ID"))
shape: (6, 2)
┌─────┬────────┐
│ ID ┆ Values │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪════════╡
│ 1 ┆ 0.0 │
│ 1 ┆ 1.0 │
│ 2 ┆ 0.0 │
│ 2 ┆ 1.0 │
│ 3 ┆ 0.0 │
│ 3 ┆ 1.0 │
└─────┴────────┘