df = pl.DataFrame(
{
"era": ["01", "01", "02", "02", "03", "03"],
"pred1": [1, 2, 3, 4, 5,6],
"pred2": [2,4,5,6,7,8],
"pred3": [3,5,6,8,9,1],
"something_else": [5,4,3,67,5,4],
}
)
pred_cols = ["pred1", "pred2", "pred3"]
ERA_COL = "era"
I'm trying to do an equivalent to pandas rank percentile on Polars. Polars' rank
function lacks the pct
flag Pandas has.
I looked at another question here: how to replace pandas df.rank(axis=1) with polars
But the results from the question (and applying it to my code), have something off. Calculating rank percentage in Pandas, gives me a single float, the example Polars provided gives me an array, not a float, so something different is being calculated on the example.
As an example, Pandas code is this one:
df[list(pred_cols)] = df.groupby(ERA_COL, group_keys=False).apply(
lambda d: d[list(pred_cols)].rank(pct=True)
)
You can use the mentioned .rank() / .count()
approach with .over()
df.select(
(pl.col(pred_cols).rank() / pl.col(pred_cols).count())
.over(ERA_COL)
)
shape: (6, 3)
┌───────┬───────┬───────┐
│ pred1 ┆ pred2 ┆ pred3 │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞═══════╪═══════╪═══════╡
│ 0.5 ┆ 0.5 ┆ 0.5 │
│ 1.0 ┆ 1.0 ┆ 1.0 │
│ 0.5 ┆ 0.5 ┆ 0.5 │
│ 1.0 ┆ 1.0 ┆ 1.0 │
│ 0.5 ┆ 0.5 ┆ 1.0 │
│ 1.0 ┆ 1.0 ┆ 0.5 │
└───────┴───────┴───────┘
.with_columns()
if you want to "replace" the original values.
df.with_columns(
(pl.col(pred_cols).rank() / pl.col(pred_cols).count())
.over(ERA_COL)
)
shape: (6, 5)
┌─────┬───────┬───────┬───────┬────────────────┐
│ era ┆ pred1 ┆ pred2 ┆ pred3 ┆ something_else │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 ┆ i64 │
╞═════╪═══════╪═══════╪═══════╪════════════════╡
│ 01 ┆ 0.5 ┆ 0.5 ┆ 0.5 ┆ 5 │
│ 01 ┆ 1.0 ┆ 1.0 ┆ 1.0 ┆ 4 │
│ 02 ┆ 0.5 ┆ 0.5 ┆ 0.5 ┆ 3 │
│ 02 ┆ 1.0 ┆ 1.0 ┆ 1.0 ┆ 67 │
│ 03 ┆ 0.5 ┆ 0.5 ┆ 1.0 ┆ 5 │
│ 03 ┆ 1.0 ┆ 1.0 ┆ 0.5 ┆ 4 │
└─────┴───────┴───────┴───────┴────────────────┘