I frequently need to calculate the percentage counts of a variable. For example for the dataframe below
df = pl.DataFrame({"person": ["a", "a", "b"],
"value": [1, 2, 3]})
I want to return a dataframe like this:
shape: (2, 2)
┌────────┬──────────┐
│ person ┆ percent │
│ --- ┆ --- │
│ str ┆ f64 │
╞════════╪══════════╡
│ a ┆ 0.666667 │
│ b ┆ 0.333333 │
└────────┴──────────┘
What I have been doing is the following, but I can't help but think there must be a more efficient / polars way to do this
n_rows = len(df)
(
df
.with_columns(pl.lit(1)
.alias('percent'))
.group_by('person')
.agg(pl.sum('percent') / n_rows)
)
GroupBy.len()
will help here. (which is shorthand for .agg(pl.len())
)
(
df
.group_by("person")
.len()
.with_columns((pl.col("len") / pl.sum("len")).alias("percent"))
)
shape: (2, 3)
┌────────┬─────┬──────────┐
│ person ┆ len ┆ percent │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ f64 │
╞════════╪═════╪══════════╡
│ a ┆ 2 ┆ 0.666667 │
│ b ┆ 1 ┆ 0.333333 │
└────────┴─────┴──────────┘