pythonpython-polars

Best way to get percentage counts in Polars


I frequently need to calculate the percentage counts of a variable. For example for the dataframe below

df = pl.DataFrame({"person": ["a", "a", "b"], 
                   "value": [1, 2, 3]})

I want to return a dataframe like this:

shape: (2, 2)
┌────────┬──────────┐
│ person ┆ percent  │
│ ---    ┆ ---      │
│ str    ┆ f64      │
╞════════╪══════════╡
│ a      ┆ 0.666667 │
│ b      ┆ 0.333333 │
└────────┴──────────┘

What I have been doing is the following, but I can't help but think there must be a more efficient / polars way to do this

n_rows = len(df)

(   
    df
    .with_columns(pl.lit(1)
    .alias('percent'))
    .group_by('person')
    .agg(pl.sum('percent') / n_rows)
)

Solution

  • GroupBy.len() will help here. (which is shorthand for .agg(pl.len()))

    (
        df
        .group_by("person")
        .len()
        .with_columns((pl.col("len") / pl.sum("len")).alias("percent"))
    )
    
    shape: (2, 3)
    ┌────────┬─────┬──────────┐
    │ person ┆ len ┆ percent  │
    │ ---    ┆ --- ┆ ---      │
    │ str    ┆ u32 ┆ f64      │
    ╞════════╪═════╪══════════╡
    │ a      ┆ 2   ┆ 0.666667 │
    │ b      ┆ 1   ┆ 0.333333 │
    └────────┴─────┴──────────┘