I would like to compute the groupby variance of my polars dataframe. Maybe the reason is obvious but I don't know why it does not exists in the groupby object namespace. Is there a workaround maybe?
df.group_by("group_id", maintain_order=True).var()
You can always use pl.all
to obtain your desired statistics for groups. For example:
import polars as pl
import numpy as np
nbr_rows_per_group = 1_000
nbr_groups = 3
rng = np.random.default_rng(1)
df = pl.DataFrame(
{
"group" : list(range(0, nbr_groups)) * nbr_rows_per_group,
"col1": rng.normal(0, 1, nbr_groups * nbr_rows_per_group),
"col2": rng.normal(0, 1, nbr_groups * nbr_rows_per_group),
}
)
(
df
.group_by('group')
.agg(
pl.all().var().name.suffix('_var'),
pl.all().mean().name.suffix('_mean'),
pl.all().skew().name.suffix('_skew'),
)
)
shape: (3, 7)
┌───────┬──────────┬──────────┬───────────┬───────────┬───────────┬───────────┐
│ group ┆ col1_var ┆ col2_var ┆ col1_mean ┆ col2_mean ┆ col1_skew ┆ col2_skew │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════╪══════════╪══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 0 ┆ 0.999802 ┆ 0.99401 ┆ 0.017574 ┆ 0.021156 ┆ -0.042408 ┆ 0.0102 │
│ 2 ┆ 1.031637 ┆ 1.029593 ┆ -0.053874 ┆ -0.037097 ┆ 0.004183 ┆ 0.080086 │
│ 1 ┆ 0.941347 ┆ 1.006852 ┆ 0.029232 ┆ -0.023855 ┆ 0.049269 ┆ 0.074515 │
└───────┴──────────┴──────────┴───────────┴───────────┴───────────┴───────────┘