pythonpandasdataframepython-polars

Polars group_by + describe: return all columns as single dataframe


I'm slowly migrating to polars from pandas and I have found that in some cases the polars syntax is tricky.

I'm seeking help to do a group_by followed by a describe using less (or more readable) code.

See this example:

from io import BytesIO
import pandas as pd
import polars as pl

S = b'''group,value\n3,245\n3,28\n3,48\n1,113\n1,288\n1,165\n2,90\n2,21\n2,109'''

pl_df = pl.read_csv(BytesIO(S))
pd_df = pd.read_csv(BytesIO(S))

# Polars' way
pl_df.group_by('group').map_groups(
    lambda df: (
        df['value']
        .describe()
        .with_columns(
            group=pl.lit(df['group'][0])
        )
    )
).pivot(index='group', on='statistic')

Something similar in pandas would be:

# pandas' 
pd_df.groupby('group').value.describe()

Solution

  • You can write a quick function that returns a mapping of expressions that you can unpack right into your DataFrame.group_by(...).agg. This avoids any slow-ness of using map_groups and enables Polars to easily scan the query for any optimizations (provided you are working with a LazyFrame).

    from io import BytesIO
    import pandas as pd
    import polars as pl
    
    def describe(column, percentiles=[.25, .5, .75]):
        return {
            'count': column.count(),
            'null_count': column.null_count(),
            'mean': column.mean(),
            'std': column.std(),
            'min': column.min(),
            **{
                f'{pct*100:g}%': column.quantile(pct)
                for pct in percentiles
            },
            'max': column.max(),
        }
    
    S = b'''group,value\n3,245\n3,28\n3,48\n1,113\n1,288\n1,165\n2,90\n2,21\n2,109'''
    
    pl_df = pl.read_csv(BytesIO(S))
    
    print(
        pl_df.group_by('group').agg(**describe(pl.col('value')))
    )
    

    Produces

    shape: (3, 10)
    ┌───────┬───────┬────────────┬────────────┬───┬───────┬───────┬───────┬─────┐
    │ group ┆ count ┆ null_count ┆ mean       ┆ … ┆ 25%   ┆ 50%   ┆ 75%   ┆ max │
    │ ---   ┆ ---   ┆ ---        ┆ ---        ┆   ┆ ---   ┆ ---   ┆ ---   ┆ --- │
    │ i64   ┆ u32   ┆ u32        ┆ f64        ┆   ┆ f64   ┆ f64   ┆ f64   ┆ i64 │
    ╞═══════╪═══════╪════════════╪════════════╪═══╪═══════╪═══════╪═══════╪═════╡
    │ 1     ┆ 3     ┆ 0          ┆ 188.666667 ┆ … ┆ 165.0 ┆ 165.0 ┆ 288.0 ┆ 288 │
    │ 3     ┆ 3     ┆ 0          ┆ 107.0      ┆ … ┆ 48.0  ┆ 48.0  ┆ 245.0 ┆ 245 │
    │ 2     ┆ 3     ┆ 0          ┆ 73.333333  ┆ … ┆ 90.0  ┆ 90.0  ┆ 109.0 ┆ 109 │
    └───────┴───────┴────────────┴────────────┴───┴───────┴───────┴───────┴─────┘