sumaggregatepython-polars

Sum across columns of the same name (or "key") in polars dataframe


I have this code

import polars as pl

pl.DataFrame({
    'id': ['CHECK.US1', 'CHECK.US2', 'CHECK.CA9'],
    'libor.M2': [99, 332, 934],
    'libor.Y5': [11, -10, 904],
    'estr.M2':  [99, 271, 741],
    'estr.Y3':  [-8, -24, 183],
    'estr.Y5':  [88, 771, 455]
})

which gives

┌───────────┬──────────┬──────────┬─────────┬─────────┬─────────┐
│ id        ┆ libor.M2 ┆ libor.Y5 ┆ estr.M2 ┆ estr.Y3 ┆ estr.Y5 │
│ ---       ┆ ---      ┆ ---      ┆ ---     ┆ ---     ┆ ---     │
│ str       ┆ i64      ┆ i64      ┆ i64     ┆ i64     ┆ i64     │
╞═══════════╪══════════╪══════════╪═════════╪═════════╪═════════╡
│ CHECK.US1 ┆ 99       ┆ 11       ┆ 99      ┆ -8      ┆ 88      │
│ CHECK.US2 ┆ 332      ┆ -10      ┆ 271     ┆ -24     ┆ 771     │
│ CHECK.CA9 ┆ 934      ┆ 904      ┆ 741     ┆ 183     ┆ 455     │
└───────────┴──────────┴──────────┴─────────┴─────────┴─────────┘

Now what I am trying to do is rename the columns to shorter names such as

┌───────────┬──────┬──────┬─────┬─────┬─────┐
│ id        ┆ M2   ┆ Y5   ┆ M2  ┆ Y3  ┆ Y5  │
│ ---       ┆ ---  ┆ ---  ┆ --- ┆ --- ┆ --- │
│ str       ┆ i64  ┆ i64  ┆ i64 ┆ i64 ┆ i64 │
╞═══════════╪══════╪══════╪═════╪═════╪═════╡
or
┌───────────┬──────┬──────┬─────┬─────┬─────┐
│ id        ┆ libor┆ libor┆ estr┆ estr┆ estr│
│ ---       ┆ ---  ┆ ---  ┆ --- ┆ --- ┆ --- │
│ str       ┆ i64  ┆ i64  ┆ i64 ┆ i64 ┆ i64 │
╞═══════════╪══════╪══════╪═════╪═════╪═════╡

and then collapse (= sum) across the columns with the same name, so that I get e.g.

┌───────────┬──────┬──────┬──────┐
│ id        ┆ M2   ┆ Y5   ┆ Y3   │
│ ---       ┆ ---  ┆ ---  ┆ ---  │
│ str       ┆ i64  ┆ i64  ┆ i64  │
╞═══════════╪══════╪══════╪══════╡
│ CHECK.US1 ┆ 198  ┆ 99   ┆ -8   │
│ CHECK.US2 ┆ 603  ┆ 761  ┆ -24  │
│ CHECK.CA9 ┆ 1675 ┆ 1359 ┆ 183  │
└───────────┴──────┴──────┴──────┘

I tried renaming them first, but got polars.exceptions.DuplicateError: the name 'M2' is duplicate.

Is there a way to achieve what I am trying to do?


edit: I also tried something like

rename_func = lambda col: col.split('.')[-1]
new_cols = set([rename_func(c) for c in df.columns])

df.with_columns([
  pl.sum_horizontal(pl.all().map(rename_func) == c).alias(c) for c in new_cols
])

but it doesn't quite work.


Solution

  • You could build a dictionary of the columns to aggregate, then use pl.sum_horizontal:

    cols = {}
    for col in df.columns:
        suffix = col.split('.')[-1]
        cols.setdefault(suffix, []).append(col)
    # {'id': ['id'], 'M2': ['libor.M2', 'estr.M2'],
    #  'Y5': ['libor.Y5', 'estr.Y5'], 'Y3': ['estr.Y3']}
    
    df.select([pl.sum_horizontal(pl.col(c) for c in l).alias(k)
               for k, l in cols.items()
              ])
    

    Output:

    ┌───────────┬──────┬──────┬─────┐
    │ id        ┆ M2   ┆ Y5   ┆ Y3  │
    │ ---       ┆ ---  ┆ ---  ┆ --- │
    │ str       ┆ i64  ┆ i64  ┆ i64 │
    ╞═══════════╪══════╪══════╪═════╡
    │ CHECK.US1 ┆ 198  ┆ 99   ┆ -8  │
    │ CHECK.US2 ┆ 603  ┆ 761  ┆ -24 │
    │ CHECK.CA9 ┆ 1675 ┆ 1359 ┆ 183 │
    └───────────┴──────┴──────┴─────┘