I have this code
import polars as pl
pl.DataFrame({
'id': ['CHECK.US1', 'CHECK.US2', 'CHECK.CA9'],
'libor.M2': [99, 332, 934],
'libor.Y5': [11, -10, 904],
'estr.M2': [99, 271, 741],
'estr.Y3': [-8, -24, 183],
'estr.Y5': [88, 771, 455]
})
which gives
┌───────────┬──────────┬──────────┬─────────┬─────────┬─────────┐
│ id ┆ libor.M2 ┆ libor.Y5 ┆ estr.M2 ┆ estr.Y3 ┆ estr.Y5 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════════╪══════════╪══════════╪═════════╪═════════╪═════════╡
│ CHECK.US1 ┆ 99 ┆ 11 ┆ 99 ┆ -8 ┆ 88 │
│ CHECK.US2 ┆ 332 ┆ -10 ┆ 271 ┆ -24 ┆ 771 │
│ CHECK.CA9 ┆ 934 ┆ 904 ┆ 741 ┆ 183 ┆ 455 │
└───────────┴──────────┴──────────┴─────────┴─────────┴─────────┘
Now what I am trying to do is rename the columns to shorter names such as
┌───────────┬──────┬──────┬─────┬─────┬─────┐
│ id ┆ M2 ┆ Y5 ┆ M2 ┆ Y3 ┆ Y5 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════════╪══════╪══════╪═════╪═════╪═════╡
or
┌───────────┬──────┬──────┬─────┬─────┬─────┐
│ id ┆ libor┆ libor┆ estr┆ estr┆ estr│
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════════╪══════╪══════╪═════╪═════╪═════╡
and then collapse (= sum) across the columns with the same name, so that I get e.g.
┌───────────┬──────┬──────┬──────┐
│ id ┆ M2 ┆ Y5 ┆ Y3 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═══════════╪══════╪══════╪══════╡
│ CHECK.US1 ┆ 198 ┆ 99 ┆ -8 │
│ CHECK.US2 ┆ 603 ┆ 761 ┆ -24 │
│ CHECK.CA9 ┆ 1675 ┆ 1359 ┆ 183 │
└───────────┴──────┴──────┴──────┘
I tried renaming them first, but got polars.exceptions.DuplicateError: the name 'M2' is duplicate
.
Is there a way to achieve what I am trying to do?
edit: I also tried something like
rename_func = lambda col: col.split('.')[-1]
new_cols = set([rename_func(c) for c in df.columns])
df.with_columns([
pl.sum_horizontal(pl.all().map(rename_func) == c).alias(c) for c in new_cols
])
but it doesn't quite work.
You could build a dictionary of the columns to aggregate, then use pl.sum_horizontal
:
cols = {}
for col in df.columns:
suffix = col.split('.')[-1]
cols.setdefault(suffix, []).append(col)
# {'id': ['id'], 'M2': ['libor.M2', 'estr.M2'],
# 'Y5': ['libor.Y5', 'estr.Y5'], 'Y3': ['estr.Y3']}
df.select([pl.sum_horizontal(pl.col(c) for c in l).alias(k)
for k, l in cols.items()
])
Output:
┌───────────┬──────┬──────┬─────┐
│ id ┆ M2 ┆ Y5 ┆ Y3 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═══════════╪══════╪══════╪═════╡
│ CHECK.US1 ┆ 198 ┆ 99 ┆ -8 │
│ CHECK.US2 ┆ 603 ┆ 761 ┆ -24 │
│ CHECK.CA9 ┆ 1675 ┆ 1359 ┆ 183 │
└───────────┴──────┴──────┴─────┘