I just got started with Polars (python) so this may be an ignorant question. I have a DF like the image shows where one of the columns (series) contains a numpy array of length 18. I would like to do a groupby on the group
column and a pairwise sum aggregation on the series column, but I can't figure out a good way to do that in Polars. I can, of course, just do a map_elements and np.sum the arrays (like in the example) but I'm hoping there is a way to optimize it.
Here is my current implementation which achieves the desired effect but I don't think it is optimal because it uses map_elements. Is there a polars expression that achieve the same thing or is this the best I can do (without learning Rust, which I will someday)?
import polars as pl
import numpy as np
data = [
{'group': 1,
'series': np.array([ 2398, 2590, 3000, 3731, 3986, 4603, 4146, 4325, 6068,
6028, 7486, 7759, 8323, 8961, 9598, 10236, 10873, 11511])},
{'group': 1,
'series': np.array([ 2398, 2590, 3000, 3731, 3986, 4603, 4146, 4325, 6068,
6028, 7486, 7759, 8323, 8961, 9598, 10236, 10873, 11511])},
{'group': 2,
'series': np.array([1132, 1269, 1452, 1687, 1389, 1655, 1532, 1661, 1711, 1528, 1582,
1638, 1603, 1600, 1597, 1594, 1591, 1588])},
{'group': 3,
'series': np.array([ 2802, 3065, 3811, 4823, 4571, 4817, 4668, 5110, 6920,
7131, 10154, 11138, 11699, 12840, 13981, 15123, 16264, 17405])},
]
df = pl.DataFrame(data)
# this performs the desired aggregation (pairwise sum of 'series' arrays)
# sums first two rows together (group 1), leaves others unchanged
df.group_by('group').agg(
pl.col('series').map_elements(lambda x: np.sum(x.to_list(), axis=0))
).to_dicts()
'''
desired output
group series
i64 object
2 [1132 1269 1452 1687 1389 1655 1532 1661 1711 1528 1582 1638 1603 1600
1597 1594 1591 1588]
1 [ 4796 5180 6000 7462 7972 9206 8292 8650 12136 12056 14972 15518
16646 17922 19196 20472 21746 23022]
3 [ 2802 3065 3811 4823 4571 4817 4668 5110 6920 7131 10154 11138
11699 12840 13981 15123 16264 17405]
'''
Thank you in advance for any help.
Firstly, any time you construct a polars DF and see that one of your columns is an Object
then stop and go back and figure out how to make it a real dtype otherwise everything else you do will either not work or will rely on python iteration making it very slow.
Say you start with your data
dict then you can do something like
df = pl.DataFrame(
[
{a:pl.Series(b) if isinstance(b,np.ndarray) else b
for a,b in x.items()}
for x in data
]
)
After that, there's not really an obvious way to do it but one rule of thumb is that it's almost always easier to explode
lists. In this case since the order of the list is important to the later aggregations you need to make an index column. Following that you explode by the series
column and the index column. Then you group_by
the group
and index and sum the values. Since you want to end with lists you'd then group_by
the group to put the values back into a list. All together, this is it:
(
df
.with_columns(series_idx=pl.int_ranges(0,pl.col('series').list.len()))
.explode('series','series_idx')
.group_by('group','series_idx', maintain_order=True)
.agg(pl.col('series').sum())
.group_by('group', maintain_order=True)
.agg('series')
)
shape: (3, 2)
┌───────┬───────────────────────┐
│ group ┆ series │
│ --- ┆ --- │
│ i64 ┆ list[i64] │
╞═══════╪═══════════════════════╡
│ 1 ┆ [4796, 5180, … 23022] │
│ 2 ┆ [1132, 1269, … 1588] │
│ 3 ┆ [2802, 3065, … 17405] │
└───────┴───────────────────────┘
If all of your lists are equal width you could do this:
cols_to_add=['series']
(
df
.with_columns(
pl.col(col).list.to_struct(
fields=lambda x, col=col:f"_{col}_{x}"
)
for col in cols_to_add
)
.unnest(cols_to_add)
.group_by('group', maintain_order=True)
.agg(pl.col(f'^_{col}_.+$').sum() for col in cols_to_add)
.select('group',
*[pl.concat_list(pl.col(f'^_{col}_.+$')).alias(col)
for col in cols_to_add]
)
)
This should also work for multiple series
columns although I didn't test it with multiple columns.