python-3.xdataframepython-polars

Polars pairwise sum of array column


I just got started with Polars (python) so this may be an ignorant question. I have a DF like the image shows where one of the columns (series) contains a numpy array of length 18. I would like to do a groupby on the group column and a pairwise sum aggregation on the series column, but I can't figure out a good way to do that in Polars. I can, of course, just do a map_elements and np.sum the arrays (like in the example) but I'm hoping there is a way to optimize it.

Here is my current implementation which achieves the desired effect but I don't think it is optimal because it uses map_elements. Is there a polars expression that achieve the same thing or is this the best I can do (without learning Rust, which I will someday)?

import polars as pl
import numpy as np
data = [
{'group': 1,
  'series': np.array([ 2398,  2590,  3000,  3731,  3986,  4603,  4146,  4325,  6068,
          6028,  7486,  7759,  8323,  8961,  9598, 10236, 10873, 11511])},
{'group': 1,
  'series': np.array([ 2398,  2590,  3000,  3731,  3986,  4603,  4146,  4325,  6068,
          6028,  7486,  7759,  8323,  8961,  9598, 10236, 10873, 11511])},
 {'group': 2,
  'series': np.array([1132, 1269, 1452, 1687, 1389, 1655, 1532, 1661, 1711, 1528, 1582,
         1638, 1603, 1600, 1597, 1594, 1591, 1588])},
 {'group': 3,
  'series': np.array([ 2802,  3065,  3811,  4823,  4571,  4817,  4668,  5110,  6920,
          7131, 10154, 11138, 11699, 12840, 13981, 15123, 16264, 17405])},
]
df = pl.DataFrame(data)
# this performs the desired aggregation (pairwise sum of 'series' arrays)
# sums first two rows together (group 1), leaves others unchanged
df.group_by('group').agg(
  pl.col('series').map_elements(lambda x: np.sum(x.to_list(), axis=0))
).to_dicts()
'''
desired output

group    series
i64    object
2    [1132 1269 1452 1687 1389 1655 1532 1661 1711 1528 1582 1638 1603 1600
 1597 1594 1591 1588]
1    [ 4796  5180  6000  7462  7972  9206  8292  8650 12136 12056 14972 15518
 16646 17922 19196 20472 21746 23022]
3    [ 2802  3065  3811  4823  4571  4817  4668  5110  6920  7131 10154 11138
 11699 12840 13981 15123 16264 17405]

'''

Thank you in advance for any help.


Solution

  • Firstly, any time you construct a polars DF and see that one of your columns is an Object then stop and go back and figure out how to make it a real dtype otherwise everything else you do will either not work or will rely on python iteration making it very slow.

    Say you start with your data dict then you can do something like

    df = pl.DataFrame(
        [
            {a:pl.Series(b) if isinstance(b,np.ndarray) else b 
             for a,b in x.items()} 
            for x in data
            ]
        )
    

    After that, there's not really an obvious way to do it but one rule of thumb is that it's almost always easier to explode lists. In this case since the order of the list is important to the later aggregations you need to make an index column. Following that you explode by the series column and the index column. Then you group_by the group and index and sum the values. Since you want to end with lists you'd then group_by the group to put the values back into a list. All together, this is it:

    (
        df
        .with_columns(series_idx=pl.int_ranges(0,pl.col('series').list.len()))
         .explode('series','series_idx')
         .group_by('group','series_idx', maintain_order=True)
         .agg(pl.col('series').sum())
         .group_by('group', maintain_order=True)
         .agg('series')
    )
    shape: (3, 2)
    ┌───────┬───────────────────────┐
    │ group ┆ series                │
    │ ---   ┆ ---                   │
    │ i64   ┆ list[i64]             │
    ╞═══════╪═══════════════════════╡
    │ 1     ┆ [4796, 5180, … 23022] │
    │ 2     ┆ [1132, 1269, … 1588]  │
    │ 3     ┆ [2802, 3065, … 17405] │
    └───────┴───────────────────────┘
    

    Struct based method

    If all of your lists are equal width you could do this:

    cols_to_add=['series']
    (
        df
        .with_columns(
            pl.col(col).list.to_struct(
                fields=lambda x, col=col:f"_{col}_{x}"
                )
            for col in cols_to_add
            )
        .unnest(cols_to_add)
        .group_by('group', maintain_order=True)
        .agg(pl.col(f'^_{col}_.+$').sum() for col in cols_to_add)
        .select('group', 
                *[pl.concat_list(pl.col(f'^_{col}_.+$')).alias(col) 
                  for col in cols_to_add]
                )
    )
    

    This should also work for multiple series columns although I didn't test it with multiple columns.