pythonpython-polars

Sum columns of one dataframe based on another dataframe


I have two dataframes that look like those:

df1 = pl.DataFrame(
    {
        "Name": ["A", "B", "C", "D"],
        "Year": [2001, 2003, 2003, 2004]
    }
)

df2 = pl.DataFrame(
    {
        "Name": ["A", "B", "C", "D"],
        "2001": [111, 112, 113, 114],
        "2002": [221, 222, 223, 224],
        "2003": [331, 332, 333, 334],
        "2004": [441, 442, 443, 444]
    }
)

I'd like to sum each year column of the second df (df2), taking in account only names whose corresponding year in df1 is the same year or later. Desired output:

┌──────┬──────┐
│ Year ┆  Sum │
╞══════╪══════╡
│ 2001 ┆  111 │  
│ 2002 ┆  221 │
│ 2003 ┆  996 │  (= 331 + 332 + 333)
│ 2004 ┆ 1770 │  (= 441 + 442 + 443 + 444)
└──────┴──────┘

I'm new to Polars (coming from Pandas), and I'm not sure how to do this. Any help will be appreciated.


Solution

  • You can unpivot df2 to put it in a form more "compatible" with df1. It "unpivots" the specified column headers into rows. Year should be cast into an int column at this point from str:

    df2.unpivot(index='Name', variable_name='Year').with_columns(
        pl.col('Year').cast(pl.Int64)
    )
    
    shape: (16, 3)
    ┌──────┬──────┬───────┐
    │ Name ┆ Year ┆ value │
    │ ---  ┆ ---  ┆ ---   │
    │ str  ┆ i64  ┆ i64   │
    ╞══════╪══════╪═══════╡
    │ A    ┆ 2001 ┆ 111   │
    │ B    ┆ 2001 ┆ 112   │
    │ C    ┆ 2001 ┆ 113   │
    │ D    ┆ 2001 ┆ 114   │
    │ A    ┆ 2002 ┆ 221   │
    │ B    ┆ 2002 ┆ 222   │
    │ C    ┆ 2002 ┆ 223   │
    │ D    ┆ 2002 ┆ 224   │
    │ A    ┆ 2003 ┆ 331   │
    │ B    ┆ 2003 ┆ 332   │
    │ C    ┆ 2003 ┆ 333   │
    │ D    ┆ 2003 ┆ 334   │
    │ A    ┆ 2004 ┆ 441   │
    │ B    ┆ 2004 ┆ 442   │
    │ C    ┆ 2004 ┆ 443   │
    │ D    ┆ 2004 ┆ 444   │
    └──────┴──────┴───────┘
    

    You can then join this with df1 on the Name, and filter the resulting rows by the year condition you specified:

    (...).join(df1, on='Name').filter(pl.col('Year') >= pl.col('Year_right'))
    

    From there it's a simple group_by and aggregation on the Year. All together:

    df2.unpivot(index='Name', variable_name='Year').with_columns(
        pl.col('Year').cast(pl.Int64)
    ).join(df1, on='Name').filter(pl.col('Year') >= pl.col('Year_right')).group_by(
        'Year', maintain_order=True
    ).agg(
        Sum=pl.col('value').sum()
    )
    
    shape: (4, 2)
    ┌──────┬──────┐
    │ Year ┆ Sum  │
    │ ---  ┆ ---  │
    │ i64  ┆ i64  │
    ╞══════╪══════╡
    │ 2001 ┆ 111  │
    │ 2002 ┆ 221  │
    │ 2003 ┆ 996  │
    │ 2004 ┆ 1770 │
    └──────┴──────┘
    

    (maintain_order is optional but slower, can achieve the same effect with a .sort('Year') after the agg if truly desired.)