I have two dataframes that look like those:
df1 = pl.DataFrame(
{
"Name": ["A", "B", "C", "D"],
"Year": [2001, 2003, 2003, 2004]
}
)
df2 = pl.DataFrame(
{
"Name": ["A", "B", "C", "D"],
"2001": [111, 112, 113, 114],
"2002": [221, 222, 223, 224],
"2003": [331, 332, 333, 334],
"2004": [441, 442, 443, 444]
}
)
I'd like to sum each year column of the second df (df2), taking in account only names whose corresponding year in df1 is the same year or later. Desired output:
┌──────┬──────┐
│ Year ┆ Sum │
╞══════╪══════╡
│ 2001 ┆ 111 │
│ 2002 ┆ 221 │
│ 2003 ┆ 996 │ (= 331 + 332 + 333)
│ 2004 ┆ 1770 │ (= 441 + 442 + 443 + 444)
└──────┴──────┘
I'm new to Polars (coming from Pandas), and I'm not sure how to do this. Any help will be appreciated.
You can unpivot
df2
to put it in a form more "compatible" with df1
. It "unpivots" the specified column headers into rows. Year
should be cast into an int
column at this point from str
:
df2.unpivot(index='Name', variable_name='Year').with_columns(
pl.col('Year').cast(pl.Int64)
)
shape: (16, 3)
┌──────┬──────┬───────┐
│ Name ┆ Year ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞══════╪══════╪═══════╡
│ A ┆ 2001 ┆ 111 │
│ B ┆ 2001 ┆ 112 │
│ C ┆ 2001 ┆ 113 │
│ D ┆ 2001 ┆ 114 │
│ A ┆ 2002 ┆ 221 │
│ B ┆ 2002 ┆ 222 │
│ C ┆ 2002 ┆ 223 │
│ D ┆ 2002 ┆ 224 │
│ A ┆ 2003 ┆ 331 │
│ B ┆ 2003 ┆ 332 │
│ C ┆ 2003 ┆ 333 │
│ D ┆ 2003 ┆ 334 │
│ A ┆ 2004 ┆ 441 │
│ B ┆ 2004 ┆ 442 │
│ C ┆ 2004 ┆ 443 │
│ D ┆ 2004 ┆ 444 │
└──────┴──────┴───────┘
You can then join
this with df1
on the Name
, and filter the resulting rows by the year condition you specified:
(...).join(df1, on='Name').filter(pl.col('Year') >= pl.col('Year_right'))
From there it's a simple group_by
and agg
regation on the Year
. All together:
df2.unpivot(index='Name', variable_name='Year').with_columns(
pl.col('Year').cast(pl.Int64)
).join(df1, on='Name').filter(pl.col('Year') >= pl.col('Year_right')).group_by(
'Year', maintain_order=True
).agg(
Sum=pl.col('value').sum()
)
shape: (4, 2)
┌──────┬──────┐
│ Year ┆ Sum │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 2001 ┆ 111 │
│ 2002 ┆ 221 │
│ 2003 ┆ 996 │
│ 2004 ┆ 1770 │
└──────┴──────┘
(maintain_order
is optional but slower, can achieve the same effect with a .sort('Year')
after the agg
if truly desired.)