Previous to Polars version 0.20.7, the pivot() method, if given multiple values for the columns argument, would apply the aggregation logic against each column in columns individually based on the index column, rather than against a collective set of columns.
Before:
df = pl.DataFrame(
{
"foo": ["one", "one", "two", "two", "one", "two"],
"bar": ["y", "y", "y", "x", "x", "x"],
"biz": ['m', 'f', 'm', 'f', 'm', 'f'],
"baz": [1, 2, 3, 4, 5, 6],
}
)
df.pivot(index='foo', values='baz', columns=('bar', 'biz'), aggregate_function='sum')
returns:
shape: (2, 5)
┌─────┬─────┬─────┬─────┬─────┐
│ foo ┆ y ┆ x ┆ m ┆ f │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╪═════╡
│ one ┆ 3 ┆ 5 ┆ 6 ┆ 2 │
│ two ┆ 3 ┆ 10 ┆ 3 ┆ 10 │
└─────┴─────┴─────┴─────┴─────┘
After (in 0.20.7):
shape: (2, 5)
┌─────┬───────────┬───────────┬───────────┬───────────┐
│ foo ┆ {"y","m"} ┆ {"y","f"} ┆ {"x","f"} ┆ {"x","m"} │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ one ┆ 1 ┆ 2 ┆ null ┆ 5 │
│ two ┆ 3 ┆ null ┆ 10 ┆ null │
└─────┴───────────┴───────────┴───────────┴───────────┘
I like the previous functionality much better; it's very awkward to deal with the new pivoted table, especially given its column names. Polars devs put this change under "Bug fixes" but it actually broke my code.
I see a couple of ways you could do it:
First, you can use melt() for your DataFrame first and then pivot:
df.melt(["foo", "baz"])
─────┬─────┬──────────┬───────┐
│ foo ┆ baz ┆ variable ┆ value │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ str │
╞═════╪═════╪══════════╪═══════╡
│ one ┆ 1 ┆ bar ┆ y │
│ one ┆ 2 ┆ bar ┆ y │
│ two ┆ 3 ┆ bar ┆ y │
│ two ┆ 4 ┆ bar ┆ x │
│ one ┆ 5 ┆ bar ┆ x │
│ … ┆ … ┆ … ┆ … │
│ one ┆ 2 ┆ biz ┆ f │
│ two ┆ 3 ┆ biz ┆ m │
│ two ┆ 4 ┆ biz ┆ f │
│ one ┆ 5 ┆ biz ┆ m │
│ two ┆ 6 ┆ biz ┆ f │
└─────┴─────┴──────────┴───────┘
and then pivot the same way:
(
df
.melt(["foo", "baz"])
.pivot(index='foo', values='baz', columns='value', aggregate_function='sum')
)
┌─────┬─────┬─────┬─────┬─────┐
│ foo ┆ y ┆ x ┆ m ┆ f │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╪═════╡
│ one ┆ 3 ┆ 5 ┆ 6 ┆ 2 │
│ two ┆ 3 ┆ 10 ┆ 3 ┆ 10 │
└─────┴─────┴─────┴─────┴─────┘
Alternatively, if lists of possible values are not intersecting between columns you need to aggregate, you could use concat() with how = align:
pl.concat(
[df.pivot(index='foo', values='baz', columns=col, aggregate_function="sum") for col in ['bar','biz']],
how='align'
)
┌─────┬─────┬─────┬─────┬─────┐
│ foo ┆ y ┆ x ┆ m ┆ f │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╪═════╡
│ one ┆ 3 ┆ 5 ┆ 6 ┆ 2 │
│ two ┆ 3 ┆ 10 ┆ 3 ┆ 10 │
└─────┴─────┴─────┴─────┴─────┘