I am trying to pivot a table on 2 value columns, apply some user defined function on the pivotted dataframe and then unpivot (stack in pandas). In pandas, this would look like this:
import pandas as pd
import polars as pl
from polars import col, lit, selectors as cs
df = pl.DataFrame(
{
"foo": [1, 1, 2, 2, 3, 3],
"bar": ["y", "x", "y", "x", "y", "x"],
"baz1": [1, 2, None, 4, 5, None],
"baz2": [1, None, 3, 4, None, 6]
}
)
df
'''
┌─────┬─────┬──────┬──────┐
│ foo ┆ bar ┆ baz1 ┆ baz2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ i64 │
╞═════╪═════╪══════╪══════╡
│ 1 ┆ y ┆ 1 ┆ 1 │
│ 1 ┆ x ┆ 2 ┆ null │
│ 2 ┆ y ┆ null ┆ 3 │
│ 2 ┆ x ┆ 4 ┆ 4 │
│ 3 ┆ y ┆ 5 ┆ null │
│ 3 ┆ x ┆ null ┆ 6 │
└─────┴─────┴──────┴──────┘
'''
pd_df = df.to_pandas()
index_col = ['foo']
columns_col = ['bar']
values_col = ['baz1', 'baz2']
def pd_udf(df): # for example purposes, let's assume the function can be more complex
return (
df.ffill() * 3
)
pd_res = (
pd_df.groupby(index_col + columns_col).first() # or .set_index(index_col + column) for same result as no duplicates
.unstack()
.pipe(pd_udf)
.stack()
.reset_index()
.sort_values(index_col + columns_col)
.pipe(pl.from_pandas)
)
pd_res
'''
┌─────┬─────┬──────┬──────┐
│ foo ┆ bar ┆ baz1 ┆ baz2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ f64 │
╞═════╪═════╪══════╪══════╡
│ 1 ┆ x ┆ 6.0 ┆ null │
│ 1 ┆ y ┆ 3.0 ┆ 3.0 │
│ 2 ┆ x ┆ 12.0 ┆ 12.0 │
│ 2 ┆ y ┆ 3.0 ┆ 9.0 │
│ 3 ┆ x ┆ 12.0 ┆ 18.0 │
│ 3 ┆ y ┆ 15.0 ┆ 9.0 │
└─────┴─────┴──────┴──────┘
'''
I found 2 ways to achieve the same result, please let me know if there is a better way to achive the results I want to get.
It is not the prettiest solution, the stack bit is achieved with quite a lot of manipulations due to the f'{value_name}_{column_name}_{column_value}
format of the columns after the pivoting.
def pl_udf(df):
return (
df.with_columns(
pl.exclude(index_col).forward_fill() * lit(3)
)
)
lazy_df_1 = (
df
.pivot(on = columns_col,
index = index_col,
values = values_col)
.lazy()
.pipe(pl_udf)
# pd stack bit
.unpivot(index = index_col)
.select(
col(index_col+['value']),
col('variable').str.split('_').list.get(0).alias('temp'),
col('variable').str.split('_').list.get(1).alias(columns_col[0]),
)
.collect()
.pivot(on = 'temp',
index = index_col+columns_col,
values = 'value')
.lazy()
.sort(index_col+columns_col)
)
lazy_df_1.collect()
The udf applied to series needs to convert the Series to a Dataframe to be able to leverage Dataframe methods. Again, I'm not sure that is the best solution.
def pl_udf_series(s):
'''
to apply to series directly
'''
return (
s.to_frame() # to use with dataframe functions (my udf will use dataframe functions)
.select(col(s.name).forward_fill() * lit(3))
.to_series().to_list()
)
lazy_df_2 = (
df.lazy()
.group_by(columns_col)
.agg(
col(index_col),
col(values_col).map_elements(pl_udf_series)
)
.explode(columns=index_col + values_col)
.sort(index_col+columns_col)
.select(col(index_col+columns_col+values_col)) # reordering
)
lazy_df_2.collect()
Both implementations give the expected result:
pd_res.equals(lazy_df_1.collect())
# True
pd_res.equals(lazy_df_2.collect())
# True
In terms of performance:
Do you really need to pivot / unpivot your dataframe? You can apply functions within the group with over()
method:
def func1(col): # for example purposes, let's assume the function can be more complex
return col.forward_fill().over("bar") * lit(3)
res = df.with_columns(func1(pl.all().exclude(['foo', 'bar'])))
print(res.sort(['foo','bar']))
┌─────┬─────┬──────┬──────┐
│ foo ┆ bar ┆ baz1 ┆ baz2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ i64 │
╞═════╪═════╪══════╪══════╡
│ 1 ┆ x ┆ 6 ┆ null │
│ 1 ┆ y ┆ 3 ┆ 3 │
│ 2 ┆ x ┆ 12 ┆ 12 │
│ 2 ┆ y ┆ 3 ┆ 9 │
│ 3 ┆ x ┆ 12 ┆ 18 │
│ 3 ┆ y ┆ 15 ┆ 9 │
└─────┴─────┴──────┴──────┘