python-3.xdataframepivotpython-polarsunpivot

Polars pivot + unpivot operation with multiple values (pandas stack / unstack alternative / UDF over)


I am trying to pivot a table on 2 value columns, apply some user defined function on the pivotted dataframe and then unpivot (stack in pandas). In pandas, this would look like this:

import pandas as pd
import polars as pl
from polars import col, lit, selectors as cs

df = pl.DataFrame(
    {
        "foo": [1, 1, 2, 2, 3, 3],
        "bar": ["y", "x", "y", "x", "y", "x"],
        "baz1": [1, 2, None, 4, 5, None],
        "baz2": [1, None, 3, 4, None, 6]
    }
)
df
'''
┌─────┬─────┬──────┬──────┐
│ foo ┆ bar ┆ baz1 ┆ baz2 │
│ --- ┆ --- ┆ ---  ┆ ---  │
│ i64 ┆ str ┆ i64  ┆ i64  │
╞═════╪═════╪══════╪══════╡
│ 1   ┆ y   ┆ 1    ┆ 1    │
│ 1   ┆ x   ┆ 2    ┆ null │
│ 2   ┆ y   ┆ null ┆ 3    │
│ 2   ┆ x   ┆ 4    ┆ 4    │
│ 3   ┆ y   ┆ 5    ┆ null │
│ 3   ┆ x   ┆ null ┆ 6    │
└─────┴─────┴──────┴──────┘
'''

pd_df = df.to_pandas()

index_col = ['foo']
columns_col = ['bar']
values_col = ['baz1', 'baz2']

def pd_udf(df): # for example purposes, let's assume the function can be more complex
    return (
        df.ffill() * 3
    )

pd_res = (
    pd_df.groupby(index_col + columns_col).first() # or .set_index(index_col + column) for same result as no duplicates
    .unstack()
    .pipe(pd_udf)
    .stack()
    .reset_index()
    .sort_values(index_col + columns_col)
    .pipe(pl.from_pandas)
)
pd_res
'''
┌─────┬─────┬──────┬──────┐
│ foo ┆ bar ┆ baz1 ┆ baz2 │
│ --- ┆ --- ┆ ---  ┆ ---  │
│ i64 ┆ str ┆ f64  ┆ f64  │
╞═════╪═════╪══════╪══════╡
│ 1   ┆ x   ┆ 6.0  ┆ null │
│ 1   ┆ y   ┆ 3.0  ┆ 3.0  │
│ 2   ┆ x   ┆ 12.0 ┆ 12.0 │
│ 2   ┆ y   ┆ 3.0  ┆ 9.0  │
│ 3   ┆ x   ┆ 12.0 ┆ 18.0 │
│ 3   ┆ y   ┆ 15.0 ┆ 9.0  │
└─────┴─────┴──────┴──────┘
'''

I found 2 ways to achieve the same result, please let me know if there is a better way to achive the results I want to get.

  1. With pivot, then unpivot, some code and another pivot

It is not the prettiest solution, the stack bit is achieved with quite a lot of manipulations due to the f'{value_name}_{column_name}_{column_value} format of the columns after the pivoting.

def pl_udf(df):
    return (
        df.with_columns(
            pl.exclude(index_col).forward_fill() * lit(3)
        )
    )

lazy_df_1 = (
    df
    .pivot(on = columns_col,
           index = index_col,
           values = values_col)
    .lazy()
    .pipe(pl_udf)
    
    # pd stack bit
    .unpivot(index = index_col)
    .select(
        col(index_col+['value']),
        col('variable').str.split('_').list.get(0).alias('temp'),
        col('variable').str.split('_').list.get(1).alias(columns_col[0]),
    )
    .collect()
    .pivot(on = 'temp',
          index = index_col+columns_col,
          values = 'value')
    .lazy()
    .sort(index_col+columns_col)
)
lazy_df_1.collect()
  1. With group_by and explode with a user-defined function applied to Series

The udf applied to series needs to convert the Series to a Dataframe to be able to leverage Dataframe methods. Again, I'm not sure that is the best solution.

def pl_udf_series(s):
    '''
    to apply to series directly
    '''
    return (
        s.to_frame() # to use with dataframe functions (my udf will use dataframe functions)
        .select(col(s.name).forward_fill() * lit(3))
        .to_series().to_list()
    )

lazy_df_2 = (
    df.lazy()
    .group_by(columns_col)
    .agg(
        col(index_col),
        col(values_col).map_elements(pl_udf_series)  
    )
    .explode(columns=index_col + values_col)
    .sort(index_col+columns_col)
    .select(col(index_col+columns_col+values_col)) # reordering
)
lazy_df_2.collect()

Both implementations give the expected result:

pd_res.equals(lazy_df_1.collect())
# True
pd_res.equals(lazy_df_2.collect())
# True

In terms of performance:

  1. pandas solution ~1000 micro-seconds
  2. pivot, unpivot, pivot ~380 micro-seconds
  3. group_by, explode ~450 micro-seconds (I have seen some cases where 2 is faster than 1)

Solution

  • Do you really need to pivot / unpivot your dataframe? You can apply functions within the group with over() method:

    def func1(col): # for example purposes, let's assume the function can be more complex
        return col.forward_fill().over("bar") * lit(3)
    
    res = df.with_columns(func1(pl.all().exclude(['foo', 'bar'])))
    print(res.sort(['foo','bar']))
    
    ┌─────┬─────┬──────┬──────┐
    │ foo ┆ bar ┆ baz1 ┆ baz2 │
    │ --- ┆ --- ┆ ---  ┆ ---  │
    │ i64 ┆ str ┆ i64  ┆ i64  │
    ╞═════╪═════╪══════╪══════╡
    │ 1   ┆ x   ┆ 6    ┆ null │
    │ 1   ┆ y   ┆ 3    ┆ 3    │
    │ 2   ┆ x   ┆ 12   ┆ 12   │
    │ 2   ┆ y   ┆ 3    ┆ 9    │
    │ 3   ┆ x   ┆ 12   ┆ 18   │
    │ 3   ┆ y   ┆ 15   ┆ 9    │
    └─────┴─────┴──────┴──────┘