Say we have this dataframe:
import polars as pl
df = pl.DataFrame({'EU': {'size': 10, 'GDP': 80},
'US': {'size': 100, 'GDP': 800},
'AS': {'size': 80, 'GDP': 500}})
shape: (1, 3)
┌───────────┬───────────┬───────────┐
│ EU ┆ US ┆ AS │
│ --- ┆ --- ┆ --- │
│ struct[2] ┆ struct[2] ┆ struct[2] │
╞═══════════╪═══════════╪═══════════╡
│ {10,80} ┆ {100,800} ┆ {80,500} │
└───────────┴───────────┴───────────┘
I am looking for a function like df.expand_structs(column_name='metric')
that gives
shape: (2, 4)
┌────────┬─────┬─────┬─────┐
│ metric ┆ EU ┆ US ┆ AS │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞════════╪═════╪═════╪═════╡
│ size ┆ 10 ┆ 100 ┆ 80 │
│ GBP ┆ 80 ┆ 800 ┆ 500 │
└────────┴─────┴─────┴─────┘
I've tried other functions like unnest
, explode
but no luck. Any help appreciated!
TL;DR
Performance comparison at the end.
Both @etrotta's method and @DeanMacGregor's adjustment perform well on a pl.lazyframe with small Structs (e.g., struct[2]
) and columns N <= 15
(not collected). Other methods fail lazily.
With bigger Structs and/or columns N > 15
, both unpivot
options below start to outperform. Other suggested methods thus far slower in general.
Option 1
out = (df.unpivot()
.unnest('value')
.select(pl.exclude('variable'))
.transpose(include_header=True)
.pipe(
lambda x: x.rename(
dict(zip(x.columns, ['metric'] + df.columns))
)
)
)
Output:
shape: (2, 4)
┌────────┬─────┬─────┬─────┐
│ metric ┆ EU ┆ US ┆ AS │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞════════╪═════╪═════╪═════╡
│ size ┆ 10 ┆ 100 ┆ 80 │
│ GDP ┆ 80 ┆ 800 ┆ 500 │
└────────┴─────┴─────┴─────┘
Explanation / Intermediates
df.unpivot
:shape: (3, 2)
┌──────────┬───────────┐
│ variable ┆ value │
│ --- ┆ --- │
│ str ┆ struct[2] │
╞══════════╪═══════════╡
│ EU ┆ {10,80} │
│ US ┆ {100,800} │
│ AS ┆ {80,500} │
└──────────┴───────────┘
df.unnest
on new 'value' column:shape: (3, 3)
┌──────────┬──────┬─────┐
│ variable ┆ size ┆ GDP │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞══════════╪══════╪═════╡
│ EU ┆ 10 ┆ 80 │
│ US ┆ 100 ┆ 800 │
│ AS ┆ 80 ┆ 500 │
└──────────┴──────┴─────┘
df.select
to exclude 'variable' column (pl.exclude
) and df.transpose
with include_header=True
:shape: (2, 4)
┌────────┬──────────┬──────────┬──────────┐
│ column ┆ column_0 ┆ column_1 ┆ column_2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞════════╪══════════╪══════════╪══════════╡
│ size ┆ 10 ┆ 100 ┆ 80 │
│ GDP ┆ 80 ┆ 800 ┆ 500 │
└────────┴──────────┴──────────┴──────────┘
df.pipe
+ df.rename
. Without the chained operation, that can also be:out.columns = ['metric'] + df.columns
Option 2
out2 = (df.unpivot()
.unnest('value')
.unpivot(index='variable', variable_name='metric')
.pivot(on='variable', index='metric')
)
Equality check:
out.equals(out2)
# True
Explanation / Intermediates
df.unpivot
to get:shape: (6, 3)
┌────────┬────────┬───────┐
│ column ┆ metric ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞════════╪════════╪═══════╡
│ EU ┆ size ┆ 10 │
│ US ┆ size ┆ 100 │
│ AS ┆ size ┆ 80 │
│ EU ┆ GDP ┆ 80 │
│ US ┆ GDP ┆ 800 │
│ AS ┆ GDP ┆ 500 │
└────────┴────────┴───────┘
df.pivot
on 'column' with 'metric' as the index to get desired shape.Performance comparison (gist)
Columns: n_range=[2**k for k in range(12)]
Struct: 2, 20, 100
Methods compared:
Results: