pythondataframepython-polars

Expand Columns of Structs to Rows in polars


Say we have this dataframe:

import polars as pl
df = pl.DataFrame({'EU': {'size': 10, 'GDP': 80},
                   'US': {'size': 100, 'GDP': 800},
                   'AS': {'size': 80, 'GDP': 500}})

shape: (1, 3)
┌───────────┬───────────┬───────────┐
│ EU        ┆ US        ┆ AS        │
│ ---       ┆ ---       ┆ ---       │
│ struct[2] ┆ struct[2] ┆ struct[2] │
╞═══════════╪═══════════╪═══════════╡
│ {10,80}   ┆ {100,800} ┆ {80,500}  │
└───────────┴───────────┴───────────┘

I am looking for a function like df.expand_structs(column_name='metric') that gives

shape: (2, 4)
┌────────┬─────┬─────┬─────┐
│ metric ┆ EU  ┆ US  ┆ AS  │
│ ---    ┆ --- ┆ --- ┆ --- │
│ str    ┆ i64 ┆ i64 ┆ i64 │
╞════════╪═════╪═════╪═════╡
│ size   ┆ 10  ┆ 100 ┆ 80  │
│ GBP    ┆ 80  ┆ 800 ┆ 500 │
└────────┴─────┴─────┴─────┘

I've tried other functions like unnest, explode but no luck. Any help appreciated!


Solution

  • TL;DR

    Performance comparison at the end.

    Both @etrotta's method and @DeanMacGregor's adjustment perform well on a pl.lazyframe with small Structs (e.g., struct[2]) and columns N <= 15 (not collected). Other methods fail lazily.

    With bigger Structs and/or columns N > 15, both unpivot options below start to outperform. Other suggested methods thus far slower in general.


    Option 1

    out = (df.unpivot()
           .unnest('value')
           .select(pl.exclude('variable'))
           .transpose(include_header=True)
           .pipe(
               lambda x: x.rename(
                   dict(zip(x.columns, ['metric'] + df.columns))
                   )
               )
           )
    

    Output:

    shape: (2, 4)
    ┌────────┬─────┬─────┬─────┐
    │ metric ┆ EU  ┆ US  ┆ AS  │
    │ ---    ┆ --- ┆ --- ┆ --- │
    │ str    ┆ i64 ┆ i64 ┆ i64 │
    ╞════════╪═════╪═════╪═════╡
    │ size   ┆ 10  ┆ 100 ┆ 80  │
    │ GDP    ┆ 80  ┆ 800 ┆ 500 │
    └────────┴─────┴─────┴─────┘
    

    Explanation / Intermediates

    shape: (3, 2)
    ┌──────────┬───────────┐
    │ variable ┆ value     │
    │ ---      ┆ ---       │
    │ str      ┆ struct[2] │
    ╞══════════╪═══════════╡
    │ EU       ┆ {10,80}   │
    │ US       ┆ {100,800} │
    │ AS       ┆ {80,500}  │
    └──────────┴───────────┘
    
    shape: (3, 3)
    ┌──────────┬──────┬─────┐
    │ variable ┆ size ┆ GDP │
    │ ---      ┆ ---  ┆ --- │
    │ str      ┆ i64  ┆ i64 │
    ╞══════════╪══════╪═════╡
    │ EU       ┆ 10   ┆ 80  │
    │ US       ┆ 100  ┆ 800 │
    │ AS       ┆ 80   ┆ 500 │
    └──────────┴──────┴─────┘
    
    shape: (2, 4)
    ┌────────┬──────────┬──────────┬──────────┐
    │ column ┆ column_0 ┆ column_1 ┆ column_2 │
    │ ---    ┆ ---      ┆ ---      ┆ ---      │
    │ str    ┆ i64      ┆ i64      ┆ i64      │
    ╞════════╪══════════╪══════════╪══════════╡
    │ size   ┆ 10       ┆ 100      ┆ 80       │
    │ GDP    ┆ 80       ┆ 800      ┆ 500      │
    └────────┴──────────┴──────────┴──────────┘
    
    out.columns = ['metric'] + df.columns
    

    Option 2

    out2 = (df.unpivot()
            .unnest('value')
            .unpivot(index='variable', variable_name='metric')
            .pivot(on='variable', index='metric')
            )
    

    Equality check:

    out.equals(out2)
    # True
    

    Explanation / Intermediates

    shape: (6, 3)
    ┌────────┬────────┬───────┐
    │ column ┆ metric ┆ value │
    │ ---    ┆ ---    ┆ ---   │
    │ str    ┆ str    ┆ i64   │
    ╞════════╪════════╪═══════╡
    │ EU     ┆ size   ┆ 10    │
    │ US     ┆ size   ┆ 100   │
    │ AS     ┆ size   ┆ 80    │
    │ EU     ┆ GDP    ┆ 80    │
    │ US     ┆ GDP    ┆ 800   │
    │ AS     ┆ GDP    ┆ 500   │
    └────────┴────────┴───────┘
    

    Performance comparison (gist)

    Columns: n_range=[2**k for k in range(12)]

    Struct: 2, 20, 100

    Methods compared:

    Results:

    struct[2]

    struct[20]

    struct[100]