pythondataframepython-polars

Expand list of struct column in `polars`


I have a pl.DataFrame with a column that is a list of struct entries. The lengths of the lists might differ:

pl.DataFrame(
    {
        "id": [1, 2, 3],
        "s": [
            [
                {"a": 1, "b": 1},
                {"a": 2, "b": 2},
                {"a": 3, "b": 3},
            ],
            [
                {"a": 10, "b": 10},
                {"a": 20, "b": 20},
                {"a": 30, "b": 30},
                {"a": 40, "b": 40},
            ],
            [
                {"a": 100, "b": 100},
                {"a": 200, "b": 200},
                {"a": 300, "b": 300},
                {"a": 400, "b": 400},
                {"a": 500, "b": 500},
            ],
        ],
    }
)

This looks like this:

shape: (3, 2)
┌─────┬─────────────────────────────────┐
│ id  ┆ s                               │
│ --- ┆ ---                             │
│ i64 ┆ list[struct[2]]                 │
╞═════╪═════════════════════════════════╡
│ 1   ┆ [{1,1}, {2,2}, {3,3}]           │
│ 2   ┆ [{10,10}, {20,20}, … {40,40}]   │
│ 3   ┆ [{100,100}, {200,200}, … {500,… │
└─────┴─────────────────────────────────┘

I've tried various versions of unnest and explode, but I am failing to turn this into a long pl.DataFrame where the list is turned into rows and the struct entries into columns. This is what I want to see:

pl.DataFrame(
        {
            "id": [1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3],
            "a": [1, 2, 3, 10, 20, 30, 40, 100, 200, 300, 400, 500],
            "b": [1, 2, 3, 10, 20, 30, 40, 100, 200, 300, 400, 500],
        }
    )

Which looks like this:

shape: (12, 3)
┌─────┬─────┬─────┐
│ id  ┆ a   ┆ b   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 2   ┆ 2   │
│ 1   ┆ 3   ┆ 3   │
│ 2   ┆ 10  ┆ 10  │
│ 2   ┆ 20  ┆ 20  │
│ …   ┆ …   ┆ …   │
│ 3   ┆ 100 ┆ 100 │
│ 3   ┆ 200 ┆ 200 │
│ 3   ┆ 300 ┆ 300 │
│ 3   ┆ 400 ┆ 400 │
│ 3   ┆ 500 ┆ 500 │
└─────┴─────┴─────┘

Is there a way to manipulate the first pl.DataFrame into the second pl.DataFrame?


Solution

  • First explode, then unnest:

    df.explode('s').unnest('s')
    

    Output:

    ┌─────┬─────┬─────┐
    │ id  ┆ a   ┆ b   │
    │ --- ┆ --- ┆ --- │
    │ i64 ┆ i64 ┆ i64 │
    ╞═════╪═════╪═════╡
    │ 1   ┆ 1   ┆ 1   │
    │ 1   ┆ 2   ┆ 2   │
    │ 1   ┆ 3   ┆ 3   │
    │ 2   ┆ 10  ┆ 10  │
    │ 2   ┆ 20  ┆ 20  │
    │ …   ┆ …   ┆ …   │
    │ 3   ┆ 100 ┆ 100 │
    │ 3   ┆ 200 ┆ 200 │
    │ 3   ┆ 300 ┆ 300 │
    │ 3   ┆ 400 ┆ 400 │
    │ 3   ┆ 500 ┆ 500 │
    └─────┴─────┴─────┘