I have a dataframe with columns A, B and C where B and C are list columns.
df = pl.DataFrame({
'A': ['t', 'u', 'v'],
'B': [['a', 'v', 'x'], ['f', 'g', 'h'], ['p', 'o', 'i']],
'C': [[11, 12, 14], [41, 42, 43], [66, 77, 88]]
})
I need to combine then like follows:
Original:
┌─────┬─────────────────┬──────────────┐
│ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- │
│ i64 ┆ list[str] ┆ list[i64] │
╞═════╪═════════════════╪══════════════╡
│ t ┆ ["a", "v", "x"] ┆ [11, 12, 14] │
│ u ┆ ["f", "g", "h"] ┆ [41, 42, 43] │
│ v ┆ ["p", "o", "i"] ┆ [66, 77, 88] │
└─────┴─────────────────┴──────────────┘
Final:
┌─────┬─────────────────────────────────────┐
│ A ┆ zip(B,C) │
│ --- ┆ --- │
│ i64 ┆ object(?) │
╞═════╪═════════════════════════════════════╡
│ t ┆ [('a', 11), ('v', 12), ('x', 14) ] │
│ u ┆ [('f', 41), ('g', 42), ('h', 43) ] │
│ v ┆ [('p', 66), ('o', 77), ('i', 88) ] │
└─────┴─────────────────────────────────────┘
Using just Python I would do a zip()
, but this approach does not scale.
I thought about using explode()
on the lists, casting then as string and join the results using a separator, but that does not feels right, and I would have problems to keep the data on column A
correctly related to the exploded result.
Is there another way to achieve this result?
In Polars, you can use a struct for this.
(
df.explode("B", "C")
.select("A", pl.struct("B", "C").alias("struct"))
.group_by("A")
.agg("struct")
)
shape: (3, 2)
┌─────┬────────────────────────────────┐
│ A ┆ struct │
│ --- ┆ --- │
│ str ┆ list[struct[2]] │
╞═════╪════════════════════════════════╡
│ t ┆ [{"a",11}, {"v",12}, {"x",14}] │
│ u ┆ [{"f",41}, {"g",42}, {"h",43}] │
│ v ┆ [{"p",66}, {"o",77}, {"i",88}] │
└─────┴────────────────────────────────┘
The result is a list of struct.