I have this data:
┌────────────┬─────────────────────────────────────┐
│ col1 ┆ col2 │
│ --- ┆ --- │
│ list[str] ┆ list[list[str]] │
╞════════════╪═════════════════════════════════════╡
│ ["a"] ┆ [["a"]] │
│ ["b", "c"] ┆ [["b", "c"], ["b", "c"], ["b", "c"] │
│ ["d"] ┆ [["d"]] │
└────────────┴─────────────────────────────────────┘
I want to have all b's and all c's in the same list in in row 2, but as you can see the associations of b to b and c to c are not maintained in row 2. With pandas I used:
import pandas as pd
pddf = pd.DataFrame({"col1": [["a"], ["b", "c"], ["d"]],
"col2": [[["a"]], [["b", "c"], ["b", "c"], ["b", "c"]], [["d"]]]})
pddf["col2"] = pddf["col2"].apply(lambda listed: pd.DataFrame(listed).transpose().values.tolist())
print(pddf)
# col1 col2
# 0 [a] [[a]]
# 1 [b, c] [[b, b, b], [c, c, c]]
# 2 [d] [[d]]
This is the desired result. I am trying to do the same with polars, by replacing pddf.transpose().values.tolist()
with pldf.transpose().to_numpy().tolist()
, but I always get and TypeError: not yet implemented: Nested object types
. Are there any workarounds? Here is the complete polars code:
import polars as pl
pldf = pl.DataFrame({"col1": [["a"], ["b", "c"], ["d"]],
"col2": [[["a"]], [["b", "c"], ["b", "c"], ["b", "c"]], [["d"]]]})
pldf = pldf.with_columns(pl.col("col2").map_elements(lambda listed: pl.DataFrame(listed).transpose().to_numpy().tolist()))
print(pldf)
# TypeError: not yet implemented: Nested object types
# Hint: Try setting `strict=False` to allow passing data with mixed types.
Where would I need to apply the mentioned strict=False
?
On an easier df pddf.transpose().values.tolist()
and pldf.transpose().to_numpy().tolist()
are the same:
import pandas as pd
import polars as pl
pd.DataFrame(
{"col1": ["a", "b", "c"],
"col2": ["d", "e", "f"]}
).transpose().values.tolist() == pl.DataFrame(
{"col1": ["a", "b", "c"],
"col2": ["d", "e", "f"]}
).transpose().to_numpy().tolist()
# True
Please keep as close as possible to the code, even though it's not ideal using .apply()
or .map_elements()
, but this is in a far greater project and I don't want to break anything else :).
(EDIT: I simplified the code a little since the second lambda wasn't really necessary for the question.)
As you've mentioned performance in the comments - this may also be of interest.
Building upon @Herick's answer - when you add an index and explode, you are guaranteed to have sorted data.
With sorted data, you can add a "row number per group" using rle()
- which can be significantly faster.
(df.with_row_index("row")
.select("row", "col2")
.explode("col2")
.with_row_index()
.explode("col2")
.with_columns(
pl.int_ranges(pl.col("index").rle().struct.field("len"))
.flatten()
.alias("list_idx")
)
)
shape: (8, 4)
┌───────┬─────┬──────┬──────────┐
│ index ┆ row ┆ col2 ┆ list_idx │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ str ┆ i64 │
╞═══════╪═════╪══════╪══════════╡
│ 0 ┆ 0 ┆ a ┆ 0 │
│ 1 ┆ 1 ┆ b ┆ 0 │
│ 1 ┆ 1 ┆ c ┆ 1 │
│ 2 ┆ 1 ┆ b ┆ 0 │
│ 2 ┆ 1 ┆ c ┆ 1 │
│ 3 ┆ 1 ┆ b ┆ 0 │
│ 3 ┆ 1 ┆ c ┆ 1 │
│ 4 ┆ 2 ┆ d ┆ 0 │
└───────┴─────┴──────┴──────────┘
.map_batches()
can be used to wrap the frame exploding/group_by logic into an expression context.
df = pl.DataFrame({
"col1": [["a"], ["b", "c"], ["d"]],
"col2": [[["a"]], [["b", "c"], ["b", "c"], ["b", "c"]], [["d"]]],
"col3": [[[1]], [[2, 3], [4, 5], [6, 7]], [[8]]]
})
df.with_columns(
pl.col("col2", "col3").map_batches(lambda s:
s.to_frame()
.with_row_index("row")
.explode(pl.last())
.with_row_index()
.explode(pl.last())
.with_columns(
pl.int_ranges(pl.col("index").rle().struct.field("len"))
.flatten()
.alias("index")
)
.group_by("index", "row", maintain_order=True) # order required for "horizontal concat"
.agg(pl.last())
.group_by("row", maintain_order=True) # order required for "horizontal concat"
.agg(pl.last())
.select(pl.last())
.to_series() # must return a series
)
)
shape: (3, 3)
┌────────────┬─────────────────────────────────┬────────────────────────┐
│ col1 ┆ col2 ┆ col3 │
│ --- ┆ --- ┆ --- │
│ list[str] ┆ list[list[str]] ┆ list[list[i64]] │
╞════════════╪═════════════════════════════════╪════════════════════════╡
│ ["a"] ┆ [["a"]] ┆ [[1]] │
│ ["b", "c"] ┆ [["b", "b", "b"], ["c", "c", "… ┆ [[2, 4, 6], [3, 5, 7]] │
│ ["d"] ┆ [["d"]] ┆ [[8]] │
└────────────┴─────────────────────────────────┴────────────────────────┘
As a basic comparison, if I sample 5_000_000
rows, I get:
name | time |
---|---|
map_elements | 61.19s |
map_batches | 7.14s |