I have the following issue with Polars's LazyFrame "Structs" (pl.struct) and "apply" (a.k.a. map_elements) in with_columns
The idea here is trying to apply a custom logic to a group of values that belong to more than one column
I have been able to achieve this using DataFrames; however, when switching to LazyFrames, a KeyError is raised whenever I try to access a column in the dictionary sent by the struct to the function. I'm looping through columns, one by one, in order to apply different functions (mapped elsewhere to their names, but in the examples below I'll just use the same one for simplicity)
my_df = pl.DataFrame(
{
"foo": ["a", "b", "c", "d"],
"bar": ["w", "x", "y", "z"],
"notes": ["1", "2", "3", "4"]
}
)
print(my_df)
cols_to_validate = ("foo", "bar")
def validate_stuff(value, notes):
# Any custom logic
if value not in ["a", "b", "x"]:
return f"FAILED {value} - PREVIOUS ({notes})"
else:
return notes
for col in cols_to_validate:
my_df = my_df.with_columns(
pl.struct([col, "notes"]).map_elements(
lambda row: validate_stuff(row[col], row["notes"])
).alias("notes")
)
print(my_df)
my_lf = pl.DataFrame(
{
"foo": ["a", "b", "c", "d"],
"bar": ["w", "x", "y", "z"],
"notes": ["1", "2", "3", "4"]
}
).lazy()
def validate_stuff(value, notes):
# Any custom logic
if value not in ["a", "b", "x"]:
return f"FAILED {value} - PREVIOUS ({notes})"
else:
return notes
cols_to_validate = ("foo", "bar")
for col in cols_to_validate:
my_lf = my_lf.with_columns(
pl.struct([col, "notes"]).map_elements(
lambda row: validate_stuff(row[col], row["notes"])
).alias("notes")
)
print(my_lf.collect())
(Ah, yeah, do notice that individually executing each iteration does work, so it's not making any sense to me why the for loop breaks)
my_lf = my_lf.with_columns(
pl.struct(["foo", "notes"]).map_elements(
lambda row: validate_stuff(row["foo"], row["notes"])
).alias("notes")
)
my_lf = my_lf.with_columns(
pl.struct(["bar", "notes"]).map_elements(
lambda row: validate_stuff(row["bar"], row["notes"])
).alias("notes")
)
I have found a workaround using pl.col instead to achieve my desired result, but I would like to know whether Structs can be used the same way with LazyFrames right as I did with DataFrames, or it's actually a bug in this Polars version
I'm using Polars 0.19.13, BTW. Thank you for your attention
It's more of a general "gotcha" with Python itself: Official Python FAQ
It breaks because col
ends up with the same value for every lambda
One approach is to use a named/keyword arg:
lambda row, col=col: validate_stuff(row[col], row["notes"])
shape: (4, 3)
┌─────┬─────┬───────────────────────────────────┐
│ foo ┆ bar ┆ notes │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════╪═════╪═══════════════════════════════════╡
│ a ┆ w ┆ FAILED w - PREVIOUS (1) │
│ b ┆ x ┆ 2 │
│ c ┆ y ┆ FAILED y - PREVIOUS (FAILED c - … │
│ d ┆ z ┆ FAILED z - PREVIOUS (FAILED d - … │
└─────┴─────┴───────────────────────────────────┘