This code not fill null values in column. I want to some fields to forward and backward fill nulls.
import polars as pl
df1 = pl.LazyFrame({
"dt": [
"2024-08-30",
"2024-08-02",
"2024-09-03",
"2024-09-04"
],
"df1": {
"a": [0.1, 0.2, 0.3, 0.1],
"b": [0, -1, 2, 1]
},
}).with_columns(
pl.col("dt").str.to_datetime("%Y-%m-%d")
)
df2 = pl.LazyFrame({
"dt": [
"2024-08-29",
"2024-08-30",
"2024-09-02",
"2024-09-03"
],
"df2":{
"a": [100, 120, -80, 20],
"b": [1, -2, 0, 0]
},
}).with_columns(
pl.col("dt").str.to_datetime("%Y-%m-%d")
)
df = pl.concat([df1, df2], how="align")
df = df.with_columns(
*[
pl.col(c).struct.with_fields(
pl.field("a").forward_fill().backward_fill(),
pl.field("b").forward_fill().backward_fill(),
)
for c in ["df1", "df2"]
]
)
print(df.collect())
Null values appear in the output. I would expect them to be forward and backward filled, but they aren't.
┌─────────────────────┬───────────┬───────────┐
│ dt ┆ df1 ┆ df2 │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ struct[2] ┆ struct[2] │
╞═════════════════════╪═══════════╪═══════════╡
│ 2024-08-02 00:00:00 ┆ {0.2,-1} ┆ null │
│ 2024-08-29 00:00:00 ┆ null ┆ {100,1} │
│ 2024-08-30 00:00:00 ┆ {0.1,0} ┆ {120,-2} │
│ 2024-09-02 00:00:00 ┆ null ┆ {-80,0} │
│ 2024-09-03 00:00:00 ┆ {0.3,2} ┆ {20,0} │
│ 2024-09-04 00:00:00 ┆ {0.1,1} ┆ null │
└─────────────────────┴───────────┴───────────┘
I would expected this output:
┌─────────────────────┬───────────┬───────────┐
│ dt ┆ df1 ┆ df2 │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ struct[2] ┆ struct[2] │
╞═════════════════════╪═══════════╪═══════════╡
│ 2024-08-02 00:00:00 ┆ {0.2,-1} ┆ {100,1} │
│ 2024-08-29 00:00:00 ┆ {0.2,-1} ┆ {100,1} │
│ 2024-08-30 00:00:00 ┆ {0.1,0} ┆ {120,-2} │
│ 2024-09-02 00:00:00 ┆ {0.1,0} ┆ {-80,0} │
│ 2024-09-03 00:00:00 ┆ {0.3,2} ┆ {20,0} │
│ 2024-09-04 00:00:00 ┆ {0.1,1} ┆ {20,0} │
└─────────────────────┴───────────┴───────────┘
How to do that and why null values in column aren't filled? Probably it could be by pl.col(c).forward_fill().backward_fill()
, but what if I want only one field to be filled?
The reason .struct.with_fields
doesn't do what you want is because structs still have outer nullability, and with_fields
does not have a special case where the outer nullability is ignored if all fields are replaced. So instead of using with_fields
to update fields, completely replace the struct column with a new one which resets the outer nullability:
out = df.with_columns(
*[
pl.struct(
pl.col(c).struct.field("a").forward_fill().backward_fill(),
pl.col(c).struct.field("b").forward_fill().backward_fill(),
).alias(c)
for c in ["df1", "df2"]
]
)