pythonpython-polars

Polars how to field.fill_null for whole column?


This code not fill null values in column. I want to some fields to forward and backward fill nulls.

import polars as pl

df1 = pl.LazyFrame({
            "dt": [
                    "2024-08-30",
                    "2024-08-02",
                    "2024-09-03",
                    "2024-09-04"
            ],
            "df1": {
                "a": [0.1, 0.2, 0.3, 0.1],
                "b": [0, -1, 2, 1]
            },
        }).with_columns(
            pl.col("dt").str.to_datetime("%Y-%m-%d")
        )
df2 = pl.LazyFrame({
            "dt": [
                    "2024-08-29",
                    "2024-08-30",
                    "2024-09-02",
                    "2024-09-03"
            ],
            "df2":{
                "a": [100, 120, -80, 20],
                "b": [1, -2, 0, 0]
            },
        }).with_columns(
            pl.col("dt").str.to_datetime("%Y-%m-%d")
        )

df = pl.concat([df1, df2], how="align")

df = df.with_columns(
    *[
        pl.col(c).struct.with_fields(
            pl.field("a").forward_fill().backward_fill(),
            pl.field("b").forward_fill().backward_fill(),
        )
        for c in ["df1", "df2"]
    ]
)

print(df.collect())

Null values ​​appear in the output. I would expect them to be forward and backward filled, but they aren't.

┌─────────────────────┬───────────┬───────────┐
│ dt                  ┆ df1       ┆ df2       │
│ ---                 ┆ ---       ┆ ---       │
│ datetime[μs]        ┆ struct[2] ┆ struct[2] │
╞═════════════════════╪═══════════╪═══════════╡
│ 2024-08-02 00:00:00 ┆ {0.2,-1}  ┆ null      │
│ 2024-08-29 00:00:00 ┆ null      ┆ {100,1}   │
│ 2024-08-30 00:00:00 ┆ {0.1,0}   ┆ {120,-2}  │
│ 2024-09-02 00:00:00 ┆ null      ┆ {-80,0}   │
│ 2024-09-03 00:00:00 ┆ {0.3,2}   ┆ {20,0}    │
│ 2024-09-04 00:00:00 ┆ {0.1,1}   ┆ null      │
└─────────────────────┴───────────┴───────────┘

I would expected this output:

┌─────────────────────┬───────────┬───────────┐
│ dt                  ┆ df1       ┆ df2       │
│ ---                 ┆ ---       ┆ ---       │
│ datetime[μs]        ┆ struct[2] ┆ struct[2] │
╞═════════════════════╪═══════════╪═══════════╡
│ 2024-08-02 00:00:00 ┆ {0.2,-1}  ┆ {100,1}   │
│ 2024-08-29 00:00:00 ┆ {0.2,-1}  ┆ {100,1}   │
│ 2024-08-30 00:00:00 ┆ {0.1,0}   ┆ {120,-2}  │
│ 2024-09-02 00:00:00 ┆ {0.1,0}   ┆ {-80,0}   │
│ 2024-09-03 00:00:00 ┆ {0.3,2}   ┆ {20,0}    │
│ 2024-09-04 00:00:00 ┆ {0.1,1}   ┆ {20,0}    │
└─────────────────────┴───────────┴───────────┘

How to do that and why null values in column aren't filled? Probably it could be by pl.col(c).forward_fill().backward_fill(), but what if I want only one field to be filled?


Solution

  • The reason .struct.with_fields doesn't do what you want is because structs still have outer nullability, and with_fields does not have a special case where the outer nullability is ignored if all fields are replaced. So instead of using with_fields to update fields, completely replace the struct column with a new one which resets the outer nullability:

    out = df.with_columns(
        *[
            pl.struct(
                pl.col(c).struct.field("a").forward_fill().backward_fill(),
                pl.col(c).struct.field("b").forward_fill().backward_fill(),
            ).alias(c)
            for c in ["df1", "df2"]
        ]
    )