pythondataframemultiple-columnsfillnapython-polars

fill_null() values with other columns data


i have a question regarding fill null values, is it possible to backfill data from other columns as in pandas?

Working pandas example on how to backfill data :

df.loc[:, ['A', 'B', 'C']] = df[['A', 'B', 'C']].fillna(
    value={
        'A': df['D'],
        'B': df['D'],
        'C': df['D'],
    }
) 

Polars example as i tried to backfill data from column D to column A if the value is null, but it's not working:

df = pl.DataFrame(
        {"date": ["2020-01-01 00:00:00", "2020-01-07 00:00:00", "2020-01-14 00:00:00"],
        "A": [3, 4, 7],
        "B": [3, 4, 5],
        "C": [0, 1, 2],
        "D": [1, 2, 5]})
df = df.with_column(pl.col("date").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S"))
date_range = df.select(pl.arange(df["date"][0], df["date"]
                        [-1] + 1, step=1000*60*60*24).cast(pl.Datetime).alias("date"))
df = (date_range.join(df, on="date", how="left"))
df['D'] = df['D'].fill_null("forward")
print(df)
df[:, ['A']] = df[['A']].fill_null({
    'A': df['D']
    }
)
print(df)

Kind regards, Tom


Solution

  • In the example you show and the accomponied pandas code. A fillna doesn't fill any null values, because the other columns are also NaN. So I am going to assume that you want to fill missing values by values of another column that doesn't have missing values, but correct me if I am wrong.

    import polars as pl
    from polars import col
    
    df = pl.DataFrame({
        "a": [0, 1, 2, 3, None, 5, 6, None, 8, None],
        "b": range(10),
    })
    
    out = df.with_columns([
        pl.when(col("a").is_null()).then(col("b")).otherwise(col("a")).alias("a"),
        pl.when(col("a").is_null()).then(col("b").shift(1)).otherwise(col("a")).alias("a_filled_lag"),
        pl.when(col("a").is_null()).then(col("b").mean()).otherwise(col("a")).alias("a_filled_mean")
    
    ])
    
    print(out)
    

    In the example above, we use a when -> then -> othwerwise expression to fill missing values by another columns values. Think about if else expressions but then on whole columns.

    I gave 3 examples, one where we fill on that value, one where we fill with the lagged value, and one where we fill with the mean value of the other column.

    The snippet above produces:

    shape: (10, 4)
    ┌─────┬─────┬──────────────┬───────────────┐
    │ a   ┆ b   ┆ a_filled_lag ┆ a_filled_mean │
    │ --- ┆ --- ┆ ---          ┆ ---           │
    │ i64 ┆ i64 ┆ i64          ┆ f64           │
    ╞═════╪═════╪══════════════╪═══════════════╡
    │ 0   ┆ 0   ┆ 0            ┆ 0.0           │
    ├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
    │ 1   ┆ 1   ┆ 1            ┆ 1             │
    ├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
    │ 2   ┆ 2   ┆ 2            ┆ 2             │
    ├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
    │ 3   ┆ 3   ┆ 3            ┆ 3             │
    ├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
    │ ... ┆ ... ┆ ...          ┆ ...           │
    ├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
    │ 5   ┆ 5   ┆ 5            ┆ 5             │
    ├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
    │ 6   ┆ 6   ┆ 6            ┆ 6             │
    ├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
    │ 7   ┆ 7   ┆ 6            ┆ 4.5           │
    ├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
    │ 8   ┆ 8   ┆ 8            ┆ 8             │
    ├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
    │ 9   ┆ 9   ┆ 8            ┆ 4.5           │
    └─────┴─────┴──────────────┴───────────────┘