pythondataframerustpython-polarsrust-polars

ewm_mean ignore nan


Given a series with possible NaN values, how does one tell polars to ignore the NaN values? That is, treat the NaN value as if it weren't in the DataFrame, and use the same value for the mean as in the cell before the NaN value.

To be 100% specific, this is the default behavior for panda's class of functions ewm and rolling.


Solution

  • I think what you want is the fill_nan method. There is currently a request to expand the fill_nan method to have the same handy options as the fill_null method (i.e., 'forward', 'backward', 'mean', etc..).

    But we can get the same results with a slight workaround. Let's suppose our data looks like this:

    df = pl.DataFrame(
        {
            "group": (["a"] * 3) + (["b"] * 4),
            "obs": [1, 2, 3, 1, 2, 3, 4],
            "val": [1.0, np.NaN, 3, 4, np.NaN, np.NaN, 7],
        }
    )
    df
    
    shape: (7, 3)
    ┌───────┬─────┬─────┐
    │ group ┆ obs ┆ val │
    │ ---   ┆ --- ┆ --- │
    │ str   ┆ i64 ┆ f64 │
    ╞═══════╪═════╪═════╡
    │ a     ┆ 1   ┆ 1.0 │
    │ a     ┆ 2   ┆ NaN │
    │ a     ┆ 3   ┆ 3.0 │
    │ b     ┆ 1   ┆ 4.0 │
    │ b     ┆ 2   ┆ NaN │
    │ b     ┆ 3   ┆ NaN │
    │ b     ┆ 4   ┆ 7.0 │
    └───────┴─────┴─────┘
    

    As long as your column does not have null values, we can convert the NaN values to null and then use the fill_null expression.

    (
        df
        .with_columns(pl.col("val").fill_nan(None))
        .with_columns(
            pl.col("val").fill_null(strategy="forward").over("group").alias("fill_forward"),
            pl.col("val").fill_null(strategy="backward").over("group").alias("fill_back"),
            pl.col("val").fill_null(strategy="mean").over("group").alias("fill_mean"),
            pl.col("val").fill_null(strategy="zero").over("group").alias("fill_zero"),
            pl.col("val").fill_null(strategy="one").over("group").alias("fill_one"),
        )
    )
    
    shape: (7, 8)
    ┌───────┬─────┬──────┬──────────────┬───────────┬───────────┬───────────┬──────────┐
    │ group ┆ obs ┆ val  ┆ fill_forward ┆ fill_back ┆ fill_mean ┆ fill_zero ┆ fill_one │
    │ ---   ┆ --- ┆ ---  ┆ ---          ┆ ---       ┆ ---       ┆ ---       ┆ ---      │
    │ str   ┆ i64 ┆ f64  ┆ f64          ┆ f64       ┆ f64       ┆ f64       ┆ f64      │
    ╞═══════╪═════╪══════╪══════════════╪═══════════╪═══════════╪═══════════╪══════════╡
    │ a     ┆ 1   ┆ 1.0  ┆ 1.0          ┆ 1.0       ┆ 1.0       ┆ 1.0       ┆ 1.0      │
    │ a     ┆ 2   ┆ null ┆ 1.0          ┆ 3.0       ┆ 2.0       ┆ 0.0       ┆ 1.0      │
    │ a     ┆ 3   ┆ 3.0  ┆ 3.0          ┆ 3.0       ┆ 3.0       ┆ 3.0       ┆ 3.0      │
    │ b     ┆ 1   ┆ 4.0  ┆ 4.0          ┆ 4.0       ┆ 4.0       ┆ 4.0       ┆ 4.0      │
    │ b     ┆ 2   ┆ null ┆ 4.0          ┆ 7.0       ┆ 5.5       ┆ 0.0       ┆ 1.0      │
    │ b     ┆ 3   ┆ null ┆ 4.0          ┆ 7.0       ┆ 5.5       ┆ 0.0       ┆ 1.0      │
    │ b     ┆ 4   ┆ 7.0  ┆ 7.0          ┆ 7.0       ┆ 7.0       ┆ 7.0       ┆ 7.0      │
    └───────┴─────┴──────┴──────────────┴───────────┴───────────┴───────────┴──────────┘
    

    You can also use other methods with null values as well, such as interpolate.

    (df
     .with_columns(pl.col("val").fill_nan(None))
     .with_columns(pl.col("val").interpolate().over("group").alias("inter"))
    )
    
    shape: (7, 4)
    ┌───────┬─────┬──────┬───────┐
    │ group ┆ obs ┆ val  ┆ inter │
    │ ---   ┆ --- ┆ ---  ┆ ---   │
    │ str   ┆ i64 ┆ f64  ┆ f64   │
    ╞═══════╪═════╪══════╪═══════╡
    │ a     ┆ 1   ┆ 1.0  ┆ 1.0   │
    │ a     ┆ 2   ┆ null ┆ 2.0   │
    │ a     ┆ 3   ┆ 3.0  ┆ 3.0   │
    │ b     ┆ 1   ┆ 4.0  ┆ 4.0   │
    │ b     ┆ 2   ┆ null ┆ 5.0   │
    │ b     ┆ 3   ┆ null ┆ 6.0   │
    │ b     ┆ 4   ┆ 7.0  ┆ 7.0   │
    └───────┴─────┴──────┴───────┘