pythondataframegroup-byaggregatepython-polars

Python-Polars: How to fill NAs with the average of the two values between?


I have a polars data frame with different weather stations and their data. The end goal is a time series analysis. However, some of the temperature values are blank. For it to not mess with the model I want to be able to fill in the blanks with the average of the two days on either side of it. I would like to group_by() weather station somehow in the process as to not take one stations information and use it for another stations missing day average. If there is only one value (Beginning or End of the time frame) then I want to use the same number as the day next to it.

For example,

Date Weather_Station Temp
2024-01-24 A 65
2024-01-25 A None
2024-01-26 A 78
2024-01-24 B None
2024-01-25 B 65
2024-01-26 B 78

The desired solution is:

Date Weather_Station Temp
2024-01-24 A 74
2024-01-25 A 76
2024-01-26 A 78
2024-01-24 B 65
2024-01-25 B 65
2024-01-26 B 78

As can be noticed 76 is the average of 74 and 78 the two days right around it, and in the case of 2024-01-24, B it is 65 because there is no prior date. I want the same for the inverse (no post date) [ex. if it was on 2024-01-27 None for weather_station B I would want the value to be 78]

Here is an example dataframe:

data = {
    'Date': ['2024-01-24', '2024-01-25', '2024-01-26', '2024-01-24', '2024-01-25', '2024-01-26'],
    'Weather_Station': ['A', 'A', 'A', 'B', 'B', 'B'],
    'Temp': [74, None, 78, None, 65, 78]
}

Solution

  • You could interpolate, then backward_fill/forward_fill over your group:

    (pl.DataFrame(data)
       .with_columns(pl.col('Temp').interpolate()
                       .backward_fill().forward_fill()
                       .over('Weather_Station')
                    )
    )
    

    Output:

    shape: (6, 3)
    ┌────────────┬─────────────────┬──────┐
    │ Date       ┆ Weather_Station ┆ Temp │
    │ ---        ┆ ---             ┆ ---  │
    │ str        ┆ str             ┆ f64  │
    ╞════════════╪═════════════════╪══════╡
    │ 2024-01-24 ┆ A               ┆ 74.0 │
    │ 2024-01-25 ┆ A               ┆ 76.0 │
    │ 2024-01-26 ┆ A               ┆ 78.0 │
    │ 2024-01-24 ┆ B               ┆ 65.0 │
    │ 2024-01-25 ┆ B               ┆ 65.0 │
    │ 2024-01-26 ┆ B               ┆ 78.0 │
    └────────────┴─────────────────┴──────┘
    

    An alternative option if you have stretches of NaN would be to get the average of the forward and backward fills:

    (pl.DataFrame(data)
       .with_columns((pl.col('Temp').backward_fill()
                     +pl.col('Temp').forward_fill()
                     )
                     .backward_fill().forward_fill()
                     .over('Weather_Station')/2.
                    )
    )
    

    Output:

    shape: (6, 3)
    ┌────────────┬─────────────────┬──────┐
    │ Date       ┆ Weather_Station ┆ Temp │
    │ ---        ┆ ---             ┆ ---  │
    │ str        ┆ str             ┆ f64  │
    ╞════════════╪═════════════════╪══════╡
    │ 2024-01-24 ┆ A               ┆ 74.0 │
    │ 2024-01-25 ┆ A               ┆ 76.0 │
    │ 2024-01-26 ┆ A               ┆ 76.0 │
    │ 2024-01-27 ┆ A               ┆ 78.0 │
    │ 2024-01-24 ┆ B               ┆ 65.0 │
    │ 2024-01-25 ┆ B               ┆ 65.0 │
    │ 2024-01-26 ┆ B               ┆ 78.0 │
    └────────────┴─────────────────┴──────┘
    

    Alternative input:

    data = {
        'Date': ['2024-01-24', '2024-01-25', '2024-01-26', '2024-01-27', '2024-01-24', '2024-01-25', '2024-01-26'],
        'Weather_Station': ['A', 'A', 'A', 'A', 'B', 'B', 'B'],
        'Temp': [74, None, None, 78, None, 65, 78]
    }