pythoninterpolationpython-polars

Interpolate based on datetimes


In pandas, I can interpolate based on a datetimes like this:

df1 = pd.DataFrame(
    {
        "ts": [
            datetime(2020, 1, 1),
            datetime(2020, 1, 3, 0, 0, 12),
            datetime(2020, 1, 3, 0, 1, 35),
            datetime(2020, 1, 4),
        ],
        "value": [1, np.nan, np.nan, 3],
    }
)
df1.set_index('ts').interpolate(method='index')

Outputs:

                        value
ts
2020-01-01 00:00:00  1.000000
2020-01-03 00:00:12  2.333426
2020-01-03 00:01:35  2.334066
2020-01-04 00:00:00  3.000000

Is there a similar method in polars? Say, starting with

df1 = pl.DataFrame(
    {
        "ts": [
            datetime(2020, 1, 1),
            datetime(2020, 1, 3, 0, 0, 12),
            datetime(2020, 1, 3, 0, 1, 35),
            datetime(2020, 1, 4),
        ],
        "value": [1, None, None, 3],
    }
)
shape: (4, 2)
┌─────────────────────┬───────┐
│ ts                  ┆ value │
│ ---                 ┆ ---   │
│ datetime[μs]        ┆ i64   │
╞═════════════════════╪═══════╡
│ 2020-01-01 00:00:00 ┆ 1     │
│ 2020-01-03 00:00:12 ┆ null  │
│ 2020-01-03 00:01:35 ┆ null  │
│ 2020-01-04 00:00:00 ┆ 3     │
└─────────────────────┴───────┘

EDIT: I've updated the example to make it a bit more "irregular", so that upsample can't be used as a solution and to make it clear that we need something more generic


Solution

  • Update: Expr.interpolate_by was added in Polars 0.20.28

    df1.with_columns(pl.col("value").interpolate_by("ts"))
    
    shape: (4, 2)
    ┌─────────────────────┬──────────┐
    │ ts                  ┆ value    │
    │ ---                 ┆ ---      │
    │ datetime[μs]        ┆ f64      │
    ╞═════════════════════╪══════════╡
    │ 2020-01-01 00:00:00 ┆ 1.0      │
    │ 2020-01-03 00:00:12 ┆ 2.333426 │
    │ 2020-01-03 00:01:35 ┆ 2.334066 │
    │ 2020-01-04 00:00:00 ┆ 3.0      │
    └─────────────────────┴──────────┘
    

    Not sure how useful this is but it looks like pandas calls np.interp() to do this:

    invalid = pl.when(pl.col('value').is_null()).then(pl.col('ts')).alias('invalid')
    valid   = pl.when(pl.col('value').is_not_null()).then(pl.col('ts')).alias('valid')
    values  = pl.when(pl.col('value').is_not_null()).then(pl.col('value')).alias('values')
    
    df.select(
       pl.struct(invalid, valid, values)
         .map(lambda args: 
             np.interp(
                args.struct['invalid'].drop_nulls().dt.timestamp().to_numpy(zero_copy_only=True), 
                args.struct['valid'].drop_nulls().dt.timestamp().to_numpy(zero_copy_only=True), 
                args.struct['values'].drop_nulls().to_numpy(zero_copy_only=True)
             )
         ) 
         .flatten()
    )
    
    shape: (2, 1)
    ┌──────────┐
    │ invalid  │
    │ ---      │
    │ f64      │
    ╞══════════╡
    │ 2.333426 │
    │ 2.334066 │
    └──────────┘
    

    Although there does seem to be a lot of other stuff going on.