pythontime-seriespython-polarspolars

Fill gaps in time series data in a Polars Lazy- / Dataframe


I am in a situation where I have some time series data, potentially looking like this:

{
        "t": [1, 2, 5, 6, 7],
        "y": [1, 1, 1, 1, 1],
}

As you can see, the time stamp jumps from 2 to 5. For my analysis, I would like to fill in zeros for the time stamps 3, and 4.

In reality, I might have multiple gaps with varying lengths. I'd like to fill this gap for all other columns.

I'd also really like to keep my data in a LazyFrame since this is only one step in my pipeline. I don't think that .interpolate is really addressing my issue, nor is fill_null helpful here.

I managed to achieve what I want, but it looks too complex:


# Dummy, lazy data.
lf = pl.LazyFrame(
    {
        "t": [1, 2, 5, 6, 7],
        "y": [1, 1, 1, 1, 1],
    }
)


lf_filled = lf.join(
    pl.Series(
        name="t",
        values=pl.int_range(
            start=lf.select("t").first().collect().item(0, 0),
            end=lf.select("t").last().collect().item(0, 0) + 1,
            eager=True,
        ),
    )
    .to_frame()
    .lazy(),
    on="t",
    how="right",
).fill_null(0)

The output is correct and I am never collecting any more data than the two values needed for start and end.

This looks like there should be a better way to do this. Happy to hear other suggestions :)


Solution

  • I think your approach is sensible, there's just no need for an intermediate collect:

    lf.join(
        lf.select(pl.int_range(pl.col.t.first(), pl.col.t.last()+1)),
        on="t",
        how="right"
    )
    .fill_null(0)
    

    An alternate approach that might be a bit more efficient is to use an asof-join with no tolerance:

    lf.select(pl.int_range(pl.col.t.first(), pl.col.t.last()+1))
      .join_asof(lf, on="t", tolerance=0)
      .fill_null(0)