I am in a situation where I have some time series data, potentially looking like this:
{
"t": [1, 2, 5, 6, 7],
"y": [1, 1, 1, 1, 1],
}
As you can see, the time stamp jumps from 2
to 5
. For my analysis, I would like to fill in zeros for the time stamps 3
, and 4
.
In reality, I might have multiple gaps with varying lengths. I'd like to fill this gap for all other columns.
I'd also really like to keep my data in a LazyFrame
since this is only one step in my pipeline. I don't think that .interpolate
is really addressing my issue, nor is fill_null
helpful here.
I managed to achieve what I want, but it looks too complex:
# Dummy, lazy data.
lf = pl.LazyFrame(
{
"t": [1, 2, 5, 6, 7],
"y": [1, 1, 1, 1, 1],
}
)
lf_filled = lf.join(
pl.Series(
name="t",
values=pl.int_range(
start=lf.select("t").first().collect().item(0, 0),
end=lf.select("t").last().collect().item(0, 0) + 1,
eager=True,
),
)
.to_frame()
.lazy(),
on="t",
how="right",
).fill_null(0)
The output is correct and I am never collect
ing any more data than the two values needed for start
and end
.
This looks like there should be a better way to do this. Happy to hear other suggestions :)
I think your approach is sensible, there's just no need for an intermediate collect:
lf.join(
lf.select(pl.int_range(pl.col.t.first(), pl.col.t.last()+1)),
on="t",
how="right"
)
.fill_null(0)
An alternate approach that might be a bit more efficient is to use an asof-join with no tolerance:
lf.select(pl.int_range(pl.col.t.first(), pl.col.t.last()+1))
.join_asof(lf, on="t", tolerance=0)
.fill_null(0)