I have a very large Polars LazyFrame (if collected it would be tens of millions records). I have information recorded for a specific piece of equipment taken every second and some location flag that is either 1 or 0.
When I have sequences where the location flag is equal to 1, I need to filter out and only leave the latest one but this must be done per equipment id.
I cannot use UDFs since this is a performance-critical piece of code and should ideally stay withing Polars expression syntax.
For a simple case where I have only a single equipment id, I can do it relatively easily by shifting the time data 1 row and filter out the records where there's a big gap:
df_test = pl.DataFrame(
{
'time': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
'equipment': [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
'loc': [0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1]
}
)
df_test.filter(pl.col('loc') == 1).with_columns((pl.col('time') - pl.col('time').shift(1)).alias('time_diff')).filter(pl.col('time_diff') > 1)
This gives me sort of a correct result, but the problem is that out of 3 sequences of 1s, I only keep 2, the first one gets lost. I can probably live with that, but ideally want to not lose any data.
In a standard case, there will be multiple equipment types and once again, the same approach works but again, for both types, I only keep 2 out of 3 sequences.
df_test = pl.DataFrame(
{
'time': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,],
'equipment': [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
'loc': [0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0]
}
)
Is there a better way to do this?
If I've interpreted correctly, for each equipment you want to keep only the first row of each continuous sequence of loc = 1
.
In that case, the only changes you need to make to your solution are:
Add the fill_value
to pl.col(“time”).shift(1)
to ensure that the first row with loc = 1
is always selected. The choice of fill_value
must ensure that the first time_diff > 1
, e.g. fill_value = negative number
.
fill_value
, the first row of the shift
is always null, resulting in a null time_diff
, so it is not selected by the time_diff > 1
filter.pl.col(“time_diff”) > 1 | pl.col(“time_diff”).is_null()
Apply the logic to each equipment by making it a window expression with .over("equipment")
.
import polars as pl
df_test = pl.DataFrame(
{
"time": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
"equipment": [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
"loc": [0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1],
}
)
res = (
df_test.filter(pl.col("loc") == 1)
#.sort("time") # uncomment if we can't assume that the df is sorted by time.
.with_columns(
(pl.col("time") - pl.col("time").shift(1, fill_value=-1))
.over("equipment")
.alias("time_diff")
)
.filter(pl.col("time_diff") > 1)
)
Output:
>>> res
shape: (3, 4)
┌──────┬───────────┬─────┬───────────┐
│ time ┆ equipment ┆ loc ┆ time_diff │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞══════╪═══════════╪═════╪═══════════╡
│ 3 ┆ 1 ┆ 1 ┆ 4 │
│ 9 ┆ 1 ┆ 1 ┆ 4 │
│ 12 ┆ 1 ┆ 1 ┆ 2 │
└──────┴───────────┴─────┴───────────┘
That said, here is another similar solution which I think is clearer:
res = (
df_test
#.sort("time") # uncomment if we can't assume that the df is sorted by time.
.filter(
((pl.col("loc") == 1) & (pl.col("loc").shift(fill_value=0) != 1))
.over("equipment")
)
)
Note that in this case the fill_value
has to be any value other than 1.