pythonpython-polars

Filter sequences of same values in a particular column of Polars df and leave only the first occurence


I have a very large Polars LazyFrame (if collected it would be tens of millions records). I have information recorded for a specific piece of equipment taken every second and some location flag that is either 1 or 0.

When I have sequences where the location flag is equal to 1, I need to filter out and only leave the latest one but this must be done per equipment id.

I cannot use UDFs since this is a performance-critical piece of code and should ideally stay withing Polars expression syntax.

For a simple case where I have only a single equipment id, I can do it relatively easily by shifting the time data 1 row and filter out the records where there's a big gap:

df_test = pl.DataFrame(
    {
        'time': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
        'equipment': [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        'loc': [0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1]
    }
)

df_test.filter(pl.col('loc') == 1).with_columns((pl.col('time') - pl.col('time').shift(1)).alias('time_diff')).filter(pl.col('time_diff') > 1)

This gives me sort of a correct result, but the problem is that out of 3 sequences of 1s, I only keep 2, the first one gets lost. I can probably live with that, but ideally want to not lose any data.

In a standard case, there will be multiple equipment types and once again, the same approach works but again, for both types, I only keep 2 out of 3 sequences.

df_test = pl.DataFrame(
    {
        'time': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,],
        'equipment': [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
        'loc': [0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0]
    }
)

Is there a better way to do this?


Solution

  • If I've interpreted correctly, for each equipment you want to keep only the first row of each continuous sequence of loc = 1.

    Fixing your solution

    In that case, the only changes you need to make to your solution are:

    import polars as pl
    
    df_test = pl.DataFrame(
        {
            "time": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
            "equipment": [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
            "loc": [0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1],
        }
    )
    
    
    res = (
        df_test.filter(pl.col("loc") == 1)
        #.sort("time") # uncomment if we can't assume that the df is sorted by time.
        .with_columns(
            (pl.col("time") - pl.col("time").shift(1, fill_value=-1))
            .over("equipment")
            .alias("time_diff")
        )
        .filter(pl.col("time_diff") > 1)
    )
    

    Output:

    >>> res
    
    shape: (3, 4)
    ┌──────┬───────────┬─────┬───────────┐
    │ time ┆ equipment ┆ loc ┆ time_diff │
    │ ---  ┆ ---       ┆ --- ┆ ---       │
    │ i64  ┆ i64       ┆ i64 ┆ i64       │
    ╞══════╪═══════════╪═════╪═══════════╡
    │ 3    ┆ 1         ┆ 1   ┆ 4         │
    │ 9    ┆ 1         ┆ 1   ┆ 4         │
    │ 12   ┆ 1         ┆ 1   ┆ 2         │
    └──────┴───────────┴─────┴───────────┘
    

    Alternative solution

    That said, here is another similar solution which I think is clearer:

    res = (
        df_test
        #.sort("time") # uncomment if we can't assume that the df is sorted by time.
        .filter(
            ((pl.col("loc") == 1) & (pl.col("loc").shift(fill_value=0) != 1))
            .over("equipment")
        )
    )
    

    Note that in this case the fill_value has to be any value other than 1.