pythongroup-bypython-polars

Python Polars Group By and Indexing, for Time Series Data


I have been developing some codes in pandas, yet I found the executions in pandas are a bit too slow. I then stumbled upon polars, which claims to be blazingly fast and much faster than pandas. I have thus been trying to transfer my existing codes to polars.

I am currently working on some stock data, where I have to find out the minimum price of the previous n transaction days. Note that here the transaction dates are discontinuous, since there are Saturdays, Sundays and public holidays. The problem is that polars do not know how to find the previous transaction days, instead polars find the actual previous days.

Let's say now we have a DataFrame, and n = 3:

df = pl.from_repr("""
┌────────────┬───────┐
│ Date       ┆ Price │
│ ---        ┆ ---   │
│ date       ┆ i64   │
╞════════════╪═══════╡
│ 2023-01-01 ┆ 1     │
│ 2023-01-02 ┆ 2     │
│ 2023-01-03 ┆ 3     │
│ 2023-01-05 ┆ 4     │
│ 2023-01-10 ┆ 5     │
│ 2023-01-11 ┆ 6     │
│ 2023-01-12 ┆ 7     │
└────────────┴───────┘
""")

The expected output I get from the following codes using pandas:

df.to_pandas().Price.rolling(3).min()
0    NaN
1    NaN
2    1.0
3    2.0
4    3.0
5    4.0
6    5.0
Name: Price, dtype: float64

However, when using polars, since there are no indices, the group by function works differently. It sets the window based on the actual physical dates(continuous) instead of the transaction dates(discontinuous, pandas can handle that with indices), which yields the following results:

┌────────────┬───────┐
│ Date       ┆ Price │
│ ---        ┆ ---   │
│ date       ┆ i64   │
╞════════════╪═══════╡
│ 2023-01-01 ┆ 1     │
│ 2023-01-02 ┆ 1     │
│ 2023-01-03 ┆ 1     │
│ 2023-01-05 ┆ 3     │
│ 2023-01-10 ┆ 5     │
│ 2023-01-11 ┆ 5     │
│ 2023-01-12 ┆ 5     │
└────────────┴───────┘

with the following code:

df.rolling("Date", period="3d").agg(pl.col("Price").min())

Is there a way to get the desired result, i.e. same as the one in pandas, with the polars library? (Should I manually create an index column???) Or is switching over to polars a bad idea for data like this, since polars advocates for the idea of no indices?


Solution

  • Polars has dedicated .rolling_* expressions, e.g. .rolling_min()

    df.with_columns(roll = pl.col("Price").rolling_min(3))
    
    shape: (7, 3)
    ┌────────────┬───────┬──────┐
    │ Date       ┆ Price ┆ roll │
    │ ---        ┆ ---   ┆ ---  │
    │ date       ┆ i64   ┆ i64  │
    ╞════════════╪═══════╪══════╡
    │ 2023-01-01 ┆ 1     ┆ null │
    │ 2023-01-02 ┆ 2     ┆ null │
    │ 2023-01-03 ┆ 3     ┆ 1    │
    │ 2023-01-05 ┆ 4     ┆ 2    │
    │ 2023-01-10 ┆ 5     ┆ 3    │
    │ 2023-01-11 ┆ 6     ┆ 4    │
    │ 2023-01-12 ┆ 7     ┆ 5    │
    └────────────┴───────┴──────┘
    

    For the frame level rolling - you can add an index column as you suggested.

    It's a bit different as we have to use the "format language" to specify the period - instead of a numeric window size.

    (df.with_row_index()
       .rolling("index", period="3i")
       .agg(pl.col("Price").min())
    )
    
    shape: (7, 2)
    ┌───────┬───────┐
    │ index ┆ Price │
    │ ---   ┆ ---   │
    │ u32   ┆ i64   │
    ╞═══════╪═══════╡
    │ 0     ┆ 1     │
    │ 1     ┆ 1     │
    │ 2     ┆ 1     │
    │ 3     ┆ 2     │
    │ 4     ┆ 3     │
    │ 5     ┆ 4     │
    │ 6     ┆ 5     │
    └───────┴───────┘
    

    You would also need to combine the result with the original frame, generally with a .join()