pythonpandasdataframerolling-computationwindowing

Rolling window on timestamped DataFrame with a custom step?


I have been fiddling about with pandas.DataFrame.rolling for some time now and I haven't been able to achieve the result that I am looking for, so before I write a custom windowing function I figured I would ask if I'm missing something.

I have postgresql data with a composite index of (time, node) that has been read into a pandas.DataFrame, where time is a certain hour on a certain date. I need to create windows that contain all entries within the last two calendar dates (or any arbitrary number of days), for example, beginning at 2022-12-26 00:00:00 and ending on 2022-12-27 23:00:00, and then perform operations on that window to return a new, resultant DataFrame. The window should then move forward an entire calendar date, which is where I am failing.

| time                  | node  | value  |
| --------------------- | ----- | ------ |
| 2022-12-26 00:00:00   | 123   | low    |
| 2022-12-26 01:00:00   | 123   | med    |
| 2022-12-26 02:00:00   | 123   | low    |
| 2022-12-26 03:00:00   | 123   | high   |
| ...                   | ...   | ...    |
| 2022-12-26 00:00:00   | 999   | low    |
| 2022-12-26 01:00:00   | 999   | low    |
| 2022-12-26 02:00:00   | 999   | low    |
| 2022-12-26 03:00:00   | 999   | med    |
| ...                   | ...   | ...    |
| 2022-12-27 00:00:00   | 123   | low    |
| 2022-12-27 01:00:00   | 123   | med    |
| 2022-12-27 02:00:00   | 123   | low    |
| 2022-12-27 03:00:00   | 123   | high   |

When I use something akin to df.rolling(window=pd.Timedelta('2days'), the windows move forward hour-by-hour, as opposed to beginning on the next calendar date.

I've played around with using min_periods, but it doesn't seem to work with my data, nor would it be acceptable in the long run because the number of expected observations per window is not fixed regardless. The step parameter also appears to be useless in this case because I am using an offset versus an integer for the window anyways.

Is the behaviour I am looking for doable with pandas.DataFrame.rolling or must I look elsewhere/write my own windowing function?

Any guidance would be appreciated. Thanks!


Solution

  • So from what I understand, you want to create windows of length ndays and the next window should start with the next day.

    Given some dataframe with 5 days in total in the frequency of 1H between indices:

    import pandas as pd
    import numpy as np
    
    periods = 23 * 5
    df = pd.DataFrame(
        {'value': list(range(periods))},
        index=pd.date_range('2022-12-16', periods=periods, freq='H')
    )
    d = np.random.choice(
        pd.date_range('2022-12-16', periods=periods, freq='H'),
        int(periods * 0.25)
    )
    df = df.drop(index=d)
    df.head(5)          
    >>>                  value
    2022-12-16 00:00:00      0
    2022-12-16 01:00:00      1
    2022-12-16 02:00:00      2
    2022-12-16 04:00:00      4
    2022-12-16 05:00:00      5
    

    I randomly dropped some indices to simulate missing data. We can use df.resample (docs) to group the data by days (regardless of missing data):

    days = df.resample('1d')
    print(days.get_group('2022-12-16'))
    >>>                  value
    2022-12-16 00:00:00      0
    2022-12-16 01:00:00      1
    2022-12-16 02:00:00      2
    2022-12-16 04:00:00      4
    2022-12-16 05:00:00      5
    2022-12-16 06:00:00      6
    2022-12-16 07:00:00      7
    2022-12-16 08:00:00      8
    2022-12-16 09:00:00      9
    2022-12-16 11:00:00     11
    2022-12-16 12:00:00     12
    2022-12-16 13:00:00     13
    2022-12-16 14:00:00     14
    2022-12-16 15:00:00     15
    2022-12-16 17:00:00     17
    2022-12-16 18:00:00     18
    2022-12-16 19:00:00     19
    2022-12-16 21:00:00     21
    2022-12-16 22:00:00     22
    2022-12-16 23:00:00     23
    

    Now, we only need to iterate over the days in a "sliding" manner. The package more-itertools has some helpful functions, such as windowed and we can easily control the size of the window (here with ndays):

    from more_itertools import windowed
    ndays = 2
    windows = [
        pd.concat([w[1] for w in window])
        for window in windowed(days, ndays)
    ]
    

    Printing the first and last index of each window returns:

    for window in windows:
        print(window.iloc[[0, -1]])
    
    >>>                  value
    2022-12-16 00:00:00      0
    2022-12-17 23:00:00     47
                         value
    2022-12-17 00:00:00     24
    2022-12-18 23:00:00     71
                         value
    2022-12-18 00:00:00     48
    2022-12-19 23:00:00     95
                         value
    2022-12-19 01:00:00     73
    2022-12-20 18:00:00    114
    

    Furthermore, you can set step in windowed to control the step size between windows.