pythonpandasloopsdatesliding

Looping through pandas dataframe


Dataframe:

       date      A    B    C    D
index
0      2015-01   ..   ..   ..   ..
1      2015-01   ..   ..   ..   ..
2      2015-02   ..   ..   ..   ..
3      2015-02   ..   ..   ..   ..
4      2015-03   ..   ..   ..   ..
5      2015-03   ..   ..   ..   ..
6      2015-04   ..   ..   ..   ..
7      2015-04   ..   ..   ..   ..
8      2015-05   ..   ..   ..   ..
9      2015-05   ..   ..   ..   ..
...
1000   ...       ..   ..   ..   ..

I want to iterate using the date (pd.to_datetime) starting with a fixed window (e.g., first three months [2015-01, 2015-01, 2015-02, 2015-02, 2015-03, 2015-03]) and returning a dataframe:

       date      A    B    C    D
index
0      2015-01   ..   ..   ..   ..
1      2015-01   ..   ..   ..   ..
2      2015-02   ..   ..   ..   ..
3      2015-02   ..   ..   ..   ..
4      2015-03   ..   ..   ..   ..
5      2015-03   ..   ..   ..   ..

Then adding the next month [2015-04, 2015-04] and dropping the oldest one [2015-01, 2015-01], returning the next dataframe:

       date      A    B    C    D
index
2      2015-02   ..   ..   ..   ..
3      2015-02   ..   ..   ..   ..
4      2015-03   ..   ..   ..   ..
5      2015-03   ..   ..   ..   ..
6      2015-04   ..   ..   ..   ..
7      2015-04   ..   ..   ..   ..

Continuing this until end of data.

I have figured out the following code:

periods = len(dataframe)
fxw = 3

for i in range(0, periods):
    start = i                       
    if i + fxw > periods:              
       break
    else: end_df = i + fxw
 
# output:
 
# df1
       date      A    B    C    D
index
0      2015-01   ..   ..   ..   ..
1      2015-01   ..   ..   ..   ..
2      2015-02   ..   ..   ..   ..

# df2
       date      A    B    C    D
index
3      2015-02   ..   ..   ..   ..
4      2015-03   ..   ..   ..   ..
5      2015-03   ..   ..   ..   ..
 
...
# dfend 

Where I can set a fixed window (fxw = 3) to iterate over the length of the dataframe 3 rows at time until end of data. (e.g., if dataframe has 12 rows, it will return 4 dataframes of 3 rows each). In this way, however, neither I am selecting the window by date nor I am dropping the last datapoint and adding the next one. I have not figured out how to do so yet. If anyone has a possible solution/suggestion, it would be very appreciated! Thank you!


Solution

  • Here is one way. Start by building a data frame:

    import numpy as np
    import pandas as pd
    
    date = [ f'2015-{i:02d}' for i in range(1, 7) ]
    date = np.repeat(date, 2)
    date = [pd.Period(d) for d in date]
    
    n = len(date)
    amt = [10 * i for i in range(n)]
    
    df = pd.DataFrame({'date': date, 'amt': amt})
    
    print(df.head())
    

    Next, construct a period range, using the min and max dates from the data frame:

    period_range = pd.period_range(start=df['date'].min(), 
                                   end=df['date'].max(), 
                                   freq='M')
    

    Now, iterate over the data frame:

    months_in_window = 2
    
    for start, end in zip(period_range, period_range[months_in_window - 1: ]):
        mask = (start <= df['date']) & (df['date'] <= end)
        print(df[mask], end='\n\n')
    
          date  amt
    0  2015-01    0
    1  2015-01   10
    2  2015-02   20
    3  2015-02   30
    
          date  amt
    2  2015-02   20
    3  2015-02   30
    4  2015-03   40
    5  2015-03   50
    
    <rest of output omitted to save space>
    

    You can often use pandas tools (including groupby and rolling) to avoid iterating over a data frame.

    UPDATE:

    We can control both length of a window, and time from start of one window to start of next:

    # create list of periods
    periods = pd.period_range(start='2020-01-01', periods=24, freq='M')
    
    # create parameters
    months_in_window = 3   # start of window i to end of window i
    step = 5               # start of window i to start of window i+1
    
    # create start and end points for each window
    windows = [
        (start, end)
        for start, end 
            in zip(periods[::step], periods[window_size-1::step])
    ]
    
    for w in windows: print(w)
    
    (Period('2020-01', 'M'), Period('2020-03', 'M'))
    (Period('2020-06', 'M'), Period('2020-08', 'M'))
    (Period('2020-11', 'M'), Period('2021-01', 'M'))
    (Period('2021-04', 'M'), Period('2021-06', 'M'))
    (Period('2021-09', 'M'), Period('2021-11', 'M'))
    

    Finally, iterating one the data frame looks like this (no change to mask):

    for start, end in windows:
        mask = (start <= df['date']) & (df['date'] <= end)
        print(df[mask], end='\n\n')