Dataframe:
date A B C D
index
0 2015-01 .. .. .. ..
1 2015-01 .. .. .. ..
2 2015-02 .. .. .. ..
3 2015-02 .. .. .. ..
4 2015-03 .. .. .. ..
5 2015-03 .. .. .. ..
6 2015-04 .. .. .. ..
7 2015-04 .. .. .. ..
8 2015-05 .. .. .. ..
9 2015-05 .. .. .. ..
...
1000 ... .. .. .. ..
I want to iterate using the date (pd.to_datetime) starting with a fixed window (e.g., first three months [2015-01, 2015-01, 2015-02, 2015-02, 2015-03, 2015-03]) and returning a dataframe:
date A B C D
index
0 2015-01 .. .. .. ..
1 2015-01 .. .. .. ..
2 2015-02 .. .. .. ..
3 2015-02 .. .. .. ..
4 2015-03 .. .. .. ..
5 2015-03 .. .. .. ..
Then adding the next month [2015-04, 2015-04] and dropping the oldest one [2015-01, 2015-01], returning the next dataframe:
date A B C D
index
2 2015-02 .. .. .. ..
3 2015-02 .. .. .. ..
4 2015-03 .. .. .. ..
5 2015-03 .. .. .. ..
6 2015-04 .. .. .. ..
7 2015-04 .. .. .. ..
Continuing this until end of data.
I have figured out the following code:
periods = len(dataframe)
fxw = 3
for i in range(0, periods):
start = i
if i + fxw > periods:
break
else: end_df = i + fxw
# output:
# df1
date A B C D
index
0 2015-01 .. .. .. ..
1 2015-01 .. .. .. ..
2 2015-02 .. .. .. ..
# df2
date A B C D
index
3 2015-02 .. .. .. ..
4 2015-03 .. .. .. ..
5 2015-03 .. .. .. ..
...
# dfend
Where I can set a fixed window (fxw = 3) to iterate over the length of the dataframe 3 rows at time until end of data. (e.g., if dataframe has 12 rows, it will return 4 dataframes of 3 rows each). In this way, however, neither I am selecting the window by date nor I am dropping the last datapoint and adding the next one. I have not figured out how to do so yet. If anyone has a possible solution/suggestion, it would be very appreciated! Thank you!
Here is one way. Start by building a data frame:
import numpy as np
import pandas as pd
date = [ f'2015-{i:02d}' for i in range(1, 7) ]
date = np.repeat(date, 2)
date = [pd.Period(d) for d in date]
n = len(date)
amt = [10 * i for i in range(n)]
df = pd.DataFrame({'date': date, 'amt': amt})
print(df.head())
Next, construct a period range, using the min and max dates from the data frame:
period_range = pd.period_range(start=df['date'].min(),
end=df['date'].max(),
freq='M')
Now, iterate over the data frame:
months_in_window = 2
for start, end in zip(period_range, period_range[months_in_window - 1: ]):
mask = (start <= df['date']) & (df['date'] <= end)
print(df[mask], end='\n\n')
date amt
0 2015-01 0
1 2015-01 10
2 2015-02 20
3 2015-02 30
date amt
2 2015-02 20
3 2015-02 30
4 2015-03 40
5 2015-03 50
<rest of output omitted to save space>
You can often use pandas tools (including groupby
and rolling
) to avoid iterating over a data frame.
UPDATE:
We can control both length of a window, and time from start of one window to start of next:
# create list of periods
periods = pd.period_range(start='2020-01-01', periods=24, freq='M')
# create parameters
months_in_window = 3 # start of window i to end of window i
step = 5 # start of window i to start of window i+1
# create start and end points for each window
windows = [
(start, end)
for start, end
in zip(periods[::step], periods[window_size-1::step])
]
for w in windows: print(w)
(Period('2020-01', 'M'), Period('2020-03', 'M'))
(Period('2020-06', 'M'), Period('2020-08', 'M'))
(Period('2020-11', 'M'), Period('2021-01', 'M'))
(Period('2021-04', 'M'), Period('2021-06', 'M'))
(Period('2021-09', 'M'), Period('2021-11', 'M'))
Finally, iterating one the data frame looks like this (no change to mask):
for start, end in windows:
mask = (start <= df['date']) & (df['date'] <= end)
print(df[mask], end='\n\n')