I am completing a Udemy course on time series forecasting. I am very new to coding.
When it comes to splitting the data, the course is pointing me in the direction of manually identifying where to split the data. is there a method to split this automatically e.g. test is always the last 12 months of data.
Currently I have this:
# Set one year for testing
train = df1.iloc[:115]
test = df1.iloc[115:]
The data set is very simple, only a time as an index and one column with totals.
From what I understood from your question, you don't want to manually type in the exact index of the date where you want to split, but instead want it to be done automatically for the last 12 months. If we sort the data by date (as shown below), we can identify the last 12 months of data dynamically.
df1['Date'] = pd.to_datetime(df1['Date'])
df1 = df1.sort_values(by='Date')
(It seems like your data is already sorted.)
split_date = df1['Date'].max() - pd.DateOffset(months=12)
We subtract 12 months using pd.DateOffset(months=12)
to calculate the date where the split should happen.
train = df1[df1['Date'] < split_date]
test = df1[df1['Date'] >= split_date]
Now you can print to check
print("Train Data:")
print(train)
print("Test Data:")
print(test)