pythontime-series

Python Time Series Forecasting - Splitting Test and Train


I am completing a Udemy course on time series forecasting. I am very new to coding.

When it comes to splitting the data, the course is pointing me in the direction of manually identifying where to split the data. is there a method to split this automatically e.g. test is always the last 12 months of data.

Currently I have this:

# Set one year for testing
train = df1.iloc[:115]
test = df1.iloc[115:]

The data set is very simple, only a time as an index and one column with totals.

Example of data


Solution

  • From what I understood from your question, you don't want to manually type in the exact index of the date where you want to split, but instead want it to be done automatically for the last 12 months. If we sort the data by date (as shown below), we can identify the last 12 months of data dynamically.

    1. Ensure 'Date' is in datetime format and sorted

    df1['Date'] = pd.to_datetime(df1['Date'])
    df1 = df1.sort_values(by='Date')
    

    (It seems like your data is already sorted.)

    2. Calculate the split date dynamically

    split_date = df1['Date'].max() - pd.DateOffset(months=12)
    

    We subtract 12 months using pd.DateOffset(months=12) to calculate the date where the split should happen.

    3. Split the data

    train = df1[df1['Date'] < split_date]
    test = df1[df1['Date'] >= split_date]
    

    Now you can print to check

    print("Train Data:")
    print(train)
    
    print("Test Data:")
    print(test)