python scikit-learn time-series hyperparameters panel-data

How to do hyper-parameter tuning with panel data in sklearn framework?

Imagine we have multiple time-series observations for multiple entities, and we want to perform hyper-parameter tuning on a single model, splitting the data in a time-series cross-validation fashion.

To my knowledge, there isn't a straightforward solution to performing this hyper-parameter tuning operation within the scikit-learn framework. There exists the functionality to do this with a single time-series using TimeSeriesSplit, however this doesn't work for multiple entities.

As a simple example imagine we have a dataframe:

from itertools import product

# create a dataframe
countries = ['ESP','FRA']
periods = list(range(10))
df = pd.DataFrame(list(product(countries,periods)), columns = ['country','period'])
df['target'] = np.concatenate((np.repeat(1, 10), np.repeat(0, 10)))
df['a_feature'] = np.random.randn(20, 1)

# this produces the following dataframe:
country,period,target,a_feature
ESP,0,1,0.08
ESP,1,1,-2.0
ESP,2,1,0.1
ESP,3,1,-0.59
ESP,4,1,-0.83
ESP,5,1,0.05
ESP,6,1,0.05
ESP,7,1,0.42
ESP,8,1,0.04
ESP,9,1,2.17
FRA,0,0,-0.44
FRA,1,0,-0.48
FRA,2,0,0.82
FRA,3,0,-1.64
FRA,4,0,0.19
FRA,5,0,0.6
FRA,6,0,-0.73
FRA,7,0,-0.5
FRA,8,0,1.11
FRA,9,0,-0.75

And we want to train a single model across Spain and France so that we take all the data up to a certain period, and then predict using that trained model the next period for both Spain and France. And we want to assess which set of hyper-parameters work best for performance.

How to do hyper-parameter tuning with panel data in an time-series cross-validation framework?

PanelSplit

I propose PanelSplit, a custom cross-validator for panel-data. It's essentially a wrapper for TimeSeriesSplit, taking similar same arguments as TimeSeriesSplit but allowing for panel-data functionality.

PanelSplit works essentially as follows:

Create train and test indices for each fold by passing the period series to TimeSeriesSplit
For the train and test sets of each fold, substitute the indices with the corresponding period values
For each train and test periods of each fold, filter for the period values in the panel data's periods and return their indices.

import pandas as pd
import numpy as np
from sklearn.model_selection import TimeSeriesSplit

class PanelSplit:
    def __init__(self, unique_periods, train_periods, n_splits = 5, gap = 0, test_size = None,  max_train_size=None):
        """
        A class for performing time series cross-validation with custom train/test splits based on unique periods.

        Parameters:
        - n_splits: Number of splits for TimeSeriesSplit
        - gap: Gap between train and test sets in TimeSeriesSplit
        - test_size: Size of the test set in TimeSeriesSplit
        - unique_periods: Pandas DataFrame or Series containing unique periods
        - train_periods: All available training periods
        - max_train_size: Maximum size for a single training set.
        """
        self.tss = TimeSeriesSplit(n_splits=n_splits, gap=gap, test_size=test_size, max_train_size = max_train_size)
        indices = self.tss.split(unique_periods)
        self.u_periods_cv = self.split_unique_periods(indices, unique_periods)
        self.all_periods = train_periods
        self.n_splits = n_splits
        
    def split_unique_periods(self, indices, unique_periods):
        """
        Split unique periods into train/test sets based on TimeSeriesSplit indices.

        Parameters:
        - indices: TimeSeriesSplit indices
        - unique_periods: Pandas DataFrame or Series containing unique periods

        Returns: List of tuples containing train and test periods
        """
        u_periods_cv = []
        for i, (train_index, test_index) in enumerate(indices):
            unique_train_periods = unique_periods.iloc[train_index].values
            unique_test_periods = unique_periods.iloc[test_index].values
            u_periods_cv.append((unique_train_periods, unique_test_periods))
        return u_periods_cv

    def split(self, X = None, y = None, groups=None):
        """
        Generate train/test indices based on unique periods.
        """
        self.all_indices = []
        
        for i, (train_periods, test_periods) in enumerate(self.u_periods_cv):
            train_indices = self.all_periods.loc[self.all_periods.isin(train_periods)].index
            test_indices = self.all_periods.loc[self.all_periods.isin(test_periods)].index
            self.all_indices.append((train_indices, test_indices))
        
        return self.all_indices
   
    def get_n_splits(self, X=None, y =None, groups=None):
        """
        Returns: Number of splits
        """
        return self.n_splits

Hyper-parameter tuning with PanelSplit

Here is a demo of how it can be used as a cross-validator for hyperparameter tuning.

Before doing hyperparameter tuning in a real setting, I reset indices and drop NaN values with respect to both feature variables and the target. This usually saves me from indexing errors.

from itertools import product

# create a dataframe
countries = ['ESP','FRA']
periods = list(range(10))
df = pd.DataFrame(list(product(countries,periods)), columns=['country','period'])
df['target'] = np.concatenate((np.repeat(1, 10), np.repeat(0, 10)))
df['a_feature'] = np.random.randn(20, 1)

unique_periods = pd.Series(df.period.unique())
panel_split = PanelSplit(n_splits=3,
                         unique_periods= unique_periods, train_periods=df.period)

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {'max_depth': [2, 3]}

param_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=panel_split)
param_search.fit(df[['a_feature']], df['target'])