Imagine we have multiple time-series observations for multiple entities, and we want to perform hyper-parameter tuning on a single model, splitting the data in a time-series cross-validation fashion.
To my knowledge, there isn't a straightforward solution to performing this hyper-parameter tuning operation within the scikit-learn framework. There exists the functionality to do this with a single time-series using TimeSeriesSplit, however this doesn't work for multiple entities.
As a simple example imagine we have a dataframe:
from itertools import product
# create a dataframe
countries = ['ESP','FRA']
periods = list(range(10))
df = pd.DataFrame(list(product(countries,periods)), columns = ['country','period'])
df['target'] = np.concatenate((np.repeat(1, 10), np.repeat(0, 10)))
df['a_feature'] = np.random.randn(20, 1)
# this produces the following dataframe:
country,period,target,a_feature
ESP,0,1,0.08
ESP,1,1,-2.0
ESP,2,1,0.1
ESP,3,1,-0.59
ESP,4,1,-0.83
ESP,5,1,0.05
ESP,6,1,0.05
ESP,7,1,0.42
ESP,8,1,0.04
ESP,9,1,2.17
FRA,0,0,-0.44
FRA,1,0,-0.48
FRA,2,0,0.82
FRA,3,0,-1.64
FRA,4,0,0.19
FRA,5,0,0.6
FRA,6,0,-0.73
FRA,7,0,-0.5
FRA,8,0,1.11
FRA,9,0,-0.75
And we want to train a single model across Spain and France so that we take all the data up to a certain period, and then predict using that trained model the next period for both Spain and France. And we want to assess which set of hyper-parameters work best for performance.
How to do hyper-parameter tuning with panel data in an time-series cross-validation framework?
Similar questions have been asked here:
I propose PanelSplit, a custom cross-validator for panel-data. It's essentially a wrapper for TimeSeriesSplit, taking similar same arguments as TimeSeriesSplit but allowing for panel-data functionality.
PanelSplit works essentially as follows:
import pandas as pd
import numpy as np
from sklearn.model_selection import TimeSeriesSplit
class PanelSplit:
def __init__(self, unique_periods, train_periods, n_splits = 5, gap = 0, test_size = None, max_train_size=None):
"""
A class for performing time series cross-validation with custom train/test splits based on unique periods.
Parameters:
- n_splits: Number of splits for TimeSeriesSplit
- gap: Gap between train and test sets in TimeSeriesSplit
- test_size: Size of the test set in TimeSeriesSplit
- unique_periods: Pandas DataFrame or Series containing unique periods
- train_periods: All available training periods
- max_train_size: Maximum size for a single training set.
"""
self.tss = TimeSeriesSplit(n_splits=n_splits, gap=gap, test_size=test_size, max_train_size = max_train_size)
indices = self.tss.split(unique_periods)
self.u_periods_cv = self.split_unique_periods(indices, unique_periods)
self.all_periods = train_periods
self.n_splits = n_splits
def split_unique_periods(self, indices, unique_periods):
"""
Split unique periods into train/test sets based on TimeSeriesSplit indices.
Parameters:
- indices: TimeSeriesSplit indices
- unique_periods: Pandas DataFrame or Series containing unique periods
Returns: List of tuples containing train and test periods
"""
u_periods_cv = []
for i, (train_index, test_index) in enumerate(indices):
unique_train_periods = unique_periods.iloc[train_index].values
unique_test_periods = unique_periods.iloc[test_index].values
u_periods_cv.append((unique_train_periods, unique_test_periods))
return u_periods_cv
def split(self, X = None, y = None, groups=None):
"""
Generate train/test indices based on unique periods.
"""
self.all_indices = []
for i, (train_periods, test_periods) in enumerate(self.u_periods_cv):
train_indices = self.all_periods.loc[self.all_periods.isin(train_periods)].index
test_indices = self.all_periods.loc[self.all_periods.isin(test_periods)].index
self.all_indices.append((train_indices, test_indices))
return self.all_indices
def get_n_splits(self, X=None, y =None, groups=None):
"""
Returns: Number of splits
"""
return self.n_splits
Here is a demo of how it can be used as a cross-validator for hyperparameter tuning.
Before doing hyperparameter tuning in a real setting, I reset indices and drop NaN values with respect to both feature variables and the target. This usually saves me from indexing errors.
from itertools import product
# create a dataframe
countries = ['ESP','FRA']
periods = list(range(10))
df = pd.DataFrame(list(product(countries,periods)), columns=['country','period'])
df['target'] = np.concatenate((np.repeat(1, 10), np.repeat(0, 10)))
df['a_feature'] = np.random.randn(20, 1)
unique_periods = pd.Series(df.period.unique())
panel_split = PanelSplit(n_splits=3,
unique_periods= unique_periods, train_periods=df.period)
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
param_grid = {'max_depth': [2, 3]}
param_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=panel_split)
param_search.fit(df[['a_feature']], df['target'])