pythongridsearchcv

GridSearchCV with data indexed by time


I am trying to use the GridSearchCV from sklearn.model_selection. My data is a set of classification that is indexed by time. As a result, when doing cross validation, I want the training set to be exclusively the data with time all before the data in the test set.

So my training set X_train, y_train looks like

Time        feature_1 feature_2 result
2020-01-30  3         6         1
2020-02-01  4         2         0
2021-03-02  7         1         0

and the test set X_test, y_test looks like

Time        feature_1 feature_2 result
2023-01-30  3         6         1
2023-02-01  4         2         0
2024-03-02  7         1         0

Suppose I am using a model such as xgboost, then to optimise the hyperparameters, I used GridSearchCV and the code looks like

param_grid = {
        'max_depth': [1,2,3,4,5],
        'min_child_weight': [0,1,2,3,4,5],
        'gamma': [0.5, 1, 1.5, 2, 5],
        'colsample_bytree': [0.6, 0.8, 1.0],
}

clf = XGBClassifier(learning_rate=0.02, 
                    n_estimators=600,
                    objective='binary:logistic',
                    silent=True, 
                    nthread=1)

grid_search = GridSearchCV(
        estimator=clf,
        param_grid=param_grid,
        scoring='accuracy',
        n_jobs= -1)

grid_search.fit(X_train, y_train)

However, how should i set the cv in grid_search? Thank you so much in advance.

Edit: So I tried to set cv=0 since I want my training data to be strictly "earlier" then test data and I got the following errors: InvalidParameterError: The 'cv' parameter of GridSearchCV must be an int in the range [2, inf), an object implementing 'split' and 'get_n_splits', an iterable or None. Got 0 instead.


Solution

  • the default cross-validation in GridSearchCV does not consider temporal dependency when splitting. You can use TimeSeriesSplit instead of the default CV from model selection. TimeSeriesSplit is built for this exact use case of yours.