I am trying to use the GridSearchCV
from sklearn.model_selection
. My data is a set of classification that is indexed by time. As a result, when doing cross validation, I want the training set to be exclusively the data with time all before the data in the test set.
So my training set X_train, y_train
looks like
Time feature_1 feature_2 result
2020-01-30 3 6 1
2020-02-01 4 2 0
2021-03-02 7 1 0
and the test set X_test, y_test
looks like
Time feature_1 feature_2 result
2023-01-30 3 6 1
2023-02-01 4 2 0
2024-03-02 7 1 0
Suppose I am using a model such as xgboost
, then to optimise the hyperparameters, I used GridSearchCV
and the code looks like
param_grid = {
'max_depth': [1,2,3,4,5],
'min_child_weight': [0,1,2,3,4,5],
'gamma': [0.5, 1, 1.5, 2, 5],
'colsample_bytree': [0.6, 0.8, 1.0],
}
clf = XGBClassifier(learning_rate=0.02,
n_estimators=600,
objective='binary:logistic',
silent=True,
nthread=1)
grid_search = GridSearchCV(
estimator=clf,
param_grid=param_grid,
scoring='accuracy',
n_jobs= -1)
grid_search.fit(X_train, y_train)
However, how should i set the cv
in grid_search
? Thank you so much in advance.
Edit: So I tried to set cv=0
since I want my training data to be strictly "earlier" then test data and I got the following errors: InvalidParameterError: The 'cv' parameter of GridSearchCV must be an int in the range [2, inf), an object implementing 'split' and 'get_n_splits', an iterable or None. Got 0 instead.
the default cross-validation in GridSearchCV does not consider temporal dependency when splitting. You can use TimeSeriesSplit instead of the default CV from model selection. TimeSeriesSplit is built for this exact use case of yours.