pythonmachine-learningscikit-learnregressioncatboost

Value error for CatboostRegressor with StratifiedKFold


I just started learning Catboost and tried to use CatboostRegressor with StratifiedKFold, but ran into error:

Here is the edited post with full block of codes and error for clarification. In addition, also tried with for i, (train_index, test_index) in enumerate(fold.split(X,y)): did not work though.

from sklearn.model_selection import KFold,StratifiedKFold
from sklearn.metrics import mean_squared_log_error
from sklearn.preprocessing import LabelEncoder
from catboost import Pool, CatBoostRegressor
fold=StratifiedKFold(n_splits=5,shuffle=True,random_state=42)

err = []
y_pred = []
for train_index, test_index in fold.split(X,y):
#for i, (train_index, test_index) in enumerate(fold.split(X,y)):
    X_train, X_val = X.iloc[train_index], X.iloc[test_index]
    y_train, y_val = y[train_index], y[test_index]
    _train = Pool(X_train, label = y_train)
    _valid = Pool(X_val, label = y_val)

    cb = CatBoostRegressor(n_estimators = 20000, 
                     reg_lambda = 1.0,
                     eval_metric = 'RMSE',
                     random_seed = 42,
                     learning_rate = 0.01,
                     od_type = "Iter",
                     early_stopping_rounds = 2000,
                     depth = 7,
                     cat_features = cate,
                     bagging_temperature = 1.0)
    cb.fit(_train,cat_features=cate,eval_set = _valid, early_stopping_rounds = 2000, use_best_model = True, verbose_eval = 100) 

    p = cb.predict(X_val)
    print("err: ",rmsle(y_val,p))
    err.append(rmsle(y_val,p))
    pred = cb.predict(test_df)
    y_pred.append(pred)
predictions = np.mean(y_pred,0)

ValueError                                Traceback (most recent call last)
<ipython-input-21-3a0df0c7b8d6> in <module>()
      7 err = []
      8 y_pred = []
----> 9 for train_index, test_index in fold.split(X,y):
     10 #for i, (train_index, test_index) in enumerate(fold.split(X,y)):
     11     X_train, X_val = X.iloc[train_index], X.iloc[test_index]

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-    packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
    333                 .format(self.n_splits, n_samples))
    334 
--> 335         for train, test in super().split(X, y, groups):
    336             yield train, test
    337 

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-   packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
     87         X, y, groups = indexable(X, y, groups)
     88         indices = np.arange(_num_samples(X))
---> 89         for test_index in self._iter_test_masks(X, y, groups):
     90             train_index = indices[np.logical_not(test_index)]
     91             test_index = indices[test_index]

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sklearn/model_selection/_split.py in _iter_test_masks(self, X, y, groups)
    684 
    685     def _iter_test_masks(self, X, y=None, groups=None):
--> 686         test_folds = self._make_test_folds(X, y)
    687         for i in range(self.n_splits):
    688             yield test_folds == i

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sklearn/model_selection/_split.py in _make_test_folds(self, X, y)
    639             raise ValueError(
    640                 'Supported target types are: {}. Got {!r instead.'.format(
--> 641                     allowed_target_types, type_of_target_y))
    642 
    643         y = column_or_1d(y)

ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.

Solution

  • You get the error for a very fundamental reason from basic ML theory: stratification is defined only for classification, in order to ensure equal representation of all classes in the split; it is meaningless in regression. Reading closely the error message, you should be able to convince yourself that its meaning is that 'continous' targets (i.e. regression) are not supported, only 'binary' or 'multiclass' (i.e. classification); and this is not some peculiarity of scikit-learn, but a fundamental issue indeed.

    A relevant hint is also included in the documentation (emphasis added):

    Stratified K-Folds cross-validator

    Provides train/test indices to split data in train/test sets.

    This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.

    Here is a short demonstration, adapting the example from the documentation, but changing the targets y to be continuous (regression) instead of discrete (classification):

    import numpy as np
    from sklearn.model_selection import StratifiedKFold
    X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
    y = np.array([0.1, 0.5, -1.1, 1.2]) # continuous targets, i.e. regression problem
    skf = StratifiedKFold(n_splits=2)
    
    for train_index, test_index in skf.split(X,y):
        print("something")
    [...]
    ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.
    

    So, simply speaking, you cannot actually use StratifiedKFold in your (regression) setting; change it to simple KFold and move on from there...