pythonscikit-learnxgboostcross-validationk-fold

Why sklearn's KFold can only be enumerated once (also on using it in xgboost.cv)?


Trying to create a KFold object for my xgboost.cv, and I have

import pandas as pd
from sklearn.model_selection import KFold

df = pd.DataFrame([[1,2,3,4,5],[6,7,8,9,10]])

KF = KFold(n_splits=2)
kf = KF.split(df)

But it seems I can only enumerate once:

for i, (train_index, test_index) in enumerate(kf):
    print(f"Fold {i}")

for i, (train_index, test_index) in enumerate(kf):
    print(f"Again_Fold {i}")

gives output of

Fold 0
Fold 1

The second enumerate seems to be on an empty object.

I am probably fundamentally understanding something wrong, or completed messed up somewhere, but could someone explain this behavior?

[Edit, adding follow up question] This behavior seems to cause passing KFold object to xgboost.cv setting xgboost.cv(..., folds = KF.split(df)) to have index out of range error. My fix is to recreate the list of tuples with

kf = []
for i, (train_index, test_index) in enumerate(KF.split(df)):
    this_split = (list(train_index), list(test_index))
    kf.append(this_split)

xgboost.cv(..., folds = kf)

looking for smarter solutions.


Solution

  • Using an example:

    from sklearn.model_selection import KFold
    import xgboost as xgb
    import numpy as np
    
    data = np.random.rand(5, 10)  # 5 entities, each contains 10 features
    label = np.random.randint(2, size=5)  # binary target
    dtrain = xgb.DMatrix(data, label=label)
    
    param = {'max_depth': 2, 'eta': 1, 'objective': 'binary:logistic'}
    

    If we run your code :

    KF = KFold(n_splits=2)
    xgboost.cv(params= param,dtrain=dtrain, folds = KF.split(df))
    

    I get the error :

    IndexError                                Traceback (most recent call last)
    Cell In[51], line 2
          1 KF = KFold(n_splits=2)
    ----> 2 xgboost.cv(params= param,dtrain=dtrain, folds = KF.split(df))
    [..]
    
    IndexError: list index out of range
    

    In the documentation, it ask for a KFold instance, so you just need to do:

    KF = KFold(n_splits=2)
    xgb.cv(params= param,dtrain=dtrain, folds = KF)
    

    You can check out the source code and see that it will call the split method, so you don't need to provide KF.split(..) .