Trying to create a KFold
object for my xgboost.cv
, and I have
import pandas as pd
from sklearn.model_selection import KFold
df = pd.DataFrame([[1,2,3,4,5],[6,7,8,9,10]])
KF = KFold(n_splits=2)
kf = KF.split(df)
But it seems I can only enumerate once:
for i, (train_index, test_index) in enumerate(kf):
print(f"Fold {i}")
for i, (train_index, test_index) in enumerate(kf):
print(f"Again_Fold {i}")
gives output of
Fold 0
Fold 1
The second enumerate seems to be on an empty object.
I am probably fundamentally understanding something wrong, or completed messed up somewhere, but could someone explain this behavior?
[Edit, adding follow up question] This behavior seems to cause passing KFold object to xgboost.cv
setting xgboost.cv(..., folds = KF.split(df))
to have index out of range error. My fix is to recreate the list of tuples with
kf = []
for i, (train_index, test_index) in enumerate(KF.split(df)):
this_split = (list(train_index), list(test_index))
kf.append(this_split)
xgboost.cv(..., folds = kf)
looking for smarter solutions.
Using an example:
from sklearn.model_selection import KFold
import xgboost as xgb
import numpy as np
data = np.random.rand(5, 10) # 5 entities, each contains 10 features
label = np.random.randint(2, size=5) # binary target
dtrain = xgb.DMatrix(data, label=label)
param = {'max_depth': 2, 'eta': 1, 'objective': 'binary:logistic'}
If we run your code :
KF = KFold(n_splits=2)
xgboost.cv(params= param,dtrain=dtrain, folds = KF.split(df))
I get the error :
IndexError Traceback (most recent call last)
Cell In[51], line 2
1 KF = KFold(n_splits=2)
----> 2 xgboost.cv(params= param,dtrain=dtrain, folds = KF.split(df))
[..]
IndexError: list index out of range
In the documentation, it ask for a KFold instance, so you just need to do:
KF = KFold(n_splits=2)
xgb.cv(params= param,dtrain=dtrain, folds = KF)
You can check out the source code and see that it will call the split method, so you don't need to provide KF.split(..)
.