As the title says, I want to know the difference between sklearn's GroupKFold
and GroupShuffleSplit
.
Both make train-test splits given for data that has a group ID, so the groups don't get separated in the split. I checked on one train/test set for each function and they both look like they make a pretty good stratification, but if someone could confirm that all splits do that, it would be great.
I made a test with both, for 10 splits:
gss = GroupShuffleSplit(n_splits=10, train_size=0.8, random_state=42)
for train_idx, test_idx in gss.split(X,y,groups):
print("train:", train_idx, "test:", test_idx)
train: [ 1 2 3 4 5 11 12 13 14 15 16 17 19 20] test: [ 0 6 7 8 9 10 18]
train: [ 1 2 3 4 5 6 7 8 9 10 12 13 14 18 19 20] test: [ 0 11 15 16 17]
train: [ 0 1 3 4 5 6 7 8 9 10 12 13 14 18 19 20] test: [ 2 11 15 16 17]
train: [ 0 2 3 4 11 12 13 14 15 16 17 18 19 20] test: [ 1 5 6 7 8 9 10]
train: [ 0 1 3 4 5 6 7 8 9 10 11 15 16 17 19 20] test: [ 2 12 13 14 18]
train: [ 1 2 3 4 5 6 7 8 9 10 11 15 16 17 18] test: [ 0 12 13 14 19 20]
train: [ 0 1 2 3 4 6 7 8 9 10 11 12 13 14 15 16 17] test: [ 5 18 19 20]
train: [ 0 1 3 4 6 7 8 9 10 11 15 16 17 18 19 20] test: [ 2 5 12 13 14]
train: [ 0 1 3 4 5 12 13 14 15 16 17 18 19 20] test: [ 2 6 7 8 9 10 11]
train: [ 0 2 3 4 5 11 12 13 14 15 16 17 19 20] test: [ 1 6 7 8 9 10 18]
group_kfold = GroupKFold(n_splits=10)
for train_idx, test_idx in group_kfold.split(X,y,groups):
print("train:", train_idx, "test:", test_idx)
train: [ 0 1 2 3 4 5 11 12 13 14 15 16 17 18 19 20] test: [ 6 7 8 9 10]
train: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 18 19 20] test: [15 16 17]
train: [ 0 1 2 3 4 5 6 7 8 9 10 11 15 16 17 18 19 20] test: [12 13 14]
train: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18] test: [19 20]
train: [ 0 1 2 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20] test: [3 4]
train: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 19 20] test: [ 0 18]
train: [ 0 1 2 3 4 5 6 7 8 9 10 12 13 14 15 16 17 18 19 20] test: [11]
train: [ 0 1 2 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20] test: [5]
train: [ 0 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20] test: [2]
train: [ 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20] test: [1]
The documentation for the un-Group versions make this clearer. KFold splits into k folds and then lumps those together into different train/test splits, whereas ShuffleSplit repeatedly makes the train/test splits directly. In particular, each sample is tested on exactly once in KFold, but can be tested on zero or multiple times in ShuffleSplit.