I use this code to test KFold
and StratifiedKFold
.
import numpy as np
from sklearn.model_selection import KFold,StratifiedKFold
X = np.array([
[1,2,3,4],
[11,12,13,14],
[21,22,23,24],
[31,32,33,34],
[41,42,43,44],
[51,52,53,54],
[61,62,63,64],
[71,72,73,74]
])
y = np.array([0,0,0,0,1,1,1,1])
sfolder = StratifiedKFold(n_splits=4,random_state=0,shuffle=False)
floder = KFold(n_splits=4,random_state=0,shuffle=False)
for train, test in sfolder.split(X,y):
print('Train: %s | test: %s' % (train, test))
print("StratifiedKFold done")
for train, test in floder.split(X,y):
print('Train: %s | test: %s' % (train, test))
print("KFold done")
I found that StratifiedKFold
can keep the proportion of labels, but KFold
can't.
Train: [1 2 3 5 6 7] | test: [0 4]
Train: [0 2 3 4 6 7] | test: [1 5]
Train: [0 1 3 4 5 7] | test: [2 6]
Train: [0 1 2 4 5 6] | test: [3 7]
StratifiedKFold done
Train: [2 3 4 5 6 7] | test: [0 1]
Train: [0 1 4 5 6 7] | test: [2 3]
Train: [0 1 2 3 6 7] | test: [4 5]
Train: [0 1 2 3 4 5] | test: [6 7]
KFold done
It seems that StratifiedKFold
is better, so should KFold
not be used?
When to use KFold
instead of StratifiedKFold
?
I think you should ask "When to use StratifiedKFold instead of KFold?".
You need to know what "KFold" and "Stratified" are first.
KFold is a cross-validator that divides the dataset into k folds.
Stratified is to ensure that each fold of dataset has the same proportion of observations with a given label.
So, it means that StratifiedKFold is the improved version of KFold
Therefore, the answer to this question is we should prefer StratifiedKFold over KFold when dealing with classification tasks with imbalanced class distributions.
FOR EXAMPLE
Suppose that there is a dataset with 16 data points and imbalanced class distribution. In the dataset, 12 of data points belong to class A and the rest (i.e. 4) belong to class B. The ratio of class B to class A is 1/3. If we use StratifiedKFold and set k = 4, then, in each iteration, the training sets will include 9 data points from class A and 3 data points from class B while the test sets include 3 data points from class A and 1 data point from class B.
As we can see, the class distribution of the dataset is preserved in the splits by StratifiedKFold while KFold does not take this into consideration.