StratifiedKFold vs KFold in scikit-learn
Asked Answered
C

3

27

I use this code to test KFold and StratifiedKFold.

import numpy as np
from sklearn.model_selection import KFold,StratifiedKFold

X = np.array([
    [1,2,3,4],
    [11,12,13,14],
    [21,22,23,24],
    [31,32,33,34],
    [41,42,43,44],
    [51,52,53,54],
    [61,62,63,64],
    [71,72,73,74]
])

y = np.array([0,0,0,0,1,1,1,1])

sfolder = StratifiedKFold(n_splits=4,random_state=0,shuffle=False)
floder = KFold(n_splits=4,random_state=0,shuffle=False)

for train, test in sfolder.split(X,y):
    print('Train: %s | test: %s' % (train, test))
print("StratifiedKFold done")

for train, test in floder.split(X,y):
    print('Train: %s | test: %s' % (train, test))
print("KFold done")

I found that StratifiedKFold can keep the proportion of labels, but KFold can't.

Train: [1 2 3 5 6 7] | test: [0 4]
Train: [0 2 3 4 6 7] | test: [1 5]
Train: [0 1 3 4 5 7] | test: [2 6]
Train: [0 1 2 4 5 6] | test: [3 7]
StratifiedKFold done
Train: [2 3 4 5 6 7] | test: [0 1]
Train: [0 1 4 5 6 7] | test: [2 3]
Train: [0 1 2 3 6 7] | test: [4 5]
Train: [0 1 2 3 4 5] | test: [6 7]
KFold done

It seems that StratifiedKFold is better, so should KFold not be used?

When to use KFold instead of StratifiedKFold?

Croydon answered 16/12, 2020 at 7:30 Comment(1)
Great answers out there, too (if you want to dive also in StratifiedShuffleSplit besides StratifiedKFold and KFold).Conjugation
S
51

I think you should ask "When to use StratifiedKFold instead of KFold?".

You need to know what "KFold" and "Stratified" are first.

KFold is a cross-validator that divides the dataset into k folds.

Stratified is to ensure that each fold of dataset has the same proportion of observations with a given label.

So, it means that StratifiedKFold is the improved version of KFold

Therefore, the answer to this question is we should prefer StratifiedKFold over KFold when dealing with classification tasks with imbalanced class distributions.


FOR EXAMPLE

Suppose that there is a dataset with 16 data points and imbalanced class distribution. In the dataset, 12 of data points belong to class A and the rest (i.e. 4) belong to class B. The ratio of class B to class A is 1/3. If we use StratifiedKFold and set k = 4, then, in each iteration, the training sets will include 9 data points from class A and 3 data points from class B while the test sets include 3 data points from class A and 1 data point from class B.

As we can see, the class distribution of the dataset is preserved in the splits by StratifiedKFold while KFold does not take this into consideration.

Subtemperate answered 16/12, 2020 at 7:58 Comment(3)
Did you mean "4 data points from class A..."?Mensural
Only difference between KFold and Stratified KFold is the balance in class distribution over each Fold, can you confirm?Bride
@Mensural Why 4 data points from class A, please clarify.Subtemperate
H
4
Assume Classification problem, Having 3 class(A,B,C) to predict.

Class  No_of_instance

 A           50 
 B           50
 C           50

**StratifiedKFold**

If data-set is  divided  into 5 fold. Then each fold will contains 10 instance from each class, i.e. no of instance per class is equal and follow  uniform distribution.

**KFold**

it will randomly took 30 instance and no of instance per class may or may not be equal or uniform.

**When to use**

Classification task use StratifiedKFold, and regression task use Kfold .
 
But if dataset contains  large number of instance, both StratifiedKFold and Kfold can be used in classification task.
Heptavalent answered 16/12, 2020 at 7:51 Comment(0)
L
1

StratifiedKFold: This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class

KFold: Split dataset into k consecutive folds.

StratifiedKFold is used when is need to balance of percentage each class in train & test. If not required KFOld is used.

Liftoff answered 16/12, 2020 at 7:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.