difference between StratifiedKFold and StratifiedShuffleSplit in sklearn
Asked Answered
J

3

78

As from the title I am wondering what is the difference between

StratifiedKFold with the parameter shuffle=True

StratifiedKFold(n_splits=10, shuffle=True, random_state=0)

and

StratifiedShuffleSplit

StratifiedShuffleSplit(n_splits=10, test_size=’default’, train_size=None, random_state=0)

and what is the advantage of using StratifiedShuffleSplit

Jockey answered 30/8, 2017 at 20:43 Comment(6)
mmm in StratifiedShuffleSplit you can set the number of splits... from the sklearn webpage: StratifiedShuffleSplit : This cross-validation object is a merge of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds. The folds are made by preserving the percentage of samples for each class.Jockey
Aah yes, my bad. But still its written in the StratifiedShuffleSplit documentation you linked that "This cross-validation object is a merge of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds. The folds are made by preserving the percentage of samples for each class."Casares
Difference is between folds (data don't overlap in folds). Whereas in StratifiedShuffleSplit it can and will overlap. See the examples given on the documentation pages to understand it better. Specifically test data. In StratifiedKFold it will always be different in each fold. And in StratifiedShuffleSplit it can be repeatative.Casares
So if I have to choose between the two I should go for StratifiedKFold isn't it? I do not see the advantage of using the StratifiedShiffleSplit...but there should be because is a more recent function of sklearn... that's way I am wonderingJockey
Sounds like StratifiedKFold samples without replacement while StratifiedShiffleSplit shuffles with. So, one advantage of StratifiedShiffleSplit is you can sample as many times as you want. Sure, individual samples will have overlap -- so any fitted models on the samples will be correlated -- but you can fit many more models, and with more data per model.Cacka
Let's think about it this way. StratifiedKFold is the true cross-validation. However, StratifiedShiffleSplit is a "generator", and it randomly generate different "train-test" splits for n_splits times.Intensify
V
106

In stratKFolds, each test set should not overlap, even when shuffle is included. With stratKFolds and shuffle=True, the data is shuffled once at the start, and then divided into the number of desired splits. The test data is always one of the splits, the train data is the rest.

In ShuffleSplit, the data is shuffled every time, and then split. This means the test sets may overlap between the splits.

See this block for an example of the difference. Note the overlap of the elements in the test sets for ShuffleSplit.

splits = 5

tx = range(10)
ty = [0] * 5 + [1] * 5

from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold
from sklearn import datasets

stratKfold = StratifiedKFold(n_splits=splits, shuffle=True, random_state=42)
shufflesplit = StratifiedShuffleSplit(n_splits=splits, random_state=42, test_size=2)

print("stratKFold")
for train_index, test_index in stratKfold.split(tx, ty):
    print("TRAIN:", train_index, "TEST:", test_index)

print("Shuffle Split")
for train_index, test_index in shufflesplit.split(tx, ty):
    print("TRAIN:", train_index, "TEST:", test_index)

Output:

stratKFold
TRAIN: [0 2 3 4 5 6 7 9] TEST: [1 8]
TRAIN: [0 1 2 3 5 7 8 9] TEST: [4 6]
TRAIN: [0 1 3 4 5 6 8 9] TEST: [2 7]
TRAIN: [1 2 3 4 6 7 8 9] TEST: [0 5]
TRAIN: [0 1 2 4 5 6 7 8] TEST: [3 9]
Shuffle Split
TRAIN: [8 4 1 0 6 5 7 2] TEST: [3 9]
TRAIN: [7 0 3 9 4 5 1 6] TEST: [8 2]
TRAIN: [1 2 5 6 4 8 9 0] TEST: [3 7]
TRAIN: [4 6 7 8 3 5 1 2] TEST: [9 0]
TRAIN: [7 2 6 5 4 3 0 9] TEST: [1 8]

As for when to use them, I tend to use stratKFolds for any cross validation, and I use ShuffleSplit with a split of 2 for my train/test set splits. But I'm sure there are other use cases for both.

Villalba answered 12/9, 2017 at 16:23 Comment(1)
+1 For your great answer. Can you please, answer my questionLarock
A
80

@Ken Syme already has a very good answer. I just want to add something.

  • StratifiedKFold is a variation of KFold. First, StratifiedKFold shuffles your data, after that splits the data into n_splits parts and Done. Now, it will use each part as a test set. Note that it only and always shuffles data one time before splitting.

With shuffle = True, the data is shuffled by your random_state. Otherwise, the data is shuffled by np.random (as default). For example, with n_splits = 4, and your data has 3 classes (label) for y (dependent variable). 4 test sets cover all the data without any overlap.

enter image description here

  • On the other hand, StratifiedShuffleSplit is a variation of ShuffleSplit. First, StratifiedShuffleSplit shuffles your data, and then it also splits the data into n_splits parts. However, it's not done yet. After this step, StratifiedShuffleSplit picks one part to use as a test set. Then it repeats the same process n_splits - 1 other times, to get n_splits - 1 other test sets. Look at the picture below, with the same data, but this time, the 4 test sets do not cover all the data, i.e there are overlaps among test sets.

enter image description here

So, the difference here is that StratifiedKFold just shuffles and splits once, therefore the test sets do not overlap, while StratifiedShuffleSplit shuffles each time before splitting, and it splits n_splits times, the test sets can overlap.

  • Note: the two methods uses "stratified fold" (that why "stratified" appears in both names). It means each part preserves the same percentage of samples of each class (label) as the original data. You can read more at cross_validation documents
Appoint answered 6/10, 2018 at 9:32 Comment(1)
Perfect explanation !!Mclellan
C
15

Output examples of KFold, StratifiedKFold, StratifiedShuffleSplit: Output examples of KFold, StratifiedKFold, StratifiedShuffleSplit

The above pictorial output is an extension of @Ken Syme's code:

from sklearn.model_selection import KFold, StratifiedKFold, StratifiedShuffleSplit
SEED = 43
SPLIT = 3

X_train = [0,1,2,3,4,5,6,7,8]
y_train = [0,0,0,0,0,0,1,1,1]   # note 6,7,8 are labelled class '1'

print("KFold, shuffle=False (default)")
kf = KFold(n_splits=SPLIT, random_state=SEED)
for train_index, test_index in kf.split(X_train, y_train):
    print("TRAIN:", train_index, "TEST:", test_index)

print("KFold, shuffle=True")
kf = KFold(n_splits=SPLIT, shuffle=True, random_state=SEED)
for train_index, test_index in kf.split(X_train, y_train):
    print("TRAIN:", train_index, "TEST:", test_index)

print("\nStratifiedKFold, shuffle=False (default)")
skf = StratifiedKFold(n_splits=SPLIT, random_state=SEED)
for train_index, test_index in skf.split(X_train, y_train):
    print("TRAIN:", train_index, "TEST:", test_index)
    
print("StratifiedKFold, shuffle=True")
skf = StratifiedKFold(n_splits=SPLIT, shuffle=True, random_state=SEED)
for train_index, test_index in skf.split(X_train, y_train):
    print("TRAIN:", train_index, "TEST:", test_index)
    
print("\nStratifiedShuffleSplit")
sss = StratifiedShuffleSplit(n_splits=SPLIT, random_state=SEED, test_size=3)
for train_index, test_index in sss.split(X_train, y_train):
    print("TRAIN:", train_index, "TEST:", test_index)

print("\nStratifiedShuffleSplit (can customise test_size)")
sss = StratifiedShuffleSplit(n_splits=SPLIT, random_state=SEED, test_size=2)
for train_index, test_index in sss.split(X_train, y_train):
    print("TRAIN:", train_index, "TEST:", test_index)
Chirlin answered 19/11, 2019 at 12:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.