Stratified Train/Validation/Test-split in scikit-learn

Asked 27/11, 2016 at 12:49 Answered 20/12, 2018 at 23:59

There is already a description here of how to do stratified train/test split in scikit via train_test_split (Stratified Train/Test-split in scikit-learn) and a description of how to random train/validation/test split via np.split (How to split data into 3 sets (train, validation and test)?). But what about doing stratified train/validation/test split.

The closest approximation that comes to mind for doing stratified (on class label) train/validation/test split is as follows, but I suspect there's a better way that can perhaps achieve this in one function call or in a more accurate way:

Let's say we want to do a 60/20/20 train/validation/test split, then my current approach is to first do 60/40 stratified split, then do a 50/50 stratifeid split on that first 40 as to ultimately get a 60/20/20 stratified split.

from sklearn.cross_validation import train_test_split
SEED = 2000
x_train, x_validation_and_test, y_train, y_validation_and_test = train_test_split(x, y, test_size=.4, random_state=SEED)
x_validation, x_test, y_validation, y_test = train_test_split(x_validation_and_test, y_validation_and_test, test_size=.5, random_state=SEED)

Please get back if my approach is correct and/or if you have a better approach.

Thank you

Enneagon answered 27/11, 2016 at 12:49 Comment(4)

Same problem here, have you confirmed if this is they correct way to do it? – Honeyed 5/12, 2016 at 9:57

@Honeyed I haven't gotten confirmation from anyone but seems to work. However, ultimately I ended up going with a different approach whereby I just do a stratified train/test split, then for validation I rely on a stratified k-fold cross-validation within the training set. Check out: scikit-learn.org/stable/modules/generated/… – Enneagon 5/12, 2016 at 13:55

Ok, thank you very much!!! – Honeyed 5/12, 2016 at 16:43

That's exactly what I do as well! It's too bad there isn't a built-in way to do this with sklearn. – Greenback 20/12, 2016 at 20:36

The solution is to just use StratifiedShuffleSplit twice, like below:

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.4, random_state=42)
for train_index, test_valid_index in split.split(df, df.target):
    train_set = df.iloc[train_index]
    test_valid_set = df.iloc[test_valid_index]

split2 = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=42)
for test_index, valid_index in split2.split(test_valid_set, test_valid_set.target):
    test_set = test_valid_set.iloc[test_index]
    valid_set = test_valid_set.iloc[valid_index]

Aeschines answered 20/12, 2018 at 23:59 Comment(1)

But this doesn't answer the question which is : is the way to do it as explained is a good way ? – Charla 13/7, 2021 at 17:35

Yes, this is exactly how I would do it - running train_test_split() twice. Think of the first as splitting off your training set, and then that training set may get divided into different folds or holdouts down the line.

In fact, if you end up testing your model using a scikit model that includes built-in cross-validation, you may not even have to explicitly run train_test_split() again. Same if you use the (very handy!) model_selection.cross_val_score function.

Christy answered 4/9, 2018 at 20:53 Comment(1)

There is a problem splitting things twice. Eg. Imagine the worst case. There are 11 samples in the dataset. All 11 are of the same class. You want to split the such that train-set has 10 samples and test-set has 1 sample. The first one split will work as it will produce 10:1 split. But the second split can't happen because there is only one sample. Best sticking to k-fold cross validation. – Apus 17/8, 2022 at 9:19

Recommended topics

Hot tags