10*10 fold cross validation in scikit-learn?
Asked Answered
D

1

7

Is

class sklearn.cross_validation.ShuffleSplit(
    n, 
    n_iterations=10, 
    test_fraction=0.10000000000000001, 
    indices=True, 
    random_state=None
)

the right way for 10*10fold CV in scikit-learn? (By changing the random_state to 10 different numbers)

Because I didn't find any random_state parameter in Stratified K-Fold or K-Fold and the separate from K-Fold are always identical for the same data.

If ShuffleSplit is the right, one concern is that it is mentioned

Note: contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets

Is this always the case for 10*10 fold CV?

Director answered 26/11, 2011 at 19:36 Comment(0)
B
10

I am not sure what you mean by 10*10 cross validation. The ShuffleSplit configuration you give will make you call the fit method of the estimator 10 times. If you call this 10 times by explicitly using an outer loop or directly call it 100 times with 10% of the data reserved for testing in a single loop if you use instead:

>>> ss = ShuffleSplit(X.shape[0], n_iterations=100, test_fraction=0.1,
...     random_state=42)

If you want to do 10 runs of StratifiedKFold with k=10 you can shuffle the dataset between the runs (that would lead to a total 100 calls to the fit method with a 90% train / 10% test split for each call to fit):

>>> from sklearn.utils import shuffle
>>> from sklearn.cross_validation import StratifiedKFold, cross_val_score
>>> for i in range(10):
...    X, y = shuffle(X_orig, y_orig, random_state=i)
...    skf = StratifiedKFold(y, 10)
...    print cross_val_score(clf, X, y, cv=skf)
Blindstory answered 26/11, 2011 at 20:5 Comment(4)
Thanks, it's exactly what I was looking for. BTW, I saw 42 many times in examples on the web page, any story for that?Director
You are asking the wrong question :) en.wikipedia.org/wiki/…Blindstory
More seriously, in the examples and tests we want to have reproducible outcomes hence we fix the PRNG seed to an arbitrary value. Feel free to tweak the value, the outcome should still "look good" but sometimes slightly different (some algorithms have a non convex objective functions with several good local optima).Blindstory
@Blindstory Hi. If I use a StratifiedShuffleSplit, do I still need the outer loop? I want to do a 10x10 SSS inside a Pipeline.Hefner

© 2022 - 2024 — McMap. All rights reserved.