GridSearchCV: "TypeError: 'StratifiedKFold' object is not iterable"
Asked Answered
M

4

11

I want to perform GridSearchCV in a RandomForestClassifier, but data is not balanced, so I use StratifiedKFold:

from sklearn.model_selection import StratifiedKFold
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {'n_estimators':[10, 30, 100, 300], "max_depth": [3, None],
          "max_features": [1, 5, 10], "min_samples_leaf": [1, 10, 25, 50], "criterion": ["gini", "entropy"]}

rfc = RandomForestClassifier()

clf = GridSearchCV(rfc, param_grid=param_grid, cv=StratifiedKFold()).fit(X_train, y_train)

But I get an error:

TypeError                                 Traceback (most recent call last)
<ipython-input-597-b08e92c33165> in <module>()
     9 rfc = RandomForestClassifier()
     10 
---> 11 clf = GridSearchCV(rfc, param_grid=param_grid, cv=StratifiedKFold()).fit(X_train, y_train)

c:\python34\lib\site-packages\sklearn\grid_search.py in fit(self, X, y)
    811 
    812         """
--> 813         return self._fit(X, y, ParameterGrid(self.param_grid))

c:\python34\lib\site-packages\sklearn\grid_search.py in _fit(self, X, y, parameter_iterable)
    559                                     self.fit_params, return_parameters=True,
    560                                     error_score=self.error_score)
--> 561                 for parameters in parameter_iterable
    562                 for train, test in cv)

c:\python34\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self, iterable)
    756             # was dispatched. In particular this covers the edge
    757             # case of Parallel used with an exhausted iterator.
--> 758             while self.dispatch_one_batch(iterator):
    759                 self._iterating = True
    760             else:

c:\python34\lib\site-packages\sklearn\externals\joblib\parallel.py in dispatch_one_batch(self, iterator)
    601 
    602         with self._lock:
--> 603             tasks = BatchedCalls(itertools.islice(iterator, batch_size))
    604             if len(tasks) == 0:
    605                 # No more tasks available in the iterator: tell caller to stop.

c:\python34\lib\site-packages\sklearn\externals\joblib\parallel.py in __init__(self, iterator_slice)
    125 
    126     def __init__(self, iterator_slice):
--> 127         self.items = list(iterator_slice)
    128         self._size = len(self.items)

c:\python34\lib\site-packages\sklearn\grid_search.py in <genexpr>(.0)
    560                                     error_score=self.error_score)
    561                 for parameters in parameter_iterable
--> 562                 for train, test in cv)
    563 
    564         # Out is a list of triplet: score, estimator, n_test_samples

TypeError: 'StratifiedKFold' object is not iterable

When I write cv=StratifiedKFold(y_train) I have ValueError: The number of folds must be of Integral type. But when I write `cv=5, it works.

I don't understand what is wrong with StratifiedKFold

Microelectronics answered 26/10, 2016 at 8:38 Comment(0)
B
10

I had exactly the same problem. The solution that worked for me is to replace:

from sklearn.grid_search import GridSearchCV

with

from sklearn.model_selection import GridSearchCV

Then it should work fine.

Boltrope answered 1/6, 2017 at 15:0 Comment(0)
H
6

The problem here is an API change as mentioned in other answers, however the answers could be more explicit.

The cv parameter documentation states:

cv : int, cross-validation generator or an iterable, optional

Determines the cross-validation splitting strategy. Possible inputs for cv are:

  • None, to use the default 3-fold cross-validation, integer, to specify the number of folds.

  • An object to be used as a cross-validation generator.

  • An iterable yielding train/test splits.

For integer/None inputs, if y is binary or multiclass, StratifiedKFold used. If the estimator is a classifier or if y is neither binary nor multiclass, KFold is used.

So, whatever the cross validation strategy used, all that is needed is to provide the generator using the function split, as suggested:

kfolds = StratifiedKFold(5)
clf = GridSearchCV(estimator, parameters, scoring=qwk, cv=kfolds.split(xtrain,ytrain))
clf.fit(xtrain, ytrain)
Happenstance answered 1/6, 2017 at 14:34 Comment(0)
J
2

It seems that cv=StratifiedKFold()).fit(X_train, y_train) should be changed to cv=StratifiedKFold()).split(X_train, y_train).

Joyless answered 14/1, 2017 at 19:19 Comment(2)
This has nothing to do with the error. this line: clf = GridSearchCV(rfc, param_grid=param_grid, cv=StratifiedKFold()).fit(X_train, y_train) just defines the object clf and then it calls the fit method to train/fit the clf.Boltrope
@Happenstance also mentioned that fit should be replaced with split.Joyless
F
0

The api changed in the latest version. You used to pass y and now you pass just the number when you create the stratifiedKFold object. You pass the y later.

Fellers answered 26/10, 2016 at 8:46 Comment(4)
I write cv=StratifiedKFold(10) and get TypeError: 'StratifiedKFold' object is not iterable When should I pass y?Microelectronics
In the current version you import sklearn.model_selection.StratifiedKFold. And then you can do cv=StratifiedKFold(10) and there should be no error. However maybe you are importing from the previous module which still exists for compatibility purposes until version 20.Fellers
Could I ask one more question? I downloaded from this site lfd.uci.edu/~gohlke/pythonlibs/#scikit-learn file scikit_learn-0.18-cp34-cp34m-win32.whl, installed it, but now I get ImportError: DLL load failed: %1 is not a valid Win32 application. . What is wrong?Microelectronics
Probably missing a dependency somewhere. The easy way is to download anaconda. Then it just works.Fellers

© 2022 - 2024 — McMap. All rights reserved.