I'm trying to parallelize the GridSearchCV
of scikit-learn
. It's running on a jupyter (hub) notebook
environment. After some research I found this code:
from sklearn.externals.joblib import Parallel, parallel_backend, register_parallel_backend
from ipyparallel import Client
from ipyparallel.joblib import IPythonParallelBackend
c = Client(profile='myprofile')
print(c.ids)
bview = c.load_balanced_view()
register_parallel_backend('ipyparallel', lambda : IPythonParallelBackend(view=bview))
grid = GridSearchCV(pipeline, cv=3, n_jobs=4, param_grid=param_grid)
with parallel_backend('ipyparallel'):
grid.fit(X_train, Y_train)
Note that I've set the n_jobs
parameter to 4
, what is the number of machine's cpu cores. (It's what nproc
returns)
But it doesn't seem to work: ImportError: cannot import name 'register_parallel_backend'
, although I installed joblib with conda install joblib
and also tried pip install -U joblib
.
So, what's the best way to parallelize the GridSearchCV
in this environment?
UPDATE:
Without ipyparallel
and just setting the n_jobs
parameter:
grid = GridSearchCV(pipeline, cv=3, n_jobs=4, param_grid=param_grid)
grid.fit(X_train, Y_train)
Result is the following warning message:
/opt/conda/lib/python3.5/site- packages/sklearn/externals/joblib/parallel.py:540: UserWarning:
Multiprocessing-backed parallel loops cannot be nested, setting n_jobs=1
Seems like it ends up in sequential execution rather than parallel execution.
n_jobs=-1
would lauch all the cpu core to parallel – Paradis