In many functions from scikit-learn implemented user-friendly parallelization. For example in
sklearn.cross_validation.cross_val_score
you just pass desired number of computational jobs in n_jobs
argument. And for PC with multi-core processor it will work very nice. But if I want use such option in high performance cluster (with installed OpenMPI package and using SLURM for resource management) ? As I know sklearn
uses joblib
for parallelization, which uses multiprocessing
. And, as I know (from this, for example, Python multiprocessing within mpi) Python programs parallelized with multiprocessing
easy to scale oh whole MPI architecture with mpirun
utility. Can I spread computation of sklearn
functions on several computational nodes just using mpirun
and n_jobs
argument?
SKLearn manages its parallelism with Joblib. Joblib can swap out the multiprocessing backend for other distributed systems like dask.distributed or IPython Parallel. See this issue on the sklearn
github page for details.
Example using Joblib with Dask.distributed
Code taken from the issue page linked above.
from sklearn.externals.joblib import parallel_backend
search = RandomizedSearchCV(model, param_space, cv=10, n_iter=1000, verbose=1)
with parallel_backend('dask', scheduler_host='your_scheduler_host:your_port'):
search.fit(digits.data, digits.target)
This requires that you set up a dask.distributed
scheduler and workers on your cluster. General instructions are available here: http://dask.readthedocs.io/en/latest/setup.html
Example using Joblib with ipyparallel
Code taken from the same issue page.
from sklearn.externals.joblib import Parallel, parallel_backend, register_parallel_backend
from ipyparallel import Client
from ipyparallel.joblib import IPythonParallelBackend
digits = load_digits()
c = Client(profile='myprofile')
print(c.ids)
bview = c.load_balanced_view()
# this is taken from the ipyparallel source code
register_parallel_backend('ipyparallel', lambda : IPythonParallelBackend(view=bview))
...
with parallel_backend('ipyparallel'):
search.fit(digits.data, digits.target)
Note: in both the above examples, the n_jobs
parameter seems to not matter anymore.
Set up dask.distributed with SLURM
For SLURM the easiest way to do this is probably to use the dask-jobqueue project
>>> from dask_jobqueue import SLURMCluster
>>> cluster = SLURMCluster(project='...', queue='...', ...)
>>> cluster.scale(20)
You could also use dask-mpi or any of several other methods mentioned at Dask's setup documentation
Use dask.distributed directly
Alternatively you can set up a dask.distributed or IPyParallel cluster and then use these interfaces directly to parallelize your SKLearn code. Here is an example video of SKLearn and Joblib developer Olivier Grisel, doing exactly that at PyData Berlin: https://youtu.be/Ll6qWDbRTD0?t=1561
Try Dask-ML
You could also try the Dask-ML package, which has a RandomizedSearchCV
object that is API compatible with scikit-learn but computationally implemented on top of Dask
https://github.com/dask/dask-ml
pip install dask-ml
dask-ssh
to set up my scheduler and workers. That works fine, if I print the scheduler object I get the right number of cores (240
). Next, I wrapped the call to the randomizedsearch's fit
in the with
statement. If I look in the console window where i executed dask-ssh
, I see a connection from the node I run the python script in. However, there is no distributed work going on. It doesn't scale, and it doesn't even see the GPUs that the workers have. –
Firelock n_jobs
parameter, setting to -1
, 1
, 100
, 240
. Each value above 20
leads to about the same performance, which makes me think that nothing is actually running on the distributed workers, but on the node I run the python script on (gensim also prints a message that there is no GPU. There is a GPU on the worker nodes, but there isn't one on the node I run the script from). –
Firelock ipyparallel
, same thing I described with dask
. The workers (engines in ipyparallel) are successfully created, the client sees them, but my grid searches do not run on them. –
Firelock sklearn
examples, as I figured them out with the help of sklearn
developers. Please let me know if you're happy with it, in which case I'll award the bounty. –
Firelock register_parallel_backend('distributed', DistributedBackend)
. This should already be handled in distributed.joblib
. Perhaps sklearn is packing along their own version of the joblib library now? –
Bly © 2022 - 2024 — McMap. All rights reserved.