Parallelizing a Numpy vector operation
Asked Answered
M

3

64

Let's use, for example, numpy.sin()

The following code will return the value of the sine for each value of the array a:

import numpy
a = numpy.arange( 1000000 )
result = numpy.sin( a )

But my machine has 32 cores, so I'd like to make use of them. (The overhead might not be worthwhile for something like numpy.sin() but the function I actually want to use is quite a bit more complicated, and I will be working with a huge amount of data.)

Is this the best (read: smartest or fastest) method:

from multiprocessing import Pool
if __name__ == '__main__':
    pool = Pool()
    result = pool.map( numpy.sin, a )

or is there a better way to do this?

Miceli answered 11/7, 2012 at 22:3 Comment(3)
If you're going to use pool.map(), you should use math.sin because it is faster than numpy.sin. Reference: #3650694.Yapon
For numpy.sin, the official numpy/scipy wiki says it should work in parallel if you compile numpy with openmp turned on.Limnology
You could also use Bohrium: It should be as simple as replacing your first line with import bohrium as numpy...Housewifery
B
75

There is a better way: numexpr

Slightly reworded from their main page:

It's a multi-threaded VM written in C that analyzes expressions, rewrites them more efficiently, and compiles them on the fly into code that gets near optimal parallel performance for both memory and cpu bounded operations.

For example, in my 4 core machine, evaluating a sine is just slightly less than 4 times faster than numpy.

In [1]: import numpy as np
In [2]: import numexpr as ne
In [3]: a = np.arange(1000000)
In [4]: timeit ne.evaluate('sin(a)')
100 loops, best of 3: 15.6 ms per loop    
In [5]: timeit np.sin(a)
10 loops, best of 3: 54 ms per loop

Documentation, including supported functions here. You'll have to check or give us more information to see if your more complicated function can be evaluated by numexpr.

Bogosian answered 12/7, 2012 at 20:28 Comment(4)
I wrote my code making use of numexpr and it performs about 6 times faster than the same code using numpy. Thanks a lot for the suggestion! Now I'm wondering why numexpr isn't more widespread. In all my searching for numerical packages in Python, I haven't come across it until now. There was also some minor annoyance in numexpr not supporting array indexing but that was hardly a setback.Miceli
Maybe you should also check Theano and Cython then. Theano can use GPUs, but I haven't really used so I can't provide you with an example.Bogosian
One reason why numexpr is not more widespread is I guess that fact that it is more cumbersome to use than pure NumPy (like in the example above). It is indeed great for easily speeding up NumPy calculations than need to run faster, though.Yapon
I got a 30x speed up with this! Awesome :)Legroom
M
28

Well this is kind of interesting note if you run the following commands:

import numpy
from multiprocessing import Pool
a = numpy.arange(1000000)    
pool = Pool(processes = 5)
result = pool.map(numpy.sin, a)

UnpicklingError: NEWOBJ class argument has NULL tp_new

wasn't expecting that, so whats going on, well:

>>> help(numpy.sin)
   Help on ufunc object:

sin = class ufunc(__builtin__.object)
 |  Functions that operate element by element on whole arrays.
 |  
 |  To see the documentation for a specific ufunc, use np.info().  For
 |  example, np.info(np.sin).  Because ufuncs are written in C
 |  (for speed) and linked into Python with NumPy's ufunc facility,
 |  Python's help() function finds this page whenever help() is called
 |  on a ufunc.

yep numpy.sin is implemented in c as such you can't really use it directly with multiprocessing.

so we have to wrap it with another function

perf:

import time
import numpy
from multiprocessing import Pool

def numpy_sin(value):
    return numpy.sin(value)

a = numpy.arange(1000000)
pool = Pool(processes = 5)

start = time.time()
result = numpy.sin(a)
end = time.time()
print 'Singled threaded %f' % (end - start)
start = time.time()
result = pool.map(numpy_sin, a)
pool.close()
pool.join()
end = time.time()
print 'Multithreaded %f' % (end - start)


$ python perf.py 
Singled threaded 0.032201
Multithreaded 10.550432

wow, wasn't expecting that either, well theres a couple of issues for starters we are using a python function even if its just a wrapper vs a pure c function, and theres also the overhead of copying the values, multiprocessing by default doesn't share data, as such each value needs to be copy back/forth.

do note that if properly segment our data:

import time
import numpy
from multiprocessing import Pool

def numpy_sin(value):
    return numpy.sin(value)

a = [numpy.arange(100000) for _ in xrange(10)]
pool = Pool(processes = 5)

start = time.time()
result = numpy.sin(a)
end = time.time()
print 'Singled threaded %f' % (end - start)
start = time.time()
result = pool.map(numpy_sin, a)
pool.close()
pool.join()
end = time.time()
print 'Multithreaded %f' % (end - start)

$ python perf.py 
Singled threaded 0.150192
Multithreaded 0.055083

So what can we take from this, multiprocessing is great but we should always test and compare it sometimes its faster and sometimes its slower, depending how its used ...

Granted you are not using numpy.sin but another function I would recommend you first verify that indeed multiprocessing will speed up the computation, maybe the overhead of copying values back/forth may affect you.

Either way I also do believe that using pool.map is the best, safest method of multithreading code ...

I hope this helps.

Massive answered 11/7, 2012 at 23:17 Comment(3)
Thanks a lot! This is very informative. I had assumed, based on what I read, that Pool's map() function would work somewhat intelligently on the data, but I guess segmenting it first makes a huge difference. Is there any other way to avoid the overhead of the processes copying over the data? Do you expect any performance difference if I use math.sin() instead?Miceli
I actually tried math.sin and well its a lot slower, even multithreaded then single threaded numpy.sin, though it was faster (took 6.435199s than the multithreaded numpy.sin which took 10.5, probably due to the fact that numpy.sin can handle arrays, the numpy guys are really good at math ;), yes there is a way using shared memory docs.python.org/library/multiprocessing.html but please don't use its quite dangerous and has limited support, or at the very least tread carefully.Massive
If your only doing reads then it maybe safe, the subprocess only need to keep track of their corresponding index or subset of indices ...Massive
B
12

SciPy actually has a pretty good writeup on this subject here.

Brahmani answered 11/7, 2012 at 22:13 Comment(1)
The link is dead. Is this what you were referring to? scipy-cookbook.readthedocs.io/items/ParallelProgramming.htmlShowily

© 2022 - 2024 — McMap. All rights reserved.