Suppose I have a large in memory numpy array, I have a function func
that takes in this giant array as input (together with some other parameters). func
with different parameters can be run in parallel. For example:
def func(arr, param):
# do stuff to arr, param
# build array arr
pool = Pool(processes = 6)
results = [pool.apply_async(func, [arr, param]) for param in all_params]
output = [res.get() for res in results]
If I use multiprocessing library, then that giant array will be copied for multiple times into different processes.
Is there a way to let different processes share the same array? This array object is read-only and will never be modified.
What's more complicated, if arr is not an array, but an arbitrary python object, is there a way to share it?
[EDITED]
I read the answer but I am still a bit confused. Since fork() is copy-on-write, we should not invoke any additional cost when spawning new processes in python multiprocessing library. But the following code suggests there is a huge overhead:
from multiprocessing import Pool, Manager
import numpy as np;
import time
def f(arr):
return len(arr)
t = time.time()
arr = np.arange(10000000)
print "construct array = ", time.time() - t;
pool = Pool(processes = 6)
t = time.time()
res = pool.apply_async(f, [arr,])
res.get()
print "multiprocessing overhead = ", time.time() - t;
output (and by the way, the cost increases as the size of the array increases, so I suspect there is still overhead related to memory copying):
construct array = 0.0178790092468
multiprocessing overhead = 0.252444982529
Why is there such huge overhead, if we didn't copy the array? And what part does the shared memory save me?