Alas, this isn't well-documented/defined. Here's a test case I'm running under Python 3.6.1:
import multiprocessing as mp
def e(i):
if i % 1000000 == 0:
print(i)
if __name__ == '__main__':
p = mp.Pool()
def g():
for i in range(100000000):
yield i
print("generator done")
r = p.map(e, g())
p.close()
p.join()
The first thing you see is the "generator done" message, and peak memory use is unreasonably high (precisely because, as you suspect, the generator is run to exhaustion before any work is passed out).
However, replace the map()
call like so:
r = list(p.imap(e, g()))
Now memory use remains small, and "generator done" appears at the output end.
However, you won't wait long enough to see that, because it's horridly slow :-( imap()
not only treats that iterable as an iterable, but effectively passes only 1 item at a time across process boundaries. To get speed back too, this works:
r = list(p.imap(e, g(), chunksize=10000))
In real life, I'm much more likely to iterate over an imap()
(or imap_unordered()
) result than to force it into a list, and then memory use remains small for looping over the results too.
multiprocessing.Pool.map
makes no guarantees about how soonfunc
will be called on each argument extracted from the iterable. – Barnyardx for x in range(10)
=>range(10)
:) – Overlapxrange
to get a generator. – Decembrist