Difference between parallel map and parallel for-loop
Asked Answered
S

2

10

when I read the Julia document of multi-core parallel computing, I noticed there are both parallel map pmap and for-loop @distributed for.

From the documentation, "Julia's pmap is designed for the case where each function call does a large amount of work. In contrast, @distributed for can handle situations where each iteration is tiny".

What makes the difference between pmap and @distributed for? Why @distributed for is slow for a large amount of work?

Thanks

Spasmodic answered 15/4, 2019 at 22:12 Comment(0)
J
8

The issue is that pmap does load balancing while @distributed for splits jobs into equal chunks. You can confirm this by running these two code examples:

julia> @time res = pmap(x -> (sleep(x/10); println(x)), [10;ones(Int, 19)]);
      From worker 2:    1
      From worker 3:    1
      From worker 4:    1
      From worker 2:    1
      From worker 3:    1
      From worker 4:    1
      From worker 3:    1
      From worker 2:    1
      From worker 4:    1
      From worker 4:    1
      From worker 2:    1
      From worker 3:    1
      From worker 2:    1
      From worker 3:    1
      From worker 4:    1
      From worker 4:    1
      From worker 3:    1
      From worker 2:    1
      From worker 4:    1
      From worker 5:    10
  1.106504 seconds (173.34 k allocations: 8.711 MiB, 0.66% gc time)

julia> @time @sync @distributed for x in [10;ones(Int, 19)]
       sleep(x/10); println(x)
       end
      From worker 4:    1
      From worker 3:    1
      From worker 5:    1
      From worker 4:    1
      From worker 5:    1
      From worker 3:    1
      From worker 5:    1
      From worker 3:    1
      From worker 4:    1
      From worker 3:    1
      From worker 4:    1
      From worker 5:    1
      From worker 4:    1
      From worker 5:    1
      From worker 3:    1
      From worker 2:    10
      From worker 2:    1
      From worker 2:    1
      From worker 2:    1
      From worker 2:    1
  1.543574 seconds (184.19 k allocations: 9.013 MiB)
Task (done) @0x0000000005c5c8b0

And you can see that the large job (value 10) makes pmap execute all small jobs on workers different than the one that got the large job (in my example worker 5 did only job 10 while workers 2 to 4 did all other jobs). On the other hand @distributed for assigned the same number of jobs to each worker. Thus the worker that got job 10 (worker 2 in the second example) still had to do four short jobs (as each worker on the average has to do 5 jobs - my example has 20 jobs in total and 4 workers).

Now the advantage of @distributed for is that if the job is inexpensive then equal splitting of jobs among workers avoids having to do the dynamic scheduling which is not for free either.

In summary, as the documentation states, if the job is expensive (and especially if its run time can vary largely), it is better to use pmap as it does load-balancing.

Jodi answered 16/4, 2019 at 8:59 Comment(0)
C
6

pmap has a batch_size argument which is, by default, 1. This means that each element of the collection will be sent one by one to available workers or tasks to be transformed by the function you provided. If each function call does a large amount of work and perhaps each call differs in time it takes, using pmap has the advantage of not letting workers go idle, while other workers do work, because when a worker completes one transformation, it will ask for the next element to transform. Therefore, pmap effectively balances load among workers/tasks.

@distributed for-loop, however, partitions a given range among workers once at the beginning, not knowing how much time each partition of the range will take. Consider, for example, a collection of matrices, where the first hundred elements of the collection are 2-by-2 matrices, the next hundred elements are 1000-by-1000 matrices and we would like to take the inverse of each matrix using @distributed for-loops and 2 worker processes.

@sync @distributed for i = 1:200
    B[i] = inv(A[i])
end

The first worker will get all the 2-by-2 matrices and the second one will get 1000-by-1000 matrices. The first worker will complete all the transformation very quickly and go idle, while the other will continue to do work for very long time. Although you are using 2 workers, the major part of the whole work will effectively be executed in serial on the second worker and you will get almost no benefit from using more than one worker. This problem is known as load balancing in the context of parallel computing. The problem may also arise, for example, when one processor is slow and the other is fast even if the work to be completed is homogeneous.

For very small work transformations, however, using pmap with a small batch size creates a communication overhead that might be significant since after each batch the processor needs to get the next batch from the calling process, whereas with @distributed for-loops each worker process will know, at the beginning, which part of the range it is responsible for.

The choice between pmap and @distributed for-loop depends on what you want to achieve. If you are going to transform a collection as in map and each transformation requires a large amount of work and this amount is varying, then you are likely to be better of choosing pmap. If each transformation is very tiny, then you are likely to be better of choosing @distributed for-loop.

Note that, if you need a reduction operation after the transformation, @distributed for-loop already provides one, most of the reductions will be applied locally while the final reduction will take place on the calling process. With pmap, however, you will need to handle the reduction yourself.

You can also implement your own pmap function with very complex load balancing and reduction schemes if you really need one.

https://docs.julialang.org/en/v1/manual/parallel-computing/

Continual answered 16/4, 2019 at 9:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.