pmap
has a batch_size
argument which is, by default, 1. This means that each element of the collection will be sent one by one to available workers or tasks to be transformed by the function you provided. If each function call does a large amount of work and perhaps each call differs in time it takes, using pmap
has the advantage of not letting workers go idle, while other workers do work, because when a worker completes one transformation, it will ask for the next element to transform. Therefore, pmap
effectively balances load among workers/tasks.
@distributed
for-loop, however, partitions a given range among workers once at the beginning, not knowing how much time each partition of the range will take. Consider, for example, a collection of matrices, where the first hundred elements of the collection are 2-by-2 matrices, the next hundred elements are 1000-by-1000 matrices and we would like to take the inverse of each matrix using @distributed
for-loops and 2 worker processes.
@sync @distributed for i = 1:200
B[i] = inv(A[i])
end
The first worker will get all the 2-by-2 matrices and the second one will get 1000-by-1000 matrices. The first worker will complete all the transformation very quickly and go idle, while the other will continue to do work for very long time. Although you are using 2 workers, the major part of the whole work will effectively be executed in serial on the second worker and you will get almost no benefit from using more than one worker. This problem is known as load balancing in the context of parallel computing. The problem may also arise, for example, when one processor is slow and the other is fast even if the work to be completed is homogeneous.
For very small work transformations, however, using pmap
with a small batch size creates a communication overhead that might be significant since after each batch the processor needs to get the next batch from the calling process, whereas with @distributed
for-loops each worker process will know, at the beginning, which part of the range it is responsible for.
The choice between pmap
and @distributed
for-loop depends on what you want to achieve. If you are going to transform a collection as in map
and each transformation requires a large amount of work and this amount is varying, then you are likely to be better of choosing pmap
. If each transformation is very tiny, then you are likely to be better of choosing @distributed
for-loop.
Note that, if you need a reduction operation after the transformation, @distributed
for-loop already provides one, most of the reductions will be applied locally while the final reduction will take place on the calling process. With pmap
, however, you will need to handle the reduction yourself.
You can also implement your own pmap
function with very complex load balancing and reduction schemes if you really need one.
https://docs.julialang.org/en/v1/manual/parallel-computing/