Given a cluster of several nodes, each of which hosts multiple-core processor, is there any advantage of using MPI between nodes and OpenMP/pthreads within nodes over using pure all-MPI? If I understand correctly, if I run an MPI-program on a single node and indicate the number of processes equal to the number of cores, then I will have an honest parallel MPI-job of several processes running on separate cores. So why bother about hybrid parallelization using threads within nodes and MPI only between nodes? I have no question in case of MPI+CUDA hybrid, as MPI cannot employ GPUs, but it can employ CPU cores, so why use threads?
Using a combination of OpenMP/pthread threads and MPI processes is known as Hybrid Programming. It is tougher to program than pure MPI but with the recent reduction in latencies with OpenMP, it makes a lot of sense to use Hybrid MPI. Some advantages are:
- Avoiding data replication: Since threads can share data within a node, if any data needs to be replicated between processes, we can avoid this.
- Light-weight : Threads are lightweight and thus you reduce the meta-data associated with processes.
- Reduction in number of messages : A single process within a node can communicate with other processes, reducing number of messages between nodes (and thus reducing pressure on the Network Interface Card). The number of messages involved in collective communication is notable.
- Faster communication : As pointed out by @user3528438 above, since threads communicate using shared memory, you can avoid using point-to-point MPI communication within a node. A recent approach (2012) recommends using RMA shared memory instead of threads within a node - this model is called MPI+MPI (search google scholar using MPI plus MPI).
But Hybrid MPI has its disadvantages as well but you asked only about the advantages.
This is in fact a much more complex question that it looks like.
It depends of lot of factor. By experience I would say: You are always happy to avoid hibrid openMP-MPI. Which is a mess to optimise. But there is some momement you cannot avoid it, mainly dependent on the problem you are solving and the cluster you have access to.
Let say you are solving a problem highly parallelizable and you have a small cluster then Hibrid will be probably useless.
But if you have a problem which lets says scale well up to N processes but start to have a very bad efficiency at 4N. And you have access to a cluster with 10N cores... Then hybridization will be a solution. You will use a little amount of thread per MPI processes something like 4 (It is known that >8 is not efficient). (its fun to think that on KNL most people I know use 4 to 8 Thread per MPI process even if one chip got 68 cores)
Then what about hybrid accelerator/openMP/MPI.
You are wrong with accelerator + MPI. As soon as you start to used a cluster which has accelerators you will need to use someting like openMP/MPI or CUDA/MPI or openACC/MPI as you will need to communicate between devices. Nowadays you can bypass the CPU using Direct GPU (at least for Nvidia, not clue for other builder but I expect that it would be the case). Then usually you will use 1 MPI process per GPU. Most cluster with GPU will have 1 socket and N accelerators (N
© 2022 - 2024 — McMap. All rights reserved.