MPI + GPU : how to mix the two techniques

Asked 9/4, 2012 at 13:37 Answered 26/4, 2017 at 8:42

My program is well-suited for MPI. Each CPU does its own, specific (sophisticated) job, produces a single double, and then I use an MPI_Reduce to multiply the result from every CPU.

But I repeat this many, many times (> 100,000). Thus, it occurred to me that a GPU would dramatically speed things up.

I have google'd around, but can't find anything concrete. How do you go about mixing MPI with GPUs? Is there a way for the program to query and verify "oh, this rank is the GPU, all other are CPUs" ? Is there a recommended tutorial or something?

Importantly, I don't want or need a full set of GPUs. I really just need a lot of CPUs, and then a single GPU to speed up the frequently-used MPI_Reduce operation.

Here is a schematic example of what I'm talking about:

Suppose I have 500 CPUs. Each CPU somehow produces, say, 50 doubles. I need to multiply all 250,00 of these doubles together. Then I repeat this between 10,000 and 1 million times. If I could have one GPU (in addition to the 500 CPUs), this could be really efficient. Each CPU would compute its 50 doubles for all ~1 million "states". Then, all 500 CPUs would send their doubles to the GPU. The GPU would then multiply the 250,000 doubles together for each of the 1 million "states", producing 1 million doubles.
These numbers are not exact. The compute is indeed very large. I'm just trying to convey the general problem.

Zetana answered 9/4, 2012 at 13:37 Comment(3)

This doesn't sound like a very good fit for GPU computing. Your proposed GPU component contains only a few hundred double precision MFLops. That s orders of magnitude smaller than is profitable for a GPU, and would be swamped by the network overhead of transmitting the data over the wire to the Node hosting the GPU and across the PCI-e bus into GPU memory. – Courland 9/4, 2012 at 13:59

@Courland Sorry for the misleading schematic example. I will update in my question. In reality, it is slightly more complicated. I need to multiply between *O*(10k) doubles together. Each CPU will produce a bunch of these doubles (not just one). The number of states will be between ~10,000 and several million (not the simple 100,000). This entire process will be repeated often. – Zetana 9/4, 2012 at 14:33

As I wrote, that is still only a few hundred MFlops. That is a tiny amount of computation, even for a CPU. – Courland 9/4, 2012 at 14:45

This isn't the way to think about these things.

I like to say that MPI and GPGPU stuff are orthogonal(*). You use MPI between tasks (for which think nodes, although you can have multiple tasks per node), and each task may or may not use an accelerator like a GPU to accelerate the computation within task. There is no MPI rank on a GPU.

Regardless, Talonmies is right; this particular example doesn't sound like it would benefit much from a GPU. And it won't be helped by having tens of thousands of doubles per task; if you're only doing one or a few FLOPs per double, the cost of sending the data to the GPU will exceed the benefit of having all those cores operate on them.

(*) This used to be more clearly true; now with, for instance, GPUDirect being able to copy memory to remote GPUs over infiniband, the distinction is fuzzier. However, I maintain that this is still the most useful way to think about things, with such things as RDMA to GPUs being an important optimization but conceptually a minor tweak.

Exanimate answered 9/4, 2012 at 14:48 Comment(5)

I suppose I am underestimating the speed of float multiplication for a standard CPU? I was thinking: multiplying 10,000 doubles together, and doing this ~1 million times sounds like a helluva lot of computations (10 billion). Is it not? – Zetana 9/4, 2012 at 15:53

@CycoMatto: your 10,000 doubles multiplied 1 million times has the same flop count as multiplying a pair of 1800x1800 dense matrices. Once. That is a couple of CPU seconds using even a modest x86 processor with a reasonably tuned BLAS..... – Courland 9/4, 2012 at 17:2

@Courland OK. And what if there is yet another level of reptition/looping? i.e: I have ~1 million trials. Each trial must sum over 1 million states. Each state requires the multiplication of ~10,000 doubles. For these reasons I was fixated on GPU + MPI – Zetana 9/4, 2012 at 21:31

@CycoMatto: Increasing the amount of work doesn't change the basic problem with your idea. Your computation requires N 64 bit words to pass over the wire and into the GPU in order to do N Flops. No matter how large N is, you can never "win" -- the communication will be vastly slower than the computation at all sizes. This is why it doesn't make sense to use the GPU. Compare this with the matrix multiply example I mentioned. There you require 2N^2 words of data transfer to get 2N^3 Flops. That is profitable to do on the GPU. – Courland 10/4, 2012 at 7:30

@tolonmies Thanks for the point of wisdom that I never found in my last 2 weeks of research on CUDA fundamentals. GPU might be the fastest beast out there in computations by the virtue of parallel processing -- like a swarm of honey bees. But feeding those bees what they need and taking from them what you want needs efforts that sometimes even invalidate the gains, like the OP's case. (GIST: I/O overhead may exceed processing overhead and hence..) – Tool 21/8, 2017 at 1:1

Here I have found some news about the topic:

"MPI, the Message Passing Interface, is a standard API for communicating data via messages between distributed processes that is commonly used in HPC to build applications that can scale to multi-node computer clusters. As such, MPI is fully compatible with CUDA, which is designed for parallel computing on a single computer or node. There are many reasons for wanting to combine the two parallel programming approaches of MPI and CUDA. A common reason is to enable solving problems with a data size too large to fit into the memory of a single GPU, or that would require an unreasonably long compute time on a single node. Another reason is to accelerate an existing MPI application with GPUs or to enable an existing single-node multi-GPU application to scale across multiple nodes. With CUDA-aware MPI these goals can be achieved easily and efficiently. In this post I will explain how CUDA-aware MPI works, why it is efficient, and how you can use it."

Adherent answered 26/4, 2017 at 8:42 Comment(0)

Recommended topics

Hot tags