My program is well-suited for MPI. Each CPU does its own, specific (sophisticated) job, produces a single double
, and then I use an MPI_Reduce
to multiply the result from every CPU.
But I repeat this many, many times (> 100,000). Thus, it occurred to me that a GPU would dramatically speed things up.
I have google'd around, but can't find anything concrete. How do you go about mixing MPI with GPUs? Is there a way for the program to query and verify "oh, this rank is the GPU, all other are CPUs" ? Is there a recommended tutorial or something?
Importantly, I don't want or need a full set of GPUs. I really just need a lot of CPUs, and then a single GPU to speed up the frequently-used MPI_Reduce
operation.
Here is a schematic example of what I'm talking about:
Suppose I have 500 CPUs. Each CPU somehow produces, say, 50 double
s. I need to multiply all 250,00 of these double
s together. Then I repeat this between 10,000 and 1 million times. If I could have one GPU (in addition to the 500 CPUs), this could be really efficient. Each CPU would compute its 50 double
s for all ~1 million "states". Then, all 500 CPUs would send their double
s to the GPU. The GPU would then multiply the 250,000 double
s together for each of the 1 million "states", producing 1 million doubles
.
These numbers are not exact. The compute is indeed very large. I'm just trying to convey the general problem.
doubles
together. Each CPU will produce a bunch of thesedoubles
(not just one). The number of states will be between ~10,000 and several million (not the simple 100,000). This entire process will be repeated often. – Zetana