Actually, I'd expect your first example to work. Setting the OMP_PROC_BIND=true
here is important, so that OpenMP stays within the CPU binding from the MPI process when pinning it's threads.
Depending on the batch system and MPI implementation, there might be very individual ways to set these things up.
Also Hyperthreading, or in general multiple hardware threads per core, that all show up as "cores" in your Linux, might be part of the problem as you'll never see 200% when two processes run on the two Hyperthreads of one cores.
Here is a generic solution, I use when figuring these things for some MPI and some OpenMP implementation on some system.
There's documentation from Cray which contains a very helpful program to figure these things out quickly, it's called xthi.c
, google the filename or paste it from here (not sure if it's legal to paste it here...). Compile with:
mpicc xthi.c -fopenmp -o xthi
Now we can see what exactly is going on, for instance on a 2x 8 Core Xeon with Hyperthreading and Intel MPI (MPICH-based) we get:
$ OMP_PROC_BIND=true OMP_PLACES=cores OMP_NUM_THREADS=2 mpiexec -n 2 ./xthi
Hello from rank 0, thread 0, on localhost. (core affinity = 0,16)
Hello from rank 0, thread 1, on localhost. (core affinity = 1,17)
Hello from rank 1, thread 0, on localhost. (core affinity = 8,24)
Hello from rank 1, thread 1, on localhost. (core affinity = 9,25)
As you can see, core means, all the Hyperthreads of a core. Note how mpirun
pins it different sockets, too by default. And With OMP_PLACES=threads
you get one thread per core:
$ OMP_PROC_BIND=true OMP_PLACES=threads OMP_NUM_THREADS=2 mpiexec -n 2 ./xthi
Hello from rank 0, thread 0, on localhost. (core affinity = 0)
Hello from rank 0, thread 1, on localhost. (core affinity = 1)
Hello from rank 1, thread 0, on localhost. (core affinity = 8)
Hello from rank 1, thread 1, on localhost. (core affinity = 9)
With OMP_PROC_BIND=false
(your second example), I get:
$ OMP_PROC_BIND=false OMP_PLACES=cores OMP_NUM_THREADS=2 mpiexec -n 2 ./xthi
Hello from rank 0, thread 0, on localhost. (core affinity = 0-7,16-23)
Hello from rank 0, thread 1, on localhost. (core affinity = 0-7,16-23)
Hello from rank 1, thread 0, on localhost. (core affinity = 8-15,24-31)
Hello from rank 1, thread 1, on localhost. (core affinity = 8-15,24-31)
Here, each OpenMP thread gets a full socket, so the MPI ranks still operate on distinct resources. However, the OpenMP threads, within one process could be scheduled wildly by the OS across all cores. It's the same as just setting OMP_NUM_THREADS=2
on my test system.
Again, this might depend on specific OpenMP and MPI implementations and versions, but I think you'll easily figure out what's going on with the description above.
Hope that helps.
mpich
, doesman mpiexec
mention anything about binding ? – Bravura