Processor/socket affinity in openMPI?

Asked 11/7, 2013 at 22:52 Answered 12/7, 2013 at 12:17

I know,there are some basic function in openMPI implementation for mapping the different processes to different cores of different sockets(if the system have more than one sockets).

--bind-to-socket(first come first serve)
--bysocket(round-robin,based on load balencing)
--npersocket N(assign N processes to each socket)
--npersocket N --bysocket(assign N process to each socket , but in a round-robin basis)
--bind-to-core(binds one process to each core in a sequential fashion)
--bind-to-core --bysocket(assign one process to each core ,but never leave any socket less utilized)
--cpus-per-proc N(bind processes to more than one core)
--rankfile(can write complete description of preference of each process)

I am running my openMPI program on a server having 8 sockets(10 cores each),and since the multi threading is on,there are 160 cores available. I need to analyze by running openMPI program on different combination of sockets/cores and processes.I expect the case when all the sockets are used ,and the code is dealing with some data transfer ,to be slowest as memory transfer is fastest in case both the process are executing on the cores of a same socket.

So my questions are follows,

What are the worst/best case mapping between the process and sockets(each process has a sleep duration and a data transfer to the root process) ?
Is there any way to print the name of the socket and core details on which the process is being executed ? (i will make us of it to know,if the processes are really distributing themselves among the sockets)

Pinson answered 11/7, 2013 at 22:52 Comment(0)

It depends on so many factors that it's impossible for a single "silver bullet" answer to exist. Among the factors are the computational intensity (FLOPS/byte) and the ratio between the amount of local data to the amount of data being passed between the processes. It also depends on the architecture of the system. Computational intensity can be estimated analytically or measured with a profiling tool like PAPI, Likwid, etc. System's architecture can be examined using the lstopo utility, part of the hwloc library, which comes with Open MPI. Unfortunately lstopo cannot tell you how fast each memory channel is and how fast/latent the links between the NUMA nodes are.
Yes, there is: --report-bindings makes each rank print to its standard error output the affinity mask that apply to it. The output varies a bit among the different Open MPI versions:

Open MPI 1.5.x shows the hexadecimal value of the affinity mask:

mpiexec --report-bindings --bind-to-core --bycore

[hostname:00599] [[10634,0],0] odls:default:fork binding child [[10634,1],0] to cpus 0001
[hostname:00599] [[10634,0],0] odls:default:fork binding child [[10634,1],1] to cpus 0002
[hostname:00599] [[10634,0],0] odls:default:fork binding child [[10634,1],2] to cpus 0004
[hostname:00599] [[10634,0],0] odls:default:fork binding child [[10634,1],3] to cpus 0008

This shows that rank 0 has its affinity mask set to 0001 which allows it to run on CPU 0 only. Rank 1 has its affinity mask set to 0002 which allows it to run on CPU 1 only. And so on.

mpiexec --report-bindings --bind-to-socket --bysocket

[hostname:21302] [[30955,0],0] odls:default:fork binding child [[30955,1],0] to socket 0 cpus 003f
[hostname:21302] [[30955,0],0] odls:default:fork binding child [[30955,1],1] to socket 1 cpus 0fc0
[hostname:21302] [[30955,0],0] odls:default:fork binding child [[30955,1],2] to socket 0 cpus 003f
[hostname:21302] [[30955,0],0] odls:default:fork binding child [[30955,1],3] to socket 1 cpus 0fc0

In that case the affinity mask alternates between 003f and 0fc0. 003f in binary is 0000000000111111 and such an affinity mask allows each even rank to execute on CPUs from 0 to 5. 0fc0 is 0000111111000000 and therefore odd ranks are only scheduled on CPUs 5 to 11.

Open MPI 1.6.x uses a nicer graphical display instead:

mpiexec --report-bindings --bind-to-core --bycore

[hostname:39646] MCW rank 0 bound to socket 0[core 0]: [B . . . . .][. . . . . .]
[hostname:39646] MCW rank 1 bound to socket 0[core 1]: [. B . . . .][. . . . . .]
[hostname:39646] MCW rank 2 bound to socket 0[core 2]: [. . B . . .][. . . . . .]
[hostname:39646] MCW rank 3 bound to socket 0[core 3]: [. . . B . .][. . . . . .]

mpiexec --report-bindings --bind-to-socket --bysocket

[hostname:13888] MCW rank 0 bound to socket 0[core 0-5]: [B B B B B B][. . . . . .]
[hostname:13888] MCW rank 1 bound to socket 1[core 0-5]: [. . . . . .][B B B B B B]
[hostname:13888] MCW rank 2 bound to socket 0[core 0-5]: [B B B B B B][. . . . . .]
[hostname:13888] MCW rank 3 bound to socket 1[core 0-5]: [. . . . . .][B B B B B B]

Each socket is represented graphically as a set of square brackets with each core represented by a dot. The core(s) that each rank is bound to is/are denoted by the letter B. Processes are bound to the first hardware thread only.

Open MPI 1.7.x is a bit more verbose and also knows about hardware threads:

mpiexec --report-bindings --bind-to-core

[hostname:28894] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../..][../../../../../..]
[hostname:28894] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../..][../../../../../..]
[hostname:28894] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../..][../../../../../..]
[hostname:28894] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../..][../../../../../..]

mpiexec --report-bindings --bind-to-socket

[hostname:29807] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]: [BB/BB/BB/BB/BB/BB][../../../../../..]
[hostname:29807] MCW rank 1 bound to socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: [../../../../../..][BB/BB/BB/BB/BB/BB]
[hostname:29807] MCW rank 2 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]: [BB/BB/BB/BB/BB/BB][../../../../../..]
[hostname:29807] MCW rank 3 bound to socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: [../../../../../..][BB/BB/BB/BB/BB/BB]

Open MPI 1.7.x also replaces the --bycore and --bysocket options with the more general --rank-by <policy> option.

Frontpage answered 12/7, 2013 at 12:17 Comment(13)

would the --report-bindings itself print something on the cmd or how to use further to get the actual binding used by processes ? It printed nothing extra for me ! – Pinson 12/7, 2013 at 17:31

It makes each rank print its binding to the standard error output. – Frontpage 12/7, 2013 at 19:13

i am not getting anything on the stderr even after redirecting this to stdout using this; char buf[BUFSIZ]; setbuf(stderr, buf); – Pinson 15/7, 2013 at 21:43

@AnkurGautam, I've updated the answer with sample outputs to be expected from --report-bindings. – Frontpage 16/7, 2013 at 11:8

Thanks a lot !. you always try to leave no further question .but still i am not able to figure out , how to redirect the stderr(since binding will be shown here) to stdout ? – Pinson 16/7, 2013 at 18:52

when i use 2>&1 , i am not able to see even stdout(regular cout print from code). – Pinson 17/7, 2013 at 0:1

does the mpiexec or mpirun matters ? i thought them to be same . – Pinson 17/7, 2013 at 0:2

In Open MPI both are symlinks to orterun. The standard recommends that the launcher is called mpiexec and in most cases both names are synonyms. – Frontpage 17/7, 2013 at 5:18

2>&1 is also not working with mpiexec :( (no output on terminal) – Pinson 17/7, 2013 at 5:21

What version of Open MPI do you use? – Frontpage 17/7, 2013 at 11:59

can you please tell me, how to save the stderr to a file ? – Pinson 19/7, 2013 at 23:16

#17605074 can you please answere my last Question :) – Pinson 19/7, 2013 at 23:17

... 2>/path/to/file-for-stderr.txt or ... 2>>/path/to/file-for-stderr.txt – Frontpage 20/7, 2013 at 9:0

1. If there is equal communication between each node and the root and no other communication pattern, then communication will not influence the performance of a specific process->socket mapping. (This is assuming a regular symmetric interconnect topology between the sockets.) Otherwise you usually try to place process pairs with heavy communication close to each other in the communication topology. With MPI on shared memory systems that may not be relevant, but on clusters it certainly is.

However load balancing may also have an effect on the performance of the mapping. If some processes wait for a message/barrier, the other cores on that socket may be able to utilize a higher turbo frequency. This heavily depends on the runtime behavior of the application. An application only consisting of sleep and transfer does not really make sense.

You can use libnuma / sched_getaffinity to confirm your process pinning manually.

There are a number of performance analysis tools that would be helpful to answer your questions. For example OpenMPI comes with VampirTrace that produces a trace containing information about the MPI communication and more. You can view with Vampir.

Apophyge answered 12/7, 2013 at 10:32 Comment(0)

Recommended topics

Hot tags