ppr
means processes per resource. Its syntax is ppr:N:resource
and it means "assign N processes to each resource of type resource available on the host". For example, on a 4-socket system with 6-core CPUs having --map-by ppr:4:socket
results in the following process map:
socket ---- 0 ---- ---- 1 ---- ---- 2 ---- ---- 3 ----
core 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
process A B C D E F G H I J K L M N O P
(process numbering runs from A
to Z
in this example)
What the manual means is that the whole ppr:N:resource
is to be regarded as a single specifier and that options could be added after it, separated by :
, e.g. ppr:2:socket:pe=2
. This should read as "start two processes per each socket and bind each of them to two processing elements" and should result in the following map given the same quad-socket system:
socket ---- 0 ---- ---- 1 ---- ---- 2 ---- ---- 3 ----
core 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
process A A B B C C D D E E F F G G H H
The sequential
mapper reads the host file line by line and starts one process per host found there. It ignores the slot count if given.
The dist
mapper maps processes on NUMA nodes by the distance of the latter from a given PCI resource. It only makes sense on NUMA systems. Again, let's use the toy quad-socket system but this time extend the representation so to show the NUMA topology:
Socket 0 ------------- Socket 1
| |
| |
| |
| |
| |
Socket 2 ------------- Socket 3
|
ib0
The lines between the sockets represent CPU links. Those are, e.g. QPI links for Intel CPUs and HT links for AMD CPUs. ib0
is an InfiniBand HCA used to communicate with other compute nodes. Now, in that system, Socket 2 talks directly to the InfiniBand HCA. Socket 0 and Socket 3 has to cross one CPU link in order to talk to ib0
and Socket 1 has to cross 2 CPU links. That means, processes running on Socket 2 will have the lowest possible latency while sending and receiving messages and processes on Socket 1 will have the highest possible latency.
How does it work? If your host file specifies e.g. 16 slots on that host and the mapping option is --map-by dist:ib0
, it may result in the following map:
socket ---- 0 ---- ---- 1 ---- ---- 2 ---- ---- 3 ----
core 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
process G H I J K L A B C D E F M N O P
6 processes are mapped to Socket 2 that is closest to the InfiniBand HCA, then 6 more are mapped to Socket 0 that is the second closest and 4 more are mapped to Socket 3. It is also possible to spread the processes instead of linearly filling the processing elements. --map-by dist:ib0:span
results in:
socket ---- 0 ---- ---- 1 ---- ---- 2 ---- ---- 3 ----
core 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
process E F G H M N O P A B C D I J K L
The actual NUMA topology is obtained using the hwloc
library, which reads the distance information provided by the BIOS. hwloc
includes a command-line tool called hwloc-ls
(also known as lstopo
) that could be used in order to display the topology of the system. Normally it only includes the topology of the processing elements and the NUMA domains in its output but if you give it the -v
option it also includes the PCI devices.