assign two MPI processes per core

Asked 31/7, 2012 at 21:20 Answered 28/5, 2021 at 2:4

How do I assign 2 MPI processes per core?

For example, if I do mpirun -np 4 ./application then it should use 2 physical cores to run 4 MPI processes (2 processes per core). I am using Open MPI 1.6. I did mpirun -np 4 -nc 2 ./application but wasn't able to run it.

It complains mpirun was unable to launch the specified application as it could not find an executable:

Queue answered 31/7, 2012 at 21:20 Comment(5)

maybe because you spelled application wrong? – Apologize 31/7, 2012 at 21:23

No. That was just typo. 'application' is not a real application name. Thanks for pointing out though. If I took '-nc 2' out then it worked! – Queue 31/7, 2012 at 21:29

In your comment, you said "nc -2" instead of "-nc 2". Thats 2 typos in 2 messages. Are you sure you arent just missing something silly because you're in a hurry? – Apologize 31/7, 2012 at 21:31

:-( I double checked. No typos in actual command. Worked without '-nc 2' – Queue 31/7, 2012 at 21:34

I would suggest that you merge the content of your other question here. – Granulation 2/8, 2012 at 8:12

orterun (the Open MPI SPMD/MPMD launcher; mpirun/mpiexec are just symlinks to it) has some support for process binding but it is not flexible enough to allow you to bind two processes per core. You can try with -bycore -bind-to-core but it will err when all cores already have one process assigned to them.

But there is a workaround - you can use a rankfile where you explicitly specify which slot to bind each rank to. Here is an example: in order to run 4 processes on a dual-core CPU with 2 processes per core, you would do the following:

mpiexec -np 4 -H localhost -rf rankfile ./application

where rankfile is a text file with the following content:

rank 0=localhost slot=0:0
rank 1=localhost slot=0:0
rank 2=localhost slot=0:1
rank 3=localhost slot=0:1

This will place ranks 0 and 1 on core 0 of processor 0 and ranks 2 and 3 on core 1 of processor 0. Ugly but works:

$ mpiexec -np 4 -H localhost -rf rankfile -tag-output cat /proc/self/status | grep Cpus_allowed_list
[1,0]<stdout>:Cpus_allowed_list:     0
[1,1]<stdout>:Cpus_allowed_list:     0
[1,2]<stdout>:Cpus_allowed_list:     1
[1,3]<stdout>:Cpus_allowed_list:     1

Edit: From your other question is becomes clear that you are actually running on a hyperthreaded CPU. Then you would have to figure out the physical numbering of your logical processors (it's a bit confusing but physical numbering corresponds to the value of processor: as reported in /proc/cpuinfo). The easiest way to obtain it is to install the hwloc library. It provides the hwloc-ls tool that you can use like this:

$ hwloc-ls --of console
...
  NUMANode L#0 (P#0 48GB) + Socket L#0 + L3 L#0 (12MB)
    L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
      PU L#0 (P#0)    <-- Physical ID 0
      PU L#1 (P#12)   <-- Physical ID 12
...

Physical IDs are listed after P# in the brackets. In your 8-core case the second hyperthread of the first core (core 0) would most likely have ID 8 and hence your rankfile would look something like:

rank 0=localhost slot=p0
rank 1=localhost slot=p8
rank 2=localhost slot=p1
rank 3=localhost slot=p9

(note the p prefix - don't omit it)

If you don't have hwloc or you cannot install it, then you would have to parse /proc/cpuinfo on your own. Hyperthreads would have the same values of physical id and core id but different processor and apicid. The physical ID is equal to the value of processor.

Granulation answered 1/8, 2012 at 14:56 Comment(1)

For future travellers, I have a followup question here – Spermatozoon 25/2, 2013 at 18:55

I'm not sure if you have multiple machines or not, and the exact details of how you want the processes distributed, but I'd consider reading up:

mpirun man page

The manual indicates that it has ways of binding processes to different things, including nodes, sockets, and cpu cores.

It's important to note that you will achieve this if you simply run twice as many processes as you have CPU cores, since they will tend to evenly distribute over cores to share load.

I'd try something like the following, though the manual is somewhat ambiguous and I'm not 100% sure it will behave as intended, as long as you have a dual core:

mpirun -np 4 -npersocket 4 ./application

Apologize answered 31/7, 2012 at 21:50 Comment(2)

I am running the application on one machine using shared mem option, – Queue 31/7, 2012 at 21:53

The workaround this uses is the -npersocket 4. Just set it to twice the number of CPU cores on each socket and it will dispatch 2 processes for each core. They won't be bound to the cores, but they will distribute themselves on their own. – Apologize 31/7, 2012 at 21:55

If you use PBS, or something like that, i would suggest this kind of submission:

qsub -l select=128:ncpus=40:mpiprocs=16 -v NPROC=2048./pbs_script.csh

In the present submission i select 128 computational nodes, that have 40 cores, and use 16 of them. In my case, i have 20 physical cores per node.

In this submission i block all the 40 cores of the node and nobody can use these resources. it can avoid other peoples from using the same node and competing with your job.

Sheritasherj answered 1/6, 2015 at 14:31 Comment(0)

Using Open MPI 4.0, the two commands:

mpirun --oversubscribe -c 8 ./a.out

and

mpirun -map-by hwthread:OVERSUBSCRIBE -c 8 ./a.out

worked for me (I have a Ryzen 5 processor with 4 cores and 8 logical cores).

I tested with a do loop that includes operations on real numbers. All logical threads are used, though it seems that there is no speedup benefit since computation takes double the amount of time compared to using -c 4 option (with no oversubscribing).

Wagonette answered 7/3, 2019 at 11:6 Comment(0)

You can run mpirun --use-hwthread-cpus ./application

In this case, Open MPI will consider that a processor is a thread provided by the Hyperthreading. This contrasts with the default behavior when it considers that a processor is a CPU core.

Open MPI denotes the threads provided by the Hyperthreading as "hardware threads" when you use this option, and allocates one Open MPI processor per "hardware thread".

Hardspun answered 28/5, 2021 at 2:4 Comment(0)

Recommended topics

Hot tags