what does it mean configuring MPI for shared memory?
Asked Answered
G

1

11

I have a bit of research related question.

Currently I have finished implementation of structure skeleton frame work based on MPI (specifically using openmpi 6.3). the frame work is supposed to be used on single machine. now, I am comparing it with other previous skeleton implementations (such as scandium, fast-flow, ..)

One thing I have noticed is that the performance of my implementation is not as good as the other implementations. I think this is because, my implementation is based on MPI (thus a two sided communication that require the match of send and receive operation) while the other implementations I am comparing with are based on shared memory. (... but still I have no good explanation to reason out that, and it is part of my question)

There are some big difference on completion time of the two categories.

Today I am also introduced to configuration of open-mpi for shared memory here => openmpi-sm

and there come comes my question.

1st what does it means to configure MPI for shared memory? I mean while MPI processes live in their own virtual memory; what really is the flag like in the following command do? (I thought in MPI every communication is by explicitly passing a message, no memory is shared between processes).

    shell$ mpirun --mca btl self,sm,tcp -np 16 ./a.out

2nd why is the performance of MPI is so much worse with compared to other skeleton implementation developed for shared memory? At least I am also running it on one single multi-core machine. (I suppose it is because other implementation used thread parallel programming, but I have no convincing explanation for that).

any suggestion or further discussion is very welcome.

Please let me know if I have to further clarify my question.

thank you for your time!

Gapin answered 21/11, 2012 at 21:14 Comment(0)
P
15

Open MPI is very modular. It has its own component model called Modular Component Architecture (MCA). This is where the name of the --mca parameter comes from - it is used to provide runtime values to MCA parameters, exported by the different components in the MCA.

Whenever two processes in a given communicator want to talk to each other, MCA finds suitable components, that are able to transmit messages from one process to the other. If both processes reside on the same node, Open MPI usually picks the shared memory BTL component, known as sm. If both processes reside on different nodes, Open MPI walks the available network interfaces and choses the fastest one that can connect to the other node. It puts some preferences on fast networks like InfiniBand (via the openib BTL component), but if your cluster doesn't have InfiniBand, TCP/IP is used as a fallback if the tcp BTL component is in the list of allowed BTLs.

By default you do not need to do anything special in order to enable shared memory communication. Just launch your program with mpiexec -np 16 ./a.out. What you have linked to is the shared memory part of the Open MPI FAQ which gives hints on what parameters of the sm BTL could be tweaked in order to get better performance. My experience with Open MPI shows that the default parameters are nearly optimal and work very well, even on exotic hardware like multilevel NUMA systems. Note that the default shared memory communication implementation copies the data twice - once from the send buffer to shared memory and once from shared memory to the receive buffer. A shortcut exists in the form of the KNEM kernel device, but you have to download it and compile it separately as it is not part of the standard Linux kernel. With KNEM support, Open MPI is able to perform "zero-copy" transfers between processes on the same node - the copy is done by the kernel device and it is a direct copy from the memory of the first process to the memory of the second process. This dramatically improves the transfer of large messages between processes that reside on the same node.

Another option is to completely forget about MPI and use shared memory directly. You can use the POSIX memory management interface (see here) to create a shared memory block have all processes operate on it directly. If data is stored in the shared memory, it could be beneficial as no copies would be made. But watch out for NUMA issues on modern multi-socket systems, where each socket has its own memory controller and accessing memory from remote sockets on the same board is slower. Process pinning/binding is also important - pass --bind-to-socket to mpiexec to have it pinn each MPI process to a separate CPU core.

Plessor answered 21/11, 2012 at 22:24 Comment(9)
FWIW, as of Linux 3.2, there are the process_vm_readv/writev syscalls, which do roughly the same as KNEM. See e.g. man7.org/linux/man-pages/man2/process_vm_readv.2.htmlCoquina
@janneb, thanks for pointing that out, but 3.x kernels are not very popular with the majority of production HPC systems now. Yet KNEM provides much more than simple data transfers, e.g. async operations, completion notifications, etc.Plessor
That's true, but then again, neither are kernels with the KNEM patch.Coquina
KNEM is not a patch. You can build it against the kernel that comes with your distribution and then simply modprobe it. It builds against any kernel version since 2.6.15.Plessor
@Hristo lliev Hey Hristo lliev that is so informative, thank you very much. I will for sure look for KNEM. I will accept this as an answer aswell. For now I leave it as it is to get some answer for my second question too. thanks :)Gapin
@hankol, both frameworks that you have linked to appear to be thread-based instead of process-based. Multithreaded applications share all of their data in the same address space and benefit from things like cache reusage and much simpler synchronisation mechanisms. It is perfectly OK for them to run faster than a typical MPI implementation. That's why hybrid programming (mixing MPI with threads) is becomming more and more popular nowadays - threading on each node, MPI between the nodes.Plessor
@Hristo lliev. thank you very much for you time. I have appreciated all your help (also on my previous questions related to MPI) :) thanks!Gapin
@Hristo lliev. I have take a look at Knem, looks like that it is interesting kernel module. One thing I am not understanding is its usage. Specifically the right flags that have to be used while running my MPI program. Something like --mca btl_sm_knem_dma_min 4860 is enough or I have to add more flag like --mca btl_sm_eager_limit 4276 in the same run? or can you please suggest me a good documentation link about knem flag usage, I have tried to look around but no good info regarding those staff. otherwise I will end up to test each command with different number value each time. thank youGapin
@hankol, unfortunately I can only point you to the source code of Open MPI. The documentation is scarce. You can address specific questions to the Open MPI User mailing list (or to the Development list - essentially the same people read both lists).Plessor

© 2022 - 2024 — McMap. All rights reserved.