MPI_SEND takes huge part of virtual memory
Asked Answered
P

2

7

Debugging my program on big counts of kernels, I faced with very strange error of insufficient virtual memory. My investigations lead to peace of code, where master sends small messages to each slave. Then I wrote small program, where 1 master simply send 10 integers with MPI_SEND and all slaves receives it with MPI_RECV. Comparison of files /proc/self/status before and after MPI_SEND showed, that difference between memory sizes is huge! The most interesting thing (which crashes my program), is that this memory won't deallocate after MPI_Send and still take huge space.

Any ideas?

 System memory usage before MPI_Send, rank: 0
Name:   test_send_size                                                                                
State:  R (running)                                                                                  
Pid:    7825                                                                                           
Groups: 2840                                                                                        
VmPeak:   251400 kB                                                                                 
VmSize:   186628 kB                                                                                 
VmLck:        72 kB                                                                                  
VmHWM:      4068 kB                                                                                  
VmRSS:      4068 kB                                                                                  
VmData:    71076 kB                                                                                 
VmStk:        92 kB                                                                                  
VmExe:       604 kB                                                                                  
VmLib:      6588 kB                                                                                  
VmPTE:       148 kB                                                                                  
VmSwap:        0 kB                                                                                 
Threads:    3                                                                                          

 System memory usage after MPI_Send, rank 0
Name:   test_send_size                                                                                
State:  R (running)                                                                                  
Pid:    7825                                                                                           
Groups: 2840                                                                                        
VmPeak:   456880 kB                                                                                 
VmSize:   456872 kB                                                                                 
VmLck:    257884 kB                                                                                  
VmHWM:    274612 kB                                                                                  
VmRSS:    274612 kB                                                                                  
VmData:   341320 kB                                                                                 
VmStk:        92 kB                                                                                  
VmExe:       604 kB                                                                                  
VmLib:      6588 kB                                                                                  
VmPTE:       676 kB                                                                                  
VmSwap:        0 kB                                                                                 
Threads:    3        
Post answered 26/10, 2012 at 14:23 Comment(4)
Which MPI implementation are you using, and what kind of network is this on?Rinaldo
It's Friday and most regular SO visitors have long but depleted their mana and are unable to guess things that you do not provide in your question. From the amount of registered memory I would guess that you are running over InfiniBand or another RDMA-enabled network. From the size of the data segment and the amount of registered memory I would also guess that you are not reusing the same buffer for all send operations but rather constantly allocating a new one. Please, prove me wrong by telling us what MPI library you use and showing us your sender's source code.Streaming
MPI: impi-4.0.3 COMPILER: intel-13.0Post
QDR Infiniband 4x/10G Ethernet/Gigabit EthernetPost
S
10

This is an expected behaviour from almost any MPI implementation that runs over InfiniBand. The IB RDMA mechanisms require that data buffers should be registered, i.e. they are first locked into a fixed position in the physical memory and then the driver tells the InfiniBand HCA how to map virtual addresses to physical memory. It is very complex and hence very slow process to register memory for usage by the IB HCA and that's why most MPI implementations never unregister memory that was once registered in hope that the same memory would later be used as a source or data target again. If the registered memory was heap memory, it is never returned back to the operating system and that's why your data segment only grows in size.

Reuse send and receive buffers as much as possible. Keep in mind that communication over InfiniBand incurrs high memory overhead. Most people don't really think about this and it is usually poorly documented, but InfiniBand uses a lot of special data structures (queues) which are allocated in the memory of the process and those queues grow significantly with the number of processes. In some fully connected cases the amount of queue memory can be so large that no memory is actually left for the application.

There are certain parameters that control IB queues used by Intel MPI. The most important in your case is I_MPI_DAPL_BUFFER_NUM which controls the amount of preallocated and preregistered memory. It's default value is 16, so you might want to decrease it. Be aware of possible performance implications though. You can also try to use dynamic preallocated buffer sizes by setting I_MPI_DAPL_BUFFER_ENLARGEMENT to 1. With this option enabled, Intel MPI would initially register small buffers and will later grow them if needed. Note also that IMPI opens connections lazily and that's why you see the huge increase in used memory only after the call to MPI_Send.

If not using the DAPL transport, e.g. using the ofa transport instead, there is not much that you can do. You can enable XRC queues by setting I_MPI_OFA_USE_XRC to 1. This should somehow decrease the memory used. Also enabling dynamic queue pairs creation by setting I_MPI_OFA_DYNAMIC_QPS to 1 might decrease memory usage if the communication graph of your program is not fully connected (a fully connected program is one in which each rank talks to all other ranks).

Streaming answered 26/10, 2012 at 15:20 Comment(16)
Thank you both for complete answers! I am really surprised, because i understand increasing in memory size on 10-20 Mb(1 master, 400 slaves, size of message = 10 integers), but 300 Mb!! And my problem must use thousands kernels... <br/> So, as a result, my big model starts at 1000 kernels and master distribute some small parameters at initialization phase. And now things that still are njt clear for me - (1) Why memory increasing so huge? Messages are so small! (2) Is this allocation means that now master process has very large buffer for internal MPI purposes untill the end of program?Post
The huge buffers are mostly to win in benchmarks :) People look at the latency numbers and forget about memory usage. Basically, the vendor runs a benchmark and finds that it is faster to copy a message less than X bytes than to do the memory registration. So that becomes the eager buffer size. I guess this issue is now acknowledged: there is the dynamic resize feature that Hristo pointed out. The internal buffers will stay registered until the end of the program. What's worse, they are pinned in physical memory, and can't even be swapped out (hence your error about virtual memory).Rinaldo
(3) What do you mean under 'reuse buffers'? I hope MPI doesnt allocate new memory for every new message-array, that i use in MPI_SEND, am i right?Post
@user1671232 (3) No, the idea is you reuse buffers in your code. If your messages are larger than the eager threshold, then each time you access a new buffer it is registered first (which is slow). Using the same buffer more than once is much faster. But it won't help with small messages.Rinaldo
So, for example, i have two MPI_SEND calls, and to first i pass Array1(1), to second - Array2(1000). Does IMPI allocate two different memory buffers for these arrays?Post
@user1671232, by default the DAPL provider for IMPI preallocates 16 buffers of 16640 bytes each (260 KiB in total) per connection. With 400 slaves this results in 102 MiB. The documentation doesn't make it clear if these are used both as send and receive buffers or one has to multiply by two. The latter gives 204 MiB, quite close to 252 MiB in your case.Streaming
@user1671232 No, there are always I_MPI_DAPL_BUFFER_NUM buffers per connection. If you run out of them (say there are 16 buffers and you send 17 messages to the same rank in quick succession), MPI will wait for one of the internal buffers to free up. It may also aggregate the messages into a single one.Rinaldo
I tried now both options(I_MPI_OFA_DYNAMIC_QPS and I_MPI_DAPL_BUFFER_ENLARGEMENT), but no changes.. How can i understand what type of transport is used in my system? Maybe some command or environment variable?Post
You can try to enforce a specific fabric by setting I_MPI_FABRICS. You could specify shm:ofa in order to use shared memory and OpenFabrics verbs or shm:dapl for shared memory and DAPL. With DAPL you can try the UD mode by setting I_MPI_DAPL_UD to enable.Streaming
Alright, after night i formed 3 last questions for complete understanding. (1) I tried messgaes 10,1000,1000000 message sizes and result was approximately the same. Is it mean, that every time master-process allocate predefined 16 buffers per connections and use only them? (2) At initialization master MPI_SEND data, at finalizing phase it will MPI_RECV it from same processors. Master will reuse buffers or allocate more? (3) Only option I_MPI_DAPL_UD=enable bring super results - memory remains approximately the same. Where is magic? ))Post
UD is the Unreliable Datagram InfiniBand protocol. It is rougly equivalent to UDP. Don't let the name fool you - it is quite reliable in most cases and the library makes retransmissions if necessary. As for the different message sizes, when the size goes above certain threshold the user buffer gets registered directly and data is not copied to the preallocated buffers (ok, part of it is, but I don't want to go into much details). But still those buffers are precreated for each connection, no matter if you are going to send small messages or not.Streaming
So, UD protocol preallocate nothing and directly use my buffers? Because i measure time of run and it is the same.Post
@Post No, even with UD there will be preallocated buffers. And Hristo is right: there really is no way to get rid of them :) As for your application crashes, how much memory do your nodes have? And what's the setting for memlock in /etc/security/limits.conf?Rinaldo
In contrary, now everything is working fine, i run my job on 5000 kernels and small part of memory is added after MPI_SEND. And run time is the same, as before modificatons, therefore new protocol(UD) doesnt affect performance. And my question is - where is magic? ))Post
@Post I updated my answer with an explanation for this behavior.Rinaldo
@vovo, as an Intel MPI user you may find this presentation, which Michael Klemm from Intel gave recently at one of our tuning workshops, very handy - 23 tips for performance tuning with the Intel® MPI Library.Streaming
R
5

Hristo's answer is mostly right, but since you are using small messages there's a bit of a difference. The messages end up on the eager path: they first get copied to an already-registered buffer, then that buffer is used for the transfer, and the receiver copies the message out of an eager buffer on their end. Reusing buffers in your code will only help with large messages.

This is done precisely to avoid the slowness of registering the user-supplied buffer. For large messages the copy takes longer than the registration would, so the rendezvous protocol is used instead.

These eager buffers are somewhat wasteful. For example, they are 16kB by default on Intel MPI with OF verbs. Unless message aggregation is used, each 10-int-sized message is eating four 4kB pages. But aggregation won't help when talking to multiple receivers anyway.

So what to do? Reduce the size of the eager buffers. This is controlled by setting the eager/rendezvous threshold (I_MPI_RDMA_EAGER_THRESHOLD environment variable). Try 2048 or even smaller. Note that this can result in a latency increase. Or change the I_MPI_DAPL_BUFFER_NUM variable to control the number of these buffers, or try the dynamic resizing feature that Hristo suggested. This assumes your IMPI is using DAPL (the default). If you are using OF verbs directly, the DAPL variables won't work.


Edit: So the final solution for getting this to run was setting I_MPI_DAPL_UD=enable. I can speculate on the origin of the magic, but I don't have access to Intel's code to actually confirm this.

IB can have different transport modes, two of which are RC (Reliable Connected) and UD (Unreliable Datagram). RC requires an explicit connection between hosts (like TCP), and some memory is spent per connection. More importantly, each connection has those eager buffers tied to it, and this really adds up. This is what you get with Intel's default settings.

There is an optimization possible: sharing the eager buffers between connections (this is called SRQ - Shared Receive Queue). There's a further Mellanox-only extension called XRC (eXtended RC) that takes the queue sharing further: between the processes that are on the same node. By default Intel's MPI accesses the IB device through DAPL, and not directly through OF verbs. My guess is this precludes these optimizations (I don't have experience with DAPL). It is possible to enable XRC support by setting I_MPI_FABRICS=shm:ofa and I_MPI_OFA_USE_XRC=1 (making Intel MPI use the OFA interface instead of DAPL).

When you switch to the UD transport you get a further optimization on top of buffer sharing: there is no longer a need to track connections. The buffer sharing is natural in this model: since there are no connections, all the internal buffers are in a shared pool, just like with SRQ. So there are further memory savings, but at a cost: datagram delivery can potentially fail, and it is up to the software, not the IB hardware to handle retransmissions. This is all transparent to the application code using MPI, of course.

Rinaldo answered 26/10, 2012 at 15:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.