Communicate data with `count` value close to `INT_MAX`
Asked Answered
H

4

4

The Message Passing Interface APIs always use int as a type for count variables. For instance, the prototype for MPI_Send is:

int MPI_Send(const void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm);

This may be a problem if the number of elements to be sent or received grows near or even beyond INT_MAX.

Of course the issue may be solved lowering the value of count by either:

  1. splitting a single call into multiple calls
  2. defining an (unnecessary) aggregate MPI_Datatype

Both approaches are anyhow more an hack than a real solution, especially if implemented with simple heuristics. What I would like to ask is therefore:

Is there a better idiom to treat these kind of cases with standard MPI calls? If not, does anybody know of some (solid) wrapper library built around MPI to overcome this limitation?

Hellenhellene answered 12/5, 2014 at 14:23 Comment(0)
V
3

The MPI Forum is extremely reluctant to making major changes to the MPI API in order to introduce 64-bit support. The reason for that is both maintaining backward compatibility and not introducing seldom used features - it appears to be an almost as vigorous process as the one that keeps Fortran 2xxx at large compatible with prehistoric FORTRAN IV programs.

As evident by the ticket, creating a large datatype to work around the limitation is actually viewed as not so hackish solution by many, even by William D. Gropp himself:

First, it is possible to send much larger data by simply creating an appropriate MPI Datatype (this could be easier, but it is possible). Second, sending such large data will take seconds (at least!) on current platforms (8GB just for 4-byte integers and a 2GB count) - so this should not be a common operation (and the overhead of creating and committing and freeing a datatype should be negligible).

The fact that MPI-3.0 introduced official support for building large (more than 231 elements) datatypes while the proposal to change the count argument of calls like MPI_SEND to MPI_Count / INTEGER(KIND=MPI_COUNT_KIND) was rejected should hint you on the way of thinking that prevails the MPI Forum. Even before MPI-3.0, 64-bit internal sizes were used by some implementations since years (e.g. Open MPI) while others have chosen to remain on the 32-bit bandwagon (e.g. Intel MPI).

Vicarage answered 12/5, 2014 at 21:1 Comment(10)
You'd think that the first .0 release in 14 years would have been the perfect time to introduce a badly needed but breaking change... It wouldn't have even necessarily broken backwards compatibility in the modern fortran bindings. Bah.Sarcomatosis
@JonathanDursi It seems that Intel MPI is still on the 32-bit side (see slide 7 ), so I assume the only portable way to do this is to split calls (despite the mainstream way of thinking in the MPI Forum)Hellenhellene
@Massimiliano, while that is an old presentation, it is still correct as of today. I work for Intel as support for the Intel® MPI Library (among other things). We've got a release coming up (currently in Beta, see http://bit.ly/sw-dev-tools-2015-beta for details) that will support the MPI_Count approach used in MPI-3.Springclean
@Hellenhellene Splitting Send-Recv calls is not equivalent due to potential changes to the original match-ordering behavior.Karlynkarma
@JonathanDursi It was discussed. At length. blogs.cisco.com/performance/… has a summary of some of the key arguments.Karlynkarma
@Jeff - the only argument there is "it would be a backward compatibility nightmare", which is why the first .0 release in 14 years would have been the perfect place for it, presumably with some compatibility flag or something for old code. Yes, there are workarounds for the fact that in 2015, the self-proclaimed lingua franca of supercomputing has a hardcoded 32-bit limit in much of its API; awesome.Sarcomatosis
@JonathanDursi It's not a practical issue for at least 95% of users and the backwards compatibility issue is less trivial than you contend. Feel free to attend the MPI Forum some time and hash it out with the vendors, which have paying customers to support, or the DOE-NNSA, which has a nuclear weapons stockpile to certify.Karlynkarma
@Jeff I get that that's who drives MPI now - owners of poorly-maintained legacy codes, and the behind-the-fence shops at the DOE. I just wish the Forum would make even superficial effort to support new developers, who have other options - options that aren't so rigidly bound to an architecture designed 25 years ago.Sarcomatosis
@JonathanDursi If you attended the MPI Forum even once, I doubt you would make such statements.Karlynkarma
It took me a big chunk of the summer of 2014, but MPICH (and its derivatives, once they pick up the changes) is internally 64 bit clean -- or should be.Fearfully
K
5

I am the lead developer of BigMPI and co-authored a paper entitled To INT_MAX... and beyond!: exploring large-count support in MPI that discusses this exact topic in far more detail than space permits here.

If you cannot access the ACM DL freely, you can download the Argonne preprint or checkout the paper source repo.

Here are the key highlights from this effort:

  • BigMPI is a relatively high quality interface to MPI that supports 64b integer counts (the type is technically MPI_Count but MPI_Aint is used internally). Ironically, it does not make use of the MPI-3 large-count features. This is because BigMPI is not completely general, but rather aims to support the most common usage models.

  • BigMPI was designed in part to be educational. It employs the ultra-permissive MIT License to make it possible for anyone to copy code from it into another project, possibly with changes to meet an unforeseen need.

  • Exceeding INT_MAX in the MPI-3 interface isn't just slightly a problem. It's invalid ISO C code. The rollover behavior of signed integers is - unlike unsigned integers - undefined. So the primary problem isn't with MPI, it's with the fact that a C integer cannot hold numbers larger than INT_MAX. It is a matter of debate if it is a problem with MPI that the count argument is specified to be the C int type, as opposed to size_t, for example. Before saying it's obvious that MPI should have switched to size_t, you need to understand the history of MPI and the importance of ABI compatibility to a subset of MPI users.

  • Even with BigMPI or similar datatype-based methods, implementations may have bugs. This means that doing the standard-compliant thing will not work, because internally an MPI implementation might improperly store something like count*sizeof(type) into a 32b value, which can overflow for a valid count like one billion if sizeof(type) is eight, for example. As noted in the aforementioned paper, in addition to these bugs - which appear to be absent in recent versions of MPICH and Open-MPI - there are bugs in POSIX functions that must be mitigated.

  • The situation with Fortran is more complicated. Fortran default integer size is not specified and MPI implementations should, in theory, respect whatever the compiler uses. However, this is often not the case in practice. I believe many MPI implementations are broken for counts above INT_MAX due to the use of C int internally. BigMPI does not have a Fortran interface, although I have some desire to write one some day. Until then, please pester MPI implementers to do the right thing w.r.t. Fortran INTEGER casting to C types internally.

Anyways, I do not wish to transcribe the entire contents of our paper into this post, particularly since it is freely available, as is the source code. If you feel this post is inadequate, please comment and I'll try to add more later.

Finally, BigMPI is research code and I would not say it is finished (however, you should not hit the unfinished code). Users are strongly encouraged to perform their own correctness testing of BigMPI and the MPI implementation prior to use in production.

Karlynkarma answered 1/4, 2015 at 21:33 Comment(0)
V
3

The MPI Forum is extremely reluctant to making major changes to the MPI API in order to introduce 64-bit support. The reason for that is both maintaining backward compatibility and not introducing seldom used features - it appears to be an almost as vigorous process as the one that keeps Fortran 2xxx at large compatible with prehistoric FORTRAN IV programs.

As evident by the ticket, creating a large datatype to work around the limitation is actually viewed as not so hackish solution by many, even by William D. Gropp himself:

First, it is possible to send much larger data by simply creating an appropriate MPI Datatype (this could be easier, but it is possible). Second, sending such large data will take seconds (at least!) on current platforms (8GB just for 4-byte integers and a 2GB count) - so this should not be a common operation (and the overhead of creating and committing and freeing a datatype should be negligible).

The fact that MPI-3.0 introduced official support for building large (more than 231 elements) datatypes while the proposal to change the count argument of calls like MPI_SEND to MPI_Count / INTEGER(KIND=MPI_COUNT_KIND) was rejected should hint you on the way of thinking that prevails the MPI Forum. Even before MPI-3.0, 64-bit internal sizes were used by some implementations since years (e.g. Open MPI) while others have chosen to remain on the 32-bit bandwagon (e.g. Intel MPI).

Vicarage answered 12/5, 2014 at 21:1 Comment(10)
You'd think that the first .0 release in 14 years would have been the perfect time to introduce a badly needed but breaking change... It wouldn't have even necessarily broken backwards compatibility in the modern fortran bindings. Bah.Sarcomatosis
@JonathanDursi It seems that Intel MPI is still on the 32-bit side (see slide 7 ), so I assume the only portable way to do this is to split calls (despite the mainstream way of thinking in the MPI Forum)Hellenhellene
@Massimiliano, while that is an old presentation, it is still correct as of today. I work for Intel as support for the Intel® MPI Library (among other things). We've got a release coming up (currently in Beta, see http://bit.ly/sw-dev-tools-2015-beta for details) that will support the MPI_Count approach used in MPI-3.Springclean
@Hellenhellene Splitting Send-Recv calls is not equivalent due to potential changes to the original match-ordering behavior.Karlynkarma
@JonathanDursi It was discussed. At length. blogs.cisco.com/performance/… has a summary of some of the key arguments.Karlynkarma
@Jeff - the only argument there is "it would be a backward compatibility nightmare", which is why the first .0 release in 14 years would have been the perfect place for it, presumably with some compatibility flag or something for old code. Yes, there are workarounds for the fact that in 2015, the self-proclaimed lingua franca of supercomputing has a hardcoded 32-bit limit in much of its API; awesome.Sarcomatosis
@JonathanDursi It's not a practical issue for at least 95% of users and the backwards compatibility issue is less trivial than you contend. Feel free to attend the MPI Forum some time and hash it out with the vendors, which have paying customers to support, or the DOE-NNSA, which has a nuclear weapons stockpile to certify.Karlynkarma
@Jeff I get that that's who drives MPI now - owners of poorly-maintained legacy codes, and the behind-the-fence shops at the DOE. I just wish the Forum would make even superficial effort to support new developers, who have other options - options that aren't so rigidly bound to an architecture designed 25 years ago.Sarcomatosis
@JonathanDursi If you attended the MPI Forum even once, I doubt you would make such statements.Karlynkarma
It took me a big chunk of the summer of 2014, but MPICH (and its derivatives, once they pick up the changes) is internally 64 bit clean -- or should be.Fearfully
S
2

I am unaware of any existing wrappers that handle this, but you could write your own. Most MPI implementations have an additional layer that is intended for profiling (PMPI). You can use this layer for other purposes, in this case splitting a message. The way this layer works is you call the desired MPI function, and it immediately calls the PMPI version of that function. You can write a wrapper of the MPI version which will split the message and call the PMPI version for each. Here is an extremely simple example I wrote long ago for splitting MPI_Bcast:

#include <mpi.h>

int MPI_Bcast(void* buffer, int count, MPI_Datatype datatype,
   int root, MPI_Comm comm ) {

   /*
      This function is a simple attempt at automatically splitting MPI
      messages, in this case MPI_Bcast.  By utilizing the profiling interface
      of MPI, this function is able to intercept a call to MPI_Bcast.  Then,
      instead of the typical profiling, the message size is checked.  If the
      message is larger than the maximum allowable size, it will be split into
      multiple messages, each of which will be sent individually.  This
      function isnot intended for high performance, it is intended to add
      capability without requiring access to the source code of either the MPI
      implementation or the program using MPI.  The intent is to compile
      this as a shared library and preload this library to catch MPI calls.
   */

   int result;
   int typesize;
   long totalsize;
   long maxsize=1;

   // Set the maximum size of a single message

   maxsize=(maxsize<<31)-1;

   // Get the size of the message to be sent

   MPI_Type_size(datatype, &typesize);
   totalsize=static_cast<long>(typesize)*static_cast<long>(count);

   // Check the size

   if (totalsize > maxsize) {
      // The message is too large, split it
      /*
         Ideally, this should be tailored to the system, possibly split into
         a minimum of equally sized messages that will fit into the maximum
         message size.  However, this is a very simple implementation, and
         is focusing on proof of concept, not efficiency.
      */
      int elementsPerChunk=maxsize/typesize;    // Number of elements per chunk
      int remCount=count;                       // Remaining number of elements
      char *address=static_cast<char*>(buffer); // Starting address
                                          // Cast to char to perform arithmetic
      int nChunks=count/elementsPerChunk;       // How many chunks to send
      if (count%elementsPerChunk!=0) nChunks++; // One more for any remaining elements
      int chunkCount;                           // Number of elements in current chunk

      // Send one chunk at a time

      for (int i=0;i<nChunks;i++) {
         // Determine how many elements to send

         if (remCount>elementsPerChunk) {
            chunkCount=elementsPerChunk;
         } else {
            chunkCount=remCount;
         }

         // Decrement the remaining elements

         remCount-=chunkCount;

         // Send the message chunk
         /*
            There is room for improvement here as well.  One key concern is the
            return value.  Normally, there would be a single return value for
            the entire operation.  However, as the operation is split into
            multiple operations, each with its own return value, a decision must
            be made as to what to return.  I have chosen to simply use the
            return value from the last call.  This skips over some error checking
            but is not critical at present.
         */

         result=PMPI_Bcast(static_cast<void*>(address),chunkCount,datatype,root,comm);

         // Update the address for the next chunk

         address+=chunkCount*typesize;
      }
   } else {
      // The message is small enough, just send as it is
      result=PMPI_Bcast(buffer,count,datatype,root,comm);
   }

   // Pass the return value back to the caller

   return result;

}

You can write something similar for MPI_Send (and MPI_Recv) and get the functionality you want. But if this is only for one program, you might be better off just modifying that program to send in chunks.

Springclean answered 12/5, 2014 at 15:13 Comment(4)
Thanks for sharing this code. I'll write my own wrappers if I am obliged to. Anyhow, as this issue is known since MPI 1.0 I was wondering if something more reliable than an hand-written for cycle existed out there :-)Hellenhellene
1) Why 31 in (maxsize<<31)-1. Why not maxsize=LONG_MAX? 2) What keeps static_cast<long>(typesize)*static_cast<long>(count); from overflowing?Bravin
maxsize is based from an int, so LONG_MAX will be too high. If you want, use INT_MAX, but this was done to ensure staying inside a 32-bit signed integer.Springclean
totalsize could very well overflow, this was not considered when I wrote it. Suggestions?Springclean
G
2

I haven't used it myself, but there is a wrapper that exists to help you out called BigMPI. You'll have to take a look at the Github README to find out more about how to use it, but I think it takes care of some of the nastiness of this.

Gallium answered 28/5, 2014 at 19:7 Comment(1)
Thanks for the pointer. I just wrote a very long post about BigMPI :-)Karlynkarma

© 2022 - 2024 — McMap. All rights reserved.