Why isn't Hadoop implemented using MPI?
Asked Answered
N

6

38

Correct me if I'm wrong, but my understanding is that Hadoop does not use MPI for communication between different nodes.

What are the technical reasons for this?

I could hazard a few guesses, but I do not know enough of how MPI is implemented "under the hood" to know whether or not I'm right.

Come to think of it, I'm not entirely familiar with Hadoop's internals either. I understand the framework at a conceptual level (map/combine/shuffle/reduce and how that works at a high level) but I don't know the nitty gritty implementation details. I've always assumed Hadoop was transmitting serialized data structures (perhaps GPBs) over a TCP connection, eg during the shuffle phase. Let me know if that's not true.

Nariko answered 4/1, 2011 at 4:34 Comment(0)
P
29

One of the big features of Hadoop/map-reduce is the fault tolerance. Fault tolerance is not supported in most (any?) current MPI implementations. It is being thought about for future versions of OpenMPI.

Sandia labs has a version of map-reduce which uses MPI, but it lacks fault tolerance.

Pancho answered 4/1, 2011 at 5:48 Comment(2)
So, you're saying the reason isn't inherent to the MPI paradigm itself, just current implementations? So it sounds like currently, corrupt network messages or fickle nodes could bring down an MPI system. Let's say both of these factors were removed. Would there be any reason not to implement Hadoop using MPI still?Nariko
I think this is a reasonable answer.Rosalia
N
18

MPI is Message Passing Interface. Right there in the name - there is no data locality. You send the data to another node for it to be computed on. Thus MPI is network-bound in terms of performance when working with large data.

MapReduce with the Hadoop Distributed File System duplicates data so that you can do your compute in local storage - streaming off the disk and straight to the processor. Thus MapReduce takes advantage of local storage to avoid the network bottleneck when working with large data.

This is not to say that MapReduce doesn't use the network... it does: and the shuffle is often the slowest part of a job! But it uses it as little, and as efficiently as possible.

To sum it up: Hadoop (and Google's stuff before it) did not use MPI because it could not have used MPI and worked. MapReduce systems were developed specifically to address MPI's shortcomings in light of trends in hardware: disk capacity exploding (and data with it), disk speed stagnant, networks slow, processor gigahertz peaked, multi-core taking over Moore's law.

Nitrosyl answered 20/10, 2012 at 20:36 Comment(2)
This is a pretty misguided answer. Most MPI programs don't send ALL data over the network. They're typically parallel simulations, and only send minimal updates to neighbors as the simulation progresses. E.g., halo exchange in a hydrodynamics code. For MapReduce, MPI doesn't make sense because it's not reliable: if one process dies, the whole job dies. This is the main reason MPI is not a great base for MapReduce. MPI is for tightly coupled apps on fast, reliable networks (supercomputers), while MapReduce is designed for running embarrassingly parallel workloads on slow, unreliable hardware.Pellagra
-1 for incorrect information. The "messages" being passed are not the entire data set, and MPI applications can definitely have data locality. MPI and Hadoop are somewhat orthogonal, and where they overlap is not where you have answered this question. Jobs executed using Hadoop could absolutely use MPI and work fine, it's just a much more bare-bones environment to work in that does less heavy lifting than Hadoop does (but with the benefit of more opportunities for optimization).Pruinose
M
9

The truth is Hadoop could be implemented using MPI. MapReduce has been used via MPI for as long as MPI has been around. MPI has functions like 'bcast' - broadcast all data, 'alltoall' - send all data to all nodes, 'reduce' and 'allreduce'. Hadoop removes the requirement to explicitly implement your data distribution and gather your result methods by packaging an outgoing communication command with a reduce command. The upside is you need to make sure your problem fits the 'reduce' function before you implement Hadoop. It could be your problem is a better fit for 'scatter'/'gather' and you should use Torque/MAUI/SGE with MPI instead of Hadoop. Finally, MPI does not write your data to disk as described in another post, unless you follow your receive method with a write to disk. It works just as Hadoop does by sending your process/data somewhere else to do the work. The important part is to understand your problem with enough detail to be sure MapReduce is the most efficient parallelization strategy, and be aware that many other strategies exist.

Mayfair answered 22/3, 2013 at 19:30 Comment(0)
S
4

There is no restriction that prevents MPI programs from using local disks. And of course MPI-programs always attempt to work locally on data - in RAM or on local disk - just like all parallel applications. In MPI 2.0 (which is not a future version, it's been here for a decade) it is possible to add and remove processes dynamically, which makes it possible to implement applications which can recover from e.g. a process dying on some node.

Perhaps hadoop is not using MPI because MPI usually requires coding in C or Fortran and has a more scientific/academic developer culture, while hadoop seems to be more driven by IT professionals with a strong Java bias. MPI is very low-level and error-prone. It allows very efficient use of hardware, RAM and network. Hadoop tries to be high-level and robust, with an efficiency penalty. MPI programming requires discipline and great care to be portable, and still requires compilation from sourcecode on each platform. Hadoop is highly portable, easy to install and allow pretty quick and dirty application development. It's a different scope.

Still, perhaps the hadoop hype will be followed by more resource-efficient alternatives, perhaps based on MPI.

Saxon answered 29/4, 2014 at 13:4 Comment(0)
C
1

In MapReduce 2.0 (MRv2) or YARN applications can be written (or being ported to run) on top of YARN.

Thus essentially there will be a Next Generation Apache Hadoop MapReduce(MAPREDUCE-279) and a way to support multiple programming paradigms on top of it. So one can write MPI applications on YARN. MapReduce programming paradigm will be supported as default always.

http://wiki.apache.org/hadoop/PoweredByYarn Should give a idea of what all applications are developed on top of YARN including Open MPI.

Costin answered 24/4, 2012 at 7:19 Comment(0)
P
1

If we just look at the Map / Reduce steps and scheduling part of Hadoop, then I would argue MPI is a much better methodology / technology. MPI supports many different exchange patterns like broadcast, barrier, gather all, scatter / gather (or call it map-reduce). But Hadoop also has the HDFS. With this, the data can sit much closer to the processing nodes. And if you look at the problem space Hadoop-like technologies where used for, the outputs of the reduction steps were actually fairly large, and you wouldn't want to have all that information swamp your network. That's why Hadoop saves everything to disk. But the control messages could have used MPI, and the MPI messages could just have pointers (urls or file handles) to the actual data on disk ...

Powel answered 29/7, 2014 at 16:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.