Hadoop MapReduce vs MPI (vs Spark vs Mahout vs Mesos) - When to use one over the other?

Asked 6/1, 2015 at 3:11 Answered 21/4, 2015 at 17:49

Solved hadoop parallel-processing mapreduce mpi

I am new to parallel computing and just starting to try out MPI and Hadoop+MapReduce on Amazon AWS. But I am confused about when to use one over the other.

For example, one common rule of thumb advice I see can be summarized as...

Big data, non-iterative, fault tolerant => MapReduce
Speed, small data, iterative, non-Mapper-Reducer type => MPI

But then, I also see implementation of MapReduce on MPI (MR-MPI) which does not provide fault tolerance but seems to be more efficient on some benchmarks than MapReduce on Hadoop, and seems to handle big data using out-of-core memory.

Conversely, there are also MPI implementations (MPICH2-YARN) on new generation Hadoop Yarn with its distributed file system (HDFS).

Besides, there seems to be provisions within MPI (Scatter-Gather, Checkpoint-Restart, ULFM and other fault tolerance) that mimic several features of MapReduce paradigm.

And how does Mahout, Mesos and Spark fit in all this?

What criteria can be used when deciding between (or a combo of) Hadoop MapReduce, MPI, Mesos, Spark and Mahout?

Recognition answered 6/1, 2015 at 3:11 Comment(3)

possible dup of #1530990 ? – Photomap 6/1, 2015 at 15:52

I did read that q&a before posting mine. There, you will see that for every answer posted, there are comments which say that the answer is not accurate. Take the first answer, for example. There are Finite Element implementations on MapReduce 1, 2. – Recognition 6/1, 2015 at 19:17

After I had asked this question, I came across a few more options (to add to the confusion) - like Akka, which seem to not be confined to "obviously parallel" scenarios like MapReduce, while also being fault tolerant and have bindings for Infiniband (TCP) etc. – Recognition 29/9, 2015 at 16:2

There might be good technical criteria for this decision but I haven't seen anything published on it. There seems to be a cultural divide where it's understood that MapReduce gets used for sifting through data in corporate environments while scientific workloads use MPI. That may be due to underlying sensitivity of those workloads to network performance. Here are a few thoughts about how to find out:

Many modern MPI implementations can run over multiple networks but are heavily optimized for Infiniband. The canonical use case for MapReduce seems to be in a cluster of "white box" commodity systems connected via ethernet. A quick search on "MapReduce Infiniband" leads to http://dl.acm.org/citation.cfm?id=2511027 which suggests that use of Infiniband in a MapReduce environment is a relatively new thing.

So why would you want to run on a system that's highly optimized for Infiniband? It's significantly more expensive than ethernet but has higher bandwidth, lower latency and scales better in cases of high network contention (ref: http://www.hpcadvisorycouncil.com/pdf/IB_and_10GigE_in_HPC.pdf).

If you have an application that would be sensitive to those effects of optimizations for Infiniband that are already baked into many MPI libraries, maybe that would be useful for you. If your app is relatively insensitive to network performance and spends more time on computations that don't require communication between processes, maybe MapReduce is a better choice.

If you have the opportunity to run benchmarks, you could do a projection on whichever system you have available to see how much improved network performance would help. Try throttling your network: downclock GigE to 100mbit or Infiniband QDR to DDR, for example, draw a line through the results and see if the purchase of a faster interconnect optimized by MPI would get you where you want to go.

Debug answered 8/4, 2015 at 12:38 Comment(0)

The link you posted about FEM being done on MapReduce: Link

uses MPI. It states it right there in the abstract. They combined MPI's programming model (non-embarrassingly parallel) with HDFS to "stage" the data to exploit data locality.

Hadoop is purely for embarrassingly parallel computations. Anything that requires processes to organize themselves and exchange data in complex ways will get crap performance with Hadoop. This can be demonstrated both from an algorithmic complexity point of view, and also from a measurement point of view.

Quintin answered 21/4, 2015 at 17:49 Comment(0)

Recommended topics

Hot tags