Is there a canonical problem that provably can't be aided with map/reduce?
Asked Answered
M

5

6

I'm trying to understand the boundaries of hadoop and map/reduce and it would help to know a non-trivial problem, or class of problems, that we know map/reduce can't assist in.

It certainly would be interesting if changing one factor of the problem would allow simplification from map/reduce.

Thank you

Marsupium answered 5/8, 2010 at 5:10 Comment(0)
A
6

Two things come to mind:

  1. Anything that requires real-time / interactive / low latency response times. There is a fixed cost incurred for any job submitted to Hadoop.

  2. Any problem that is not embarrassingly parallel. Hadoop can handle a lot of problems that require some simple interdependency between data, since records are joined during the reduce phase. However, certain graph processing and machine learning algorithms are difficult to write in Hadoop because there are too many operations that are dependent on one another. Some machine learning algorithms require very low latency, random access to a large set of data, which Hadoop does not provide out of the box.

Apple answered 6/8, 2010 at 15:51 Comment(0)
C
3

It's not totally clear what you mean by "we know map/reduce can't assist in". The other answers seem content with some examples when it's not trivial, easy or not too difficult how to use map reduce to get some significant speed up, but what's easy for one might not be for someone else and the same with significant. I personally would feel more satisfied with a theorem that says that something can't be done. If we look at computational complexity, there is a class of hard to parallelize problems, P-complete problems. Well known such problems are context free grammar recognition (important for compilers), linear programming and some problems in compression. The wikipedia entry has more.

Some people are making up new complexity classes for map reduce. I am extremely skeptical, but the jury is out on how useful those will be.

The other question is: can we simulate any parallel algorithm in map reduce? Sure we can't map reduce our way past P-complete problems but maybe there are some problems that are solvable in parallel but not in mapreduce. I don't know of any such problems, but I know of a paper that points in the opposite direction, albeit under some assumptions "Simulating Parallel Algorithms in the MapReduce Framework with Applications to Parallel Computational Geometry" by Michael T. Goodrich

In practice we had very little time to think in the map reduce way and to develop algorithmic techniques specific to this model, and I see new problems falling to map reduce solutions. When the dust settles, we might find that mapreduce has more power than it seemed at first.

Courteous answered 5/10, 2010 at 7:57 Comment(0)
K
0

I'd think that any problem that can't be resolved to a divide and conquer would not work so well with hadoop. If the algorithm can be modified to be able to create subtasks, then I guess it can run under hadoop well.

To add to your question (rather than the answer, do I make this part of a comment ?), what if there are subtasks that I can breakdown the problem into, but there is no clear way to do the filter phase of hadoop? I guess one can still use hadoop for the map phase and do the reduction on a single machine.

Kevel answered 6/8, 2010 at 19:46 Comment(0)
P
0

As stated by others: The problem must be easy to chop into pieces that can be processed independently.

So let me give you two examples from my past as a WebAnalytics architect (essentially I was trying to do what Hadoop offers now ... without hadoop ... using bash shell scripting on a single system with multiple CPU cores). I hope these two give you some insight in what those boundries may look like in reallife.

Context:

Assume you have a large body of loglines from a webserver. And you want to analyse the behavior of visitors. You need a property that reliably tells you which request was done by which user.

My two examples:

  1. The property you want to use to correlate behaviour upon is bad in some parts (let's say: sessionid). This often happens if the site places this factor by using JavaScript (YUCK!). It then becomes extremely hard (from my experience: near impossible) to partition the job of correcting this correlating property. This "must" be done in "one single huge partition" and as such MapReduce can't help you.
  2. Now the property that is needed to chop all data into manageable chunks is reliable ( i.e. everything a single individual browser did on the site ) it becomes extremely easy to create a lot of subsets of your data. You can easily scale the processing for a large site to hundreds of systems.

In the past someone once told me that Fluid dynamics equations (Navier–Stokes) cannot be split into "Mapreducable" parts. Although I do see some logic in why this would be true (every part of the fluid influences every other part of the fluid) ; I should make clear that I'm not even going to try to understand these equations.

I hope these examples help you in finding the boundries.

Piper answered 7/8, 2010 at 19:6 Comment(0)
I
0

Response to Gangadhar (sorry, not enough space in the comments box):

I like your response about divide and conquer, but I have a caveat to add. Mapreduce doesn't handle the recursive aspect of divide and conquer algorithms very well because the combination of subproblems for certain d/c algorithms depends on eventually reducing to one key. Take the merge sort algorithm for example (ignoring for the moment that sorting is trivial in Hadoop due to the key ordering properties of its Mapreduce implementation): you would use your mapper to partition your data into two-chunks, using an arbitrary id for a key. your reduce phase would merge the value lists for its respective keys together.

7 3 2 4 1 5 8 9 would map to 
1->[[7],[3]] 2->[[2],[4]] 3->[[1],[5]] 4->[[8],[9]] would reduce to
3,7 2,4 1,5 8,9 would map to 
1->[[3,7],[2,4]] 2->[[1,5],[8,9]] 
etc.

You see that the number of keys is reduced by two (or whatever chunking factor you use for your d/c algorithm) and eventually should get down to one key for a sorted list. This is bad for parallelism.

So the divide aspect of divide and conquer is clearly necessary for mapreduce, but we also need data parallelism in our conquest phase such that your result is some collection of data parallel items. FFT is a good example of a divide and conquer algorithm that works well with mapreduce.

Influence answered 19/8, 2010 at 18:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.