Is caching the only advantage of spark over map-reduce?

Asked 11/7, 2014 at 20:8 Answered 9/10, 2017 at 11:58

I have started to learn about Apache Spark and am very impressed by the framework. Although one thing which keeps bothering me is that in all Spark presentations they talk about how Spark caches the RDDs and therefore multiple operations which need the same data are faster than other approaches like Map Reduce.

So the question I had is that if this is the case, then just add a caching engine inside of MR frameworks like Yarn/Hadoop.

Why to create a new framework altogether?

I am sure I am missing something here and you will be able to point me to some documentation which educates me more on spark.

Incubator answered 11/7, 2014 at 20:8 Comment(0)

Caching + in memory computation is definitely a big thing for spark, However there are other things.

RDD(Resilient Distributed Data set): an RDD is the main abstraction of spark. It allows recovery of failed nodes by re-computation of the DAG while also supporting a more similar recovery style to Hadoop by way of checkpointing, to reduce the dependencies of an RDD. Storing a spark job in a DAG allows for lazy computation of RDD's and can also allow spark's optimization engine to schedule the flow in ways that make a big difference in performance.

Spark API: Hadoop MapReduce has a very strict API that doesn't allow for as much versatility. Since spark abstracts away many of the low level details it allows for more productivity. Also things like broadcast variables and accumulators are much more versatile than DistributedCache and counters IMO.

Spark Streaming: spark streaming is based on a paper Discretized Streams, which proposes a new model for doing windowed computations on streams using micro batches. Hadoop doesn't support anything like this.

As a product of in memory computation spark sort of acts as it's own flow scheduler. Whereas with standard MR you need an external job scheduler like Azkaban or Oozie to schedule complex flows

The hadoop project is made up of MapReduce, YARN, commons and HDFS; spark however is attempting to create one unified big data platform with libraries (in the same repo) for machine learning, graph processing, streaming, multiple sql type libraries and I believe a deep learning library is in the beginning stages. While none of this is strictly a feature of spark it is a product of spark's computing model. Tachyon and BlinkDB are two other technologies that are built around spark.

Rhubarb answered 11/7, 2014 at 20:46 Comment(0)

So its much more than just caching. Aaronman covered a lot so ill only add what he missed.

Raw performance w/o caching is 2-10x faster due to a generally more efficient and well archetected framework. E.g. 1 jvm per node with akka threads is better than forking a whole process for each task.

Scala API. Scala stands for Scalable Language and is clearly the best language to choose for parallel processing. They say Scala cuts down code by 2-5x, but in my experience from refactoring code in other languages - especially java mapreduce code, its more like 10-100x less code. Seriously I have refactored 100s of LOC from java into a handful of Scala / Spark. Its also much easier to read and reason about. Spark is even more concise and easy to use than the Hadoop abstraction tools like pig & hive, its even better than Scalding.

Spark has a repl / shell. The need for a compilation-deployment cycle in order to run simple jobs is eliminated. One can interactively play with data just like one uses Bash to poke around a system.

The last thing that comes to mind is ease of integration with Big Table DBs, like cassandra and hbase. In cass to read a table in order to do some analysis one just does

sc.cassandraTable[MyType](tableName).select(myCols).where(someCQL)

Similar things are expected for HBase. Now try doing that in any other MPP framework!!

UPDATE thought of pointing out this is just the advantages of Spark, there are quite a few useful things on top. E.g. GraphX for graph processing, MLLib for easy machine learning, Spark SQL for BI, BlinkDB for insane fast apprx queries, and as mentioned Spark Streaming

Smetana answered 12/7, 2014 at 8:39 Comment(0)

a lot was covered by aaronman and samthebest. I have a few more points.

Spark has much lower per job and per task overhead. It gives it ability to be applied to the cases where Hadoop MR is not applicable. It is cases when reply is needed in 1-30 seconds.
Low per task overhead makes Spark more efficient for even big jobs with a lot of short tasks. As a very rough estimation - when task takes 1 second Spark will be 2 times more efficient then Hadoop MR.
Spark has lower abstraction then MR - it is graph of computations. As a result it is possible to implement more efficient processing then MR - specifically in cases when sorting is not needed. In other words - in MR we always pay for the sorting, but in Spark - we do not have to.

Mirepoix answered 13/7, 2014 at 6:14 Comment(2)

Did you mean to say "20 times"? ... IME it takes around 20 seconds for the Hadoop to spin up the JVMs – Smetana 13/7, 2014 at 18:45

I mentioned case when there is a lot of tasks, where job startup overhead can be negleted. In this case hadoop 1-2 seconds per task overhead makes 1-2 second tasks to waste about 50% of time, while Spark is doing just fine with a lot of small tasks. – Mirepoix 13/7, 2014 at 18:59

Apache Spark processes data in-memory while Hadoop MapReduce persists back to the disk after a map or reduce action. But Spark needs a lot of memory
Spark loads a process into memory and keeps it there until further notice, for the sake of caching.
Resilient Distributed Dataset (RDD), which allows you to transparently store data on memory and persist it to disc if it's needed.
Since Spark uses in-memory, there's no synchronisation barrier that's slowing you down. This is a major reason for Spark's performance.
Rather than just processing a batch of stored data, as is the case with MapReduce, Spark can also manipulate data in real time using Spark Streaming.
The DataFrames API was inspired by data frames in R and Python (Pandas), but designed from the ground-up to as an extension to the existing RDD API.
A DataFrame is a distributed collection of data organized into named columns, but with richer optimizations under the hood that supports to the speed of spark.
Using RDDs Spark simplifies complex operations like join and groupBy and in the backend, you’re dealing with fragmented data. That fragmentation is what enables Spark to execute in parallel.
Spark allows to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern. It supports in-memory data sharing across DAGs, so that different jobs can work with the same data. DAGs are a major part of Sparks speed.
Spark code base is much smaller.

Hope this helps.

Melindamelinde answered 14/12, 2015 at 11:13 Comment(0)

I think there are three primary reasons.

The main two reasons stem from the fact that, usually, one does not run a single MapReduce job, but rather a set of jobs in sequence.

One of the main limitations of MapReduce is that it persists the full dataset to HDFS after running each job. This is very expensive, because it incurs both three times (for replication) the size of the dataset in disk I/O and a similar amount of network I/O. Spark takes a more holistic view of a pipeline of operations. When the output of an operation needs to be fed into another operation, Spark passes the data directly without writing to persistent storage. This is an innovation over MapReduce that came from Microsoft's Dryad paper, and is not original to Spark.
The main innovation of Spark was to introduce an in-memory caching abstraction. This makes Spark ideal for workloads where multiple operations access the same input data. Users can instruct Spark to cache input data sets in memory, so they don't need to be read from disk for each operation.
What about Spark jobs that would boil down to a single MapReduce job? In many cases also these run faster on Spark than on MapReduce. The primary advantage Spark has here is that it can launch tasks much faster. MapReduce starts a new JVM for each task, which can take seconds with loading JARs, JITing, parsing configuration XML, etc. Spark keeps an executor JVM running on each node, so launching a task is simply a matter of making an RPC to it and passing a Runnable to a thread pool, which takes in the single digits of milliseconds.

Lastly, a common misconception probably worth mentioning is that Spark somehow runs entirely in memory while MapReduce does not. This is simply not the case. Spark's shuffle implementation works very similarly to MapReduce's: each record is serialized and written out to disk on the map side and then fetched and deserialized on the reduce side.

Undercharge answered 9/10, 2017 at 11:58 Comment(0)

Recommended topics

Hot tags