Spark Standalone vs YARN
Asked Answered
A

0

6

What features of YARN make it better than Spark Standalone mode for multi-tenant cluster running only Spark applications? Maybe besides authentication.

There are a lot of answers at Google, pretty much of them sounds wrong to me, so I'm not sure where is the truth.

For example:

  1. DZone, Deep Dive Into Spark Cluster Management

    Standalone is good for small Spark clusters, but it is not good for bigger clusters (there is an overhead of running Spark daemons — master + slave — in cluster nodes)

    But other cluster managers also require running agents on cluster nodes. I.e. YARN's slaves are called node managers. They may consume even more memory than Spark's slaves (Spark default is 1 GB).

  2. This answer

The Spark standalone mode requires each application to run an executor on every node in the cluster; whereas with YARN, you choose the number of executors to use

agains Spark Standalone # executor/cores control, that shows how you can specify number of consumed resources at Standalone mode.

  1. Spark Standalone Mode documentation

The standalone cluster mode currently only supports a simple FIFO scheduler across applications.

Against the fact Standalone mode can use Dynamic Allocation, and you can specify spark.dynamicAllocation.minExecutors & spark.dynamicAllocation.maxExecutors. Also I haven't found a note about Standalone doesn't support FairScheduler.

  1. This answer

YARN directly handles rack and machine locality

How does YARN may know anything about data locality in my job? Suppose, I'm storing file locations at AWS Glue (used by EMR as Hive metastore). Inside Spark job I'm querying some-db.some-table. How YARN may know what executor is better for job assignment?

UPD: found another mention about YARN and data locality https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-data-locality.html. Still doesn't matter in case of S3 for example.

Appertain answered 6/11, 2019 at 12:54 Comment(9)
Maybe for dynamic resource allocation and resources queue, it's way easier to manage multi tenancy via user groups, resource groups and allocation pools using YarnGoforth
@Goforth by resource queue you mean YARN's CapacityScheduler? Could you please bring an example when CapacityScheduler is better than Standalone mode dynamic allocation? Also I'm afraid I'm not familiar with YARN resource groupsAppertain
I was talking about the resource pools: docs.cloudera.com/documentation/enterprise/latest/topics/…Goforth
This question deserves a real answer. Getting a hand on detailed facts side by side whtout grappling your mind with each is not an easy taskArius
@VB did you get further with your inquiry since the date of this post ?Arius
@MehdiLAMRANI no more progress yet. Last few years I'm working on AWS EMR & Azure Databricks, so no choice for Standalone option. Let me put a bounty to this questionAppertain
@Appertain I work heavily / collaborate with Databricks as well. But the question of resource allocation is still relevant -for internal understanding of the underlying mechanisms. #57045759Arius
@MehdiLAMRANI regarding your question, with Databricks you don't have that level of flexibility, everything is managed for you. So you can only specify an amount of workers and their type (balance of CPU to RAM). So with Databricks you shouldn't worry about Standalone vs YARNAppertain
@Appertain I am well aware of that. that is why I referred to "internal understanding of the underlying mechanisms". "everything is managed for you", yes, but how does it work and which cluster manager they used behind the hoods... (To my knowledge This is not disclosed)Arius

© 2022 - 2024 — McMap. All rights reserved.