What is the relationship between workers, worker instances, and executors?
Asked Answered
M

4

88

In Spark Standalone mode, there are master and worker nodes.

Here are few questions:

  1. Does 2 worker instance mean one worker node with 2 worker processes?
  2. Does every worker instance hold an executor for specific application (which manages storage, task) or one worker node holds one executor?
  3. Is there a flow chart explaining how spark works on runtime, such as word count?
Methane answered 11/7, 2014 at 11:34 Comment(0)
L
63

I suggest reading the Spark cluster docs first, but even more so this Cloudera blog post explaining these modes.

Your first question depends on what you mean by 'instances'. A node is a machine, and there's not a good reason to run more than one worker per machine. So two worker nodes typically means two machines, each a Spark worker.

Workers hold many executors, for many applications. One application has executors on many workers.

Your third question is not clear.

Lamarckism answered 11/7, 2014 at 13:53 Comment(4)
1.first question comes from spark-env.sh:SPARK_WORKER_INSTANCES, to set the number of worker processes per node. 2.the class StandaloneExecutorBackend is so called Executor? 3.could you explain how wordcount goes inside spark,how data transport different node by picture :)Methane
'A node is a machine, and there's not a good reason to run more than one worker per machine.' why do you say that, can you please explain? If a node has good memory it can have 2 or more executors in the same machine.Koppel
All else equal, it'd be better to let one process manage all those resources, not two. The exceptions I can think of are somewhat extreme cases - if you have 256GB of memory, you may not want a 256GB single JVM heap, as GC can take a while for example. Or you may have some non-thread-safe native library that requires you to run one task per executor, thus multiple executors. But those are exceptional.Lamarckism
Clouder blog link broke use this oneSew
S
73

Extending to other great answers, I would like to describe with few images.

In Spark Standalone mode, there are master node and worker nodes.

If we represent both master and workers(each worker can have multiple executors if CPU and memory are available) at one place for standalone mode.

Spark Standalone mode

If you are curious about how Spark works with YARN? check this post Spark on YARN

1. Does two worker instance mean one worker node with two worker processes?

In general, we call worker instance as a slave as it's a process to execute spark tasks/jobs. Suggested mapping for a node(a physical or virtual machine) and a worker is,

1 Node = 1 Worker process

2. Does every worker instance hold an executor for the specific application (which manages storage, task) or one worker node holds one executor?

Yes, A worker node can be holding multiple executors (processes) if it has sufficient CPU, Memory and Storage.

Check the Worker node in the given image. A Worker node in a cluster

BTW, the Number of executors in a worker node at a given point of time entirely depends on workload on the cluster and capability of the node to run how many executors.

3. Is there a flow chart explaining how spark runtime?

If we look at the execution from Spark perspective over any resource manager for a program, which join two rdds and do some reduce operation then filter

Spark runtime for a sample code

HIH

Spectacles answered 7/12, 2017 at 7:7 Comment(4)
very nice and elaborative answer, too bad the poster has accepted an answer alreadyOctad
Excellent post - thanks! One question: you say that "Suggested mapping for node(a physical or virtual machine) and worker is, 1 Node = 1 Worker process". But the Spark docs at spark.apache.org/docs/latest/hardware-provisioning.html say "note that the Java VM does not always behave well with more than 200 GB of RAM. If you purchase machines with more RAM than this, you can run multiple worker JVMs per node" So, is your advice assuming this JVM RAM limit? Or (as I suspect) is this RAM limit based on older JVMs, which were a bit less robust with their memory collection?Velour
@Brian: I assume that more the memory to JVM(especially heap) more the time(log pause) to GC when it happened. Sorry for the late response, somehow I missed your comment.Spectacles
That's a good suggestion; I updated the answer. I was under the impression the second image will explain more about the worker.Spectacles
L
63

I suggest reading the Spark cluster docs first, but even more so this Cloudera blog post explaining these modes.

Your first question depends on what you mean by 'instances'. A node is a machine, and there's not a good reason to run more than one worker per machine. So two worker nodes typically means two machines, each a Spark worker.

Workers hold many executors, for many applications. One application has executors on many workers.

Your third question is not clear.

Lamarckism answered 11/7, 2014 at 13:53 Comment(4)
1.first question comes from spark-env.sh:SPARK_WORKER_INSTANCES, to set the number of worker processes per node. 2.the class StandaloneExecutorBackend is so called Executor? 3.could you explain how wordcount goes inside spark,how data transport different node by picture :)Methane
'A node is a machine, and there's not a good reason to run more than one worker per machine.' why do you say that, can you please explain? If a node has good memory it can have 2 or more executors in the same machine.Koppel
All else equal, it'd be better to let one process manage all those resources, not two. The exceptions I can think of are somewhat extreme cases - if you have 256GB of memory, you may not want a 256GB single JVM heap, as GC can take a while for example. Or you may have some non-thread-safe native library that requires you to run one task per executor, thus multiple executors. But those are exceptional.Lamarckism
Clouder blog link broke use this oneSew
M
33

I know this is an old question and Sean's answer was excellent. My writeup is about the SPARK_WORKER_INSTANCES in MrQuestion's comment. If you use Mesos or YARN as your cluster manager, you are able to run multiple executors on the same machine with one worker, thus there is really no need to run multiple workers per machine. However, if you use standalone cluster manager, currently it still only allows one executor per worker process on each physical machine. Thus in case you have a super large machine and would like to run multiple exectuors on it, you have to start more than 1 worker process. That's what SPARK_WORKER_INSTANCES in the spark-env.sh is for. The default value is 1. If you do use this setting, make sure you set SPARK_WORKER_CORES explicitly to limit the cores per worker, or else each worker will try to use all the cores.

This standalone cluster manager limitation should go away soon. According to this SPARK-1706, this issue will be fixed and released in Spark 1.4.

Marrakech answered 21/4, 2015 at 18:22 Comment(1)
So how does it work now in the latest Spark versions? I can manipulate number of workers by just setting the number of cores for executors. Like if worker has 16 cores, and I give executor cores to be 4, would I have 4 executors per worker? I asked such question recently, which you may answer, #54364903Coppock
K
8

As Lan was saying, the use of multiple worker instances is only relevant in standalone mode. There are two reasons why you want to have multiple instances: (1) garbage pauses collector can hurt throughput for large JVMs (2) Heap size of >32 GB can’t use CompressedOoops

Read more about how to set up multiple worker instances.

Kaylil answered 5/6, 2015 at 8:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.