Apache Spark: Differences between client and cluster deploy modes

Asked 4/5, 2016 at 12:23 Answered 9/7, 2018 at 7:4

Solved apache-spark apache-spark-standalone

TL;DR: In a Spark Standalone cluster, what are the differences between client and cluster deploy modes? How do I set which mode my application is going to run on?

We have a Spark Standalone cluster with three machines, all of them with Spark 1.6.1:

A master machine, which also is where our application is run using spark-submit
2 identical worker machines

From the Spark Documentation, I read:

(...) For standalone clusters, Spark currently supports two deploy modes. In client mode, the driver is launched in the same process as the client that submits the application. In cluster mode, however, the driver is launched from one of the Worker processes inside the cluster, and the client process exits as soon as it fulfills its responsibility of submitting the application without waiting for the application to finish.

However, I don't really understand the practical differences by reading this, and I don't get what are the advantages and disadvantages of the different deploy modes.

Additionally, when I start my application using start-submit, even if I set the property spark.submit.deployMode to "cluster", the Spark UI for my context shows the following entry:

So I am not able to test both modes to see the practical differences. That being said, my questions are:

1) What are the practical differences between Spark Standalone client deploy mode and cluster deploy mode? What are the pro's and con's of using each one?

2) How to I choose which one my application is going to be running on, using spark-submit?

Prophylactic answered 4/5, 2016 at 12:23 Comment(0)

What are the practical differences between Spark Standalone client deploy mode and cluster deploy mode? What are the pro's and con's of using each one?

Let's try to look at the differences between client and cluster mode.

Client:

Driver runs on a dedicated server (Master node) inside a dedicated process. This means it has all available resources at it's disposal to execute work.
Driver opens up a dedicated Netty HTTP server and distributes the JAR files specified to all Worker nodes (big advantage).
Because the Master node has dedicated resources of it's own, you don't need to "spend" worker resources for the Driver program.
If the driver process dies, you need an external monitoring system to reset it's execution.

Cluster:

Driver runs on one of the cluster's Worker nodes. The worker is chosen by the Master leader
Driver runs as a dedicated, standalone process inside the Worker.
Driver programs takes up at least 1 core and a dedicated amount of memory from one of the workers (this can be configured).
Driver program can be monitored from the Master node using the --supervise flag and be reset in case it dies.
When working in Cluster mode, all JARs related to the execution of your application need to be publicly available to all the workers. This means you can either manually place them in a shared place or in a folder for each of the workers.

Which one is better? Not sure, that's actually for you to experiment and decide. This is no better decision here, you gain things from the former and latter, it's up to you to see which one works better for your use-case.

How to I choose which one my application is going to be running on, using spark-submit

The way to choose which mode to run in is by using the --deploy-mode flag. From the Spark Configuration page:

/bin/spark-submit \
  --class <main-class>
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]

Backache answered 4/5, 2016 at 12:53 Comment(14)

Something I've noticed is that the driver needs access to the data as well although it won't be doing anything with that. So, if you are using a file system to keep some files, you need to have the same files both on the driver node and also on the cluster. – Wavellite 19/4, 2017 at 6:16

@DurgaSwaroop Which mode are you referring to? – Backache 19/4, 2017 at 6:17

In Client mode. – Wavellite 19/4, 2017 at 6:31

@DurgaSwaroop In client mode, if you execute the driver from your Master node, then sure, the master node must have all the files available since he is the one starting your SparkContext. – Backache 19/4, 2017 at 6:35

Not just from Master node, If I'm starting from a node completely outside of the cluster, it still expects the files to be accessible from the driver. – Wavellite 19/4, 2017 at 8:1

@DurgaSwaroop You're right, the Master node was only an example. – Backache 19/4, 2017 at 8:49

Do you know why the driver expects the data to be present with itself also? Because I found that in the end it is not doing anything with it but still expects the data. – Wavellite 19/4, 2017 at 11:36

@DurgaSwaroop Because the Driver is the one initializing the SparkContext, it must have the code itself present, how else would it start the job? – Backache 19/4, 2017 at 11:38

I'm talking about the data. Driver expects the code and that's fine. But why does it expect the data present at the driver as well (in case of a file system file)? – Wavellite 19/4, 2017 at 11:54

@DurgaSwaroop Let us continue this discussion in chat. – Backache 19/4, 2017 at 11:57

What do you guys mean by "driver expects the data to be present within itself" ? I am using a single node setup so it means I should have data in this setup itself, right ? it should not be on some external location. Am I correct ?? @YuvalItzchakov – Mansuetude 31/1, 2019 at 16:8

@YuvalItzchakov I think your description of the cluster mode is correct, but in the client mode it seems like the driver does not run on the master node. See the Spark docs spark.apache.org/docs/latest/spark-standalone.html. I'm not clear whether the standalone client and cluster mode mirror the YARN cluster and client modes but in YARN client the driver is always outside the cluster – Fulvi 11/4, 2020 at 18:30

@Fulvi For standalone, client mode will runs where ever you submit the job from. Regarding YARN I'm unsure, although the following seems to indicate something similar: In client mode, the Spark executors will use the local directories configured for YARN while the Spark driver will use those defined in spark.local.dir. This is because the Spark driver does not run on the YARN cluster in client mode, only the Spark executors do. – Backache 11/4, 2020 at 19:22

@YuvalItzchakov thanks. I'm guessing you mean the driver will run wherever you submit the job from. – Fulvi 11/4, 2020 at 22:6

Let's say you are going to perform a spark submit in EMR by doing SSH to the master node. If you are providing the option --deploy-mode cluster, then following things will happen.

You won't be able to see the detailed logs in the terminal.
Since driver is not created in the Master itself, you won't be able to terminate the job from the terminal.

But in case of --deploy-mode client:

You will be able to see the detailed logs in the terminal.
You will be able to terminate the job from the terminal itself.

These are the basic things that I have noticed till now.

Extraterrestrial answered 9/7, 2018 at 7:4 Comment(0)

I'm also having the same scenario, here master node use a standalone ec2 cluster. In this setup client mode is appropriate. In this driver is launched directly with in the spark-submit process which acts as a client to the cluster. The Input & output of the application is attached to the console.Thus, this mode is especially suitable for applications that involve REPL.

Else if your application is submitted from a machine far from the worker machines then it is quite common to use cluster mode to minimize the network latency b/w driver & executor.

Coplin answered 26/7, 2017 at 4:39 Comment(0)

Recommended topics

Hot tags