What happens when Spark master fails?
Asked Answered
R

3

14

Does the driver need constant access to the master node? Or is it only required to get initial resource allocation? What happens if master is not available after Spark context has been created? Does it mean application will fail?

Roderic answered 5/3, 2016 at 14:38 Comment(0)
M
15

The first and probably the most serious for the time being consequence of a master failure or a network partition is that your cluster won't be able to accept new applications. This is why Master is considered to be a single point of failure when cluster is used with default configuration.

Master loss will be acknowledged by the running applications but otherwise these should continue to work more or less like nothing happened with two important exceptions:

  • application won't be able to finish gracefully
  • if master is down, or network partition affects worker nodes as well, slaves will try to reregisterWithMaster. If this fails multiple times workers will simply give up. At this moment long running applications (like streaming apps) won't be able to continue processing but it still shouldn't result in immediate failure. Instead application will wait for a master to go back on-line (file system recovery) or a contact from a new leader (Zookeeper mode), and if that happens it will continue processing.
Milburn answered 6/3, 2016 at 1:5 Comment(1)
Buf if the Master is restarted, can it recover from logs? I get the notion yes from your posting. I mean also batch job, not Streaming. Does it matter if SA or Yarn? I think not.Chimene
R
7

Below are the steps spark application does, when it starts,

  1. Starts the Spark Driver
  2. Spark Driver, connects to spark master for resource allocation.
  3. Spark Driver, sends the jar attached in spark context to master server.
  4. Spark Driver, keeps polling master server to get the job status.
  5. If there is a shuffling or broadcast in code, data is routed via spark driver. That is why, it is required for spark driver to have sufficient memory.
  6. If there is any operation like take, takeOrdered, or collect, data is accumulater on driver.

So, yes, failing on master will result in executors not able to communicate with it. So, they will stop working. Failing of master will make driver unable to communicate with it for job status. So, your application will fail.

Rutherford answered 19/3, 2016 at 4:45 Comment(0)
G
3

Yes, the driver and master communicate constantly throughout the SparkContext's lifetime. That allows driver to:

  • Display detailed status of jobs / stages / tasks on its Web Interface and REST API
  • Listen on job start and end events (you can add your own listeners)
  • Wait for jobs to end (via synchronous API - e.g. rdd.count() won't terminate until job is completed) and get their result

A disconnect between driver and master will fail the job.

Germanous answered 5/3, 2016 at 16:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.