How to submit multiple Spark applications in parallel without spawning separate JVMs?

Asked 16/5, 2017 at 10:49 Answered 16/5, 2017 at 14:48

Solved java rest apache-spark spring-data-hadoop

The problem is that you need to launch separate JVM to create separate session with different number of RAM per job.

How to submit few Spark applications simultaneously without manually spawning separate JVMs?

My app is run on single server, within single JVM. That appears a problem with Spark session per JVM paradigm. Spark paradigm says:

1 JVM => 1 app => 1 session => 1 context => 1 RAM/executors/cores config

I'd like to have different configurations per Spark application without launching extra JVMs manually. Configurations:

spark.executor.cores
spark.executor.memory
spark.dynamicAllocation.maxExecutors
spark.default.parallelism

Usecase

You have started long running job, say 4-5 hours to complete. The job is run within a session with configs spark.executor.memory=28GB, spark.executor.cores=2. Now you want to launch 5-10 seconds job on user demand, without waiting 4-5 hours. This tinny job need 1GB of RAM. What would you do? Submit tinny job from behalf of long-running-job-session? Than it will claim 28GB ((

What I've found

Spark allow you to configure number of CPU and executors only on the session level. Spark scheduling pool allow you to slide and dice only number of cores, not a RAM or executors, right?
Spark Job Server. But they does't support Spark newer than 2.0, not an option for me. But they actually solve the problem for versions older than 2.0. In Spark JobServer features they said Separate JVM per SparkContext for isolation (EXPERIMENTAL), which means spawn new JVM per context
Mesos fine-grained mode is deprecated
This hack, but it's too risky to use it in production.
Hidden Apache Spark REST API for job submission, read this and this. There is definitely way to specify executor memory and cores there, but still what is the behavior on submitting two jobs with different configs? As I understand this is Java REST client for it.
Livy. Not familiar with it, but looks they have Java API only for batch submission, which is not an option for me.

Dodiedodo answered 16/5, 2017 at 10:49 Comment(1)

I really don't understand the problem here. Cluster resources and application resources are two different things. Each application can use it's own configuration and spark.executor.memory is an application property. – Gotten 16/5, 2017 at 11:38

With a use case, this is much clearer now. There are two possible solutions:

If you require shared data between those jobs, use the FAIR-scheduler and a (REST-)frontend (as does SparkJobServer, Livy, etc.). You don't need to use SparkJobServer either, it should be relatively easy to code, if you have a fixed scope. I've seen projects go in that direction. All you need is an event loop and a way to translate your incoming queries into Spark queries. In a way, I would expect there to be demand for a library to cover this use case, since it's pretty much always the first thing you have to build, when you work on a Spark-based application/framework. In this case, you can size your executors according to your hardware, Spark will manage scheduling of your jobs. With Yarn's dynamic resource allocation, Yarn will also free resources (kill executors), should your framework/app be idle. For more information, read here: http://spark.apache.org/docs/latest/job-scheduling.html

If you don't need shared data, use YARN (or another resource manager) to assign resources in a fair manner to both jobs. YARN has a fair scheduling mode, and you can set the resource demands per application. If you think this suits you, but you need shared data, then you might want to think about using Hive or Alluxio to provide a data interface. In this case you would run two spark-submits, and maintain multiple drivers in the cluster. Building additional automation around spark-submit can help you make this less annoying and more transparent to end users. This approach is also high-latency, since resource allocation and SparkSession initialization take up a more or less constant amount of time.

Cheesecake answered 16/5, 2017 at 14:48 Comment(6)

How does that solve the requirement of "How to submit few Spark applications simultaneously without manually spawning separate JVMs?" + "I'd like to have different configurations per Spark application without launching extra JVMs manually."? I don't see how YARN could help here. – Cola 16/5, 2017 at 14:53

@RickMoritz Thanks for your answer! But I don't want to maintain multiple drivers on a regular basis, since the seconds tinny-job-request may never happen. I don't allocate extra resources for extra drivers. I want a solution which will start another JVM for me, execute tinny-job and kill extra JVM. – Dodiedodo 16/5, 2017 at 14:58

With regard to the first option, in particular: YARN allows you to increase and decrease resources on the fly. That covers the underlying requirement of resource usage. Simultaneous execution is managed through use of fair scheduling - the same driver jvm schedules the additional job to run, using additional resources, if YARN can spare them for the application. Sure, your executors will be the same size, but there's very little resource blocking. Getting differently sized executors: #34876050 – Cheesecake 16/5, 2017 at 14:58

@VolodymyrBakhmatiuk : in that case, use e.g. the Yarn-API to reimplement spark-submit. Then you can have an event loop/webserver and you can actually spawn new SparkSessions for every request, with individual sizing. It's not pretty, and I'd prefer using dynamic allocation, which kills all unused JVMs, except for the AM, but you can do just that using YARN's Java-API, for example. Here's the simplest case: github.com/mahmoudparsian/data-algorithms-book/blob/master/misc/… – Cheesecake 16/5, 2017 at 15:6

@RickMoritz thanks! That a nice example. Please more details about dynamic allocation! We use dynamic allocation to change executors amount from session to session, and I don't know how to change RAM per job with dynamic allocation – Dodiedodo 16/5, 2017 at 15:19

Dynamic allocation will kill executors that haven't run jobs in X-time but restart them, when the resources are available, and the driver requests new machines. At the moment the algorithm is pretty naive and Spark will instruct Yarn after a certain amount of time, that executors can be removed. You don't change the amount of RAM per executor, but you do change the amount of RAM reserved/used in the cluster, by reducing the number of executors (down to 0, if possible). It's a fire-and-forget setting in the spark configuration, per Yarn-app/SparkSession. – Cheesecake 16/5, 2017 at 15:40

tl;dr I'd say it's not possible.

A Spark application is at least one JVM and it's at spark-submit time when you specify the requirements of the single JVM (or a bunch of JVMs that act like executors).

If however you want to have different JVM configurations without launching separate JVMs, that does not seem possible (even outside Spark but assuming JVM is in use).

Cola answered 16/5, 2017 at 13:56 Comment(11)

not JVM configs, but at least different spark.executor.cores and spark.executor.memory. Does Spark relaunch each executor JVM on session restart? – Dodiedodo 16/5, 2017 at 14:16

These are the same as they directly map to JVM's settings. – Cola 16/5, 2017 at 14:17

If you're using Yarn (or any other means of submitting jobs), then couldn't you write a single java application which replaces Spark-submit, and submits jobs which then instantiate their driver jvms remotely? The only thing I am not sure about, is how you can make sure, that the SparkSessions aren't actually instantiated in the principal JVM. I suspect, if you ship the code into yarn using just the jars, then you might be able to achieve this - in a fashion, of course. -- On the other hand I completely fail to see the use case. With dynamic scaling in Spark-on-Yarn, this shouldn't be an issue. – Cheesecake 16/5, 2017 at 14:18

@RickMoritz what did you mean by instantiate jvms remotely? I don't instantiate any JVMs since I'm working through Java API (SparkSession), which do everything for me – Dodiedodo 16/5, 2017 at 14:24

@JacekLaskowski Please more details! I'll very appreciate them. By default Spark allocate 1 executor per 1 core in Standalone mode. Suppose you have 2 cores on some node. When you submit a job from the 1st app - Spark start 2 JVMs (executros). Suppose you submit another job from 2nd app. Will Spark launch 2 extra JVMs on node? Means 4 JVMs in summary – Dodiedodo 16/5, 2017 at 14:27

@RickMoritz created a Usecase section of my question – Dodiedodo 16/5, 2017 at 14:28

I mean that you task Yarn with starting your applications, but you don't do so using SparkSubmit, but instead Yarn's actual API. In effect you end up writing your own spark-submit alternative, and since you use yarn-cluster deployment, you don't actually have the SparkSession running in the JVM that starts your job. Since you're spawning JVMs all over the cluster in any case, the extra driver/application master jvms don't matter, and they won't run where you launch your application. --- What remains impossible is to have multiple SparkSessions inside the same JVM. – Cheesecake 16/5, 2017 at 14:30

@RickMoritz that's absolutely right. That's actually a problem, and I'm looking for solution like github.com/spark-jobserver/spark-jobserver. You just make a request to Spark JobServer - and it spawn separate JVM and manage SparkSession for you. You can forget about SparkSession-per-JVM limitation. – Dodiedodo 16/5, 2017 at 14:38

@RickMoritz I just looking for solution, I don't want to write a vehicle which will start separate JVMs. I'm sure people have faced the problem many times – Dodiedodo 16/5, 2017 at 14:39

@VolodymyrBakhmatiuk would you be surprised to know that you're alone? :) – Cola 16/5, 2017 at 14:44

@JacekLaskowski yes!) That means that I'm trying to solve common problem in a wrong way. But how will you solve my usecase than?? – Dodiedodo 16/5, 2017 at 14:45

Usecase

What I've found

Recommended topics

Hot tags