How to submit Apache Spark job to Hadoop YARN on Azure HDInsight

Asked 10/7, 2014 at 9:14 Answered 24/4, 2015 at 1:45

Solved azure apache-spark azure-hdinsight

I am very excited that HDInsight switched to Hadoop version 2, which supports Apache Spark through YARN. Apache Spark is a much better fitting parallel programming paradigm than MapReduce for the task that I want to perform.

I was unable to find any documentation however on how to do remote job submission of a Apache Spark job to my HDInsight cluster. For remote job submission of standard MapReduce jobs I know that there are several REST endpoints like Templeton and Oozie. But as for as I was able to find, running Spark jobs is not possible through Templeton. I did find it to be possible to incorporate Spark jobs into Oozie, but I've read that this is a very tedious thing to do and also I've read some reports of job failure detection not working in this case.

Probably there must be a more appropriate way to submit Spark jobs. Does anyone know how to do remote job submissions of Apache Spark jobs to HDInsight?

Many thanks in advance!

Luff answered 10/7, 2014 at 9:14 Comment(4)

Difficult topic, you would need a way to get scala to the slave nodes which is unlikely to be efficient when starting a job. As you already found out, you can't submit stuff from the outside, you must RDP into the headnode and submit it from there. – Nuncle 10/7, 2014 at 20:0

Thanks for the comment. I tried submitting from the headnode via RDP. When I search the headnode for a Spark jar file to run Spark jobs from, I find nothing. Searching for Tez, one of the other new YARN computational models, I did find a jar file and I also am able to use this jar file to submit example Tez jobs to the cluster. Does the absence of Spark on the headnode maybe indicate that the cluster does NOT support Spark after all? – Luff 13/7, 2014 at 17:52

It looks like you can do this with a power shell script at the install of the HDinsight cluster. blogs.technet.com/b/dataplatforminsider/archive/2014/11/17/… – Wandis 18/11, 2014 at 18:32

Great! Good to know that the support for Spark on Azure has improveimproved! – Luff 19/11, 2014 at 19:41

You can install spark on a hdinsight cluster. You have to do it at by creating a custom cluster and adding an action script that will install Spark on the cluster at the time it creates the VMs for the Cluster.

To install with an action script on cluster install is pretty easy, you can do it in C# or powershell by adding a few lines of code to a standard custom create cluster script/program.

powershell:

# ADD SCRIPT ACTION TO CLUSTER CONFIGURATION
$config = Add-AzureHDInsightScriptAction -Config $config -Name "Install Spark" -ClusterRoleCollection HeadNode -Urin https://hdiconfigactions.blob.core.windows.net/sparkconfigactionv02/spark-installer-v02.ps1

C#:

// ADD THE SCRIPT ACTION TO INSTALL SPARK
clusterInfo.ConfigActions.Add(new ScriptAction(
  "Install Spark", // Name of the config action
  new ClusterNodeType[] { ClusterNodeType.HeadNode }, // List of nodes to install Spark on
  new Uri("https://hdiconfigactions.blob.core.windows.net/sparkconfigactionv02/spark-installer-v02.ps1"), // Location of the script to install Spark
  null //because the script used does not require any parameters.
));

you can then RDP into the headnode and run use the spark-shell or use spark-submit to run jobs. I am not sure how would run spark job and not rdp into the the headnode but that is an other question.

Wandis answered 11/2, 2015 at 16:59 Comment(4)

At the time that I needed it there was no Spark installer for Azure available yet, unfortunately. But good to hear that this is now possible on Azure! I accepted and upvoted your answer as it will be of great help to anyone wanting to use Spark on Azure who stumbles on this page. – Luff 12/2, 2015 at 14:27

Next step from microsoft should be to provide some way to remotely submit Spark jobs without having to RDP the master node manually. – Struble 13/2, 2015 at 16:20

I would totally agree. I am currently working on getting oozie to schedule spark jobs on the cluster, not sure how hard or if it is possible at this point, but might be a good way to get around so of that... but a little off topic. – Wandis 13/2, 2015 at 17:59

There is already an ongoing suggestion here. It would be great if you could just vote this idea and make the Microsoft developers aware of this need. – Struble 16/2, 2015 at 13:51

I also asked the same question with Azure guys. Following is the solution from them:

"Two questions to the topic: 1. How can we submit a job outside of the cluster without "Remote to…" — Tao Li

Currently, this functionality is not supported. One workaround is to build job submission web service yourself:

Create Scala web service that will use Spark APIs to start jobs on the cluster.
Host this web service in the VM inside the same VNet as the cluster.
Expose web service end-point externally through some authentication scheme. You can also employ intermediate map reduce job, it would take longer though.

Microsporangium answered 24/4, 2015 at 1:45 Comment(0)

-1

You might consider using Brisk (https://brisk.elastatools.com) which offers Spark on Azure as a provisioned service (with support available). There's a free tier and it lets you access blob storage with a wasb://path/to/files just like HDInsight.

It doesn't sit on YARN; instead it is a lightweight and Azure oriented distribution of Spark.

Disclaimer: I work on the project!

Best wishes,

Andy

Elfie answered 10/9, 2014 at 19:52 Comment(1)

Thank you, seems like a very valuable solution, since it seems there is no other way to get Spark running on Azure out-of-the-box, without doing a lot of configuring work yourself. – Luff 11/9, 2014 at 10:13

"Two questions to the topic: 1. How can we submit a job outside of the cluster without "Remote to…" — Tao Li

Recommended topics

Hot tags