Spark runs on Yarn cluster exitCode=13:
Asked Answered
B

5

27

I am a spark/yarn newbie, run into exitCode=13 when I submit a spark job on yarn cluster. When the spark job is running in local mode, everything is fine.

The command I used is:

/usr/hdp/current/spark-client/bin/spark-submit --class com.test.sparkTest --master yarn --deploy-mode cluster --num-executors 40 --executor-cores 4 --driver-memory 17g --executor-memory 22g --files /usr/hdp/current/spark-client/conf/hive-site.xml /home/user/sparkTest.jar*

Spark Error Log:

16/04/12 17:59:30 INFO Client:
         client token: N/A
         diagnostics: Application application_1459460037715_23007 failed 2 times due to AM Container for appattempt_1459460037715_23007_000002 exited with  exitCode: 13
For more detailed output, check application tracking page:http://b-r06f2-prod.phx2.cpe.net:8088/cluster/app/application_1459460037715_23007Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_e40_1459460037715_23007_02_000001
Exit code: 13
Stack trace: ExitCodeException exitCode=13:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
        at org.apache.hadoop.util.Shell.run(Shell.java:487)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)


**Yarn logs**

    16/04/12 23:55:35 INFO mapreduce.TableInputFormatBase: Input split length: 977 M bytes.
16/04/12 23:55:41 INFO yarn.ApplicationMaster: Waiting for spark context initialization ...
16/04/12 23:55:51 INFO yarn.ApplicationMaster: Waiting for spark context initialization ...
16/04/12 23:56:01 INFO yarn.ApplicationMaster: Waiting for spark context initialization ...
16/04/12 23:56:11 INFO yarn.ApplicationMaster: Waiting for spark context initialization ...
16/04/12 23:56:11 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x152f0b4fc0e7488
16/04/12 23:56:11 INFO zookeeper.ZooKeeper: Session: 0x152f0b4fc0e7488 closed
16/04/12 23:56:11 INFO zookeeper.ClientCnxn: EventThread shut down
16/04/12 23:56:11 INFO executor.Executor: Finished task 0.0 in stage 1.0 (TID 2). 2003 bytes result sent to driver
16/04/12 23:56:11 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 82134 ms on localhost (2/3)
16/04/12 23:56:17 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x4508c270df0980316/04/12 23:56:17 INFO zookeeper.ZooKeeper: Session: 0x4508c270df09803 closed *
...
    16/04/12 23:56:21 ERROR yarn.ApplicationMaster: SparkContext did not initialize after waiting for 100000 ms. Please check earlier log output for errors. Failing the application.
16/04/12 23:56:21 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: Timed out waiting for SparkContext.)
16/04/12 23:56:21 INFO spark.SparkContext: Invoking stop() from shutdown hook *
Bowing answered 10/4, 2016 at 20:50 Comment(6)
Could you share the yarn logs as well (not the whole logs, just the error messages in yarn logs)?Dumbbell
You could get yarn logs: $ yarn logs -applicationId application_1459460037715_18191Dumbbell
Thanks for the response. So it turns out exitCode 10 is because of the classNotFound issue. After a quick fix of that, I encountered the new issue with exit Code 13 when spark job is running on yarn cluster. It is work well in local mode. I have updated the question as well as logs so it won't confuse people :)Bowing
Have you set the master in your code? like doing SparkConf.setMaster("local[*]") ?Dumbbell
You are totally right! :) Thanks a lot. I have made the same issue before in another place and the exit code was 15. So when it's 13 this time, I didn't even look back my code as the log, so dumm.Bowing
Good.. I ll put as an answer so you could mark your question as answered :)Dumbbell
D
39

It seems that you have set the master in your code to be local

SparkConf.setMaster("local[*]")

You have to let the master unset in the code, and set it later when you issue spark-submit

spark-submit --master yarn-client ...

Dumbbell answered 13/4, 2016 at 17:46 Comment(2)
what if i want to submit to --master yarn --deploy-mode cluster ...its giving error .Synecdoche
what is the error?? because this is the new way in spark submit for version 2+ so it shouldn't give an errorDumbbell
O
8

If it helps someone

Another possibility of this error is when you put incorrectly the --class param

Ogham answered 10/5, 2019 at 20:56 Comment(2)
What is this class parameter? And how can you know your class param?Shadrach
This is the main class that you want to execute with spark , may be this question can help you #50206121Ogham
S
4

I had exactly the same problem but the above answer didn't work. Alternatively, when I ran this with spark-submit --deploy-mode client everything worked fine.

Squirt answered 16/8, 2019 at 1:18 Comment(5)
Does anyone understand the reason for this?Riles
Yes, this solved my problem. I was using spark-submit --deploy-mode cluster, but when I changed it to client, it worked fine. In my case, I was executing SQL scripts using a python code, so my code was not "spark dependent", but I am not sure what will be the implications of doing this when you want multiprocessing.Tune
--deploy-mode client only runs spark on the master driver node. This does not use Spark's worker (core & task) nodes. Instead, use --deploy-mode cluster to distribute work across workers.Petey
@BeerIsGood, that's only true of the single-threaded code you run. Any actual spark operations (reads, writes, maps, filters, etc) are distributed by the master node across the entire cluster, even in client mode. The difference between client and cluster modes is how the work gets submitted to the cluster and which nodes get used for what.Mundell
In case, it's not obvious, piecing several answers together, you can get this error when your jars aren't available on the worker nodes but try to access them there (via cluster mode). If they are available on the "master" node, then switching to client mode will work. It's a bad error message, made more confusing if you happen to be submitting "Steps" remotely to an AWS EMR cluster from another machine (or something other cloud provider/managed spark/hadoop service). Because from that perspective you're accessing the "master" node as a server with your AWS CLI client.Troublesome
D
4

This exit code 13 is a tricky one...

For me it was SyntaxError: invalid syntax that was in one of the scripts imports downstream to the spark-submit call.

When debugging this on aws, if the spark-submit was not initialized properly, you will not find the error on Spark History Server, you will have to find it on the Spark logs: EMR UI Console -> Summary -> Log URI -> Containers -> application_xxx_xxx -> container_yyy_yy_yy -> stdout.gz.

Dives answered 30/8, 2022 at 21:14 Comment(1)
Lesson learned: exit code 13 could mean a very wide variety of different errors in the spark job parameters. Bummer.Troublesome
O
1

I got this same error running a SparkSQL job in cluster mode. None of the other solutions worked for me but looking in the job history server logs in Hadoop I found this stack trace.

20/02/05 23:01:24 INFO hive.metastore: Connected to metastore.
20/02/05 23:03:03 ERROR yarn.ApplicationMaster: Uncaught exception: 
java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]
    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
    at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
    at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
    at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
...


and looking at the Spark source code you'll find that basically the AM timed out waiting for the spark.driver.port property to be set by the Thread executing the user class.
So it could either be a transient issue or you should investigate your code for the reason for a timeout.

Octamerous answered 6/2, 2020 at 0:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.