Spark-submit not working when application jar is in hdfs
Asked Answered
E

5

33

I'm trying to run a spark application using bin/spark-submit. When I reference my application jar inside my local filesystem, it works. However, when I copied my application jar to a directory in hdfs, i get the following exception:

Warning: Skip remote jar hdfs://localhost:9000/user/hdfs/jars/simple-project-1.0-SNAPSHOT.jar. java.lang.ClassNotFoundException: com.example.SimpleApp

Here's the command:

$ ./bin/spark-submit --class com.example.SimpleApp --master local hdfs://localhost:9000/user/hdfs/jars/simple-project-1.0-SNAPSHOT.jar

I'm using hadoop version 2.6.0, spark version 1.2.1

Epiphysis answered 26/2, 2015 at 10:18 Comment(1)
what did you finally decide here? Did you switch to YARN or find another workaround? Sanjiv, below, was pointing at a bug that seems peripherally relevant. Did you try --deploy-mode cluster ? Thanks, interesting bug if it's really a bug, and doesn't seem to have been directly submitted to JIRA. Perhaps check thisAnther
X
23

The only way it worked for me, when I was using

--master yarn-cluster

Xiaoximena answered 2/4, 2015 at 1:5 Comment(2)
What if they don't want to use YARN? I see this is the accepted answer yet the OP was trying to use local[*]? Eeen-teresting.Anther
--master yarn-cluster is not working for me. Following is my snippet of the logs: Apr 11, 2018 9:22:20 AM org.apache.spark.launcher.OutputRedirector redirect INFO: master yarn-cluster Apr 11, 2018 9:22:20 AM org.apache.spark.launcher.OutputRedirector redirect INFO: deployMode cluster Apr 11, 2018 9:22:20 AM org.apache.spark.launcher.OutputRedirector redirect INFO: Warning: Skip remote jar hdfs://locahlost/user/MyUser/Sample-1.0.1Manish-SNAPSHOT.jar.Edelman
H
10

To make HDFS library accessible to spark-job , you have to run job in cluster mode.

$SPARK_HOME/bin/spark-submit \
--deploy-mode cluster \
--class <main_class> \
--master yarn-cluster \
hdfs://myhost:8020/user/root/myjar.jar

Also, There is Spark JIRA raised for client mode which is not supported yet.

SPARK-10643 :Support HDFS application download in client mode spark submit

Holmberg answered 23/2, 2016 at 10:50 Comment(1)
Nice answer to me this should be accepted :) but you are not showing cluster mode, you are showing yarn, you need --deploy-mode cluster and --master spark://yourmaster:7077 instead of --master yarn-cluster? If the OP said he's using YARN I missed it, though I guess HDFS is a good clue. I think, as stated, the OP is trying to use the Spark job manager and finding a bug with local mode?Anther
S
1

There is a workaround. You could mount the directory in HDFS (which contains your application jar) as local directory.

I did the same (with azure blob storage, but it should be similar for HDFS)

example command for azure wasb

sudo mount -t cifs //{storageAccountName}.file.core.windows.net/{directoryName} {local directory path} -o vers=3.0,username={storageAccountName},password={storageAccountKey},dir_mode=0777,file_mode=0777

Now, in your spark submit command, you provide the path from the command above

$ ./bin/spark-submit --class com.example.SimpleApp --master local {local directory path}/simple-project-1.0-SNAPSHOT.jar

Solarize answered 3/3, 2016 at 18:37 Comment(0)
A
0
spark-submit --master spark://kssr-virtual-machine:7077 --deploy-mode client --executor-memory 1g hdfs://localhost:9000/user/wordcount.py

For me its working I am using Hadoop 3.3.1 & Spark 3.2.1. I am able to read the file from HDFS.

Assimilable answered 25/4, 2022 at 16:57 Comment(0)
C
-2

Yes, it has to be a local file. I think that's simply the answer.

Clea answered 26/2, 2015 at 10:30 Comment(6)
But in the official documentation, it stated there that: "application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes."Epiphysis
@dlim good point. That is worth a question to the user@ mailing list. From skimming the code it looks like it specifically only allows local filesClea
Thanks. I'll try the mailing list for now.Epiphysis
Was there an answer on the mailing lists?Applejack
You have use --master yarn-cluster in your spark submit provided that you use Yarn as your cluster manager.Epiphysis
The mailing list is not that useful, when there's an answer it's great but so many questions go unanswered! They need gamification like SO, really seems to work. Meanwhile the answer from Sanjiv seems like it has identified SPARK-10643 which deals with this, so you must use --deploy-mode cluster explicitly. Of course local[*] won't work with that. But that bug, now that I look at it, doesn't seem to deal with this directly.Anther

© 2022 - 2024 — McMap. All rights reserved.