Passing additional jars to Spark via spark-submit
Asked Answered
N

2

1

I'm using Spark with MongoDB, and consequently rely on the mongo-hadoop drivers. I got things working thanks to input on my original question here.

My Spark job is running, however, I receive warnings that I don't understand. When I run this command

$SPARK_HOME/bin/spark-submit --driver-class-path /usr/local/share/mongo-hadoop/build/libs/mongo-hadoop-1.5.0-SNAPSHOT.jar:/usr/local/share/mongo-hadoop/spark/build/libs/mongo-hadoop-spark-1.5.0-SNAPSHOT.jar --jars /usr/local/share/mongo-hadoop/build/libs/mongo-hadoop-1.5.0-SNAPSHOT.jar:/usr/local/share/mongo-hadoop/spark/build/libs/mongo-hadoop-spark-1.5.0-SNAPSHOT.jar my_application.py

it works, but gives me the following warning message

Warning: Local jar /usr/local/share/mongo-hadoop/build/libs/mongo-hadoop-1.5.0-SNAPSHOT.jar:/usr/local/share/mongo-hadoop/spark/build/libs/mongo-hadoop-spark-1.5.0-SNAPSHOT.jar does not exist, skipping.

When I was trying to get this working, if I left out those paths when submitting the job it wouldn't run at all. Now, however, if I leave out those paths it does run

$SPARK_HOME/bin/spark-submit  my_application.py

Can someone please explain what is going on here? I have looked through similar questions here referencing the same warning, and searched through the documentation.

By setting the options once are they stored as environment variables or something? I'm glad it works, but wary that I don't fully understand why sometimes and not others.

Nablus answered 27/11, 2015 at 16:43 Comment(0)
E
2

The problem is that CLASSPATH should be colon separated, while JARS should be comma separated:

$SPARK_HOME/bin/spark-submit \
--driver-class-path /usr/local/share/mongo-hadoop/build/libs/mongo-hadoop-1.5.0-SNAPSHOT.jar:/usr/local/share/mongo-hadoop/spark/build/libs/mongo-hadoop-spark-1.5.0-SNAPSHOT.jar \
--jars /usr/local/share/mongo-hadoop/build/libs/mongo-hadoop-1.5.0-SNAPSHOT.jar,/usr/local/share/mongo-hadoop/spark/build/libs/mongo-hadoop-spark-1.5.0-SNAPSHOT.jar my_application.py
Elconin answered 28/11, 2015 at 0:27 Comment(2)
That solved it, thanks for your help. Since I had the JARS colon separated, does that mean they weren't being passed properly. If so, how come the job still ran, does that mean the unpassed JAR simply isn't needed?Hymeneal
Regarding the first question the answer is positive. Regarding the second one as far as I can tell it is negative. It may work in local mode (classpath set using driver-class-path) but it doesn't with remote workers, at least in a standalone mode. Of course it is not required if these jars are already on the worker classpath. BTW I've updated the previous answer and the docker image.Elconin
W
0

Adding on top of Zero323 answer

I think Better way of doing this is

$SPARK_HOME/bin/spark-submit \
--driver-class-path  $(echo /usr/local/share/mongo-hadoop/build/libs/*.jar | tr ' ' ',') \
--jars $(echo /usr/local/share/mongo-hadoop/build/libs/*.jar | tr ' ' ',') my_application.py

in this approach, you wont miss any jar by mistake in the classpath hence no warning should come.

Wroughtup answered 30/4, 2016 at 16:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.