How to add third-party Java JAR files for use in PySpark

H

9

46

I have some third-party database client libraries in Java. I want to access them through

java_gateway.py

E.g.: to make the client class (not a JDBC driver!) available to the Python client via the Java gateway:

java_import(gateway.jvm, "org.mydatabase.MyDBClient")

It is not clear where to add the third-party libraries to the JVM classpath. I tried to add to file compute-classpath.sh, but that did not seem to work. I get:

Py4jError: Trying to call a package

Also, when comparing to Hive: the hive JAR files are not loaded via file compute-classpath.sh, so that makes me suspicious. There seems to be some other mechanism happening to set up the JVM side classpath.

Harmonics answered 30/12, 2014 at 0:43 Comment(0)

R

39

You can add external jars as arguments to pyspark

pyspark --jars file1.jar,file2.jar

Recommend answered 12/2, 2015 at 22:24 Comment(2)

not in a position to check at this moment - but that sounds correct. The errors we were having actually had nothing to do with this, but in any case that does invalidate your answer. – Harmonics 12/2, 2015 at 23:15

Note that there are no spaces after the commas! It will fail if you put spaces in there. – Absorbent 9/3, 2016 at 20:50

E

43

You could add the path to jar file using Spark configuration at Runtime.

Here is an example :

conf = SparkConf().set("spark.jars", "/path-to-jar/spark-streaming-kafka-0-8-assembly_2.11-2.2.1.jar")

sc = SparkContext( conf=conf)

Refer the document for more information.

Enrich answered 28/3, 2018 at 7:0 Comment(2)

Does this require uploading and deploying the jars to the driver and workers? is the "/path-to-jar/.." the path on the driver node? – Eisenstein 4/2, 2020 at 16:20

@justincress Hi, I ran it as a standalone cluster but I feel the driver is where the jar files need to be present as the workers/executors do as told by the driver. – Enrich 17/2, 2020 at 6:32

R

39

You can add external jars as arguments to pyspark

pyspark --jars file1.jar,file2.jar

Recommend answered 12/2, 2015 at 22:24 Comment(2)

not in a position to check at this moment - but that sounds correct. The errors we were having actually had nothing to do with this, but in any case that does invalidate your answer. – Harmonics 12/2, 2015 at 23:15

Note that there are no spaces after the commas! It will fail if you put spaces in there. – Absorbent 9/3, 2016 at 20:50

P

14

You could add --jars xxx.jar when using spark-submit

./bin/spark-submit --jars xxx.jar your_spark_script.py

or set the enviroment variable SPARK_CLASSPATH

SPARK_CLASSPATH='/path/xxx.jar:/path/xx2.jar' your_spark_script.py

your_spark_script.py was written by pyspark API

Parallelogram answered 17/9, 2015 at 5:53 Comment(2)

@stanislav Thanks for your modification. – Parallelogram 8/4, 2016 at 9:21

I have spark-1.6.1-bin-hadoop2.6 and --jars doesn't work for me. The second option (setting SPARK_CLASSPATH) works. Anyone have any idea why first option doesn't work? – Eckert 1/7, 2018 at 12:27

E

8

Apart from the accepted answer, you also have below options:

if you are in virtual environment then you can place it in

e.g. lib/python3.7/site-packages/pyspark/jars
if you want java to discover it then you can place where your jre is installed under ext/ directory

Ethics answered 19/5, 2020 at 16:33 Comment(0)

C

6

All the above answers did not work for me

What I had to do with pyspark was

pyspark --py-files /path/to/jar/xxxx.jar

For Jupyter Notebook:

spark = (SparkSession
    .builder
    .appName("Spark_Test")
    .master('yarn-client')
    .config("spark.sql.warehouse.dir", "/user/hive/warehouse")
    .config("spark.executor.cores", "4")
    .config("spark.executor.instances", "2")
    .config("spark.sql.shuffle.partitions","8")
    .enableHiveSupport()
    .getOrCreate())

# Do this 

spark.sparkContext.addPyFile("/path/to/jar/xxxx.jar")

Link to the source where I found it: https://github.com/graphframes/graphframes/issues/104

Charades answered 26/4, 2019 at 20:41 Comment(1)

addPyFile is for python dependencies, not jars spark.apache.org/docs/0.7.2/api/pyspark/… – Lustig 16/9, 2019 at 19:58

A

5

Extract the downloaded jar file.
Edit system environment variable
- Add a variable named SPARK_CLASSPATH and set its value to \path\to\the\extracted\jar\file.

Eg: you have extracted the jar file in C drive in folder named sparkts its value should be: C:\sparkts

Restart your cluster

Anisette answered 10/12, 2016 at 22:45 Comment(0)

L

4

One more thing you can do is to add the Jar in the pyspark jar folder where pyspark is installed. Usually /python3.6/site-packages/pyspark/jars

Be careful if you are using a virtual environment that the jar needs to go to the pyspark installation in the virtual environment.

This way you can use the jar without sending it in command line or load it in your code.

Lobby answered 26/7, 2018 at 10:55 Comment(2)

I stumbled in here after googling for “add jar to existing sparksession” so if this works I shall be delighted. Will try it out later today. – Before 29/4, 2020 at 5:19

yep. adding the jar to the jars directory worked. I was then able to call a function in my jar that takes a org.apache.spark.sql.DataFrame like this: spark._sc._jvm.com.mypackage.MyObject.myFunction(myPySparkDataFrame._jdf) – Before 30/4, 2020 at 20:2

B

2

I've worked around this by dropping the jars into a directory drivers and then creating a spark-defaults.conf file in conf folder. Steps to follow;

To get the conf path:  
cd ${SPARK_HOME}/conf

vi spark-defaults.conf  
spark.driver.extraClassPath /Users/xxx/Documents/spark_project/drivers/*

run your Jupyter notebook.

Bassoon answered 15/12, 2019 at 12:9 Comment(0)

K

1

java/scala libs from pyspark both --jars and spark.jars are not working in version 2.4.0 and earlier (I didn't check newer version). I'm surprised how many guys are claiming that it is working.

The main problem is that for classloader retrieved in following way:

jvm = SparkSession.builder.getOrCreate()._jvm
clazz = jvm.my.scala.class
# or
clazz = jvm.java.lang.Class.forName('my.scala.class')

it works only when you copy jar files to ${SPARK_HOME}/jars (this one works for me).

But when your only way is using --jars or spark.jars there is another classloader used (which is child class loader) which is set in current thread. So your python code needs to look like:

clazz = jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(f"{object_name}$")

Hope it explains your troubles. Give me a shout if not.

Klingensmith answered 30/7, 2020 at 14:27 Comment(0)

Recommended topics

Hot tags