Running custom Java class in PySpark

Asked 5/11, 2015 at 12:6 Answered 12/1, 2021 at 3:8

I'm trying to run a custom HDFS reader class in PySpark. This class is written in Java and I need to access it from PySpark, either from the shell or with spark-submit.

In PySpark, I retrieve the JavaGateway from the SparkContext (sc._gateway).

Say I have a class:

package org.foo.module

public class Foo {

    public int fooMethod() {
        return 1;
    }

}

I've tried to package it into a jar and pass it with the --jar option to pyspark and then running:

from py4j.java_gateway import java_import

jvm = sc._gateway.jvm
java_import(jvm, "org.foo.module.*")

foo = jvm.org.foo.module.Foo()

But I get the error:

Py4JError: Trying to call a package.

Can someone help with this? Thanks.

Shoreline answered 5/11, 2015 at 12:6 Comment(0)

In PySpark try the following

from py4j.java_gateway import java_import
java_import(sc._gateway.jvm,"org.foo.module.Foo")

func = sc._gateway.jvm.Foo()
func.fooMethod()

Make sure that you have compiled your Java code into a runnable jar and submit the spark job like so

spark-submit --driver-class-path "name_of_your_jar_file.jar" --jars "name_of_your_jar_file.jar" name_of_your_python_file.py

Darcidarcia answered 1/3, 2016 at 14:12 Comment(2)

Also, remember if you are adding multiple jars make sure to use classpath syntax for --driver-class-path and comma separation --jars. – Firstborn 4/8, 2016 at 12:25

Adding --driver-class-path causes tons of issues for me within AWS / EMR. Just adding --jars was enough for me and fixed tons of issues I saw when also adding the same jar to --driver-class-path (which broke Hive and S3 access, to name a few). – Silicone 11/9, 2019 at 18:27

Problem you've described usually indicates that org.foo.module is not on the driver CLASSPATH. One possible solution is to use spark.driver.extraClassPath to add your jar file. It can be for example set in conf/spark-defaults.conf or provided as a command line parameter.

On a side note:

if class you use is a custom input format there should be no need for using Py4j gateway whatsoever. You can simply use SparkContext.hadoop* / SparkContext.newAPIHadoop* methods.
using java_import(jvm, "org.foo.module.*") looks like a bad idea. Generally speaking you should avoid unnecessary imports on JVM. It is not public for a reason and you really don't want to mess with that. Especially when you access in a way which make this import completely obsolete. So drop java_import and stick with jvm.org.foo.module.Foo().

Coopersmith answered 6/11, 2015 at 0:35 Comment(3)

Using the classpath option actually worked and I can use the classes in the Spark driver. However, when I try to use them inside transformations I get different kind of errors. The option of SparkContext.hadoop* doesn't fit my use case. I want to parallelize a list of paths and then make a transformation that reads those files. – Shoreline 6/11, 2015 at 13:4

Inside transformations? It is not possible (or at least not using this approach). – Coopersmith 6/11, 2015 at 15:24

You can also add it to the classpath by adding it as a cmd-line param with: --driver-class-path if you don't want to change your config files – Reid 9/3, 2016 at 20:55

If you run PySpark locally in IDE (PyCharm, etc.), to use custom classes in a jar, you can put the jar into $SPARK_HOME/jars, it will be added to class path to run Spark, check code snippet in $SPARK_HOME/bin/spark-class2.cmd for details.

Calmas answered 12/1, 2021 at 3:8 Comment(0)

-1

Rather than --jars you should use --packages to import packages into your spark-submit action.

Advancement answered 5/11, 2015 at 12:43 Comment(1)

This is not always correct. --packages searches for Maven packages. If a user is attempting to load their own JAR that is not in a Maven repo, --jars is correct. – Pentadactyl 11/8, 2019 at 21:30

Recommended topics

Hot tags