TypeError: 'JavaPackage' object is not callable (spark._jvm)
Asked Answered
A

2

14

I'm setting up GeoSpark Python and after installing all the pre-requisites, I'm running the very basic code examples to test it.

from pyspark.sql import SparkSession
from geo_pyspark.register import GeoSparkRegistrator


spark = SparkSession.builder.\
        getOrCreate()

GeoSparkRegistrator.registerAll(spark)

df = spark.sql("""SELECT st_GeomFromWKT('POINT(6.0 52.0)') as geom""")

df.show()

I tried running it with python3 basic.py and spark-submit basic.py, both give me this error:

Traceback (most recent call last):
  File "/home/jessica/Downloads/geo_pyspark/basic.py", line 8, in <module>
    GeoSparkRegistrator.registerAll(spark)
  File "/home/jessica/Downloads/geo_pyspark/geo_pyspark/register/geo_registrator.py", line 22, in registerAll
    cls.register(spark)
  File "/home/jessica/Downloads/geo_pyspark/geo_pyspark/register/geo_registrator.py", line 27, in register
    spark._jvm. \
TypeError: 'JavaPackage' object is not callable

I'm using Java 8, Python 3, Apache Spark 2.4, my JAVA_HOME is set correctly, I'm running Linux Mint 19. My SPARK_HOME is also set:

$ printenv SPARK_HOME
/home/jessica/spark/

How can I fix this?

Adiathermancy answered 29/10, 2019 at 13:17 Comment(0)
C
10

The Jars for geoSpark are not correctly registered with your Spark Session. There's a few ways around this ranging from a tad inconvenient to pretty seamless. For example, if when you call spark-submit you specify:

--jars jar1.jar,jar2.jar,jar3.jar

then the problem will go away, you can also provide a similar command to pyspark if that's your poison.

If, like me, you don't really want to be doing this every time you boot (and setting this as a .conf() in Jupyter will get tiresome) then instead you can go into $SPARK_HOME/conf/spark-defaults.conf and set:

spark-jars jar1.jar,jar2.jar,jar3.jar

Which will then be loaded when you create a spark instance. If you've not used the conf file before it'll be there as spark-defaults.conf.template.

Of course, when I say jar1.jar.... What I really mean is something along the lines of:

/jars/geo_wrapper_2.11-0.3.0.jar,/jars/geospark-1.2.0.jar,/jars/geospark-sql_2.3-1.2.0.jar,/jars/geospark-viz_2.3-1.2.0.jar

but that's up to you to get the right ones from the geo_pyspark package.

If you are using an EMR: You need to set your cluster config json to

[
  {
    "classification":"spark-defaults", 
    "properties":{
      "spark.jars": "/jars/geo_wrapper_2.11-0.3.0.jar,/jars/geospark-1.2.0.jar,/jars/geospark-sql_2.3-1.2.0.jar,/jars/geospark-viz_2.3-1.2.0.jar"
      }, 
    "configurations":[]
  }
]

and also get your jars to upload as part of your bootstrap. You can do this from Maven but I just threw them on an S3 bucket:

#!/bin/bash
sudo mkdir /jars
sudo aws s3 cp s3://geospark-test-ds/bootstrap/geo_wrapper_2.11-0.3.0.jar /jars/
sudo aws s3 cp s3://geospark-test-ds/bootstrap/geospark-1.2.0.jar /jars/
sudo aws s3 cp s3://geospark-test-ds/bootstrap/geospark-sql_2.3-1.2.0.jar /jars/
sudo aws s3 cp s3://geospark-test-ds/bootstrap/geospark-viz_2.3-1.2.0.jar /jars/

If you are using an EMR Notebook You need a magic cell at the top of your notebook:

%%configure -f
{
"jars": [
        "s3://geospark-test-ds/bootstrap/geo_wrapper_2.11-0.3.0.jar",
        "s3://geospark-test-ds/bootstrap/geospark-1.2.0.jar",
        "s3://geospark-test-ds/bootstrap/geospark-sql_2.3-1.2.0.jar",
        "s3://geospark-test-ds/bootstrap/geospark-viz_2.3-1.2.0.jar"
    ]
}
Chapen answered 3/2, 2020 at 13:22 Comment(5)
Thank you so much! One addition here, if anyone is installing geospark as a package on the cluster, then they can also use the location /usr/local/lib/python3.6/site-packages/geospark/jars/2_4/<JAR_FILE> when specifying spark.jars, because that is the location used on EMR for both Master and Core nodes.Perpetua
where can I download geo_wrapper.jar?Jade
It's been a while but I thinkn we grabbed it from the geo_pyspark repo, just be sure to get the right version: github.com/Imbruced/geo_pyspark/tree/master/geo_pyspark/jarsChapen
And just in case you see the same problem in a Databricks notebook, you could install the missing JARs via the UI for the cluster configuration.Felishafelita
By the way, the error indicates that Python code was installed (thus the Python imports work) but not the JARs that are used by that Python code.Felishafelita
S
2

I was seeing a similar kind of issue with SparkMeasure jars on Windows 10 machine

self.stagemetrics =
self.sc._jvm.ch.cern.sparkmeasure.StageMetrics(self.sparksession._jsparkSession)
TypeError: 'JavaPackage' object is not callable

So what I did was

  1. Went to 'SPARK_HOME' via Pyspark shell, and installed the required jar

    bin/pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.16

  2. Grabbed that jar ( ch.cern.sparkmeasure_spark-measure_2.12-0.16.jar ) and copied into the the Jars folder of 'SPARK_HOME'

  3. Reran the script and now it worked without that above error.

Schultz answered 26/8, 2020 at 0:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.