Why can't PySpark find py4j.java_gateway?
Asked Answered
L

6

47

I installed Spark, ran the sbt assembly, and can open bin/pyspark with no problem. However, I am running into problems loading the pyspark module into ipython. I'm getting the following error:

In [1]: import pyspark
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-c15ae3402d12> in <module>()
----> 1 import pyspark

/usr/local/spark/python/pyspark/__init__.py in <module>()
     61
     62 from pyspark.conf import SparkConf
---> 63 from pyspark.context import SparkContext
     64 from pyspark.sql import SQLContext
     65 from pyspark.rdd import RDD

/usr/local/spark/python/pyspark/context.py in <module>()
     28 from pyspark.conf import SparkConf
     29 from pyspark.files import SparkFiles
---> 30 from pyspark.java_gateway import launch_gateway
     31 from pyspark.serializers import PickleSerializer, BatchedSerializer, UTF8Deserializer, \
     32     PairDeserializer, CompressedSerializer

/usr/local/spark/python/pyspark/java_gateway.py in <module>()
     24 from subprocess import Popen, PIPE
     25 from threading import Thread
---> 26 from py4j.java_gateway import java_import, JavaGateway, GatewayClient
     27
     28

ImportError: No module named py4j.java_gateway
Lipscomb answered 23/10, 2014 at 16:46 Comment(3)
I don't know if this is a real answer, but sudo pip install py4j fixed this problem for me. I assume this error comes after you already added SPARK_HOME to the PYTHON_PATH?Banneret
I provided an answer to this same (or similar problem here). I may be helpful to you: #24250347Metzger
I also set my PYTHONPATH to point to all needed python dependencies but got the same error. To resolve the problem, I also had to 1) install another copy of py4j at the site-packages folder where usual python packages are installed 2) change the permission of everything in the py4j folder so YARN executor nodes can read / execute the relevant files.Betelgeuse
P
74

In my environment (using docker and the image sequenceiq/spark:1.1.0-ubuntu), I ran in to this. If you look at the pyspark shell script, you'll see that you need a few things added to your PYTHONPATH:

export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH

That worked in ipython for me.

Update: as noted in the comments, the name of the py4j zip file changes with each Spark release, so look around for the right name.

Paapanen answered 9/12, 2014 at 5:21 Comment(2)
That's export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH in Spark 1.6.0Mummer
The name of the py4j zip file changes with every Spark version, so make sure the zip file your are pointing to in $PYTHONPATH actually exists.Derayne
L
30

I solved this problem by adding some paths in .bashrc

export SPARK_HOME=/home/a141890/apps/spark
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH

After this, it never raise ImportError: No module named py4j.java_gateway.

Laflamme answered 16/1, 2015 at 3:35 Comment(3)
I am also facing the same problem. Where should I write these export statements? I tried in cmd prompt and ipython notebook. It did not work for me in either of themDissonance
I set the spark path, python 2.7 path and py4j zip file path as the environment system variable path. I couldn't solve the issue. When I run from pyspar import SparkContext I am getting the error.Dissonance
What does adding the ':$PYTHONPATH' do?Theron
C
9

Install pip module 'py4j'.

pip install py4j

I got this problem with Spark 2.1.1 and Python 2.7.x. Not sure if Spark stopped bundling this package in latest distributions. But installing py4j module solved the issue for me.

Coati answered 14/6, 2017 at 9:8 Comment(1)
you have to use version of py4j that's shipped with Spark. Even upgrades like from Spark 2.2 to 2.3 use incompatible versions of py4j.Paperweight
M
4

In Pycharm, before running above script, ensure that you have unzipped the py4j*.zip file. and add its reference in script sys.path.append("path to spark*/python/lib")

It worked for me.

Mcgregor answered 28/7, 2016 at 10:31 Comment(0)
M
3
#/home/shubham/spark-1.6.2
import os
import sys
# Set the path for spark installation
# this is the path where you have built spark using sbt/sbt assembly
os.environ['SPARK_HOME'] = "/home/shubham/spark-1.6.2"
# os.environ['SPARK_HOME'] = "/home/jie/d2/spark-0.9.1"
# Append to PYTHONPATH so that pyspark could be found
sys.path.append("/home/shubham/spark-1.6.2/python")
sys.path.append("/home/shubham/spark-1.6.2/python/lib")
# sys.path.append("/home/jie/d2/spark-0.9.1/python")
# Now we are ready to import Spark Modules
try:
    from pyspark import SparkContext
    from pyspark import SparkConf`enter code here`
    print "Hey nice"
except ImportError as e:
    print ("Error importing Spark Modules", e)
sys.exit(1)
Mcgregor answered 28/7, 2016 at 10:44 Comment(0)
C
1

For setup of PySpark with python 3.8, add below paths to bash profile (Mac):

export SPARK_HOME=/Users/<username>/spark-3.0.1-bin-hadoop2.7
export PATH=$PATH:/Users/<username>/spark-3.0.1-bin-hadoop2.7/bin
export PYSPARK_PYTHON=python3
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH

NOTE: Use the py4j path present in your downloaded spark package.

Save the new updated bash file: Ctrl + X.

Run the new bash file: source ~/.bash_profile

Cement answered 26/12, 2020 at 19:55 Comment(1)
Thanks for the detailed explanation. Still not working for me in Windows.Indistinguishable

© 2022 - 2024 — McMap. All rights reserved.