ERROR: Unable to find py4j, your SPARK_HOME may not be configured correctly
Asked Answered
S

2

12

I'm unable to run below import in Jupyter notebook.

findspark.init('home/ubuntu/spark-3.0.0-bin-hadoop3.2')

Getting this following error:

    ---------------------------------------------------------------------------
~/.local/lib/python3.6/site-packages/findspark.py in init(spark_home, python_path, edit_rc, edit_profile)
    144     except IndexError:
    145         raise Exception(
--> 146             "Unable to find py4j, your SPARK_HOME may not be configured correctly"
    147         )
    148     sys.path[:0] = [spark_python, py4j]

Exception: Unable to find py4j, your SPARK_HOME may not be configured correctly

I do have py4j installed and also tried to add these below lines into ~/.bashrc

export SPARK_HOME=/home/ubuntu/spark-3.0.0-bin-hadoop3.2
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
Switcheroo answered 25/8, 2020 at 5:55 Comment(1)
you've tried other versions of spark ?Narrative
T
7

Check if the spark version you installed is the same that you declare under SPARK_HOME name

For example (in Google Colab), I've installed:

!wget -q https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz

and then I declare:

os.environ["SPARK_HOME"] = "/content/spark-3.0.1-bin-hadoop3.2"

Look that spark-3.0.1-bin-hadoop3.2 must be same in both places

Tremaine answered 2/12, 2020 at 23:4 Comment(3)
this returns the same problem !Narrative
Check closely what versions are you using now. There must have been some update sinceMarv
This worked for me. Make sure the versions are the same in both lines.Martinamartindale
M
3

The error message is suggesting that findinit is having trouble locating your SPARK_HOME directory.

I had a look through the source code for findinit and it's a pretty straightforward error.

Background

The first thing the code does is specify that a variable spark_python is your SPARK_HOME path followed by /python.

Next the code is looking for the py4j path using the glob module, which finds all the pathnames matching the pattern os.path.join(spark_python,"lib","py4j-*.zip") which in your case should equate to /home/ubuntu/spark-3.0.0-bin-hadoop3.2/python/lib/py4j-0.10.7-src.zip (I made up the py4j version number based on mine, so you yours might be slightly different). Now, it gets the py4j path from the list returned by the glob operation by selecting the first element. This is why the error is an IndexError, and it happens when the py4j path doesn't exist, which itself only relies on SPARK_HOME being properly specified.

To solve the problem

The only culprit would be the specification of SPARK_HOME, which as you've said, is read into the environment variables from the ~/.bashrc file. So the three things to check are:

  1. That your SPARK_HOME path is correct (check it exists)
  2. That you have a py4j .zip file in /home/ubuntu/spark-3.0.0-bin-hadoop3.2/python/lib/
  3. That there aren't any formatting problems in the SPARK_HOME path specification in the ~/.bashrc file

I use quotes around my exported paths e.g. export SPARK_HOME="/home/ubuntu/spark-3.0.0-bin-hadoop3.2" but I'm not sure if that makes a difference.

Muzzleloader answered 23/11, 2020 at 22:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.