Py4JJavaError: An error occurred while calling
Asked Answered
K

2

10

I am new to PySpark. I have been writing my code with a test sample. Once I run the code on the larger file(3gb compressed). My code is only doing some filtering and joins. I keep getting errors regarding py4J.

Any help would be useful, and appreciated.

from pyspark.sql import SparkSession
from pyspark.conf import SparkConf

ss = SparkSession \
      .builder \
      .appName("Example") \
      .getOrCreate()

ss.conf.set("spark.sql.execution.arrow.enabled", 'true')

df = ss.read.csv(directory + '/' + filename, header=True, sep=",")
# Some filtering and groupbys...
df.show()

Return

Py4JJavaError: An error occurred while calling o88.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 
1, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
...
Caused by: java.lang.OutOfMemoryError: Java heap space

UPDATE: I was using py4j 10.7 and just updated to 10.8

UPDATE(1): Adding spark.driver.memory:

 ss = SparkSession \
  .builder \
  .appName("Example") \
  .config("spark.driver.memory", "16g")\
  .getOrCreate()

Summarized Return Error:

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:38004)

py4j.protocol.Py4JNetworkError: Answer from Java side is empty
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving

Py4JError
Py4JError: An error occurred while calling o94.showString

UPDATE(2) : I tried this, by changing the spark-defaults.conf file. Still getting error PySpark: java.lang.OutofMemoryError: Java heap space

SEMI-SOLVED : This seemed to be a general memory problem. I started a 2xlarge instance with 32g of memory. The program runs with no errors.

Knowing this, is there something else, a conf option that could help so I don't have to run an expensive instance?

Thanks Everyone.

Kerch answered 6/2, 2019 at 4:13 Comment(6)
How much memory has been allocated to the Driver?Gib
@SurajRamesh I am using an aws cloud. I have used this .config("spark.executor.memory", "16g"). It didn't make a difference.Kerch
Try setting spark.driver.memory to 16g. Does your could work for smaller datasets? .config("spark.driver.memory", "16g")Theotheobald
@Theotheobald I took your advice and got a different error: Py4JError: An error occurred while calling o94.showStringKerch
You may have to post the filtering and groupby methods you are using. Spark's lazy evaluation leads to error messages being shown for the last method when it is earlier methods that are the cause.Theotheobald
You could try allocating more memory to the JVM by increasing the Java heap memory, and then reducing driver memory to see if you can run your application on a smaller instance.Gib
I
1

This is a current issue with pyspark 2.4.0 installed via conda. You'll want to downgrade to pyspark 2.3.0 via conda prompt or Linux terminal:

    conda install pyspark=2.3.0
Inconsequential answered 21/2, 2019 at 3:59 Comment(0)
D
0

You may not have right permissions.

I have the same problem when I use a docker image jupyter/pyspark-notebook to run an example code of pyspark, and it was solved by using root within the container.

Anyone also use the image can find some tips here.

Dandy answered 27/8, 2020 at 10:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.