PySpark: java.lang.OutofMemoryError: Java heap space
Asked Answered
S

4

81

I have been using PySpark with Ipython lately on my server with 24 CPUs and 32GB RAM. Its running only on one machine. In my process, I want to collect huge amount of data as is give in below code:

train_dataRDD = (train.map(lambda x:getTagsAndText(x))
.filter(lambda x:x[-1]!=[])
.flatMap(lambda (x,text,tags): [(tag,(x,text)) for tag in tags])
.groupByKey()
.mapValues(list))

When I do

training_data =  train_dataRDD.collectAsMap()

It gives me outOfMemory Error. Java heap Space. Also, I can not perform any operations on Spark after this error as it looses connection with Java. It gives Py4JNetworkError: Cannot connect to the java server.

It looks like heap space is small. How can I set it to bigger limits?

EDIT:

Things that I tried before running: sc._conf.set('spark.executor.memory','32g').set('spark.driver.memory','32g').set('spark.driver.maxResultsSize','0')

I changed the spark options as per the documentation here(if you do ctrl-f and search for spark.executor.extraJavaOptions) : http://spark.apache.org/docs/1.2.1/configuration.html

It says that I can avoid OOMs by setting spark.executor.memory option. I did the same thing but it seem not be working.

Sturm answered 1/9, 2015 at 16:45 Comment(3)
Check this question #21139251Acidulent
@bcaceiro: I see lot of spark options being set in the post. I dont use scala. I am using IPython. Do you know if I can set those options from within the shell?Sturm
@bcaceiro : Updated the question with suggestion from the post that you directed me too. It seems like there is some problem with JVM.Sturm
S
95

After trying out loads of configuration parameters, I found that there is only one need to be changed to enable more Heap space and i.e. spark.driver.memory.

sudo vim $SPARK_HOME/conf/spark-defaults.conf
#uncomment the spark.driver.memory and change it according to your use. I changed it to below
spark.driver.memory 15g
# press : and then wq! to exit vim editor

Close your existing spark application and re run it. You will not encounter this error again. :)

Sturm answered 3/9, 2015 at 15:42 Comment(6)
Can you change this conf value from the actual script (ie. set('spark.driver.memory','15g')) ?Cesar
I tried doing it but was not successful. I think it need to restart with new global parameters.Sturm
From docs: spark.driver.memory "Amount of memory to use for the driver process, i.e. where SparkContext is initialized. (e.g. 1g, 2g). Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-memory command line option or in your default properties file."Pound
I was running the Spark code using SBT run from IDEA SBT Console, the fix for me was to add -Xmx4096M -d64 to the java VM parameters that get passed on the SBT Console launch. This is under Other settings -> SBT.Pound
Spark keeps evolving. So you might have to look into its documentation and find out the configuration parameters that correlate to the memory allocation.Sturm
I had to create the $SPARK_HOME/conf/spark-defaults.conf file but it worked either way. Also, I did not need to restart Spark or anything, just relaunched my python application and the setting was immediately applied.Sennacherib
T
61

If you're looking for the way to set this from within the script or a jupyter notebook, you can do:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master('local[*]') \
    .config("spark.driver.memory", "15g") \
    .appName('my-cool-app') \
    .getOrCreate()
Tour answered 17/2, 2020 at 17:44 Comment(1)
Great! Worked seamlessly inside Jupyter Notebook as expected. Thanks!Sovereignty
T
2

I had the same problem with pyspark (installed with brew). In my case it was installed on the path /usr/local/Cellar/apache-spark.

The only configuration file I had was in apache-spark/2.4.0/libexec/python//test_coverage/conf/spark-defaults.conf.

As suggested here I created the file spark-defaults.conf in the path /usr/local/Cellar/apache-spark/2.4.0/libexec/conf/spark-defaults.conf and appended to it the line spark.driver.memory 12g.

Turne answered 9/1, 2019 at 14:59 Comment(0)
L
2

I got the same error and I just assigned memory to spark while creating session

spark = SparkSession.builder.master("local[10]").config("spark.driver.memory", "10g").getOrCreate()

or

SparkSession.builder.appName('test').config("spark.driver.memory", "10g").getOrCreate()
License answered 21/10, 2022 at 20:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.