I found that AWS Glue set up executor's instance with memory limit to 5 Gb --conf spark.executor.memory=5g
and some times, on a big datasets it fails with java.lang.OutOfMemoryError
. The same is for driver instance --spark.driver.memory=5g
.
Is there any option to increase this value?
The official glue documentation suggests that glue doesn't support custom spark config.
There are also several argument names used by AWS Glue internally that you should never set:
--conf — Internal to AWS Glue. Do not set!
--debug — Internal to AWS Glue. Do not set!
--mode — Internal to AWS Glue. Do not set!
--JOB_NAME — Internal to AWS Glue. Do not set!
Any better suggestion on solving this problem?
--conf
and value as spark.driver.extraClassPath=s3://temp/jsch-0.1.55.jar
for giving precedence to latest jar of jsch instead of the version that Glue is selecting but it doesn't work. Am I missing something? So, how should we go about resolving this? –
Leak despite aws documentation stating that the --conf
parameter should not be passed, our AWS support team told us to pass --conf spark.driver.memory=10g
which corrected the issue we were having
You can override the parameters by editing the job and adding job parameters. The key and value I used are here:
Key: --conf
Value: spark.yarn.executor.memoryOverhead=7g
This seemed counterintuitive since the setting key is actually in the value, but it was recognized. So if you're attempting to set spark.yarn.executor.memory the following parameter would be appropriate:
Key: --conf
Value: spark.yarn.executor.memory=7g
DefaultArguments
part: "--conf": "spark.yarn.executor.memory=8g"
without luck. The job fails with the message Container killed by YARN for exceeding memory limits. 5.7 GB of 5.5 GB physical memory used.
I can actually see the parameter in the Job Parameters. –
Nielson --conf
and value as spark.driver.extraClassPath=s3://temp/jsch-0.1.55.jar
for giving precedence to latest jar of jsch instead of the version that Glue is selecting but it doesn't work. Am I missing something. Also, as @rileyss mentioned, Glue documentation states that conf cannot be set. So, how should we go about resolving this? –
Leak "spark.driver.memory=8g"
–
Prankster - Open Glue> Jobs > Edit your Job> Script libraries and job parameters (optional) > Job parameters near the bottom
- Set the following: key: --conf value: spark.yarn.executor.memoryOverhead=1024 spark.driver.memory=10g
The official glue documentation suggests that glue doesn't support custom spark config.
There are also several argument names used by AWS Glue internally that you should never set:
--conf — Internal to AWS Glue. Do not set!
--debug — Internal to AWS Glue. Do not set!
--mode — Internal to AWS Glue. Do not set!
--JOB_NAME — Internal to AWS Glue. Do not set!
Any better suggestion on solving this problem?
--conf
and value as spark.driver.extraClassPath=s3://temp/jsch-0.1.55.jar
for giving precedence to latest jar of jsch instead of the version that Glue is selecting but it doesn't work. Am I missing something? So, how should we go about resolving this? –
Leak I hit out of memory errors like this when I had a highly skewed dataset. In my case, I had a bucket of json files that contained dynamic payloads that were different based on the event type indicated in the json. I kept hitting Out of Memory errors no matter if I used the configuration flags indicated here and increased the DPUs. It turns out that my events were highly skewed to a couple of the event types being > 90% of the total data set. Once I added a "salt" to the event types and broke up the highly skewed data I did not hit any out of memory errors.
Here's a blog post for AWS EMR that talks about the same Out of Memory error with highly skewed data. https://medium.com/thron-tech/optimising-spark-rdd-pipelines-679b41362a8a
You can use Glue G.1X and G.2X worker types which give more memory and disk space to scale Glue jobs that need high memory and throughput.
Also you can edit Glue job and set
--conf
value spark.yarn.executor.memoryOverhead=1024
or 2048
and spark.driver.memory=10g
© 2022 - 2024 — McMap. All rights reserved.
--driver-memory 8g
and--executor-memory 8g
but have no seen changes. Job still fails withjava.lang.OutOfMemoryError
trying to load data over 5gb – Josephjosepha--executor-memory 8g
was passed in run parameters. But, as soon I can pass only script parameters, I see 2--executor-memory
: first is part of spark job run parameters passed by Glue, and second is mine. Like this:/usr/lib/spark/bin/spark-submit --master yarn --executor-memory 5g ... /tmp/runscript.py script_2018-03-16-11-09-28.py --JOB_NAME XXX --executor-memory 8g
After that, a log message like18/03/16 11:09:31 INFO Client: Will allocate AM container, with 5632 MB memory including 512 MB overhead
– Josephjosepha--conf
and value asspark.driver.extraClassPath=s3://temp/jsch-0.1.55.jar
for giving precedence to latest jar of jsch instead of the version that Glue is selecting but it doesn't work. Am I missing something. Also, as @rileyss mentioned, Glue documentation states that conf cannot be set. So, how should we go about resolving this? – Leak