How to overcome Spark "No Space left on the device" error in AWS Glue Job
Asked Answered
F

4

17

I had used the AWS Glue Job with the PySpark to read the data from the s3 parquet files which is more than 10 TB, but the Job was failing during the execution of the Spark SQL Query with the error

java.io.IOException: No space left on the device

On analysis, I found AWS Glue workers G1.x has 4 vCPU, 16 GB of memory, 64 GB disk. So we tried to increase the number of workers

Even after increasing the number of Glue workers (G1.X) to 50, Glue Jobs keeps on failing with the same error.

Is there is way to configure the Spark local temp directory to s3 instead of the Local Filesystem? or can we mount EBS volume on the Glue workers.

I had tried configuring the property in the Spark Session builder, but Still, Spark is using the local tmp directory

SparkSession.builder.appName("app").config("spark.local.dir", "s3a://s3bucket/temp").getOrCreate()
Felipe answered 28/12, 2020 at 13:38 Comment(3)
I am not sure if this is same issue with Glue but whenever I encountered this issue with EMR, it was due to lot of logging to log files. Or you could try to increase the disk space allocated to each worker.Tritium
@Tritium - I didn't found an option to increase the disk spack of the Glue Worker, Is it possible to do that.Felipe
s3 is not a real filesystem. don't even tryKathlenekathlin
E
11

As @Prajappati stated, there are several solutions.

These solutions are described in detail in the aws blog that presents s3 shuffle feature. I am going to ommit the shuffle configuration tweaking since it is not too much reliable. So, basically, you can either:

  • Scale out vertically, increasing the size of the machine (i.e. going from G.1X to G.2X) which increases the cost.

  • Disaggregate compute and storage: which in this case means using s3 as storage service for spills and shuffles.

    At the time of writting, to configure this disaggreagation, the job must be configured with the following settings:

    • Glue 2.0 Engine
    • Glue job parameters:
    Parameter Value Explanation
    --write-shuffle-files-to-s3 true Main parameter (required)
    --write-shuffle-spills-to-s3 true Optional
    --conf spark.shuffle.glue.s3ShuffleBucket=S3://<your-bucket-name>/<your-path> Optional. If not set, the path --TempDir/shuffle-data will be used instead

    Remember to assign the proper iam permissions to the job to access the bucket and write under the s3 path provided or configured by default.

Euniceeunuch answered 16/3, 2022 at 17:43 Comment(0)
C
6

According to the error message, it appears as if the Glue job is running out of disk space when writing a DynamicFrame. As you may know, Spark will perform a shuffle on certain operations, writing the results to disk. When the shuffle is too large, it the job will fail and

There are 2 option to consider.

  1. Upgrade your worker type to G.2X and/or increase the number of workers.

  2. Implement AWS Glue Spark Shuffle manager with S3 [1]. To implement this option, you will need to downgrade to Glue version 2.0. The Glue Spark shuffle manager will write the shuffle-files and shuffle-spills data to S3, lowering the probability of your job running out of memory and failing. Please could you add the following additional job parameters. You can do this via the following steps:

  • Open the "Jobs" tab in the Glue console.
  1. Select the job you want to apply this to, then click "Actions" then click "Edit Job".
  2. Scroll down and open the drop down named "Security configuration, script libraries, and job parameters (optional)".
  3. Under job parameters, enter the following key value pairs:
  • Key: --write-shuffle-files-to-s3 Value: true
  • Key: --write-shuffle-spills-to-s3 Value: true
  • Key: --conf Value: spark.shuffle.glue.s3ShuffleBucket=S3://

Remember to replace the triangle brackets <> with the name of the S3 bucket where you would like to store the shuffle data. 5) Click "Save" then run the job.

Cuisse answered 28/2, 2022 at 14:2 Comment(2)
How does it perform when you use S3 for shuffle data?Avoidance
@RedJohn it make processes faster and completes successfullyCuisse
E
0

FWIW I discovered that first thing you need to check is that Spark UI is not enabled on the job: https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-jobs.html AWS documentation mentions that logs generated for Spark UI are flushed to S3 path every 30 seconds, but it doesn't look like they are rotated on the worker. So sooner or later, depending on the workload and worker type, all of them run out of disk space and the run fails with Command failed with exit code 10.

Edita answered 6/12, 2022 at 0:28 Comment(0)
G
-1

The documentation states that spark.local.dir is used to specify a local directory only.

This error can be addressed modifying the logging properties or, depending on the cluster manager used, the cluster manager properties such as for yarn in this answer.

Glacis answered 10/10, 2021 at 11:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.