I had used the AWS Glue Job with the PySpark to read the data from the s3 parquet files which is more than 10 TB, but the Job was failing during the execution of the Spark SQL Query with the error
java.io.IOException: No space left on the device
On analysis, I found AWS Glue workers G1.x has 4 vCPU, 16 GB of memory, 64 GB disk. So we tried to increase the number of workers
Even after increasing the number of Glue workers (G1.X) to 50, Glue Jobs keeps on failing with the same error.
Is there is way to configure the Spark local temp directory to s3 instead of the Local Filesystem? or can we mount EBS volume on the Glue workers.
I had tried configuring the property in the Spark Session builder, but Still, Spark is using the local tmp directory
SparkSession.builder.appName("app").config("spark.local.dir", "s3a://s3bucket/temp").getOrCreate()