File already exists error while writing Spark dataframe to S3 using AWS Glue
Asked Answered
Z

1

0

I'm using this command to write a dataframe to S3:

df.write.option("delimiter","|").option("header",True).option("compression", "gzip").mode("overwrite").format("csv").save("s3://bucketname/metrics/parsed/")

But I'm always getting this error, just the filename keeps changing:

An error occurred while calling o293.save. File already exists:s3://bucketname/metrics/parsed/part-01195-6ef08750-dbf5-41c6-b024-501403820268-c000.csv.gz

Full error:

"Failure Reason": "JobFailed(org.apache.spark.SparkException: Job aborted due to stage failure: Task 1195 in stage 11.0 failed 4 times, most recent failure: 
Lost task 1195.3 in stage 11.0 (TID 3023) (172.36.67.235 executor 9):
 org.apache.hadoop.fs.FileAlreadyExistsException: File already exists

I tried the following but it didn't work, and ends up giving the same error:

  1. Added coalesce(100) in the command
  2. Writing to a new destination, with and without the .mode("overwrite") option
  3. Exporting the data in parquet format
  4. Writing with .mode("append") option

I couldn't find anything helpful which could help resolve this, except this post but I'm using Glue 3.0 (Spark 3.1) hence this shouldn't be applicable.

Zecchino answered 26/10, 2022 at 16:59 Comment(0)
Z
1

Turns out the error displayed by Glue was not the correct exception. Although the task stages failed due to this error, but before this there was a stage failure due to an Exception in the code.

After setting up Spark UI on Glue, I was able to find the first failure and the cause of it.

Here's how to setup Spark UI

Zecchino answered 27/10, 2022 at 16:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.