Overwrite parquet files from dynamic frame in AWS Glue

Asked 24/8, 2018 at 9:47 Answered 3/2, 2023 at 6:7

Solved amazon-web-services parquet aws-glue

I use dynamic frames to write a parquet file in S3 but if a file already exists my program append a new file instead of replace it. The sentence that I use is this:

glueContext.write_dynamic_frame.from_options(frame = table,
                                         connection_type = "s3",
                                         connection_options = {"path": output_dir,
                                                               "partitionKeys": ["var1","var2"]},
                                         format = "parquet")

Is there anything like "mode":"overwrite" that replace my parquet files?

Military answered 24/8, 2018 at 9:47 Comment(0)

Currently AWS Glue doesn't support 'overwrite' mode but they are working on this feature.

As a workaround you can convert DynamicFrame object to spark's DataFrame and write it using spark instead of Glue:

table.toDF()
  .write
  .mode("overwrite")
  .format("parquet")
  .partitionBy("var_1", "var_2")
  .save(output_dir)

Heart answered 25/8, 2018 at 1:15 Comment(6)

Thanks for your answer! – Military 27/8, 2018 at 9:11

Would this replace all existing files in a partition? Or just those with conflicting names? – Sandberg 6/5, 2019 at 11:24

It will overwrite all files – Heart 6/5, 2019 at 11:48

@YuriyBondaruk: Does AWS Glue support this overwrite mode now? particularly for write to S3 or DynamoDB. – Thorn 2/11, 2022 at 15:50

Is the overwrite available now? Couldn't find anything on AWS Glue documentation. @YuriyBondaruk – Alyssa 19/1, 2023 at 21:41

Did not find anything. I ended up using spark data frames instead as mentioned in the above answer. @Yefet – Alyssa 19/3, 2023 at 12:32

As mentioned earlier, AWS Glue doesn't support mode="overwrite" mode. But converting Glue Dynamic Frame back to PySpark data frame can cause lot of issues with big data.

You just need to add signle command i.e. purge_s3_path() before writing dynamic_dataFrame to S3.

glueContext.purge_s3_path(s3_path,  {"retentionPeriod": 0})
glueContext.write_dynamic_frame.from_options(frame = table,
                                     connection_type = "s3",
                                     connection_options = {"path": s3_path,
                                                           "partitionKeys": ["var1","var2"]},
                                     format = "parquet")

Please refer : AWS Documentation

Kalindi answered 3/2, 2023 at 6:7 Comment(2)

write code with more explaination. "partitionKeys" is not clear – Waterish 5/1 at 10:24

What kind of issues are you referring to when you say cause lot of issues with big data ? – Sinistrodextral 25/1 at 17:58

If you don't want your process to overwrite everything under "s3://bucket/table_name", you could use

spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
data.toDF()
    .write
    .mode("overwrite")
    .format("parquet")
    .partitionBy("date", "name")
    .save("s3://folder/<table_name>")

This will only update the "selected" partitions in that S3 location. In my case, I have 30 date-partitions in my DynamicFrame "data".

I'm using Glue 1.0 - Spark 2.4 - Python 2.

Estranged answered 23/8, 2019 at 20:9 Comment(1)

Thank you, I couldn't get my job working right until I saw your spark.conf.set usage. – Wayless 1/11, 2020 at 14:1

Recommended topics

Hot tags