Problems when writing parquet with timestamps prior to 1900 in AWS Glue 3.0

Asked 23/8, 2021 at 10:51 Answered 3/4, 2023 at 16:3

Solved amazon-web-services apache-spark pyspark aws-glue

When switching from Glue 2.0 to 3.0, which means also switching from Spark 2.4 to 3.1.1, my jobs start to fail when processing timestamps prior to 1900 with this error:

An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet INT96 files can be ambiguous, 
as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+s Proleptic Gregorian calendar.
See more details in SPARK-31404.
You can set spark.sql.legacy.parquet.int96RebaseModeInRead to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. 
Or set spark.sql.legacy.parquet.int96RebaseModeInRead to 'CORRECTED' to read the datetime values as it is.

I tried everything to set the int96RebaseModeInRead config in Glue, even contacted the Support, but it seems that currently Glue is overwriting that flag and you can not set it yourself.

If anyone knows a workaround, that would be great. Otherwise I will continue with Glue 2.0. and wait for the Glue dev team to fix this.

Addie answered 23/8, 2021 at 10:51 Comment(5)

have you tried to set the conf directly when creating the sparkSession ? – Tapp 23/8, 2021 at 11:34

Yes, unfortunately that does not work, also setting it via environment variables does not work. – Addie 23/8, 2021 at 11:46

can you show what you've tried so far ? – Tapp 23/8, 2021 at 11:47

try --conf as in docs.aws.amazon.com/glue/latest/dg/… – Shoeshine 24/8, 2021 at 2:1

As I said, setting it as a Environment Variable does not work either – Addie 24/8, 2021 at 6:48

I made it work by setting --conf to spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED --conf spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED.

This is a workaround though and Glue Dev team is working on a fix, although there is no ETA.

Also this is still very buggy. You can not call .show() on a DynamicFrame for example, you need to call it on a DataFrame. Also all my jobs failed where I call data_frame.rdd.isEmpty(), don't ask me why.

Update 24.11.2021: I reached out to the Glue Dev Team and they told me that this is the intended way of fixing it. There is a workaround that can be done inside of the script though:

sc = SparkContext()
# Get current sparkconf which is set by glue
conf = sc.getConf()
# add additional spark configurations
conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")
# Restart spark context
sc.stop()
sc = SparkContext.getOrCreate(conf=conf)
# create glue context with the restarted sc
glueContext = GlueContext(sc)

Addie answered 3/9, 2021 at 7:24 Comment(10)

Still not working for me. Tried to write a dataframe but keeps failing no matter the configurations. I tried all possible combinations - setting the confs in script, setting the confs in job, setting confs both in script and job. The only way that I made it worked is by ignoring the glueContext and using a SparkSession instead which is very disappointing as I have Spark 3 in all scripts excluding the ones that I use some of the glueContext features(bookmarks, read_from_catalogue, etc) – Rip 30/11, 2021 at 16:53

This is weird, it works for me in > 200 jobs, by setting the --conf key in the job. – Addie 30/11, 2021 at 16:55

Yes, you re right. It worked at last! Most likely a glitch or something - did not change nothing at all, just pressed "Run Job" – Rip 30/11, 2021 at 17:5

@RobertKossendey how did you set it ? thanks – Valdivia 7/3, 2022 at 16:16

@RobertKossendey you set it as a long Job parameter. E.g: Key: --conf Value: spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED --conf etc. – Pee 18/3, 2022 at 14:27

Still have the same issue when handling rdd ! any suggestion please – Valdivia 11/5, 2022 at 13:58

Cannot run multiple SparkContexts at once. The stop doesn't seem to work – Livingstone 19/5, 2022 at 8:37

That is weird, it worked from me. Since then we migrated to Databricks so I unfortunately can not test it. – Addie 19/5, 2022 at 8:43

Awesome this works absolutely fine. – Maria 3/9, 2022 at 17:41

Still works, but my configuration was: val spark = SparkSession .builder .appName("fromCSVtoParquet") .master("local[*]") .config("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED") .config("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED") .getOrCreate() – Forman 27/2, 2023 at 23:35

From spark version 3.2, spark.sql.legacy.parquet.* is deprecated.

2023-04-03 21:27:13.362 thread=main, log_level=WARN , [o.a.s.s.internal.SQLConf], The SQL config 'spark.sql.legacy.parquet.datetimeRebaseModeInRead' has been deprecated in Spark v3.2 and may be removed in the future. Use 'spark.sql.parquet.datetimeRebaseModeInRead' instead.

So need to use below spark configs:

conf.set("spark.sql.parquet.int96RebaseModeInRead", "CORRECTED")
conf.set("spark.sql.parquet.int96RebaseModeInWrite", "CORRECTED")
conf.set("spark.sql.parquet.datetimeRebaseModeInRead", "CORRECTED")
conf.set("spark.sql.parquet.datetimeRebaseModeInWrite", "CORRECTED")

Authorization answered 3/4, 2023 at 16:3 Comment(0)

The issue being addressed in the official Glue Developer guide

Migrating from AWS Glue 2.0 to AWS Glue 3.0 last bullet item.

Jinja answered 22/11, 2021 at 18:10 Comment(0)

In some cases, the job can even fail with the required configs properly set. For example, it will fail when reading dataframe and then calling .rdd method. The issue is that in some cases when reading data using Glue API with DynamicDataFrame as output it will use RDD API internally. And this code fails because required flags are not propagated into the execution stage.

In order to fix this issue, we additionally wrap this code with

SqlExecution.withSQLConfPropagated{
   glueContext.getSource(...).getDynamicFrame
}

Then required SQL configs will propagate properly to the execution stage and will be taken into account.

Appetizing answered 4/8, 2022 at 13:3 Comment(3)

There are something similar to pyspark? – Lounge 5/8, 2022 at 10:56

the code is quite simple so similar functionality can be implemented using python. Take a look here: github.com/apache/spark/blob/… – Appetizing 5/8, 2022 at 23:10

I solved it by forcing evaluation before the call to .rdd method. – Reexamine 17/7, 2023 at 15:58

Recommended topics

Hot tags