Problems when writing parquet with timestamps prior to 1900 in AWS Glue 3.0
Asked Answered
A

4

24

When switching from Glue 2.0 to 3.0, which means also switching from Spark 2.4 to 3.1.1, my jobs start to fail when processing timestamps prior to 1900 with this error:

An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet INT96 files can be ambiguous, 
as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+s Proleptic Gregorian calendar.
See more details in SPARK-31404.
You can set spark.sql.legacy.parquet.int96RebaseModeInRead to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. 
Or set spark.sql.legacy.parquet.int96RebaseModeInRead to 'CORRECTED' to read the datetime values as it is.

I tried everything to set the int96RebaseModeInRead config in Glue, even contacted the Support, but it seems that currently Glue is overwriting that flag and you can not set it yourself.

If anyone knows a workaround, that would be great. Otherwise I will continue with Glue 2.0. and wait for the Glue dev team to fix this.

Addie answered 23/8, 2021 at 10:51 Comment(5)
have you tried to set the conf directly when creating the sparkSession ?Tapp
Yes, unfortunately that does not work, also setting it via environment variables does not work.Addie
can you show what you've tried so far ?Tapp
try --conf as in docs.aws.amazon.com/glue/latest/dg/…Shoeshine
As I said, setting it as a Environment Variable does not work eitherAddie
A
31

I made it work by setting --conf to spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED --conf spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED.

This is a workaround though and Glue Dev team is working on a fix, although there is no ETA.

Also this is still very buggy. You can not call .show() on a DynamicFrame for example, you need to call it on a DataFrame. Also all my jobs failed where I call data_frame.rdd.isEmpty(), don't ask me why.

Update 24.11.2021: I reached out to the Glue Dev Team and they told me that this is the intended way of fixing it. There is a workaround that can be done inside of the script though:

sc = SparkContext()
# Get current sparkconf which is set by glue
conf = sc.getConf()
# add additional spark configurations
conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")
# Restart spark context
sc.stop()
sc = SparkContext.getOrCreate(conf=conf)
# create glue context with the restarted sc
glueContext = GlueContext(sc)
Addie answered 3/9, 2021 at 7:24 Comment(10)
Still not working for me. Tried to write a dataframe but keeps failing no matter the configurations. I tried all possible combinations - setting the confs in script, setting the confs in job, setting confs both in script and job. The only way that I made it worked is by ignoring the glueContext and using a SparkSession instead which is very disappointing as I have Spark 3 in all scripts excluding the ones that I use some of the glueContext features(bookmarks, read_from_catalogue, etc)Rip
This is weird, it works for me in > 200 jobs, by setting the --conf key in the job.Addie
Yes, you re right. It worked at last! Most likely a glitch or something - did not change nothing at all, just pressed "Run Job"Rip
@RobertKossendey how did you set it ? thanksValdivia
@RobertKossendey you set it as a long Job parameter. E.g: Key: --conf Value: spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED --conf etc.Pee
Still have the same issue when handling rdd ! any suggestion pleaseValdivia
Cannot run multiple SparkContexts at once. The stop doesn't seem to workLivingstone
That is weird, it worked from me. Since then we migrated to Databricks so I unfortunately can not test it.Addie
Awesome this works absolutely fine.Maria
Still works, but my configuration was: val spark = SparkSession .builder .appName("fromCSVtoParquet") .master("local[*]") .config("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED") .config("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED") .getOrCreate()Forman
A
14

From spark version 3.2, spark.sql.legacy.parquet.* is deprecated.

2023-04-03 21:27:13.362 thread=main, log_level=WARN , [o.a.s.s.internal.SQLConf], The SQL config 'spark.sql.legacy.parquet.datetimeRebaseModeInRead' has been deprecated in Spark v3.2 and may be removed in the future. Use 'spark.sql.parquet.datetimeRebaseModeInRead' instead.

So need to use below spark configs:

conf.set("spark.sql.parquet.int96RebaseModeInRead", "CORRECTED")
conf.set("spark.sql.parquet.int96RebaseModeInWrite", "CORRECTED")
conf.set("spark.sql.parquet.datetimeRebaseModeInRead", "CORRECTED")
conf.set("spark.sql.parquet.datetimeRebaseModeInWrite", "CORRECTED")
Authorization answered 3/4, 2023 at 16:3 Comment(0)
J
7

The issue being addressed in the official Glue Developer guide

Migrating from AWS Glue 2.0 to AWS Glue 3.0 last bullet item.

Jinja answered 22/11, 2021 at 18:10 Comment(0)
A
1

In some cases, the job can even fail with the required configs properly set. For example, it will fail when reading dataframe and then calling .rdd method. The issue is that in some cases when reading data using Glue API with DynamicDataFrame as output it will use RDD API internally. And this code fails because required flags are not propagated into the execution stage.

In order to fix this issue, we additionally wrap this code with

SqlExecution.withSQLConfPropagated{
   glueContext.getSource(...).getDynamicFrame
}

Then required SQL configs will propagate properly to the execution stage and will be taken into account.

Appetizing answered 4/8, 2022 at 13:3 Comment(3)
There are something similar to pyspark?Lounge
the code is quite simple so similar functionality can be implemented using python. Take a look here: github.com/apache/spark/blob/…Appetizing
I solved it by forcing evaluation before the call to .rdd method.Reexamine

© 2022 - 2024 — McMap. All rights reserved.