How to save spark dataframe to parquet without using INT96 format for timestamp columns?
Asked Answered
O

1

8

I have a spark dataframe that I want to save as parquet then load it using the parquet-avro library.

There is a timestamp column in my dataframe that is converted to a INT96 timestamp column in parquet. However parquet-avro does not support INT96 format and throws.

Is there a way to avoid it ? Is it possible to change the format used by Spark when writing timestamps to parquet in something supported by avro ?

I currently use

date_frame.write.parquet("path")
Offhand answered 13/6, 2019 at 14:16 Comment(0)
O
21

Reading spark code I have found the spark.sql.parquet.outputTimestampType property

spark.sql.parquet.outputTimestampType :
Sets which Parquet timestamp type to use when Spark writes data to Parquet files.
INT96 is a non-standard but commonly used timestamp type in Parquet.
TIMESTAMP_MICROS is a standard timestamp type in Parquet, which stores number of microseconds from the Unix epoch.
TIMESTAMP_MILLIS is also standard, but with millisecond precision, which means Spark has to truncate the microsecond portion of its timestamp value.

So I can do the following :

spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
data_frame.write.parquet("path")
Offhand answered 14/6, 2019 at 10:8 Comment(3)
Aren't you going to accept your own answer? :P Also thanks, this was helpful.Cartagena
It's a shame that there isn't an option to just write as String. It's also strange that Spark 3.0 is outputting timestamps in INT96 without asking... Thanks for the answer, it did exactly what I wanted.Wheelock
Similar issue running spark 2.4 -- Parquet does not support date. See HIVE-6384. Unfortunately this answer changes nothing in my case.Latour

© 2022 - 2024 — McMap. All rights reserved.