how to fix Illegal Parquet type: INT64 (TIMESTAMP_MICROS) error
Asked Answered
P

4

5

I use a sqlContext.read.parquet function in PySpark to read the parquet files everyday. The data has a timestamp column. They changed the timestamp field from 2019-08-26T00:00:13.600+0000 to 2019-08-26T00:00:13.600Z. It reads fine in Databricks, but it gives an Illegal Parquet type: INT64 (TIMESTAMP_MICROS) error while I'm trying to read it over a spark cluster. How do I read this new column using the read.parquet function itself?

Currently I use: from_unixtime(unix_timestamp(ts,"yyyy-MM-dd HH:mm:ss.SSS"),"yyyy-MM-dd") as ts to convert the 2019-08-26T00:00:13.600+0000 to a 2019-08-26 format.

How do I convert 2019-08-26T00:00:13.600Z to 2019-08-26 ?

Portis answered 28/8, 2019 at 20:54 Comment(2)
I am not able to read the parquet files (through read.parquet(path)) because of the Illegal Parquet type: INT64 (TIMESTAMP_MICROS) error.Portis
Did you manage to solve this in the end?Cheryl
C
7

It might be due to your data being written to parquet by one system, and you are trying to read the parquet from another system. The two systems use different versions of Parquet.

I had similar issue. In my case, I prepared data with timestamp columns in Python and saved it using pandas.to_parquet. Later I tried to read the parquet in Pyspark on EMR and got the error:

org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,false))

After some search, it turns out this is caused by the different timestamp resolution that Pandas and Pyspark can handle. My dataset has several timestamp columns. In Pandas, timestamps are represented in nanoseconds resolution. However, some Parquet readers used by Pyspark may only support timestamps stored in millisecond ("ms") or microsecond ("us") resolution. See https://arrow.apache.org/docs/python/parquet.html#storing-timestamps for additional disucssion.

So when I save Pandas dataframe with timestamp columns, I explicitly coerce the timestamp columns to microsecond resolution. I had to do the coercion in Python for each partition.

Clipboard answered 27/10, 2023 at 2:5 Comment(0)
B
6

I just wanted to share my fix for this after searching all day the other day. You have to upgrade to a version that has a fix or recompile an older version with the applied fix.

Fix Version/s: 3.2.4, 3.3.2, 3.4.0

Then you have to set this config parameter to true: spConfigParams.put("spark.sql.legacy.parquet.nanosAsLong", "true")

Check out this link for more info on the bug: https://issues.apache.org/jira/browse/SPARK-40819

Briones answered 11/3, 2024 at 19:11 Comment(3)
In Python, this parameter can be introduced easily by adding an extra parameter while creating the spark context as in >> spark9 = SparkSession.builder.master("local[*]").config('spark.ui.port', '4050').config("spark.sql.legacy.parquet.nanosAsLong", "true").getOrCreate()Immuno
spark.conf.set("spark.sql.legacy.parquet.nanosAsLong", "true")Braley
Thank you! In databricks SQL this worked for me: SET spark.sql.legacy.parquet.nanosAsLong=true;Weidman
I
0

Here is the scala version

import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

val df2 = Seq(("a3fac", "2019-08-26T00:00:13.600Z")).toDF("id", "eventTime")
val df3= df2.withColumn("eventTime1", to_date(unix_timestamp($"eventTime", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'").cast(TimestampType)))

df3.show(false)
+-----+------------------------+----------+
|id   |eventTime               |eventTime1|
+-----+------------------------+----------+
|a3fac|2019-08-26T00:00:13.600Z|2019-08-26|
+-----+------------------------+----------+

Following line is converting timezone date to date

to_date(unix_timestamp($"eventTime", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'").cast(TimestampType))

pyspark version:

>>> from pyspark.sql.functions import col, to_date,unix_timestamp
>>> df2=spark.createDataFrame([("a3fac", "2019-08-26T00:00:13.600Z")], ['id', 'eventTime'])
>>> df3=df2.withColumn("eventTime1", to_date(unix_timestamp(col("eventTime"), "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'").cast('timestamp')))
>>> df3.show()
+-----+--------------------+----------+
|   id|           eventTime|eventTime1|
+-----+--------------------+----------+
|a3fac|2019-08-26T00:00:...|2019-08-26|
+-----+--------------------+----------+
Indiscrimination answered 28/8, 2019 at 21:21 Comment(1)
Is there any implicit way I can read/ do this? The way you suggested is explicit, and I have to convert my data into a df, read the path and specify the columns and apply this. Is there any way I can directly read the parquet? using read.parquet itself?Portis
S
0

You can use to_date api from function module

import pyspark.sql.functions as f

dfl2 = spark.createDataFrame([(1, "2019-08-26T00:00:13.600Z"),]).toDF('col1', 'ts')

dfl2.show(1, False)
+----+------------------------+
|col1|ts                      |
+----+------------------------+
|1   |2019-08-26T00:00:13.600Z|
+----+------------------------+

dfl2.withColumn('date',f.to_date('ts', "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")).show(1, False)

+----+------------------------+----------+
|col1|ts                      |date      |
+----+------------------------+----------+
|1   |2019-08-26T00:00:13.600Z|2019-08-26|
+----+------------------------+----------+

dfl2.withColumn('date',f.to_date('ts', "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")).printSchema()
root
 |-- col1: long (nullable = true)
 |-- ts: string (nullable = true)
 |-- date: date (nullable = true)

Shockheaded answered 28/8, 2019 at 21:36 Comment(5)
Is there any implicit way I can read/ do this? The way you suggested is explicit, and I have to convert my data into a df, read the path and specify the columns and apply this. Is there any way I can directly read the parquet? using read.parquet itself?Portis
@Portis are you getting error while reading parquet file only?Shockheaded
yes, its showing me the Illegal Parquet type: INT64 (TIMESTAMP_MICROS) error while trying to read the parquet itself.Portis
@Portis : Ok, Then its different problem. This happens when parquet schema is changed form initial. Can you try setting timestamp format using .option api of read Like spark.read.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'").parquet("/path/to/parquet") In this case all timestamp column should have similar pattern thoughShockheaded
In that case, need to see file and schema of the parquet file. Couple of options 1. stop infer schema : spark.read.option("inferSchema", "false") . This will give your date in string format and once you read you can cast datatypes. 2. If your Parquet files are located in HDFS or S3 like me, you can try something like the following and analyze files : HDFS: parquet-tools schema hdfs://<YOUR_NAME_NODE_IP>:<PORT>/<YOUR_FILE_PATH>/<YOUR_FILE>.parquet S3: parquet-tools schema s3://<YOUR_BUCKET_PATH>/<YOUR_FILE>.parquetShockheaded

© 2022 - 2025 — McMap. All rights reserved.