Why do I get the "is not a Parquet file" error when reading a parquet file
Asked Answered
R

1

6

The following error occures when reading a parquet file from an hdfs

2020-06-04 14:11:23 WARN  TaskSetManager:66 - Lost task 44.0 in stage 1.0 (TID 3514, 192.168.16.41, executor 1): java.lang.RuntimeException: hdfs://data-hadoop-hdfs-nn.hadoop:8020/somedata/serviceName=someService/masterAccount=ma/siteAccount=sa/systemCode=111/part-00170-7ff5ac19-98b7-4a5a-b93d-9e988dff07eb.c000.snappy.parquet is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [55, 49, 98, 48]

I found similar problems on the internet but most people tried to read other filetypes than parquet. I am 100% sure that this file is written in parquet format as can be seen in the logs. Filename is part-00170-7ff5ac19-98b7-4a5a-b93d-9e988dff07eb.c000.snappy.parquet .

There is only one job writing into this somdata folder and this one is only writing parquet (spark structured streaming job) Ending also says it is a parquet file. Other parquet files written by the same job don't throw this error

Reed answered 4/6, 2020 at 14:28 Comment(2)
Have you tried this- thatbigdata.blogspot.com/2019/09/…Coelho
Maybe try using parquet-tools (github.com/apache/parquet-mr/tree/master/parquet-tools) to validate your file is in proper parquet format.Illustrate
M
1

Got this same error today. For us the problem was that we were generating parquet files > 2GB, which breaks some clients.

https://issues.apache.org/jira/browse/SPARK-24296

Setting the spark option maxRecordsPerFile to limit the file sizes fixed for us.

Mawkish answered 15/2, 2022 at 20:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.