Why do I get the "is not a Parquet file" error when reading a parquet file

The following error occures when reading a parquet file from an hdfs

2020-06-04 14:11:23 WARN  TaskSetManager:66 - Lost task 44.0 in stage 1.0 (TID 3514, 192.168.16.41, executor 1): java.lang.RuntimeException: hdfs://data-hadoop-hdfs-nn.hadoop:8020/somedata/serviceName=someService/masterAccount=ma/siteAccount=sa/systemCode=111/part-00170-7ff5ac19-98b7-4a5a-b93d-9e988dff07eb.c000.snappy.parquet is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [55, 49, 98, 48]

I found similar problems on the internet but most people tried to read other filetypes than parquet. I am 100% sure that this file is written in parquet format as can be seen in the logs. Filename is part-00170-7ff5ac19-98b7-4a5a-b93d-9e988dff07eb.c000.snappy.parquet .

There is only one job writing into this somdata folder and this one is only writing parquet (spark structured streaming job) Ending also says it is a parquet file. Other parquet files written by the same job don't throw this error

Recommended topics

Hot tags