Spark Parquet read error : java.io.EOFException: Reached the end of stream with XXXXX bytes left to read

Asked 30/10, 2019 at 6:14 Answered 19/2, 2024 at 13:4

Solved apache-spark apache-spark-sql parquet

While reading parquet files in spark, if you face the below problem.

App > Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 44, 10.23.5.196, executor 2): java.io.EOFException: Reached the end of stream with 193212 bytes left to read App > at org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104) App > at org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) App > at org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) App > at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174) App > at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805) App > at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301) App > at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256) App > at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159) App > at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) App > at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:124) App > at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:215)

For below spark commands:

val df = spark.read.parquet("s3a://.../file.parquet")
df.show(5, false)

Tamandua answered 30/10, 2019 at 6:14 Comment(0)

I think you can bypass this issue with

--conf  spark.sql.parquet.enableVectorizedReader=false

Handclap answered 16/4, 2020 at 8:52 Comment(1)

Careful here because it could increase the speed of reading your files significantly issues.apache.org/jira/browse/SPARK-12854 – Pedro 4/10, 2021 at 19:24

For me above didn't do the trick, but the following did:

--conf spark.hadoop.fs.s3a.experimental.input.fadvise=sequential

Not sure why, but what gave me a hint was this issue and some details about the options here.

Slow answered 24/8, 2021 at 10:15 Comment(1)

This didn't help me, I am using parquet with iceberg. – Adder 9/2, 2024 at 14:37

I think you can bypass this issue with

--conf  spark.sql.parquet.enableVectorizedReader=false

Handclap answered 16/4, 2020 at 8:52 Comment(1)

Careful here because it could increase the speed of reading your files significantly issues.apache.org/jira/browse/SPARK-12854 – Pedro 4/10, 2021 at 19:24

For me, I was getting below set of Exceptions in different spark apps:

Caused by: java.io.EOFException: Reached the end of stream with 1008401 bytes left to read
    at org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
    at org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
    at org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)

and

Caused by: java.io.IOException: could not read page in col [X] optional binary X (UTF8) as the dictionary was missing for encoding PLAIN_DICTIONARY
    at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.initDataReader(VectorizedColumnReader.java:571)
    at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPageV1(VectorizedColumnReader.java:616)

Setting this spark config

--conf  spark.sql.parquet.enableVectorizedReader=false

solved both the issues.

Donothingism answered 19/2, 2024 at 13:4 Comment(0)

Recommended topics

Hot tags