Read Zstandard-compressed file in Spark 2.3.0
Asked Answered
H

1

8

Apache Spark supposedly supports Facebook's Zstandard compression algorithm as of Spark 2.3.0 (https://issues.apache.org/jira/browse/SPARK-19112), but I am unable to actually read a Zstandard-compressed file:

$ spark-shell

...

// Short name throws an exception
scala> val events = spark.read.option("compression", "zstd").json("data.zst")
java.lang.IllegalArgumentException: Codec [zstd] is not available. Known codecs are bzip2, deflate, uncompressed, lz4, gzip, snappy, none.

// Codec class can be imported
scala> import org.apache.spark.io.ZStdCompressionCodec
import org.apache.spark.io.ZStdCompressionCodec

// Fully-qualified code class bypasses error, but results in corrupt records
scala> spark.read.option("compression", "org.apache.spark.io.ZStdCompressionCodec").json("data.zst")
res4: org.apache.spark.sql.DataFrame = [_corrupt_record: string]

What do I need to do in order to read such a file?

Environment is AWS EMR 5.14.0.

Hodgkins answered 15/6, 2018 at 2:16 Comment(0)
H
5

Per this comment, support for Zstandard in Spark 2.3.0 is limited to internal and shuffle outputs.

Reading or writing Zstandard files utilizes Hadoop's org.apache.hadoop.io.compress.ZStandardCodec, which was introduced in Hadoop 2.9.0 (2.8.3 is included in EMR 5.14.0).

Hodgkins answered 15/6, 2018 at 18:2 Comment(4)
I'm using Hadoop 3.2.2, but when trying to read a zstd, it gives me a java.lang.RuntimeException: native zStandard library not available: this version of libhadoop was built without zstd support. Any ideas? ThanksJenniejennifer
me too @cnstlungu, I'm running hadoop 2.10 hadoop checknative -a, and seems that zstd : false, maybe zstd license is not fully open, and apache team decided to build without it ?Perforce
@DiegoScaravaggi here's how I sorted it out #67099704Jenniejennifer
@Jenniejennifer , I think you are right, but I'm not using 3.x data platform, on my 2.10, when I added native library I got main org.apache.spark.sql.AnalysisException: java.lang.Uns atisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$POSIX.stat , for now I will postpone nativa library, I will arrange a test platform with 3.x, and I'm waiting bigtop apache team for stable build 1.6Perforce

© 2022 - 2024 — McMap. All rights reserved.