Read Zstandard-compressed file in Spark 2.3.0

About

Asked 15/6, 2018 at 2:16 Answered 15/6, 2018 at 18:2

Solved apache-spark hadoop2 amazon-emr zstandard

Apache Spark supposedly supports Facebook's Zstandard compression algorithm as of Spark 2.3.0 (https://issues.apache.org/jira/browse/SPARK-19112), but I am unable to actually read a Zstandard-compressed file:

$ spark-shell

...

// Short name throws an exception
scala> val events = spark.read.option("compression", "zstd").json("data.zst")
java.lang.IllegalArgumentException: Codec [zstd] is not available. Known codecs are bzip2, deflate, uncompressed, lz4, gzip, snappy, none.

// Codec class can be imported
scala> import org.apache.spark.io.ZStdCompressionCodec
import org.apache.spark.io.ZStdCompressionCodec

// Fully-qualified code class bypasses error, but results in corrupt records
scala> spark.read.option("compression", "org.apache.spark.io.ZStdCompressionCodec").json("data.zst")
res4: org.apache.spark.sql.DataFrame = [_corrupt_record: string]

What do I need to do in order to read such a file?

Environment is AWS EMR 5.14.0.

Hodgkins answered 15/6, 2018 at 2:16 Comment(0)

Per this comment, support for Zstandard in Spark 2.3.0 is limited to internal and shuffle outputs.

Reading or writing Zstandard files utilizes Hadoop's org.apache.hadoop.io.compress.ZStandardCodec, which was introduced in Hadoop 2.9.0 (2.8.3 is included in EMR 5.14.0).

Hodgkins answered 15/6, 2018 at 18:2 Comment(4)

I'm using Hadoop 3.2.2, but when trying to read a zstd, it gives me a java.lang.RuntimeException: native zStandard library not available: this version of libhadoop was built without zstd support. Any ideas? Thanks – Jenniejennifer 18/4, 2021 at 20:37

me too @cnstlungu, I'm running hadoop 2.10 hadoop checknative -a, and seems that zstd : false, maybe zstd license is not fully open, and apache team decided to build without it ? – Perforce 21/4, 2021 at 13:10

@DiegoScaravaggi here's how I sorted it out #67099704 – Jenniejennifer 22/4, 2021 at 18:40

@Jenniejennifer , I think you are right, but I'm not using 3.x data platform, on my 2.10, when I added native library I got

main org.apache.spark.sql.AnalysisException: java.lang.Uns atisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$POSIX.stat

, for now I will postpone nativa library, I will arrange a test platform with 3.x, and I'm waiting bigtop apache team for stable build 1.6 – Perforce 23/4, 2021 at 7:39

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags