Is gzip format supported in Spark?

About

Asked 30/4, 2013 at 14:30 Answered 30/4, 2013 at 22:1

Solved java scala mapreduce gzip apache-spark

For a Big Data project, I'm planning to use spark, which has some nice features like in-memory-computations for repeated workloads. It can run on local files or on top of HDFS.

However, in the official documentation, I can't find any hint as to how to process gzipped files. In practice, it can be quite efficient to process .gz files instead of unzipped files.

Is there a way to manually implement reading of gzipped files or is unzipping already automatically done when reading a .gz file?

Buffet answered 30/4, 2013 at 14:30 Comment(0)

From the Spark Scala Programming guide's section on "Hadoop Datasets":

Spark can create distributed datasets from any file stored in the Hadoop distributed file system (HDFS) or other storage systems supported by Hadoop (including your local file system, Amazon S3, Hypertable, HBase, etc). Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.

Support for gzip input files should work the same as it does in Hadoop. For example, sc.textFile("myFile.gz") should automatically decompress and read gzip-compressed files (textFile() is actually implemented using Hadoop's TextInputFormat, which supports gzip-compressed files).

As mentioned by @nick-chammas in the comments:

note that if you call sc.textFile() on a gzipped file, Spark will give you an RDD with only 1 partition (as of 0.9.0). This is because gzipped files are not splittable. If you don't repartition the RDD somehow, any operations on that RDD will be limited to a single core

Guardian answered 30/4, 2013 at 22:1 Comment(5)

When I try logs = sc.textFile("logs/*.bz2"), I get an error on subsequent logs.count(). Any ideas why? – Orchidectomy 16/3, 2015 at 18:46

@Orchidectomy have you figured it out at the end? I'm getting the following error when loading tar.gz files: JsonParseException: Illegal character ((CTRL-CHAR, code 0)): only regular white space (\r, \n, \t) is allowed between tokens – Malcolm 27/10, 2015 at 17:1

@Leon, from this page: spark.apache.org/docs/latest/programming-guide.html, it says: All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz") I hope that helps. – Orchidectomy 29/10, 2015 at 22:14

I am trying to process something from Google Takeout, but it is one file (.mbox) I want from inside an archive. How can I specify that I want this one file? – Purl 3/4, 2016 at 9:2

It seems the spark checks for the .gz file extension for compressed files. I had a compressed file which is read well with sc.textFile() but returns byte strings when I mess around with the extension thus, somefile.gz.bkp – Terpsichorean 27/5, 2016 at 18:54

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags