For a Big Data project, I'm planning to use spark, which has some nice features like in-memory-computations for repeated workloads. It can run on local files or on top of HDFS.
However, in the official documentation, I can't find any hint as to how to process gzipped files. In practice, it can be quite efficient to process .gz files instead of unzipped files.
Is there a way to manually implement reading of gzipped files or is unzipping already automatically done when reading a .gz file?
logs = sc.textFile("logs/*.bz2")
, I get an error on subsequentlogs.count()
. Any ideas why? – Orchidectomy