I struggled with this challenge for a while so I decided to run an experiment. I'm using spark on AWS Glue but think this applies to a spark cluster on EMR as well.
I've noticed several questions on SO with similar challenges (see below). All of us seem to struggle dealing with large gzip CSV or JSON files. As Tim points out above, these are unsplittable. Also, my experiments show, as he pointed out, the best way to deal with them is to process them uncompressed or split them and recompress them.
Here is my approach:
- I timed a read on a 750GB gzipped csv file (yes, not very large but makes my test faster -- I think the results only magnify the differences as file size grows and as workers are added).
- I used AWS Glue's DynamicFrame as well as a generic spark csv reader.
- For each experiment, I did light processing including a show, printSchema, and materialize to a count.
- For those unsplittable gzip reads resulting in a single partition, I repartitioned to simulate this necessary added step to split into partitions such as on a parquet db.
My file was about 75% compressed so the resulting decompressed file was around 2.7GB.
My results were:
Fastest was to use straight spark or the AWS extended glue dynamicframe preceded with a download decompress. These were dramatically faster than spark without first downloading and decompressing. The resulting spark frame from the downloaded decompressed file had 21 partitions (DynamicFrame was 42) vs. the 1 partition that results from an un-splittable gzip file.
The AWS Dynamic Frame reading a gzip file without first decompressing it took around 40 minutes vs. less than a minute with spark or AWS DF decompressed!
When I tried to use AWS Glue's DynamicFrame "optimizePerformance" for faster CSV SIMD reading, the DynamicFrame never finished processing even though my tab delimited CSV seemed to meet the spec.
For my source file, I used a vertices file from the commoncrawl public repository:
"s3://commoncrawl/projects/hyperlinkgraph/cc-main-2018-feb-mar-apr/domain/cc-main-2018-feb-mar-apr-domain-vertices.txt.gz"
This is a simple tab delimited single-line files with two columns -- a from_id, and to_id.
Here are my split times (in seconds):
splits.t0 splits.t_d ... splits.t_t total
task ...
df_read_gz NaN NaN ... NaN 69.0
dyf_read_gz NaN NaN ... NaN 2266.0
dyf_read_decomp NaN 50.0 ... NaN 50.0
df_read_decomp NaN 50.0 ... NaN 50.0
df = spark.read.csv(options)
dyf = glue_context.create_dynamic_frame.from_options(options)
each with and without a download/decompress ahead of the read/transform. All times include download/decompress, extract, and transform times. Single partition results from gzip unsplittable includes a repartition step.
Test setup was on AWSGlue 4.0 running on Docker.
Related SO Questions:
- Using AWS Glue to convert very big csv.gz (30-40 gb each) to parquet
- AWS Glue Crawler - Reading a gzip file of csv
- How do I split / chunk Large JSON Files with AWS glueContext before converting them to JSON?
- Dealing with a large gzipped file in Spark