Unzip a large ZIP file on Amazon S3 [closed]

Asked 21/9, 2015 at 14:28 Answered 16/11, 2020 at 7:37

I'm working at a company that processes very large CSV files. Clients upload the file to Amazon S3 via filepicker. Then multiple server processes can read the file in parallel (i.e. starting from different points) to process it and store it in a database. Optionally the clients may zip the file before uploading.

Am I correct that the ZIP format does not allow decompression of a single file in parallel? That is, there is no way to have multiple processes read the ZIP file from different offsets (maybe with some overlap between blocks) and stream uncompressed data from there?

If I am correct, then I want a way to take the ZIP file on S3 and produce an unzipped CSV, also on S3.

Does Amazon provide any services that can perform this task simply? I was hoping that Data Pipeline could do the job, but it seems to have limitations. For example "CopyActivity does not support copying multipart Amazon S3 files" (source) seems to suggest that I can't unzip anything larger than 5GB using that. My understanding of Data Pipeline is very limited so I don't know how suitable it is for this task or where I would look.
Is there any SaaS that does the job? Edit: someone answered this question with their own product https://zipit.run/, which I think was a good answer, but it was downvoted so they deleted it.

I can write code to download, unzip, and multipart upload the file back to S3, but I was hoping for an efficient, easily scalable solution. AWS Lambda would have been ideal for running the code (to avoid provisioning unneeded resources) but execution time is limited to 60 seconds. Plus the use case seems so simple and generic I expect to find an existing solution.

Shakti answered 21/9, 2015 at 14:28 Comment(7)

How big are these files before/after zip? – Retributive 21/9, 2015 at 14:59

@Retributive about 10 GB before zip, although we want to be ready for any case in the future. – Shakti 21/9, 2015 at 15:6

Have you considered a format that has faster decompression, such as snappy (quora.com/How-do-LZO-and-Snappy-compare)? Or can you decompress parts of the file in multiple Lambdas (spawned by the initial Lambda event handler), e.g. one top-level folder each? – Retributive 21/9, 2015 at 15:14

@Retributive We're using ZIP because it's well known and easy for clients without technical expertise to create. Not sure what you're saying in the second question but remember it's just a single big CSV file that has been zipped. – Shakti 21/9, 2015 at 15:24

Apologies, forgot that it was a single CSV file. Have you asked AWS Support if raising the 60-second Lambda limit is possible? Personally I'd also investigate snappy - remember that it's also faster to compress so that's an advantage for your customers, especially with a 10GB file. – Retributive 21/9, 2015 at 17:33

Also, have you increased the available RAM for your event handler? Not only would the unzip process have more RAM to use, possibly making it faster, but Lambda gives you more CPU power with larger RAM sizes, also making it faster. See 'How are compute resources assigned to an AWS Lambda function?' at aws.amazon.com/lambda/faqs. – Retributive 21/9, 2015 at 18:50

@Retributive it's not that I've made a lambda and found that it times out. It's that it needs to be able to handle arbitrarily large files in the future. Assume that I need to decompress files in the terabytes. – Shakti 21/9, 2015 at 19:20

@E.J. Brennan is right, I had a chat with AWS support, they told we cannot use Lambda to do this operation. Following is the guidance I got from Support.

Whenever a file is dropped in S3.
Trigger a notification to SQS.
Have EC2 listen to SQS.
Do the Un ZIP.
Add another notification to SQS and the next lambda function can do the further processing.

Hope it helps some one. I wasted lot of time solving this issue,

Solution/Work around!!

After a Longer struggle got a solution from my tech lead. We can use AWS Glue to solve this issue. That has more memory to use. It gets the job done.

Hope it helps some one.

Sutherlan answered 21/3, 2018 at 23:18 Comment(4)

did it with AWS Glue. It takes a very long time (2 min) just to unzip a 52 kb file from and back to S3 (this is the cold run, but since my script is used less often than hourly, it's not an option) – Gargan 10/2, 2019 at 20:57

I've similar requirement to unzip 6gb files in s3. How to use Glue, could you pls tell me the steps involved. – Cathe 10/4, 2019 at 10:5

Why did AWS support say you cannot use Lambda? – Naphthalene 28/8, 2019 at 14:36

@Naphthalene Because AWS Lambda have memory limitations – Miracle 21/3, 2021 at 10:25

Your best bet is probably to have an S3 event notification sent to an SQS queue every time a zip file is uploaded to S3, and have on or more EC2 instances polling the queue waiting for files to unzip.

You may only need on running instance to do this, but you could also have a autoscale policy that spins up more instance if the size of the SQS queue grows too big for a single instance to do the de-zipping fast enough (as defined by you).

Cupronickel answered 21/9, 2015 at 18:32 Comment(2)

It's not a bad idea but it's not the level of simplicity I was hoping for / expecting. I still have to write and maintain code and manage instances. I was really thinking there'd be a trivial solution to this. The name "Data Pipeline" is especially suggestive. – Shakti 21/9, 2015 at 21:10

It can't help in case of big files. Only 512 MB is available in a Lambda instance, so where you suggest loading a file with a size of 12GB to unzip it and store unziped result file before s3? – Uel 15/2, 2019 at 19:0

It's entirely possible to process a 10GB file with a single Lambda invocation, even when the content is in a ZIP file (and although the maximum memory available in a Lambda is 3GB). It's certainly easier with a CSV as you can use multiple Lambda invocations to read different sections of the file concurrently using Range requests (S3 supports this in the API as well as via plain HTTP if the objects are public).

I've written a C# stream implementation that demonstrates how this is most easily done. The stream uses the S3 API to get sub-sets of the file so that only parts of it are held in memory, but presents a standard Stream interface so that System.IO.Compression.ZipArchive can be used to read the contents (normally you'd need a file on disk, or the entire contents in a Memory stream do do that).

The Github repo includes an example that does what you need, albeit with a smaller (1GB) file and an intentionally under-powered lambda (configured with the minimum memory of 256MB). See ./Examples/Process1GBWith256MBLambda.

Basically, your code looks something like:

using var stream = new Cppl.Utilities.AWS.SeekableS3Stream(s3, BUCKET, KEY, 12 * 1024 * 1024, 5);
using var zip = new ZipArchive(stream, ZipArchiveMode.Read);
var entry = zip.GetEntry(FILENAME);
using var file = entry.Open();
using var reader = new StreamReader(file);

string line = null;
while ((line  = await reader.ReadLineAsync()) != null) {
   // process line here
}

No need for (often idle) EC2 instances or other reserved-capacity billed resources. Simple Lambda will do the trick. And, as other posters have mentioned, triggering such a Lambda on the S3 event trigger would be a smart move.

Dizzy answered 16/11, 2020 at 7:37 Comment(3)

This approach is very slow when we want to extract files and save them to other s3 bucket – Thirzia 10/10, 2022 at 13:2

I’m not sure why you would say that. Any approach will involve both reading the S3 object, and writing extracted content. There isn’t anything magic about an EC2 that would make it faster in that regard. My approach involves only reading the parts of the file actually used, and is often much, much faster as a result. I’ve never seen it be slower. – Dizzy 12/10, 2022 at 1:45

this code is not thread-safe; the moment you try to upload multiple files with threading it breaks – Thirzia 8/11, 2022 at 8:36

I'm using a EMR cluster with no Applications, with only one node (only the master node, no slaves) and having it run a single step that runs a shell script.

The shell script does the following:

Download the thezeep.zip file from S3 to the /mnt folder in the master node
Unzip the file content to /mnt/thezeep/
Upload the extracted files to S3.

The whole process takes 20 minutes to process a 10Gb zip file containing files for a total of 100Gb.

When the step is terminated, the EMR cluster shuts down automatically.

N.B. : The downside is that if there is not enough space on the /mnt/ folder to download and/or unzip the file, the step won't terminate alone. It will wait, asking for a return that you can't give...and so you'll have to terminate the cluster manually...
So don't hesitate to add more space on the EBS volume to avoid such issue.

Cran answered 3/7, 2019 at 16:10 Comment(0)

you can use lambda to trigger a glue that can download the file, unzip the files and upload back to s3 .. will be serverless

Moray answered 10/11, 2020 at 22:44 Comment(2)

Can you please update your comment to be more specific? eg. What would "a glue" that performs the actions you suggested look like – Mistook 17/6, 2021 at 19:39

Lambda container is not limitless. @Lee's solution is much more convenient. – Nkvd 13/9, 2021 at 11:36

You can always use EC2 with Active polling, but it will not be a cost-efficient solution.
There are other solutions as well, like AWS EMR (Elastic Map Reduce) or AWS Glue.
But the most cost-efficient solution is still using lambda function.

You will not face any storage problem because it's not storing any data. Everything is happening on-the-fly.

Glimmer answered 12/6, 2020 at 5:0 Comment(0)

Recommended topics

Hot tags