I'm working at a company that processes very large CSV files. Clients upload the file to Amazon S3 via filepicker. Then multiple server processes can read the file in parallel (i.e. starting from different points) to process it and store it in a database. Optionally the clients may zip the file before uploading.
- Am I correct that the ZIP format does not allow decompression of a single file in parallel? That is, there is no way to have multiple processes read the ZIP file from different offsets (maybe with some overlap between blocks) and stream uncompressed data from there?
If I am correct, then I want a way to take the ZIP file on S3 and produce an unzipped CSV, also on S3.
- Does Amazon provide any services that can perform this task simply? I was hoping that Data Pipeline could do the job, but it seems to have limitations. For example "CopyActivity does not support copying multipart Amazon S3 files" (source) seems to suggest that I can't unzip anything larger than 5GB using that. My understanding of Data Pipeline is very limited so I don't know how suitable it is for this task or where I would look.
- Is there any SaaS that does the job? Edit: someone answered this question with their own product https://zipit.run/, which I think was a good answer, but it was downvoted so they deleted it.
I can write code to download, unzip, and multipart upload the file back to S3, but I was hoping for an efficient, easily scalable solution. AWS Lambda would have been ideal for running the code (to avoid provisioning unneeded resources) but execution time is limited to 60 seconds. Plus the use case seems so simple and generic I expect to find an existing solution.