How to minimize the time for Unzipping & zipping the files in Unix?

Asked 9/10, 2013 at 17:42 Answered 9/6, 2023 at 15:36

I have to unzip and then (after processing) again zip(archive) the source files. File sizes are huge typically around 200-250 GB (unzipped, .dat format)(total 96 files). The process of unzipping takes around 2 hours and again the zipping process takes 1:30 to 2 hours which is unaffordable. Currently I am using "zcat" command for unzipping and "gzip -3" for zipping the files. Disk space is not a issue as we have 1.5 Terabyte mount in place. Will you please suggest some more efficient modes of doing this process..

Looking forward to your suggestions, Thanks - Pushkar.

Memoried answered 9/10, 2013 at 17:42 Comment(3)

can you do your processing 'in-line'? i.e. gzcat file.gz | ./fixingScript | gzip -9 - > file.tmp.gz && mv file.tmp.gz file.gz ? (Sorry, I don't have time to lookup the exact syntax you'd use with zip utilities). This should essentially cut your processing time down to the longer of the two, unzip or re-zip. Or if this is something you can rearchitect, go for smaller files OR something that can be feed into a large parrallel processing system, Hadoop and many others. Good luck. – Magel 9/10, 2013 at 18:21

Thanks for responce! Actually after unzipping, I am using the files as input to Informatica tool. So it can't be done in line. – Memoried 9/10, 2013 at 18:25

Still not sure I understand your use-case, if you're not processing data and then zipping the revised version backup AND you have plenty of disk-space, then how about cp file.zip file.orig.zip && unzip file.zip && load_to_informatica file && rm file && mv file.orig.zip file.zip. So you're keeping a copy of your zipped file, unziping temporarily, and after unzipped file is loaded, you just delete it, and rename the saved copy of .zip back to file.zip. Good luck. – Magel 9/10, 2013 at 18:40

Try the silent mode -q when unzipping. This will reduce the time by a lot if there are too many files in the archive, since unzip writes the names to stdout.

man unzip:

   -q     perform  operations  quietly  (-qq  = even quieter).  Ordinarily
          unzip prints the names of the files it's extracting or  testing,
          the extraction methods, any file or zipfile comments that may be
          stored in the archive, and possibly a summary when finished with
          each  archive.   The -q[q] options suppress the printing of some
          or all of these messages.

Berthoud answered 4/1, 2019 at 10:5 Comment(0)

If disk space is not an issue, then simply don't ever compress. Then you'll never need to decompress either.

You can try pigz to speed things up if you have multiple cores. It is a parallel implementation of gzip that will especially speed up compression.

I don't understand why your decompression is so slow compared to your compression. It should be about a factor of three to ten faster. Can you provide the actual code for what you're doing? There must be something wrong there.

By the way, your terminology is incorrect. zipping and unzipping refer to the .zip format, not the .gz format. You would just say that you compress to and decompress the gzip format.

Rhatany answered 9/10, 2013 at 20:23 Comment(3)

pigz only speeds up compression, not decompression. – Saharanpur 21/4, 2020 at 11:54

@Saharanpur Actually it does speed it up some. When decompressing, pigz has separate threads for reading, writing, decompression, and CRC calculation. I just did a quick test on a 100 MB gzip file, where decompressing with gzip took about 2.5 seconds, whereas pigz did it in about 1.5 seconds. – Rhatany 23/4, 2020 at 20:22

Man page says: "As a result, pigz uses a single thread (the main thread) for decompression, but will create three other threads for reading, writing, and check calculation, which can speed up decompression under some circumstances." – Rhatany 24/4, 2020 at 16:36

-2

Use parallel processing! Depending on the amount of CPU's you have available, you can speed up the process with a factor equal to the number of CPU's. You can do this using a bash script. I personally prefer doing it using a python script. I use the module ProcessPoolExecutor from concurrent.futures for this.

Amaleta answered 9/6, 2023 at 15:36 Comment(0)

Recommended topics

Hot tags