python zipfile module doesn't seem to be compressing my files
Asked Answered
L

3

99

I made a little helper function:

import zipfile

def main(archive_list=[],zfilename='default.zip'):
    print zfilename
    zout = zipfile.ZipFile(zfilename, "w")
    for fname in archive_list:
        print "writing: ", fname
        zout.write(fname)
    zout.close()

if __name__ == '__main__':
    main()  

The problem is that all my files are NOT being COMPRESSED! The files are the same size and, effectively, just the extension is being change to ".zip" (from ".xls" in this case).

I'm running python 2.5 on winXP sp2.

Louise answered 12/11, 2010 at 15:56 Comment(0)
B
190

This is because ZipFile requires you to specify the compression method. If you don't specify it, it assumes the compression method to be zipfile.ZIP_STORED, which only stores the files without compressing them. You need to specify the method to be zipfile.ZIP_DEFLATED. You will need to have the zlib module installed for this (it is usually installed by default).

import zipfile

def main(archive_list=[],zfilename='default.zip'):
    print zfilename
    zout = zipfile.ZipFile(zfilename, "w", zipfile.ZIP_DEFLATED) # <--- this is the change you need to make
    for fname in archive_list:
        print "writing: ", fname
        zout.write(fname)
    zout.close()

if __name__ == '__main__':
    main()  

Update: As per the documentation (python 3.7), value for 'compression' argument should be specified to override the default, which is ZIP_STORED. The available options are ZIP_DEFLATED, ZIP_BZIP2 or ZIP_LZMA and the corresponding libraries zlib, bz2 or lzma should be available.

Backsword answered 12/11, 2010 at 16:1 Comment(4)
what a terrible default! Why?!Osmious
Because the zlib module is not always available, especially in sandboxed installations.Backsword
I ran into the same issue with zip files. I have to admit my fault was not reading documentation before trying example code from python docs. I think example code should include ZIP_DEFLATED parameter to make it less confusing.Triphammer
If you use a ZipInfo() while writing to the ZipFile, you must also set zip_info.compress_type = ZIP_DEFLATED.Deafmute
P
17

Hope this is going to be useful to someone. I tested all zip modes and benchmarked them on two data sets. First one small (~30 MB) and other large (~ 1,5 GB). They consisted of various types of files so it would be as close to real life scenario as possible. I did two methods of tests on each dataset: the “proportional” one and the “complete” one. Both tests where repeated 3 times one after another to get an average. Those result may differ depending on your machines, but I think it’s still a good place to start.

I did the test in two methods because I’m trying to make my own specialized backup solution. The proportional method creates more zip files but it allows me to transfer smaller packages of data if necessary eg. replacing only things that changed. It's more complicated than that, but it is not important right now.

The proportional method explanation

The complete method is just straight up compressing whole folder.

The complete method explanation

Compression ratio calculation:

size_difference = source_size - compressed_size

compression_ratio = (size_difference * 100.0) / source_size

Basically the higher that number the better.

Each zip archive was initialized like this:

# Mode tests
with zipfile.ZipFile(target_zip, 'w', compression_method) as ziph:

# Level tests
with zipfile.ZipFile(target_zip, 'w', compression_method, compresslevel=level) as ziph:

Here are the results:

research results

It seems that no matter the method, the most optimal compression mode is ZIP_DEFLATED. The only smaller archive size gave me ZIP_LZMA mode, but it was only fraction of % and it took about 8x longer for large data sets.

Furthermore I tried different levels of compression with the same data set and methods. Except this time there was only one run per level.

Research results

It looks like ZIP_DEFLATED and ZIP_BIP2 have similar compression capabilities, but the second one is much slower. For large data sets the compression level of 1 or 2 should suffice. Increasing it more gives no significant effect on final file size. If the workload demands a lot of “small” zip files it is better to use level 9. It gives high compression ratio but takes about the same amount of time as at level 1.

Putman answered 7/5, 2021 at 21:25 Comment(0)
P
16

There is a really easy way to compress zip format,

Use in shutil.make_archive library.

For example:

import shutil

shutil.make_archive(file_name, 'zip', file location after compression)

Can see more extensive documentation at: Here

Pontificate answered 22/10, 2017 at 12:1 Comment(1)
Thanks for posting this. This works to achieve archiving of files very easily without going through the hoops of zipping. It's a great bookend to processes that need to have their data dumped for record-keeping. I consolidated ~10 lines of code down to 3 with this.Vinson

© 2022 - 2024 — McMap. All rights reserved.