python unzip -- tremendously slow?
Asked Answered
H

4

8

Can somebody please explain the following mystery?

I created a binary file of size ~37[MB]. zipping it in Ubuntu -- using the terminal -- took less than 1[sec]. I then tried python: zipping it programatically (using the zipfile module) took also about 1[sec].

I then tried to unzip the zip file I created. In Ubuntu -- using the terminal -- this took less than 1[sec].

In python, the code to unzip (used the zipfile module) took close to 37[sec] to run! any ideas why?

Himes answered 14/2, 2011 at 22:16 Comment(7)
Could you post the part where you are zipping the files? This way, we can make more accurate comments.Lagasse
I'm guessing the python zip/unzip code is interpreted instead of being a call out to some (compiled C) library.Encarnacion
@TomMD: Actually, it isn't, since it depends on zlib, at least when the file is actually compressed. The actual decompression is done in native code. It might be worth comparing unzip times when the zip file is not compressed, to see if the effect is coming from interpretation.Shien
@chinmay The poster never said how he was calling 'zip' so I didn't want to assume anything. Good to know that the normal Python {,un}zip is a zlib binding though, thanks!Encarnacion
Maybe you're not handling the stream of unzipped data efficiently. Loading a 37 MB-size string in memory will certainly take a long time due to memory allocation and swapping. You should send the output to a file directly. How are you using the zipfile module to unzip the compressed file?Reinhard
@scoffey: I find it hard to believe that memory allocation/swapping would take that long. 37 MB is nothing, even in Python.Reorientation
#61930945, #37141786Agnusago
T
2

I was struggling to unzip/decompress/extract zip files with Python as well and that "create ZipFile object, loop through its .namelist(), read the files and write them to file system" low-level approach didn't seem very Python. So I started to dig zipfile objects that I believe not very well documented and covered all the object methods:

>>> from zipfile import ZipFile
>>> filepath = '/srv/pydocfiles/packages/ebook.zip'
>>> zip = ZipFile(filepath)
>>> dir(zip)
['NameToInfo', '_GetContents', '_RealGetContents', '__del__', '__doc__', '__enter__', '__exit__', '__init__', '__module__', '_allowZip64', '_didModify', '_extract_member', '_filePassed', '_writecheck', 'close', 'comment', 'compression', 'debug', 'extract', 'extractall', 'filelist', 'filename', 'fp', 'getinfo', 'infolist', 'mode', 'namelist', 'open', 'printdir', 'pwd', 'read', 'setpassword', 'start_dir', 'testzip', 'write', 'writestr'] 

There we go the "extractall" method works just like tarfile's extractall ! (on python 2.6 and 2.7 but NOT 2.5)

Then the performance concerns; the file ebook.zip is 84.6 MB (mostly pdf files) and uncompressed folder is 103 MB, zipped by default "Archive Utility" under MacOSx 10.5. So I did the same with Python's timeit module:

>>> from timeit import Timer
>>> t = Timer("filepath = '/srv/pydocfiles/packages/ebook.zip'; \
...         extract_to = '/tmp/pydocnet/build'; \
...         from zipfile import ZipFile; \
...         ZipFile(filepath).extractall(path=extract_to)")
>>> 
>>> t.timeit(1)
1.8670060634613037

which took less than 2 seconds on a heavy loaded machine that has 90% of the memory is being used by other applications.

Hope this helps someone.

Tearing answered 6/11, 2011 at 13:53 Comment(2)
wow, zipfile objects documentation is just updated on docs.python.org a day after I gave this answer. perhaps it was some output issue or python is doing grreeat!Tearing
Nice info! However if we need to access just some files, or process them somehow instead of just uncompressing them, this won't help much I'm afraid :(Epizoon
I
0

I don't know what code you use to unzip your file, but the following works for me: After creating a zip archive "test.zip" containing just one file "file1", the following Python script extracts "file1" from the archive:

from zipfile import ZipFile, ZIP_DEFLATED
zip = ZipFile("test.zip", mode='r', compression=ZIP_DEFLATED, allowZip64=False)
data = zip.read("file1")
print len(data)

This takes nearly no time: I tried a 37MB input file which compressed down to a 15MB zip archive. In this example the Python script took 0.346 seconds on my MacBook Pro. Maybe in your case the 37 seconds were taken up by something you did with the data instead?

Isatin answered 7/3, 2011 at 20:25 Comment(1)
Reading just one file is easy - however a large zip archive with many small compressed files in it runs excruciatingly slow for me. Perhaps the file lookup within the zip is inefficient?Epizoon
A
0

Some options:

  • Use subprocess to defer it to some external tool. You can pipe data directly to it.
  • czipfile, but that does not seem to be maintained anymore (last release 2010). A somewhat recent fork is ziyuang/czipfile (last update 2019).
  • PyTorch has the internal native torch._C.PyTorchFileReader which can read zip files, see the torch.load logic, and _open_zipfile_reader. This does not support arbitrary zip files currently, but I think it only would need minor adaptations to support it.
  • libzip.py (2023) is a ctypes wrapper around libzip. But it seems very unknown?
Agnusago answered 24/8, 2023 at 20:59 Comment(0)
I
-1

Instead of using the python module we can use the zip featured offered by ubuntu in python. I use this because sometimes the python zip fails.

import os

filename = test
os.system('7z a %s.zip %s'% (filename, filename))
Ithaca answered 6/6, 2011 at 13:58 Comment(2)
You should use str.format() instead of the % formatting, like os.system('7z a {0}.zip {0}'.format(filename)). As they mention in the docs, it's going to be removed in the future and I believe it's already gone in 3+.Musetta
@Musetta Wrong. This approach should be avoided at all, and instead import subprocess; subprocess.call(['7z', 'a', filename+'.zip', filename]) be used. Or what happens if filename contains a space or a newline?Kirghiz

© 2022 - 2024 — McMap. All rights reserved.