Fast zip decryption in python

E

3

1

I've a program which process a zip file using zipfile. It works with an iterator, since the uncompressed file is bigger than 2GB and it can become a memory problem.

with zipfile.Zipfile(BytesIO(my_file)) as myzip:
    for file_inside in myzip.namelist():
        with myzip.open(file_inside) as file:
            # Process here
            # for loop ....

Then I noticed that this process was being extremely slow to process my file. And I can understand that it may take some time, but at least it should use my machine resources: lets say the python process should use at 100% the core where it lives.

Since it doesn't, I started researching the possible root causes. I'm not an expert in compression matters, so first considered basic things:

Resources seem not to be the problem, there's plenty RAM available even if my coding approach wouldn't use it.
CPU is not in high level usage, not even for one core.
The file being open is just about 80MB when compressed, so disk reading should not be a slowing issue either.

This made me to think that the bottleneck could be in the most invisible parameters: RAM bandwidth. However I have no idea how could I measure this.

Then on the software side, I found on the zipfile docs:

Decryption is extremely slow as it is implemented in native Python rather than C.

I guess that if it's using native Python, it's not even using OpenGL acceleration so another point for slowliness. I'm also curious about how this method works, again because of the low CPU usage.

So my question is of course, how could I work in a similar way (not having the full uncompress file in RAM), but uncompressing in a faster way in Python? Is there another library or maybe another approach to overcome this slowliness?

Emulsion answered 21/5, 2020 at 8:34 Comment(0)

A

1

I've done some research and found the following:

You could "pip install czipfile", more information at https://pypi.org/project/czipfile/

Another solution is to use "Cython", a variant of python -https://www.reddit.com/r/Python/comments/cksvp/whats_a_python_zip_library_with_fast_decryption/

Or you could outsource to 7-Zip, as explained here: Faster alternative to Python's zipfile module?

Aflcio answered 21/5, 2020 at 8:40 Comment(2)

There is a github port for Python 3: github.com/ziyuang/czipfile – Aflcio 21/5, 2020 at 9:11

Hm interesting, I couldn't find a sigle reference to it. However I couldn't achieve to install it. It fails because tries to read README file instead of README.md, and the author hasn't issues allowed on the repo. – Emulsion 21/5, 2020 at 9:30

V

2

It's quite stupid that Python doesn't implement zip decryption in pure c.

So I make it in cython, which is 17 times faster.

Just get the dezip.pyx and setup.py from this gist.

https://gist.github.com/zylo117/cb2794c84b459eba301df7b82ddbc1ec

And install cython and build a cython library

pip3 install cython
python3 setup.py build_ext --inplace

Then run the original script with two more lines.

import zipfile

# add these two lines
from dezip import _ZipDecrypter_C
setattr(zipfile, '_ZipDecrypter', _ZipDecrypter_C)

z = zipfile.ZipFile('./test.zip', 'r')
z.extractall('/tmp/123', None, b'password')

Vesicatory answered 6/6, 2022 at 4:30 Comment(0)

A

1

I've done some research and found the following:

You could "pip install czipfile", more information at https://pypi.org/project/czipfile/

Another solution is to use "Cython", a variant of python -https://www.reddit.com/r/Python/comments/cksvp/whats_a_python_zip_library_with_fast_decryption/

Or you could outsource to 7-Zip, as explained here: Faster alternative to Python's zipfile module?

Aflcio answered 21/5, 2020 at 8:40 Comment(2)

There is a github port for Python 3: github.com/ziyuang/czipfile – Aflcio 21/5, 2020 at 9:11

Hm interesting, I couldn't find a sigle reference to it. However I couldn't achieve to install it. It fails because tries to read README file instead of README.md, and the author hasn't issues allowed on the repo. – Emulsion 21/5, 2020 at 9:30

J

1

There is this lib for python to handle zipping files without memory hassle.

Quoted from the docs:

Buzon - ZipFly

ZipFly is a zip archive generator based on zipfile.py. It was created by Buzon.io to generate very large ZIP archives for immediate sending out to clients, or for writing large ZIP archives without memory inflation.

Never used but can help.

Janik answered 21/5, 2020 at 9:11 Comment(0)

Recommended topics

Hot tags