Python - Creating Zip in stream that exceeds RAM
Asked Answered
S

1

4

I'm trying to create a zip file on the fly of a directory and return it to the user over a Flask App. The below works great for smaller directories but I'd also like to achieve this with large directories full of images (>20GB).

def return_zip():
    dir_to_send = '/dir/to/the/files'
    base_path = pathlib.Path(dir_to_send)
    data = io.BytesIO()
    with zipfile.ZipFile(data, mode='w') as z:
        for f_name in base_path.iterdir():
            z.write(f_name, arcname=f_name.name)
    data.seek(0)
    return send_file(data, mimetype='application/zip', as_attachment=True, attachment_filename='data.zip')

When trying this on large directories the whole system crashes, is there any way to create zips in stream that exceed the system's memory?

I'd prefer not to write the ZIP to disk then send it to the user then delete it from disk as this just increases the R/W operations to wear out the storage drive everything is located on.

The OS is running on an SSD (not the same drive as the images to zip), maybe part of this could be turned into virtual RAM? I'm not very adept at working in memory.

Any ideas would be greatly appreciated!

Ubuntu 20.04, Python3 with Flask, 2TB storage drive and a 250GB OS SSD with 8GB RAM.

Seethe answered 7/11, 2022 at 2:1 Comment(1)
If you do it in virtual RAM (aka SWAP) you'll still be writing it to your SSD. The only way to do it without writing to your SSD is to upgrade your RAM. But really RAM is still an SSD in the form of a RAM card with high speed IO. So either way you'll be writing it into one or another form of memory that is capable of wearing out. Modern day SSDs are pretty good with IO operations and you shouldn't need to worry to much about wearing out the drive. So just check the size of the filesystem before zipping and if it exceeds 75% of your available RAM write it to a temporary file before sending.Impercipient
E
5

Python itself is sadly totally incapable of streaming ZIP data. It keeps the whole ZIP in memory until it's finished generating the ZIP file... It won't release the memory until the whole ZIP is completely written to disk.

However, there are two modern, maintained Python 3 libraries for generating live-streaming ZIP data with low memory usage.

  • https://github.com/sandes/zipfly: Project which started in March 2020 but has been in low-maintenance mode since July 2020, seems stable though; sadly it requires disk paths as input (it cannot use existing memory buffers/streams as input); this library is also basically a hack, which just replaces the stream-writer to make it flush data to disk constantly. It's famous, but I personally wouldn't use it.
  • https://github.com/uktrade/stream-zip: New project made by a contractor for the UK government in December 2021 and actively maintained ever since, and it's incredibly well-made with deep technical knowledge of the ZIP standard (the author even wrote a Deflate64 encoder/decoder in pure Python), and it has automated tests with CI workflow, so the codebase is reliable; it uses zlib to directly generate the compressed chunks via fast, native C code; this library's API allows you to generate the input data live (reading from streams, etc), and if you want to read "live" from file-descriptor objects (file handles), you could look at this project comment for an explanation of how to make a live data reader that conserves memory, and how to vary the compression level per-file. With this library, you can make extremely efficient ZIPs. I personally use this library to create a 8 GB ZIP file where Python ONLY uses 6 megabytes of RAM at all times, because I stream the input data and output (compressed ZIP) in 128K chunks all thanks to this library! And it's super fast! (Compare that to Python's native zipfile library which used 12.5 GB RAM for the same process, haha.)

And lastly, if you ever need streaming unzipping too, then this related project by the same author is the best (very well written, safe and conformant to ZIP standards):

Economize answered 22/6, 2023 at 6:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.