Python in-memory zip library
Asked Answered
G

9

124

Is there a Python library that allows manipulation of zip archives in memory, without having to use actual disk files?

The ZipFile library does not allow you to update the archive. The only way seems to be to extract it to a directory, make your changes, and create a new zip from that directory. I want to modify zip archives without disk access, because I'll be downloading them, making changes, and uploading them again, so I have no reason to store them.

Something similar to Java's ZipInputStream/ZipOutputStream would do the trick, although any interface at all that avoids disk access would be fine.

Guildroy answered 17/3, 2010 at 15:57 Comment(1)
In this post I answered the same question. #60644357Demount
M
122

According to the Python docs:

class zipfile.ZipFile(file[, mode[, compression[, allowZip64]]])

  Open a ZIP file, where file can be either a path to a file (a string) or a file-like object. 

So, to open the file in memory, just create a file-like object (perhaps using BytesIO).

file_like_object = io.BytesIO(my_zip_data)
zipfile_ob = zipfile.ZipFile(file_like_object)
Merely answered 17/3, 2010 at 16:2 Comment(2)
How write different files to the inmemory object? i.e. create a/b/c.txt a/b/cc.txt in the archive?Circumscissile
This answer only works if my_zip_data is a bytes object containing a validly constructed zip archive (when mode='r' as is the default) . Passing an empty memory buffer like zipfile.ZipFile(io.BytesIO(), mode='r') fails because ZipFile checks for a "End of Central Directory" record in the passed file-like obj during instantiation when mode='r'. As a work around, Validimir's answer suggests a way to construct a buffer of a zip archive with an empty dummy file in it.Kristie
Z
131

PYTHON 3

import io
import zipfile

zip_buffer = io.BytesIO()

with zipfile.ZipFile(zip_buffer, "a", zipfile.ZIP_DEFLATED, False) as zip_file:
    for file_name, data in [('1.txt', io.BytesIO(b'111')),
                            ('2.txt', io.BytesIO(b'222'))]:
        zip_file.writestr(file_name, data.getvalue())

with open('C:/1.zip', 'wb') as f:
    f.write(zip_buffer.getvalue())
Zeba answered 6/7, 2017 at 10:44 Comment(5)
Link to the documentation. data can be either bytes or strings and this worked perfectly on Ubuntu and Python 3.6Tournedos
Why not writing directly bytes, instead of wrapping the data in io.BytesIO then using getvalue in each iteration ?Schaumberger
@freesoul, just read writestr doc, it wantsZeba
@Zeba What do you mean? Isn't the result the same if you change io.BytesIO(b'111') for b'111', and data.getvalue() by dataSchaumberger
@freesoul,BytesIO is common usage case hereZeba
M
122

According to the Python docs:

class zipfile.ZipFile(file[, mode[, compression[, allowZip64]]])

  Open a ZIP file, where file can be either a path to a file (a string) or a file-like object. 

So, to open the file in memory, just create a file-like object (perhaps using BytesIO).

file_like_object = io.BytesIO(my_zip_data)
zipfile_ob = zipfile.ZipFile(file_like_object)
Merely answered 17/3, 2010 at 16:2 Comment(2)
How write different files to the inmemory object? i.e. create a/b/c.txt a/b/cc.txt in the archive?Circumscissile
This answer only works if my_zip_data is a bytes object containing a validly constructed zip archive (when mode='r' as is the default) . Passing an empty memory buffer like zipfile.ZipFile(io.BytesIO(), mode='r') fails because ZipFile checks for a "End of Central Directory" record in the passed file-like obj during instantiation when mode='r'. As a work around, Validimir's answer suggests a way to construct a buffer of a zip archive with an empty dummy file in it.Kristie
R
60

From the article In-Memory Zip in Python:

Below is a post of mine from May of 2008 on zipping in memory with Python, re-posted since Posterous is shutting down.

I recently noticed that there is a for-pay component available to zip files in-memory with Python. Considering this is something that should be free, I threw together the following code. It has only gone through very basic testing, so if anyone finds any errors, let me know and I’ll update this.

import zipfile
import StringIO

class InMemoryZip(object):
    def __init__(self):
        # Create the in-memory file-like object
        self.in_memory_zip = StringIO.StringIO()

    def append(self, filename_in_zip, file_contents):
        '''Appends a file with name filename_in_zip and contents of 
        file_contents to the in-memory zip.'''
        # Get a handle to the in-memory zip in append mode
        zf = zipfile.ZipFile(self.in_memory_zip, "a", zipfile.ZIP_DEFLATED, False)

        # Write the file to the in-memory zip
        zf.writestr(filename_in_zip, file_contents)

        # Mark the files as having been created on Windows so that
        # Unix permissions are not inferred as 0000
        for zfile in zf.filelist:
            zfile.create_system = 0        

        return self

    def read(self):
        '''Returns a string with the contents of the in-memory zip.'''
        self.in_memory_zip.seek(0)
        return self.in_memory_zip.read()

    def writetofile(self, filename):
        '''Writes the in-memory zip to a file.'''
        f = file(filename, "w")
        f.write(self.read())
        f.close()

if __name__ == "__main__":
    # Run a test
    imz = InMemoryZip()
    imz.append("test.txt", "Another test").append("test2.txt", "Still another")
    imz.writetofile("test.zip")
Rana answered 17/3, 2010 at 16:2 Comment(6)
Useful link - this is a good example of how to use the ZipFile object in the way described by Jason's answer. ThanksGuildroy
No problem, glad you found it useful.Rana
Care to summarize the content of the link here, if it dies, so does your answerCalefactory
@IvoFlipse - Good point. I added all of that content to this post, just in case.Rana
Does not work for real under Windows or on Python 3.X, see my answer for an update of the code.Hamrah
I am sorry to revive an old post but I tried the solution proposed but I got a corrupted zip file in output. I am using python 2.7 in windows environment and the zip file has only one file. Filename inside the zip is up to 40 chars.Larrabee
H
22

The example Ethier provided has several problems, some of them major:

  • doesn't work for real data on Windows. A ZIP file is binary and its data should always be written with a file opened 'wb'
  • the ZIP file is appended to for each file, this is inefficient. It can just be opened and kept as an InMemoryZip attribute
  • the documentation states that ZIP files should be closed explicitly, this is not done in the append function (it probably works (for the example) because zf goes out of scope and that closes the ZIP file)
  • the create_system flag is set for all the files in the zipfile every time a file is appended instead of just once per file.
  • on Python < 3 cStringIO is much more efficient than StringIO
  • doesn't work on Python 3 (the original article was from before the 3.0 release, but by the time the code was posted 3.1 had been out for a long time).

An updated version is available if you install ruamel.std.zipfile (of which I am the author). After

pip install ruamel.std.zipfile

or including the code for the class from here, you can do:

import ruamel.std.zipfile as zipfile

# Run a test
zipfile.InMemoryZipFile()
imz.append("test.txt", "Another test").append("test2.txt", "Still another")
imz.writetofile("test.zip")  

You can alternatively write the contents using imz.data to any place you need.

You can also use the with statement, and if you provide a filename, the contents of the ZIP will be written on leaving that context:

with zipfile.InMemoryZipFile('test.zip') as imz:
    imz.append("test.txt", "Another test").append("test2.txt", "Still another")

because of the delayed writing to disc, you can actually read from an old test.zip within that context.

Hamrah answered 1/11, 2013 at 7:13 Comment(4)
Why not use io.BytesIO in python 2?Illume
@Illume No particular reason apart from that you should check if BytesIO on 2.7 uses the much faster underlying C implementation, and is not a Python only compatibility layer calling StringIO (instead of CStringIO)Hamrah
This really should include at least the skeleton of whatever code you made to actually answer the question, instead of just telling people to install a module. If nothing else, at least link to the module's home page.Charmainecharmane
For python 2.7 case I would suggest converting unicode strings to utf8-strings before passing to writestr() function. More details https://mcmap.net/q/182111/-zipfile-testzip-returning-different-results-on-python-2-and-python-3.Samovar
F
10

I am using Flask to create an in-memory zipfile and return it as a download. Builds on the example above from Vladimir. The seek(0) took a while to figure out.

import io
import zipfile

zip_buffer = io.BytesIO()
with zipfile.ZipFile(zip_buffer, "a", zipfile.ZIP_DEFLATED, False) as zip_file:
    for file_name, data in [('1.txt', io.BytesIO(b'111')), ('2.txt', io.BytesIO(b'222'))]:
        zip_file.writestr(file_name, data.getvalue())

zip_buffer.seek(0)
return send_file(zip_buffer, attachment_filename='filename.zip', as_attachment=True)
Faletti answered 20/1, 2022 at 15:34 Comment(1)
You deserve a medal for pointing out that seek(0).Bolger
A
2

Helper to create in-memory zip file with multiple files based on data like {'1.txt': 'string', '2.txt": b'bytes'}

import io, zipfile

def prepare_zip_file_content(file_name_content: dict) -> bytes:
    """returns Zip bytes ready to be saved with 
    open('C:/1.zip', 'wb') as f: f.write(bytes)
    @file_name_content dict like {'1.txt': 'string', '2.txt": b'bytes'} 
    """
    zip_buffer = io.BytesIO()
    with zipfile.ZipFile(zip_buffer, "a", zipfile.ZIP_DEFLATED, False) as zip_file:
        for file_name, file_data in file_name_content.items():
            zip_file.writestr(file_name, file_data)

    zip_buffer.seek(0)
    return zip_buffer.getvalue()
Ashanti answered 15/12, 2022 at 9:15 Comment(1)
This works for me with python 3.10.11Accusation
Y
1

I want to modify zip archives without disk access, because I'll be downloading them, making changes, and uploading them again, so I have no reason to store them

This is possible using the two libraries https://github.com/uktrade/stream-unzip and https://github.com/uktrade/stream-zip (full disclosure: written by me). And depending on the changes, you might not even have to store the entire zip in memory at once.

Say you just want to download, unzip, zip, and re-upload. Slightly pointless, but you could slot in some changes to the unzipped content:

from datetime import datetime
import httpx
from stream_unzip import stream_unzip
from stream_zip import stream_zip, ZIP_64

def get_source_bytes_iter(url):
    with httpx.stream('GET', url) as r:
        yield from r.iter_bytes()

def get_target_files(files):
    # stream-unzip doesn't expose perms or modified_at, but stream-zip requires them
    modified_at = datetime.now()
    perms = 0o600

    for name, _, chunks in files:
        # Could change name, manipulate chunks, skip a file, or yield a new file
        yield name.decode(), modified_at, perms, ZIP_64, chunks

source_url = 'https://source.test/file.zip'
target_url = 'https://target.test/file.zip'

source_bytes_iter = get_source_bytes_iter(source_url)
source_files = stream_unzip(source_bytes_iter)
target_files = get_target_files(source_files)
target_bytes_iter = stream_zip(target_files)

httpx.put(target_url, data=target_bytes_iter)
Ygerne answered 12/1, 2022 at 8:25 Comment(0)
Y
0

You can use the library libarchive in Python through ctypes - it offers ways of manipulating ZIP data in memory, with a focus on streaming (at least historically).

Say we want to uncompress ZIP files on the fly while downloading from an HTTP server. The below code

from contextlib import contextmanager
from ctypes import CFUNCTYPE, POINTER, create_string_buffer, cdll, byref, c_ssize_t, c_char_p, c_int, c_void_p, c_char
from ctypes.util import find_library

import httpx

def get_zipped_chunks(url, chunk_size=6553):
    with httpx.stream('GET', url) as r:
        yield from r.iter_bytes()

def stream_unzip(zipped_chunks, chunk_size=65536):
    # Library
    libarchive = cdll.LoadLibrary(find_library('archive'))

    # Callback types
    open_callback_type = CFUNCTYPE(c_int, c_void_p, c_void_p)
    read_callback_type = CFUNCTYPE(c_ssize_t, c_void_p, c_void_p, POINTER(POINTER(c_char)))
    close_callback_type = CFUNCTYPE(c_int, c_void_p, c_void_p)

    # Function types
    libarchive.archive_read_new.restype = c_void_p
    libarchive.archive_read_open.argtypes = [c_void_p, c_void_p, open_callback_type, read_callback_type, close_callback_type]
    libarchive.archive_read_finish.argtypes = [c_void_p]

    libarchive.archive_entry_new.restype = c_void_p

    libarchive.archive_read_next_header.argtypes = [c_void_p, c_void_p]
    libarchive.archive_read_support_compression_all.argtypes = [c_void_p]
    libarchive.archive_read_support_format_all.argtypes = [c_void_p]

    libarchive.archive_entry_pathname.argtypes = [c_void_p]
    libarchive.archive_entry_pathname.restype = c_char_p

    libarchive.archive_read_data.argtypes = [c_void_p, POINTER(c_char), c_ssize_t]
    libarchive.archive_read_data.restype = c_ssize_t

    libarchive.archive_error_string.argtypes = [c_void_p]
    libarchive.archive_error_string.restype = c_char_p

    ARCHIVE_EOF = 1
    ARCHIVE_OK = 0

    it = iter(zipped_chunks)
    compressed_bytes = None  # Make sure not garbage collected

    @contextmanager
    def get_archive():
        archive = libarchive.archive_read_new()
        if not archive:
            raise Exception('Unable to allocate archive')

        try:
            yield archive
        finally:
            libarchive.archive_read_finish(archive)

    def read_callback(archive, client_data, buffer):
        nonlocal compressed_bytes

        try:
            compressed_bytes = create_string_buffer(next(it))
        except StopIteration:
            return 0
        else:
            buffer[0] = compressed_bytes
            return len(compressed_bytes) - 1

    def uncompressed_chunks(archive):
        uncompressed_bytes = create_string_buffer(chunk_size)
        while (num := libarchive.archive_read_data(archive, uncompressed_bytes, len(uncompressed_bytes))) > 0:
            yield uncompressed_bytes.value[:num]
        if num < 0:
            raise Exception(libarchive.archive_error_string(archive))

    with get_archive() as archive: 
        libarchive.archive_read_support_compression_all(archive)
        libarchive.archive_read_support_format_all(archive)

        libarchive.archive_read_open(
            archive, 0,
            open_callback_type(0), read_callback_type(read_callback), close_callback_type(0),
        )
        entry = c_void_p(libarchive.archive_entry_new())
        if not entry:
            raise Exception('Unable to allocate entry')

        while (status := libarchive.archive_read_next_header(archive, byref(entry))) == ARCHIVE_OK:
            yield (libarchive.archive_entry_pathname(entry), uncompressed_chunks(archive))

        if status != ARCHIVE_EOF:
            raise Exception(libarchive.archive_error_string(archive))

can be used as follows to do that

zipped_chunks = get_zipped_chunks('https://domain.test/file.zip')
files = stream_unzip(zipped_chunks)

for name, uncompressed_chunks in stream_unzip(zipped_chunks):
    print(name)
    for uncompressed_chunk in uncompressed_chunks:
        print(uncompressed_chunk)

In fact since libarchive supports multiple archive formats, and nothing above is particularly ZIP-specific, it may well work with other formats.

Ygerne answered 2/1, 2023 at 20:37 Comment(0)
F
0

It's important to note that if you want to use the newly created in-memory Zip archive outside of Python, such as saving it to a local disk, or sent through a POST request, it needs to have the end of central directory records written to it; otherwise, it won't be recognized as a valid ZIP file.

This would look like (for Python 3.11)

with(
    io.BytesIO() as raw,
    zipfile.ZipFile(raw, "a", zipfile.ZIP_DEFLATED, False) as zip
):
    for file_name, file_data in ["example_dir/example_file.txt", bytes]:
        zip.writestr(file_name, file_data)

    zip.close()  # THIS is REQUIRED!

    requests.post(addr, files = {"file": ("zip_name.zip", zip.getbuffer())})
Fannyfanon answered 31/10, 2023 at 0:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.