How to extract zip file recursively?
Asked Answered
A

4

18

I have a zip file which contains three zip files in it like this:

zipfile.zip\  
    dirA.zip\
         a  
    dirB.zip\
         b  
    dirC.zip\
         c

I want to extract all the inner zip files that are inside the zip file in directories with these names (dirA, dirB, dirC).
Basically, I want to end up with the following schema:

output\  
    dirA\
         a  
    dirB\
         b  
    dirC\
         c

I have tried the following:

import os, re
from zipfile import ZipFile

os.makedirs(directory)  # where directory is "\output"
with ZipFile(self.archive_name, "r") as archive:
    for id, files in data.items():
        if files:
            print("Creating", id)
            dirpath = os.path.join(directory, id)

            os.mkdir(dirpath)

            for file in files:
                match = pattern.match(filename)
                new = match.group(2)
                new_filename = os.path.join(dirpath, new)

                content = archive.open(file).read()
            with open(new_filename, "wb") as outfile:
                outfile.write(content)

But it only extracts the zip file and I end up with:

output\  
    dirA\
         dirA.zip 
    dirB\
         dirB.zip 
    dirC\
         dirC.zip

Any suggestions including code-segments will be much appreciated cause I have tried so many different things and read the docs without success.

Attemper answered 29/3, 2016 at 13:18 Comment(2)
Please modify your question and provide a Minimal, Complete, and Verifiable example that includes what's in data.items().Baggett
@Baggett Thank you for your comment. As described above, data holds \zipfile.zip > dirA.zip > a \zipfile.zip > dirB.zip > b \zipfile.zip > dirC.zip > c I tried to make the question a bit more general and not dependent to whatever 'data' holds, except for the fact that there are zips inside of a zip.Attemper
G
16

When extracting the zip file, you would want to write the inner zip files to memory instead of them on disk. To do this, I've used BytesIO.

Check out this code:

import os
import io
import zipfile

def extract(filename):
    z = zipfile.ZipFile(filename)
    for f in z.namelist():
        # get directory name from file
        dirname = os.path.splitext(f)[0]  
        # create new directory
        os.mkdir(dirname)  
        # read inner zip file into bytes buffer 
        content = io.BytesIO(z.read(f))
        zip_file = zipfile.ZipFile(content)
        for i in zip_file.namelist():
            zip_file.extract(i, dirname)

If you run extract("zipfile.zip") with zipfile.zip as:

zipfile.zip/
    dirA.zip/
        a
    dirB.zip/
        b
    dirC.zip/
        c

Output should be:

dirA/
  a
dirB/
  b
dirC/
  c
Ghassan answered 29/3, 2016 at 13:34 Comment(2)
Exactly what I was looking for, it does the extraction as described on my question. Thanks!Attemper
If the original zip file only contains some "zip-like" files at first level, like .xlsx, they will be unzipped as well. I suggest checking the extension before unzippingAustralia
D
10

For a function that extracts a nested zip file (any level of nesting) and cleans up the original zip files:

import zipfile, re, os

def extract_nested_zip(zippedFile, toFolder):
    """ Extract a zip file including any nested zip files
        Delete the zip file(s) after extraction
    """
    with zipfile.ZipFile(zippedFile, 'r') as zfile:
        zfile.extractall(path=toFolder)
    os.remove(zippedFile)
    for root, dirs, files in os.walk(toFolder):
        for filename in files:
            if re.search(r'\.zip$', filename):
                fileSpec = os.path.join(root, filename)
                extract_nested_zip(fileSpec, root)
Dithyramb answered 10/5, 2017 at 14:54 Comment(1)
Can we s3 paths here? instead of local disk pathDiffractive
S
5

I tried some of the other solutions but couldn't get them to work "in place". I'll post my solution to handle the "in place" version. Note: it deletes the zip files and 'replaces' them with identically named directories, so back up your zip files if you want to keep.

Strategy is simple. Unzip all zip files in the directory (and subdirectories) and rinse and repeat until no zip files remain. The rinse and repeat is needed if the zip files contain zip files.

import os
import io
import zipfile
import re

def unzip_directory(directory):
    """" This function unzips (and then deletes) all zip files in a directory """
    for root, dirs, files in os.walk(directory):
        for filename in files:
            if re.search(r'\.zip$', filename):
                to_path = os.path.join(root, filename.split('.zip')[0])
                zipped_file = os.path.join(root, filename)
                if not os.path.exists(to_path):
                    os.makedirs(to_path)
                    with zipfile.ZipFile(zipped_file, 'r') as zfile:
                        zfile.extractall(path=to_path)
                    # deletes zip file
                    os.remove(zipped_file)

def exists_zip(directory):
    """ This function returns T/F whether any .zip file exists within the directory, recursively """
    is_zip = False
    for root, dirs, files in os.walk(directory):
        for filename in files:
            if re.search(r'\.zip$', filename):
                is_zip = True
    return is_zip

def unzip_directory_recursively(directory, max_iter=1000):
    print("Does the directory path exist? ", os.path.exists(directory))
    """ Calls unzip_directory until all contained zip files (and new ones from previous calls)
    are unzipped
    """
    iterate = 0
    while exists_zip(directory) and iterate < max_iter:
        unzip_directory(directory)
        iterate += 1
    pre = "Did not " if iterate < max_iter else "Did"
    print(pre, "time out based on max_iter limit of", max_iter, ". Took iterations:", iterate)

Assuming your zip files are backed up, you make this all work by calling unzip_directory_recursively(your_directory).

Staffard answered 18/9, 2018 at 18:30 Comment(0)
B
3

This works for me. Just place this script with the nested zip under the same directory. It will extract zip into directory with the same name as the original zip and clean up the original zip. It will also count the total number of files within the nested zip as well

import os

from zipfile import ZipFile


def unzip (path, total_count):
    for root, dirs, files in os.walk(path):
        for file in files:
            file_name = os.path.join(root, file)
            if (not file_name.endswith('.zip')):
                total_count += 1
            else:
                currentdir = file_name[:-4]
                if not os.path.exists(currentdir):
                    os.makedirs(currentdir)
                with ZipFile(file_name) as zipObj:
                    zipObj.extractall(currentdir)
                os.remove(file_name)
                total_count = unzip(currentdir, total_count)
    return total_count

total_count = unzip ('.', 0)
print(total_count)
Buttock answered 28/4, 2020 at 20:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.