Extract files from zip without keep the top-level folder with python zipfile
Asked Answered
D

5

8

I'm using the current code to extract the files from a zip file while keeping the directory structure:

zip_file = zipfile.ZipFile('archive.zip', 'r')
zip_file.extractall('/dir/to/extract/files/')
zip_file.close()

Here is a structure for an example zip file:

/dir1/file.jpg
/dir1/file1.jpg
/dir1/file2.jpg

At the end I want this:

/dir/to/extract/file.jpg
/dir/to/extract/file1.jpg
/dir/to/extract/file2.jpg

But it should ignore only if the zip file has a top-level folder with all files inside it, so when I extract a zip with this structure:

/dir1/file.jpg
/dir1/file1.jpg
/dir1/file2.jpg
/dir2/file.txt
/file.mp3

It should stay like this:

/dir/to/extract/dir1/file.jpg
/dir/to/extract/dir1/file1.jpg
/dir/to/extract/dir1/file2.jpg
/dir/to/extract/dir2/file.txt
/dir/to/extract/file.mp3

Any ideas?

Dickdicken answered 31/12, 2011 at 18:49 Comment(0)
E
7

If I understand your question correctly, you want to strip any common prefix directories from the items in the zip before extracting them.

If so, then the following script should do what you want:

import sys, os
from zipfile import ZipFile

def get_members(zip):
    parts = []
    # get all the path prefixes
    for name in zip.namelist():
        # only check files (not directories)
        if not name.endswith('/'):
            # keep list of path elements (minus filename)
            parts.append(name.split('/')[:-1])
    # now find the common path prefix (if any)
    prefix = os.path.commonprefix(parts)
    if prefix:
        # re-join the path elements
        prefix = '/'.join(prefix) + '/'
    # get the length of the common prefix
    offset = len(prefix)
    # now re-set the filenames
    for zipinfo in zip.infolist():
        name = zipinfo.filename
        # only check files (not directories)
        if len(name) > offset:
            # remove the common prefix
            zipinfo.filename = name[offset:]
            yield zipinfo

args = sys.argv[1:]

if len(args):
    zip = ZipFile(args[0])
    path = args[1] if len(args) > 1 else '.'
    zip.extractall(path, get_members(zip))
Each answered 1/1, 2012 at 4:54 Comment(2)
May you add some comments in order to understand better what is happening here, please?Adaline
@aturegano. I added some comments to the example code. The filenames of zipinfo objects are writable. So the script strips the common prefix from all the files in the archive, before extracting them to the destination directory.Each
C
1

Read the entries returned by ZipFile.namelist() to see if they're in the same directory, and then open/read each entry and write it to a file opened with open().

Cyclades answered 31/12, 2011 at 18:54 Comment(0)
A
1

This might be a problem with the zip archive itself. In a python prompt try this to see if the files are in the correct directories in the zip file itself.

import zipfile

zf = zipfile.ZipFile("my_file.zip",'r')
first_file = zf.filelist[0]
print file_list.filename

This should say something like "dir1" repeat the steps above substituting and index of 1 into filelist like so first_file = zf.filelist[1] This time the output should look like 'dir1/file1.jpg' if this is not the case then the zip file does not contain directories and will be unzipped all to one single directory.

Ales answered 31/12, 2011 at 20:14 Comment(0)
A
0

Based on the @ekhumoro's answer I come up with a simpler funciton to extract everything on the same level, it is not exactly what you are asking but I think can help someone.

    def _basename_members(self, zip_file: ZipFile):
        for zipinfo in zip_file.infolist():
            zipinfo.filename = os.path.basename(zipinfo.filename)
            yield zipinfo

    from_zip="some.zip"
    to_folder="some_destination/"
    with ZipFile(file=from_zip, mode="r") as zip_file:
        os.makedirs(to_folder, exist_ok=True)
        zip_infos = self._basename_members(zip_file)
        zip_file.extractall(path=to_folder, members=zip_infos)
Adonai answered 11/3, 2022 at 15:42 Comment(0)
C
0

Basically you need to do two things:

  1. Identify the root directory in the zip.
  2. Remove the root directory from the paths of other items in the zip.

The following should retain the overall structure of the zip while removing the root directory:

import typing, zipfile

def _is_root(info: zipfile.ZipInfo) -> bool:
    if info.is_dir():
        parts = info.filename.split("/")
        # Handle directory names with and without trailing slashes.
        if len(parts) == 1 or (len(parts) == 2 and parts[1] == ""):
            return True
    return False

def _members_without_root(archive: zipfile.ZipFile, root_filename: str) -> typing.Generator:
    for info in archive.infolist():
        parts = info.filename.split(root_filename)
        if len(parts) > 1 and parts[1]:
            # We join using the root filename, because there might be a subdirectory with the same name.
            info.filename = root_filename.join(parts[1:])
            yield info

with zipfile.ZipFile("archive.zip", mode="r") as archive:
    # We will use the first directory with no more than one path segment as the root.
    root = next(info for info in archive.infolist() if _is_root(info))
    if root:
        archive.extractall(path="/dir/to/extract/", members=_members_without_root(archive, root.filename))
    else:
        print("No root directory found in zip.")
Carnassial answered 16/5, 2022 at 16:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.