Python zipfile module can't extract filenames with Chinese characters
Asked Answered
Y

8

11

I'm trying to use a python script to download files from a Chinese service provider (I'm not from China myself). The provider is giving me a .zip file which contains a file which seems to have Chinese characters in its name. This seems to be causing the zipfile module to barf.

Code:

import zipfile

f = "/path/to/zip_file.zip"

if zipfile.is_zipfile(f):
    fz = zipfile.ZipFile(f, 'r')

The zipfile itself doesn't contain any non-ASCII characters but the file inside it does. When I run the above script i get the following exception:

Traceback (most recent call last):   File "./temp.py", line 9, in <module>
    fz = zipfile.ZipFile(f, 'r')   File "/usr/lib/python2.7/zipfile.py", line 770, in __init__
    self._RealGetContents()   File "/usr/lib/python2.7/zipfile.py", line 859, in _RealGetContents
    x.filename = x._decodeFilename()   File "/usr/lib/python2.7/zipfile.py", line 379, in _decodeFilename
    return self.filename.decode('utf-8')   File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xbd in position 30: invalid start byte

I've tried looking through the answers to many similar questions:

Please correct me if I'm wrong, but it looks like an open issue with the zipfile module.

How do I get around this? Is there any alternative module for dealing with zipfiles that I should use? Or any other solution?

TIA.

Edit: I can access/unzip the same file perfectly with the linux command-line utility "unzip".

Yautia answered 7/12, 2016 at 14:11 Comment(0)
L
16

The way of Python 2.x(2.7) and Python 3.x dealing with non utf-8 filename in module zipfile are a bit different.

First, they both check ZipInfo.flag_bits of the file, if ZipInfo.flag_bits & 0x800, name of the file will be decode with utf-8.

If the check of above is False, in Python 2.x, the byte string of the name will be returned; in Python 3.x, the module will decode the file with encoding cp437 and return decoded result. Of course, the module will not know the true encoding of the filename in both Python versions.

So, suppose you have got a filename from a ZipInfo object or zipfile.namelist method, and you have already know the filename is encoded with XXX encoding. Those are the ways you get the correct unicode filename:

# in python 2.x
filename = filename.decode('XXX')


# in python 3.x
filename = filename.encode('cp437').decode('XXX')
Larynx answered 6/10, 2017 at 14:34 Comment(1)
To know which code XXX correspond to your language, check here for python 2.4 or here for python 3.x.Decent
P
8

Recently I met the same problem. Here is my solution. I hope it is useful for you.

import shutil
import zipfile
f = zipfile.ZipFile('/path/to/zip_file.zip', 'r')
for fileinfo in f.infolist():
    filename = fileinfo.filename.encode('cp437').decode('gbk')
    outputfile = open(filename, "wb")
    shutil.copyfileobj(f.open(fileinfo.filename), outputfile)
    outputfile.close()
f.close()

UPDATE: You can use the following simpler solution with pathlib:

from pathlib import Path
import zipfile

with zipfile.ZipFile('/path/to/zip_file.zip', 'r') as f:
    for fn in f.namelist():
        extracted_path = Path(f.extract(fn))
        extracted_path.rename(fn.encode('cp437').decode('gbk'))
Piedadpiedmont answered 4/4, 2018 at 11:26 Comment(0)
A
4

This is almost 6 years late, but this was finally fixed in Python 3.11 with the addition of the metadata_encoding parameter. I posted this answer here anyway to help other people with similar issues.

import zipfile

f = "your/zip/file.zip"
t = "the/dir/where/you/want/to/extract/it/all"

with zipfile.ZipFile(f, "r", metadata_encoding = "utf-8") as zf:
    zf.extractall(t)
Adjacency answered 15/11, 2022 at 15:9 Comment(0)
F
2

What about this code?

import zipfile

with zipfile.ZipFile('/path/to/zip_file.zip', 'r') as f:
    zipInfo = f.infolist()
    for member in zipInfo:
        member.filename = member.filename.encode('cp437').decode('gbk')
        f.extract(member)
Freckly answered 9/1, 2019 at 13:38 Comment(0)
M
1

The ZIP file is invalid. It has a flag that signals that filenames inside it are encoded as UTF-8, but they're actually not; they contain byte sequences that aren't valid as UTF-8. Maybe they're GBK? Maybe something else? Maybe some unholy inconsistent mixture? ZIP tools in the wild are unfortunately very very poor at handling non-ASCII filenames consistently.

A quick workaround might be to replace the library function that decodes the filenames. This is a monkey-patch as there isn't a simple way to inject your own ZipInfo class into ZipFile, but:

zipfile.ZipInfo._decodeFilename = lambda self: self.filename

would disable the attempt to decode the filename, and always return a ZipInfo with a byte string filename property that you can proceed to decode/handle manually in whatever way is appropriate.

Matinee answered 9/12, 2016 at 22:16 Comment(3)
"It has a flag that signals that filenames inside it are encoded as UTF-8" I've never heard of that flag. Where would one find it?Killick
Sorry I didn't mention this in my question, but I can access/unzip it perfectly with the linux command-line utility "unzip". So I doubt if it's a problem with the file itself.Yautia
@Rhymoid: bit 11 of the file attribute flags word, see PKware appnote sec 4.1.4: “If this bit is set, the filename and comment fields for this file MUST be encoded using UTF-8”. hyperwiser: it is unwise to judge the validity of a file by any one tool's handling of it. Certainly a tool that predates the UTF-8 flag would ignore it, as would a byte-oriented tool that didn't care about encodings. Quite apart from the wildly varying reactions of tools in-the-wild to ZIP's various sloppily-defined edge cases.Matinee
A
1

@Mr.Ham's solution perfectly solved my problem. I'm using the Chinese version of Win10. Which the default encoding of the file system is GBK.

I think for other language users. Just change decode from GBK to their system default encoding will also work. And the default system encoding could automaticly get by Python.

So the patched code looks like this:

import zipfile
import locale

default_encoding = locale.getpreferredencoding()

with zipfile.ZipFile("/path/to/zip_file.zip") as f:
    zipinfo = f.infolist()
    for member in zipinfo:
        member.filename = member.filename.encode('cp437').decode(default_encoding)
        # The second argument could make the extracted filese to the same dir as the zip file, or leave it blank to your work dir.
        f.extract(member, "/path/to/zip_file")
Astronaut answered 14/10, 2021 at 5:51 Comment(0)
E
0

In my case,i solved the problem by add parameter metadata_encoding='utf-8' to the ZipFile function.

f = zipfile.ZipFile("./dataset_delete_test.zip",'r',metadata_encoding='utf-8')

Through a quick debug,i found that there was an exception in an if statement in a ZipFile that evaluates filename to cp437. The reason is that flags is 0

#this code will lead filename.decode='cp437' in class ZipFile of zipfile.py

f = zipfile.ZipFile("./dataset_delete_test.zip",'r')

the if statement about the filename's decode in the zipfile.py

 if flags & _MASK_UTF_FILENAME:
            # UTF-8 file names extension
            filename = filename.decode('utf-8')
        else:
            # Historical ZIP filename encoding
            filename = filename.decode(self.metadata_encoding or 'cp437')
Eckhart answered 7/9, 2024 at 2:44 Comment(0)
G
-1

In my opinion, this is a better solution to the previous answers.

Change:

        with zipfile.ZipFile(file_path, "r") as zipobj:
            zipobj.extractall(path=dest_dir)
            print("Successfully extracted zip archive to {}".format(dest_dir))

to:

        with zipfile.ZipFile(file_path, "r") as zipobj:
            zipobj._extract_member = lambda a,b,c: _extract_member_new(zipobj, a,b,c)
            zipobj.extractall(path=dest_dir)
            print("Successfully extracted zip archive to {}".format(dest_dir))

where _extract_member_new is:

def _extract_member(self, member, targetpath, pwd):
    """Extract the ZipInfo object 'member' to a physical
        file on the path targetpath.
    """
    import zipfile
    if not isinstance(member, zipfile.ZipInfo):
        member = self.getinfo(member)

    # build the destination pathname, replacing
    # forward slashes to platform specific separators.
    arcname = member.filename.replace('/', os.path.sep)
    arcname = arcname.encode('cp437', errors='replace').decode('gbk', errors='replace')

    if os.path.altsep:
        arcname = arcname.replace(os.path.altsep, os.path.sep)
    # interpret absolute pathname as relative, remove drive letter or
    # UNC path, redundant separators, "." and ".." components.
    arcname = os.path.splitdrive(arcname)[1]
    invalid_path_parts = ('', os.path.curdir, os.path.pardir)
    arcname = os.path.sep.join(x for x in arcname.split(os.path.sep)
                                if x not in invalid_path_parts)
    if os.path.sep == '\\':
        # filter illegal characters on Windows
        arcname = self._sanitize_windows_name(arcname, os.path.sep)

    targetpath = os.path.join(targetpath, arcname)
    targetpath = os.path.normpath(targetpath)

    # Create all upper directories if necessary.
    upperdirs = os.path.dirname(targetpath)
    if upperdirs and not os.path.exists(upperdirs):
        os.makedirs(upperdirs)

    if member.is_dir():
        if not os.path.isdir(targetpath):
            os.mkdir(targetpath)
        return targetpath

    with self.open(member, pwd=pwd) as source, \
            open(targetpath, "wb") as target:
        shutil.copyfileobj(source, target)

    return targetpath
Gmt answered 25/2, 2024 at 11:44 Comment(2)
Please be careful with the words you use. Some words commonly used in speech are considered rude or impolite in written textHeredia
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Albatross

© 2022 - 2025 — McMap. All rights reserved.