namelist() from ZipFile returns strings with an invalid encoding
Asked Answered
W

3

9

The problem is that for some archives or files up-loaded to the python application, ZipFile's namelist() returns badly decoded strings.

from zip import ZipFile
for name in ZipFile('zipfile.zip').namelist():
    print('Listing zip files: %s' % name)

How to fix that code so i always decode file names in unicode (so Chineeze, Russian and other languages supported)?

I've seen some samples for Python 2, but since string's nature is changed in python3, i have no clue how to re-encode it, or apply chardet on it.

Willi answered 9/6, 2016 at 10:33 Comment(2)
#1807563 may give you some answers, specifically the second answer.Brandtr
Regardless of workarounds people posted in answers here, none of them are actual reliable solution with explanation of how other programs in other languages handle the problem. There is also no information from python package developers on this.Willi
U
10

How to fix that code so i always decode file names in unicode (so Chineeze, Russian and other languages supported)?

Automatically? You can't. Filenames in a basic ZIP file are strings of bytes with no attached encoding information, so unless you know what the encoding was on the machine that created the ZIP you can't reliably get a human-readable filename back out.

There is an extension to the flags on modern ZIP files to tell you that the filename is UTF-8. Unfortunately files you receive from Windows users typically don't have it, so you'll left guessing with inherently unreliable methods like chardet.

I've seen some samples for Python 2, but since string's nature is changed in python3, i have no clue how to re-encode it, or apply chardet on it.

Python 2 would just give you raw bytes back. In Python 3 the new behaviour is:

  • if the UTF-8 flag is set, it decodes the filenames using UTF-8 and you get the correct string value back

  • otherwise, it decodes the filenames using DOS code page 437, which is pretty unlikely to be what was intended. However you can re-encode the string back to the original bytes, and then try to decode again using the code page you actually want, eg name.encode('cp437').decode('cp1252').

Unfortunately (again, because the unfortunatelies never end where ZIP is concerned), ZipFile does this decoding silently without telling you what it did. So if you want to switch and only do the transcode step when the filename is suspect, you have to duplicate the logic for sniffing whether the UTF-8 flag was set:

ZIP_FILENAME_UTF8_FLAG = 0x800

for info in ZipFile('zipfile.zip').filelist():
    filename = info.filename
    if info.flag_bits & ZIP_FILENAME_UTF8_FLAG == 0:
        filename_bytes = filename.encode('437')
        guessed_encoding = chardet.detect(filename_bytes)['encoding'] or 'cp1252'
        filename = filename_bytes.decode(guessed_encoding, 'replace')
    ...
Uterus answered 12/6, 2016 at 10:59 Comment(2)
I'd like to note that I have experienced ZIP files from Mac OS X that does encode the file list as utf-8, but forgets to set the flag.Midis
Thanks!! it worked for my case when the name of the files was encoded with cp949. Just using: name.encode('cp437').decode('cp949') worked!Matriculate
R
7

Here's the code that decodes filenames in zipfile.py according to the zip spec that supports only cp437 and utf-8 character encodings:

        if flags & 0x800:
            # UTF-8 file names extension
            filename = filename.decode('utf-8')
        else:
            # Historical ZIP filename encoding
            filename = filename.decode('cp437')

As you can see, if 0x800 flag is not set i.e., if utf-8 is not used in your input zipfile.zip then cp437 is used and therefore the result for "Chineeze, Russian and other languages" is likely to be incorrect.

In practice, ANSI or OEM Windows codepages may be used instead of cp437.

If you know the actual character encoding e.g., cp866 (OEM (console) codepage) may be used on Russian Windows then you could reencode filenames to get the original filenames:

filename = corrupted_filename.encode('cp437').decode('cp866')

The best option is to create the zip archive using utf-8 so that you can support multiple languages in the same archive:

c:\> 7z.exe a -tzip -mcu archive.zip <files>..

or

$ python -mzipfile -c archive.zip <files>..`
Rarefaction answered 12/6, 2016 at 17:2 Comment(2)
This did it for me: f = filename.encode('cp437').decode('cp866') and vise versa : arch = archive.open( f.encode('cp866').decode('cp437') )Prentice
maybe try .encode('cp437').decode('gbk') for chinese namesChristiechristin
S
2

Got the same problem, but with defined language (Russian).

  1. Most simple solution is just to convert it with this utility: https://github.com/vlm/zip-fix-filename-encoding For me it works on 98% of archives (failed to run on 317 files from corpus of 11388)

  2. More complex solution: use python module chardet with zipfile. But it depends on python version (2 or 3) you use - it has some differences on zipfile. For python 3 I wrote a code:

    import chardet
    original_name = name
    try:
        name = name.encode('cp437')
    except UnicodeEncodeError:
        name = name.encode('utf8')
    encoding = chardet.detect(name)['encoding']
    name = name.decode(encoding)
    

    This code try to work with old style zips (having encoding CP437 and just has it broken), and if fails, it seems that zip archive is new style (UTF-8). After determining proper encoding, you can extract files by code like:

    from shutil import copyfileobj
    fp = archive.open(original_name)
    fp_out = open(name, 'wb')
    copyfileobj(fp, fp_out)
    

In my case, this resolved last 2% of failed files.

Smoothspoken answered 19/7, 2019 at 17:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.