unzipping file results in "BadZipFile: File is not a zip file"
Asked Answered
N

16

68

I have two zip files, both of them open well with Windows Explorer and 7-zip.

However when i open them with Python's zipfile module [ zipfile.ZipFile("filex.zip") ], one of them gets opened but the other one gives error "BadZipfile: File is not a zip file".

I've made sure that the latter one is a valid Zip File by opening it with 7-Zip and looking at its properties (says 7Zip.ZIP). When I open the file with a text editor, the first two characters are "PK", showing that it is indeed a zip file.

I'm using Python 2.5 and really don't have any clue how to go about for this. I've tried it both with Windows as well as Ubuntu and problem exists on both platforms.

Update: Traceback from Python 2.5.4 on Windows:

Traceback (most recent call last):
File "<module1>", line 5, in <module>
    zipfile.ZipFile("c:/temp/test.zip")
File "C:\Python25\lib\zipfile.py", line 346, in init
    self._GetContents()
File "C:\Python25\lib\zipfile.py", line 366, in _GetContents
    self._RealGetContents()
File "C:\Python25\lib\zipfile.py", line 378, in _RealGetContents
    raise BadZipfile, "File is not a zip file"
BadZipfile: File is not a zip file

Basically when the _EndRecData function is called for getting data from End of Central Directory" record, the comment length checkout fails [ endrec[7] == len(comment) ].

The values of locals in the _EndRecData function are as following:

 END_BLOCK: 4096,
 comment: '\x00',
 data: '\xd6\xf6\x03\x00\x88,N8?<e\xf0q\xa8\x1cwK\x87\x0c(\x82a\xee\xc61N\'1qN\x0b\x16K-\x9d\xd57w\x0f\xa31n\xf3dN\x9e\xb1s\xffu\xd1\.....', (truncated)
 endrec: ['PK\x05\x06', 0, 0, 4, 4, 268, 199515, 0],
 filesize: 199806L,
 fpin: <open file 'c:/temp/test.zip', mode 'rb' at 0x045D4F98>,
 start: 4073
Nogas answered 21/6, 2010 at 8:55 Comment(8)
Try to run the unix file command on both of your files. May be it will give you some clue.Slumber
For both files it says: Zip archive data, at least v2.0 to extractNogas
bad news. I hoped it will say something different. Does all your files gets uncompressed by 7zip w/o any errors? Are they both can be uncompressed with unix' unzip command as well? Did you updated your python libzip bindings to latest version?Slumber
Yes, both files get uncompressed by 7-zip as well as unzip without any errors. I haven't tried updating the libzip bindings to latest version. How do I do that?Nogas
Could it be this: bugs.python.org/issue1757072 ?Buzzard
In my case the file I was decompressing had this structure file.compress.2020.zip, renaming and deleting all dots in file solved the issue (file_compress_2020.zip)Plautus
for this this happened when the file wasn't downloaded fully I think. So I just delete it in my download code.Scrub
@CharlieParker same here - I just deleted the file and redownloaded it and it worked fine. (These were .npz files).Billow
S
22

files named file can confuse python - try naming it something else. if it STILL wont work, try this code:

def fixBadZipfile(zipFile):  
 f = open(zipFile, 'r+b')  
 data = f.read()  
 pos = data.find('\x50\x4b\x05\x06') # End of central directory signature  
 if (pos > 0):  
     self._log("Trancating file at location " + str(pos + 22)+ ".")  
     f.seek(pos + 22)   # size of 'ZIP end of central directory record' 
     f.truncate()  
     f.close()  
 else:  
     # raise error, file is truncated  
Sprain answered 8/7, 2012 at 18:17 Comment(5)
Impressed with the signature thingEllinger
replace with b'\x50\x4b\x05\x06' to avoid TypeError: argument should be integer or bytes-like object, not 'str'Zaller
For other people, I'd recommend trying this solution first before trying the one below. I thought the one by UltramaticOrange is an improved version of this. It's not and they are different.Finned
It gives the same error regardless of the filename.Chemulpo
This function won't run because self is undefined. What is self supposed to refer to in self._log(...? Is this supposed to go in some class?Modernism
S
16

I run into the same issue. My problem was that it was a gzip instead of a zip file. I switched to the class gzip.GzipFile and it worked like a charm.

Splenius answered 18/9, 2013 at 13:39 Comment(4)
how to extract using GzipFile Contrasty
@pyd you can do it with tarfile module: with tarfile.open('somefile.zip', "r:gz") as f: f.extractall()Volost
@pyd, Have a look at the examples in the documentation from pythons standard librarySplenius
Would work for smaller files, but wouldn't work if the compressed data is heavy in storage and RAM memory is not enough. Specifically in my case, I am using a Colab Notebook with 13GB ram and my compressed file is arround 25 GB .. so makes sense if the notebook crashesArmourer
S
13

astronautlevel's solution works for most cases, but the compressed data and CRCs in the Zip can also contain the same 4 bytes. You should do an rfind (not find), seek to pos+20 and then add write \x00\x00 to the end of the file (tell zip applications that the length of the 'comments' section is 0 bytes long).


    # HACK: See http://bugs.python.org/issue10694
    # The zip file generated is correct, but because of extra data after the 'central directory' section,
    # Some version of python (and some zip applications) can't read the file. By removing the extra data,
    # we ensure that all applications can read the zip without issue.
    # The ZIP format: http://www.pkware.com/documents/APPNOTE/APPNOTE-6.3.0.TXT
    # Finding the end of the central directory:
    #   https://mcmap.net/q/296734/-how-to-find-the-position-of-central-directory-in-a-zip-file
    #   https://mcmap.net/q/296735/-why-can-39-t-python-execute-a-zip-archive-passed-via-stdin
    #       This second link is only losely related, but echos the first, "processing a ZIP archive often requires backwards seeking"
    content = zipFileContainer.read()
    pos = content.rfind('\x50\x4b\x05\x06') # reverse find: this string of bytes is the end of the zip's central directory.
    if pos>0:
        zipFileContainer.seek(pos+20) # +20: see secion V.I in 'ZIP format' link above.
        zipFileContainer.truncate()
        zipFileContainer.write('\x00\x00') # Zip file comment length: 0 byte length; tell zip applications to stop reading.
        zipFileContainer.seek(0)

    return zipFileContainer
Shivaree answered 24/2, 2014 at 18:52 Comment(3)
I haven't worked with python before, but I need to solve my issue. I can't understand what is this zipFileContainer and why there is return statement. Could you please enhance this answer and explain how this piece of code works??Scientistic
Hi @Aleksandrs, It's been a while since I've had to look at this code, but I'm pretty sure zipFileContainer is just the file handle. The line would look like, zipFileContainer = open(zipFile, 'r+b') - in the job I held at the time, I would've been working with file-like objects and not actual files, hence the weird variable name.Shivaree
@Scientistic I made a more complete example as a gist on github.Shivaree
P
4

I had the same problem and was able to solve this issue for my files, see my answer at zipfile cant handle some type of zip data?

Priscilapriscilla answered 17/9, 2011 at 21:3 Comment(0)
L
4

I'm very new at python and i was facing the exact same issue, none of the previous methods were working. Trying to print the 'corrupted' file just before unzipping it returned an empty byte object.

Turned out, I was trying to unzip the file right after writing it to disk, without closing the file handler.

with open(path, 'wb') as outFile:
    outFile.write(data)
    outFile.close()   # was missing this
    with zipfile.ZipFile(path, 'r') as zip:
        zip.extractall(destination)

Closing the file stream then unzipping the file resolved my issue.

Leeds answered 16/6, 2022 at 13:46 Comment(1)
this is what I was missingLetta
M
3

Sometime there are zip file which contain corrupted files and upon unzipping the zip gives badzipfile error. but there are tools like 7zip winrar which ignores these errors and successfully unzip the zip file. you can create a sub process and use this code to unzip your zip file without getting BadZipFile Error.

import subprocess
ziploc = "C:/Program Files/7-Zip/7z.exe" #location where 7zip is installed
cmd = [ziploc, 'e',your_Zip_file.zip ,'-o'+ OutputDirectory ,'-r' ] 
sp = subprocess.Popen(cmd, stderr=subprocess.STDOUT, stdout=subprocess.PIPE)
Masquer answered 31/1, 2019 at 9:53 Comment(0)
V
3

I faced this problem and was looking for a good and clean solution; But there was no solution until I found this answer. I had the same problem that @marsl (among the answers) had. It was a gzipfile instead of a zipfile in my case.

I could unarchive and decompress my gzipfile with this approach:

with tarfile.open(archive_path, "r:gz") as gzip_file:
    gzip_file.extractall()
Volost answered 27/12, 2020 at 20:19 Comment(0)
M
2

Show the full traceback that you got from Python -- this may give a hint as to what the specific problem is. Unanswered: What software produced the bad file, and on what platform?

Update: Traceback indicates having problem detecting the "End of Central Directory" record in the file -- see function _EndRecData starting at line 128 of C:\Python25\Lib\zipfile.py

Suggestions:
(1) Trace through the above function
(2) Try it on the latest Python
(3) Answer the question above.
(4) Read this and anything else found by google("BadZipfile: File is not a zip file") that appears to be relevant

Mcatee answered 21/6, 2010 at 11:16 Comment(2)
Windows, Python 2.5.2: Traceback (most recent call last): File "<module1>", line 5, in <module> zipfile.ZipFile("c:/temp/test.zip") File "C:\Python25\lib\zipfile.py", line 346, in init self._GetContents() File "C:\Python25\lib\zipfile.py", line 366, in _GetContents self._RealGetContents() File "C:\Python25\lib\zipfile.py", line 378, in _RealGetContents raise BadZipfile, "File is not a zip file" BadZipfile: File is not a zip fileNogas
Thanks for the link. I've already gone through it but that didn't help. Tested on Python 2.5.4, 2.6.5 on Windows and Python 2.5.2 on Ubuntu 64-bit.Nogas
D
1

Have you tried a newer python, or if that is too much trouble, simply a newer zipfile.py? I have successfully used a copy of zipfile.py from Python 2.6.2 (latest at the time) with Python 2.5 in order to open some zip files that weren't supported by Py2.5s zipfile module.

Drop answered 22/6, 2010 at 7:3 Comment(0)
T
1

In some cases, you have to confirm if the zip file is actually in gzip format. this was the case for me and i solved it by :

import requests
import tarfile
url = ".tar.gz link"
response = requests.get(url, stream=True)
file = tarfile.open(fileobj=response.raw, mode="r|gz")
file.extractall(path=".")
Tamqrah answered 19/10, 2021 at 9:8 Comment(0)
S
0

for this this happened when the file wasn't downloaded fully I think. So I just delete it in my download code.

def download_and_extract(url: str,
                         path_used_for_zip: Path = Path('~/data/'),
                         path_used_for_dataset: Path = Path('~/data/tmp/'),
                         rm_zip_file_after_extraction: bool = True,
                         force_rewrite_data_from_url_to_file: bool = False,
                         clean_old_zip_file: bool = False,
                         gdrive_file_id: Optional[str] = None,
                         gdrive_filename: Optional[str] = None,
                         ):
    """
    Downloads data and tries to extract it according to different protocols/file types.

    note:
        - to force a download do:
            force_rewrite_data_from_url_to_file = True
            clean_old_zip_file = True
        - to NOT remove file after extraction:
            rm_zip_file_after_extraction = False


    Tested with:
    - zip files, yes!

    Later:
    - todo: tar, gz, gdrive
    force_rewrite_data_from_url_to_file = remvoes the data from url (likely a zip file) and redownloads the zip file.
    """
    path_used_for_zip: Path = expanduser(path_used_for_zip)
    path_used_for_zip.mkdir(parents=True, exist_ok=True)
    path_used_for_dataset: Path = expanduser(path_used_for_dataset)
    path_used_for_dataset.mkdir(parents=True, exist_ok=True)
    # - download data from url
    if gdrive_filename is None:  # get data from url, not using gdrive
        import ssl
        ctx = ssl.create_default_context()
        ctx.check_hostname = False
        ctx.verify_mode = ssl.CERT_NONE
        print("downloading data from url: ", url)
        import urllib
        import http
        response: http.client.HTTPResponse = urllib.request.urlopen(url, context=ctx)
        print(f'{type(response)=}')
        data = response
        # save zipfile like data to path given
        filename = url.rpartition('/')[2]
        path2file: Path = path_used_for_zip / filename
    else:  # gdrive case
        from torchvision.datasets.utils import download_file_from_google_drive
        # if zip not there re-download it or force get the data
        path2file: Path = path_used_for_zip / gdrive_filename
        if not path2file.exists():
            download_file_from_google_drive(gdrive_file_id, path_used_for_zip, gdrive_filename)
        filename = gdrive_filename
    # -- write downloaded data from the url to a file
    print(f'{path2file=}')
    print(f'{filename=}')
    if clean_old_zip_file:
        path2file.unlink(missing_ok=True)
    if filename.endswith('.zip') or filename.endswith('.pkl'):
        # if path to file does not exist or force to write down the data
        if not path2file.exists() or force_rewrite_data_from_url_to_file:
            # delete file if there is one if your going to force a rewrite
            path2file.unlink(missing_ok=True) if force_rewrite_data_from_url_to_file else None
            print(f'about to write downloaded data from url to: {path2file=}')
            # wb+ is used sinze the zip file was in bytes, otherwise w+ is fine if the data is a string
            with open(path2file, 'wb+') as f:
            # with open(path2file, 'w+') as f:
                print(f'{f=}')
                print(f'{f.name=}')
                f.write(data.read())
            print(f'done writing downloaded from url to: {path2file=}')
    elif filename.endswith('.gz'):
        pass  # the download of the data doesn't seem to be explicitly handled by me, that is done in the extract step by a magic function tarfile.open
    # elif is_tar_file(filename):
    #     os.system(f'tar -xvzf {path_2_zip_with_filename} -C {path_2_dataset}/')
    else:
        raise ValueError(f'File type {filename=} not supported.')

    # - unzip data written in the file
    extract_to = path_used_for_dataset
    print(f'about to extract: {path2file=}')
    print(f'extract to target: {extract_to=}')
    if filename.endswith('.zip'):
        import zipfile  # this one is for zip files, inspired from l2l
        zip_ref = zipfile.ZipFile(path2file, 'r')
        zip_ref.extractall(extract_to)
        zip_ref.close()
        if rm_zip_file_after_extraction:
            path2file.unlink(missing_ok=True)
    elif filename.endswith('.gz'):
        import tarfile
        file = tarfile.open(fileobj=response, mode="r|gz")
        file.extractall(path=extract_to)
        file.close()
    elif filename.endswith('.pkl'):
        # no need to extract it, but when you use the data make sure you torch.load it or pickle.load it.
        print(f'about to test torch.load of: {path2file=}')
        data = torch.load(path2file)  # just to test
        assert data is not None
        print(f'{data=}')
        pass
    else:
        raise ValueError(f'File type {filename=} not supported, edit code to support it.')
        # path_2_zip_with_filename = path_2_ziplike / filename
        # os.system(f'tar -xvzf {path_2_zip_with_filename} -C {path_2_dataset}/')
        # if rm_zip_file:
        #     path_2_zip_with_filename.unlink(missing_ok=True)
        # # raise ValueError(f'File type {filename=} not supported.')
    print(f'done extracting: {path2file=}')
    print(f'extracted at location: {path_used_for_dataset=}')
    print(f'-->Succes downloading & extracting dataset at location: {path_used_for_dataset=}')

you can use my code with pip install ultimate-utils for the most up to date version.

Scrub answered 12/11, 2022 at 3:46 Comment(0)
N
0

In the other case, this warning showing up when the ml/dl model has different format. For the example: you want to open pickle, but the model format is .sav

Solution: you need to change the format to original format pickle --> .pkl tensorflow --> .h5 etc.

Nunes answered 19/12, 2022 at 4:19 Comment(0)
C
0

In my case, the zip file itself was missing from that directory - thus when I tried to unzip it, I got the error "BadZipFile: File is not a zip file". It got resolved after I moved the .zip file to the directory. Please confirm that the file is indeed present in your directory before running the python script.

Clathrate answered 11/1, 2023 at 5:2 Comment(0)
S
0

In my case, the zip file is just broken. Unzip it with NanaZip or 7zip gives me error message like "the zip file is broken"

Shushubert answered 26/9, 2023 at 6:53 Comment(0)
D
0

I also faced a similar problem when I tried to unzip my file from the drive. Use the online file zipping websites to zip your file which does not break the file and the error doesn't raise for me.

Desirable answered 27/12, 2023 at 4:9 Comment(1)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Bonnee
W
-1

In my case, the zip file was corrupted. I was trying to download the zip file with urllib.request.urlretrieve but the file wouldn't completely download for some reason.

I connected to a VPN, the file downloaded just fine, and I was able to open the file.

Wizened answered 3/5, 2022 at 13:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.