I was unable to get Python 3 to extract from the archive. Some results from an investigation (on Mac OS X) that might be helpful.
Check the health of the archive
Make the file read-only in order to prevent accidental changes:
$ chmod -w vertnet_latest_amphibians.zip
$ ls -lh vertnet_latest_amphibians.zip
-r--r--r-- 1 lawh 2045336417 296M Jan 6 10:10 vertnet_latest_amphibians.zip
Check the archive using zip
and unzip
:
$ zip -T vertnet_latest_amphibians.zip
test of vertnet_latest_amphibians.zip OK
$ unzip -t vertnet_latest_amphibians.zip
Archive: vertnet_latest_amphibians.zip
testing: VertNet_Amphibia_eml.xml OK
testing: __MACOSX/ OK
testing: __MACOSX/._VertNet_Amphibia_eml.xml OK
testing: vertnet_latest_amphibians.csv OK
testing: __MACOSX/._vertnet_latest_amphibians.csv OK
No errors detected in compressed data of vertnet_latest_amphibians.zip
As also found by @sam-mussmann, 7z
reports a CRC error:
$ 7z t vertnet_latest_amphibians.zip
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=utf8,Utf16=on,HugeFiles=on,64 bits,4 CPUs x64)
Scanning the drive for archives:
1 file, 309726398 bytes (296 MiB)
Testing archive: vertnet_latest_amphibians.zip
--
Path = vertnet_latest_amphibians.zip
Type = zip
Physical Size = 309726398
ERROR: CRC Failed : vertnet_latest_amphibians.csv
Sub items Errors: 1
Archives with Errors: 1
Sub items Errors: 1
My zip
and unzip
are both rather old; 7z
is pretty new:
$ zip -v | head -2
Copyright (c) 1990-2008 Info-ZIP - Type 'zip "-L"' for software license.
This is Zip 3.0 (July 5th 2008), by Info-ZIP.
$ unzip -v | head -1
UnZip 6.00 of 20 April 2009, by Debian. Original by Info-ZIP.
$ 7z --help |head -3
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=utf8,Utf16=on,HugeFiles=on,64 bits,4 CPUs x64)
Extract
Using unzip
:
$ time unzip vertnet_latest_amphibians.zip vertnet_latest_amphibians.csv
Archive: vertnet_latest_amphibians.zip
inflating: vertnet_latest_amphibians.csv
real 0m17.201s
user 0m14.281s
sys 0m2.460s
Extract using Python 2.7.13, using zipfile
's command-line interface for brevity:
$ time ~/local/python-2.7.13/bin/python2 -m zipfile -e vertnet_latest_amphibians.zip .
real 0m19.491s
user 0m12.996s
sys 0m5.897s
As you found, Python 3.6.0 (also 3.4.5 and 3.5.2) reports a bad CRC
Hypothesis 1: The archive contains a bad CRC that zip
, unzip
and
Python 2.7.13 are failing to detect; 7z
and Python 3.4-3.6 are all doing the
right thing.
Hypothesis 2: The archive is fine; 7z
and Python 3.4-3.6 all contain a bug.
Given the relative ages of these tools, I would guess that H1 is correct.
Workaround
If you are not using Windows and trust the contents of the archive, it might be more straightforward to use regular shell commands. Something like:
wget <the-long-url> -O /tmp/vertnet_latest_amphibians.zip
unzip /tmp/vertnet_latest_amphibians.zip vertnet_latest_amphibians.csv
rm -rf /tmp/vertnet_latest_amphibians.zip
Or you could execute unzip
from within Python:
import os
os.system('unzip vertnet_latest_amphibians.zip vertnet_latest_amphibians.csv')
Incidental
It is slightly neater to catch ImportError
than to check the version of the
Python interpreter:
try:
from urllib.request import urlretrieve
except ImportError:
from urllib import urlretrieve
3.6.0
. I just added it to the question. – Beersheba2.7.12
(working) and3.4.3
(failing). I've got to leave this now, but I did verify that the CRC in the ZipInfo is the same in both Python versions, so I think this is a difference in CRC32 computation. As a side note, my version of 7-zip (which I haven't updated since 2010) also thinks this file is broken. – Crocked