Downloading and unzipping a .zip file without writing to disk
Asked Answered
M

10

123

I have managed to get my first python script to work which downloads a list of .ZIP files from a URL and then proceeds to extract the ZIP files and writes them to disk.

I am now at a loss to achieve the next step.

My primary goal is to download and extract the zip file and pass the contents (CSV data) via a TCP stream. I would prefer not to actually write any of the zip or extracted files to disk if I could get away with it.

Here is my current script which works but unfortunately has to write the files to disk.

import urllib, urllister
import zipfile
import urllib2
import os
import time
import pickle

# check for extraction directories existence
if not os.path.isdir('downloaded'):
    os.makedirs('downloaded')

if not os.path.isdir('extracted'):
    os.makedirs('extracted')

# open logfile for downloaded data and save to local variable
if os.path.isfile('downloaded.pickle'):
    downloadedLog = pickle.load(open('downloaded.pickle'))
else:
    downloadedLog = {'key':'value'}

# remove entries older than 5 days (to maintain speed)

# path of zip files
zipFileURL = "http://www.thewebserver.com/that/contains/a/directory/of/zip/files"

# retrieve list of URLs from the webservers
usock = urllib.urlopen(zipFileURL)
parser = urllister.URLLister()
parser.feed(usock.read())
usock.close()
parser.close()

# only parse urls
for url in parser.urls: 
    if "PUBLIC_P5MIN" in url:

        # download the file
        downloadURL = zipFileURL + url
        outputFilename = "downloaded/" + url

        # check if file already exists on disk
        if url in downloadedLog or os.path.isfile(outputFilename):
            print "Skipping " + downloadURL
            continue

        print "Downloading ",downloadURL
        response = urllib2.urlopen(downloadURL)
        zippedData = response.read()

        # save data to disk
        print "Saving to ",outputFilename
        output = open(outputFilename,'wb')
        output.write(zippedData)
        output.close()

        # extract the data
        zfobj = zipfile.ZipFile(outputFilename)
        for name in zfobj.namelist():
            uncompressed = zfobj.read(name)

            # save uncompressed data to disk
            outputFilename = "extracted/" + name
            print "Saving extracted file to ",outputFilename
            output = open(outputFilename,'wb')
            output.write(uncompressed)
            output.close()

            # send data via tcp stream

            # file successfully downloaded and extracted store into local log and filesystem log
            downloadedLog[url] = time.time();
            pickle.dump(downloadedLog, open('downloaded.pickle', "wb" ))
Monstrosity answered 19/4, 2011 at 2:13 Comment(1)
ZIP format isn't designed to be streamed. It uses footers, meaning you need the end of the file to figure out where things belong inside it, meaning you need to have the whole file before you can do anything with a subset of it.Tetrasyllable
D
79

My suggestion would be to use a StringIO object. They emulate files, but reside in memory. So you could do something like this:

# get_zip_data() gets a zip archive containing 'foo.txt', reading 'hey, foo'

import zipfile
from StringIO import StringIO

zipdata = StringIO()
zipdata.write(get_zip_data())
myzipfile = zipfile.ZipFile(zipdata)
foofile = myzipfile.open('foo.txt')
print foofile.read()

# output: "hey, foo"

Or more simply (apologies to Vishal):

myzipfile = zipfile.ZipFile(StringIO(get_zip_data()))
for name in myzipfile.namelist():
    [ ... ]

In Python 3 use BytesIO instead of StringIO:

import zipfile
from io import BytesIO

filebytes = BytesIO(get_zip_data())
myzipfile = zipfile.ZipFile(filebytes)
for name in myzipfile.namelist():
    [ ... ]
Damselfish answered 19/4, 2011 at 2:23 Comment(7)
"The StringIO object can accept either Unicode or 8-bit strings" Doesn't this mean that if the number of bytes you expect to write is not congruent to 0 mod 8, then you will either throw an exception or write incorrect data?Oosperm
Not at all -- why would you only be able to write 8 bytes at a time? Conversely, when do you ever write fewer than 8 bits at a time?Damselfish
@ninjagecko: You seem to fear a problem if the number of bytes expected to be written is not a multiple of 8. That is not derivable from the statement about StringIO and is quite baseless. The problem with StringIO is when the user mixes unicode objects with str objects that are not decodable by the system default encoding (which is typically ascii).Dyscrasia
Small comment on the above code: when you read multiple files out of the .zip, make sure you read the data out one by one, because calling zipfile.open two times will remove the reference in the first.Potto
Notice that as of Python 3 you have to use from io import StringIOJotter
@Damselfish how do you save the file you downloaded?Popelka
What do you mean by "get_zip_data() gets a zip archive"? What type of object does it return? How does it get it?Hers
V
115

Below is a code snippet I used to fetch zipped csv file, please have a look:

Python 2:

from StringIO import StringIO
from zipfile import ZipFile
from urllib import urlopen

resp = urlopen("http://www.test.com/file.zip")
myzip = ZipFile(StringIO(resp.read()))
for line in myzip.open(file).readlines():
    print line

Python 3:

from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen
# or: requests.get(url).content

resp = urlopen("http://www.test.com/file.zip")
myzip = ZipFile(BytesIO(resp.read()))
for line in myzip.open(file).readlines():
    print(line.decode('utf-8'))

Here file is a string. To get the actual string that you want to pass, you can use zipfile.namelist(). For instance,

resp = urlopen('http://mlg.ucd.ie/files/datasets/bbc.zip')
myzip = ZipFile(BytesIO(resp.read()))
myzip.namelist()
# ['bbc.classes', 'bbc.docs', 'bbc.mtx', 'bbc.terms']
Viscosity answered 19/4, 2011 at 2:53 Comment(3)
Note it's a bit confusing to use the same name for a class instances and a library - the identifier zipfile gets redefined in this code.Cylix
I just fixed that :)Ruppert
When I do resp = urlopen('http://mlg.ucd.ie/files/datasets/bbc.zip'), I get urllib.error.HTTPError: HTTP Error 406: Not Acceptable.Hers
D
79

My suggestion would be to use a StringIO object. They emulate files, but reside in memory. So you could do something like this:

# get_zip_data() gets a zip archive containing 'foo.txt', reading 'hey, foo'

import zipfile
from StringIO import StringIO

zipdata = StringIO()
zipdata.write(get_zip_data())
myzipfile = zipfile.ZipFile(zipdata)
foofile = myzipfile.open('foo.txt')
print foofile.read()

# output: "hey, foo"

Or more simply (apologies to Vishal):

myzipfile = zipfile.ZipFile(StringIO(get_zip_data()))
for name in myzipfile.namelist():
    [ ... ]

In Python 3 use BytesIO instead of StringIO:

import zipfile
from io import BytesIO

filebytes = BytesIO(get_zip_data())
myzipfile = zipfile.ZipFile(filebytes)
for name in myzipfile.namelist():
    [ ... ]
Damselfish answered 19/4, 2011 at 2:23 Comment(7)
"The StringIO object can accept either Unicode or 8-bit strings" Doesn't this mean that if the number of bytes you expect to write is not congruent to 0 mod 8, then you will either throw an exception or write incorrect data?Oosperm
Not at all -- why would you only be able to write 8 bytes at a time? Conversely, when do you ever write fewer than 8 bits at a time?Damselfish
@ninjagecko: You seem to fear a problem if the number of bytes expected to be written is not a multiple of 8. That is not derivable from the statement about StringIO and is quite baseless. The problem with StringIO is when the user mixes unicode objects with str objects that are not decodable by the system default encoding (which is typically ascii).Dyscrasia
Small comment on the above code: when you read multiple files out of the .zip, make sure you read the data out one by one, because calling zipfile.open two times will remove the reference in the first.Potto
Notice that as of Python 3 you have to use from io import StringIOJotter
@Damselfish how do you save the file you downloaded?Popelka
What do you mean by "get_zip_data() gets a zip archive"? What type of object does it return? How does it get it?Hers
V
34

I'd like to offer an updated Python 3 version of Vishal's excellent answer, which was using Python 2, along with some explanation of the adaptations / changes, which may have been already mentioned.

from io import BytesIO
from zipfile import ZipFile
import urllib.request
    
url = urllib.request.urlopen("http://www.unece.org/fileadmin/DAM/cefact/locode/loc162txt.zip")

with ZipFile(BytesIO(url.read())) as my_zip_file:
    for contained_file in my_zip_file.namelist():
        # with open(("unzipped_and_read_" + contained_file + ".file"), "wb") as output:
        for line in my_zip_file.open(contained_file).readlines():
            print(line)
            # output.write(line)

Necessary changes:

  • There's no StringIO module in Python 3 (it's been moved to io.StringIO). Instead, I use io.BytesIO]2, because we will be handling a bytestream -- Docs, also this thread.
  • urlopen:

Note:

  • In Python 3, the printed output lines will look like so: b'some text'. This is expected, as they aren't strings - remember, we're reading a bytestream. Have a look at Dan04's excellent answer.

A few minor changes I made:

  • I use with ... as instead of zipfile = ... according to the Docs.
  • The script now uses .namelist() to cycle through all the files in the zip and print their contents.
  • I moved the creation of the ZipFile object into the with statement, although I'm not sure if that's better.
  • I added (and commented out) an option to write the bytestream to file (per file in the zip), in response to NumenorForLife's comment; it adds "unzipped_and_read_" to the beginning of the filename and a ".file" extension (I prefer not to use ".txt" for files with bytestrings). The indenting of the code will, of course, need to be adjusted if you want to use it.
    • Need to be careful here -- because we have a byte string, we use binary mode, so "wb"; I have a feeling that writing binary opens a can of worms anyway...
  • I am using an example file, the UN/LOCODE text archive:

What I didn't do:

  • NumenorForLife asked about saving the zip to disk. I'm not sure what he meant by it -- downloading the zip file? That's a different task; see Oleh Prypin's excellent answer.

Here's a way:

import urllib.request
import shutil

with urllib.request.urlopen("http://www.unece.org/fileadmin/DAM/cefact/locode/2015-2_UNLOCODE_SecretariatNotes.pdf") as response, open("downloaded_file.pdf", 'w') as out_file:
    shutil.copyfileobj(response, out_file)
Verwoerd answered 8/2, 2017 at 16:49 Comment(2)
If you want to write all the files to disk the easier way is to use my_zip_file.extractall('my_target')` instead of looping. But that's great!Lanfranc
can you please help me with this question : #62417955Whitefish
C
27

I'd like to add my Python3 answer for completeness:

from io import BytesIO
from zipfile import ZipFile
import requests

def get_zip(file_url):
    url = requests.get(file_url)
    zipfile = ZipFile(BytesIO(url.content))
    files = [zipfile.open(file_name) for file_name in zipfile.namelist()]
    return files.pop() if len(files) == 1 else files

        
Calm answered 18/1, 2016 at 19:58 Comment(1)
Could you please expand a bit on what exactly this function returns and how to go forward with the returned data? Thanks.Eisk
O
17

write to a temporary file which resides in RAM

it turns out the tempfile module ( http://docs.python.org/library/tempfile.html ) has just the thing:

tempfile.SpooledTemporaryFile([max_size=0[, mode='w+b'[, bufsize=-1[, suffix=''[, prefix='tmp'[, dir=None]]]]]])

This function operates exactly as TemporaryFile() does, except that data is spooled in memory until the file size exceeds max_size, or until the file’s fileno() method is called, at which point the contents are written to disk and operation proceeds as with TemporaryFile().

The resulting file has one additional method, rollover(), which causes the file to roll over to an on-disk file regardless of its size.

The returned object is a file-like object whose _file attribute is either a StringIO object or a true file object, depending on whether rollover() has been called. This file-like object can be used in a with statement, just like a normal file.

New in version 2.6.

or if you're lazy and you have a tmpfs-mounted /tmp on Linux, you can just make a file there, but you have to delete it yourself and deal with naming

Oosperm answered 19/4, 2011 at 2:16 Comment(1)
+1 -- didn't know about SpooledTemporaryFile. My inclination would still be to use StringIO explicitly, but this is good to know.Damselfish
H
15

Adding on to the other answers using requests:

 # download from web

 import requests
 url = 'http://mlg.ucd.ie/files/datasets/bbc.zip'
 content = requests.get(url)

 # unzip the content
 from io import BytesIO
 from zipfile import ZipFile
 f = ZipFile(BytesIO(content.content))
 print(f.namelist())

 # outputs ['bbc.classes', 'bbc.docs', 'bbc.mtx', 'bbc.terms']

Use help(f) to get more functions details for e.g. extractall() which extracts the contents in zip file which later can be used with with open.

Hackathorn answered 7/3, 2018 at 11:0 Comment(1)
To read your CSV, do: with f.open(f.namelist()[0], 'r') as g: df = pd.read_csv(g)Riband
C
8

All of these answers appear too bulky and long. Use requests to shorten the code, e.g.:

import requests, zipfile, io
r = requests.get(zip_file_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall("/path/to/directory")
Charlesettacharleston answered 2/12, 2020 at 10:38 Comment(0)
H
4

Vishal's example, however great, confuses when it comes to the file name, and I do not see the merit of redefing 'zipfile'.

Here is my example that downloads a zip that contains some files, one of which is a csv file that I subsequently read into a pandas DataFrame:

from StringIO import StringIO
from zipfile import ZipFile
from urllib import urlopen
import pandas

url = urlopen("https://www.federalreserve.gov/apps/mdrm/pdf/MDRM.zip")
zf = ZipFile(StringIO(url.read()))
for item in zf.namelist():
    print("File in zip: "+  item)
# find the first matching csv file in the zip:
match = [s for s in zf.namelist() if ".csv" in s][0]
# the first line of the file contains a string - that line shall de ignored, hence skiprows
df = pandas.read_csv(zf.open(match), low_memory=False, skiprows=[0])

(Note, I use Python 2.7.13)

This is the exact solution that worked for me. I just tweaked it a little bit for Python 3 version by removing StringIO and adding IO library

Python 3 Version

from io import BytesIO
from zipfile import ZipFile
import pandas
import requests

url = "https://www.nseindia.com/content/indices/mcwb_jun19.zip"
content = requests.get(url)
zf = ZipFile(BytesIO(content.content))

for item in zf.namelist():
    print("File in zip: "+  item)

# find the first matching csv file in the zip:
match = [s for s in zf.namelist() if ".csv" in s][0]
# the first line of the file contains a string - that line shall de     ignored, hence skiprows
df = pandas.read_csv(zf.open(match), low_memory=False, skiprows=[0])
Hemminger answered 10/10, 2017 at 21:35 Comment(0)
K
1

It wasn't obvious in Vishal's answer what the file name was supposed to be in cases where there is no file on disk. I've modified his answer to work without modification for most needs.

from StringIO import StringIO
from zipfile import ZipFile
from urllib import urlopen

def unzip_string(zipped_string):
    unzipped_string = ''
    zipfile = ZipFile(StringIO(zipped_string))
    for name in zipfile.namelist():
        unzipped_string += zipfile.open(name).read()
    return unzipped_string
Kettledrummer answered 7/6, 2015 at 5:0 Comment(1)
This is a Python 2 answer.Indigotin
I
1

Use the zipfile module. To extract a file from a URL, you'll need to wrap the result of a urlopen call in a BytesIO object. This is because the result of a web request returned by urlopen doesn't support seeking:

from urllib.request import urlopen

from io import BytesIO
from zipfile import ZipFile

zip_url = 'http://example.com/my_file.zip'

with urlopen(zip_url) as f:
    with BytesIO(f.read()) as b, ZipFile(b) as myzipfile:
        foofile = myzipfile.open('foo.txt')
        print(foofile.read())

If you already have the file downloaded locally, you don't need BytesIO, just open it in binary mode and pass to ZipFile directly:

from zipfile import ZipFile

zip_filename = 'my_file.zip'

with open(zip_filename, 'rb') as f:
    with ZipFile(f) as myzipfile:
        foofile = myzipfile.open('foo.txt')
        print(foofile.read().decode('utf-8'))

Again, note that you have to open the file in binary ('rb') mode, not as text or you'll get a zipfile.BadZipFile: File is not a zip file error.

It's good practice to use all these things as context managers with the with statement, so that they'll be closed properly.

Indigotin answered 23/6, 2020 at 21:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.