What's the best way to download file using urllib3
Asked Answered
D

3

27

I would like to download file over HTTP protocol using urllib3. I have managed to do this using following code:

 url = 'http://url_to_a_file'
 connection_pool = urllib3.PoolManager()
 resp = connection_pool.request('GET',url )
 f = open(filename, 'wb')
 f.write(resp.data)
 f.close()
 resp.release_conn()

But I was wondering what is the proper way of doing this. For example will it work well for big files and If no what to do to make this code more bug tolerant and scalable.

Note. It is important to me to use urllib3 library not urllib2 for example, because I want my code to be thread safe.

Darky answered 24/6, 2013 at 21:30 Comment(0)
A
41

Your code snippet is close. Two things worth noting:

  1. If you're using resp.data, it will consume the entire response and return the connection (you don't need to resp.release_conn() manually). This is fine if you're cool with holding the data in-memory.

  2. You could use resp.read(amt) which will stream the response, but the connection will need to be returned via resp.release_conn().

This would look something like...

import urllib3
http = urllib3.PoolManager()
r = http.request('GET', url, preload_content=False)

with open(path, 'wb') as out:
    while True:
        data = r.read(chunk_size)
        if not data:
            break
        out.write(data)

r.release_conn()

The documentation might be a bit lacking on this scenario. If anyone is interested in making a pull-request to improve the urllib3 documentation, that would be greatly appreciated. :)

Amphithecium answered 24/6, 2013 at 22:3 Comment(6)
And one more question. Will it work with POST method if I add r = http.request('POST', url)?Darky
@Darky Err, that was a mistake in my code. You're right, the method should go first, and your snippet will work. (Updated my answer.)Amphithecium
I tried the above code today using urllib3 1.15.1. It needs two modifications to be 100% correct. First, you need preload_content=False in http.request('GET', url, ...). Second, if data is None should be if not data to take into account that data being an empty string, not None. Otherwise, it works perfectly. Thank you. I also want to thank @Alecz below for providing more clues.Riding
Works well! What's a reasonable chunk size?Mola
Good question. 64kb is probably a safe choice (2**16 or 65536).Amphithecium
Is there a reason for while looping when for data in request.read(chunk_size)\n\tout.write(data) seems to achieve the same results?Rhetorician
P
9

The most correct way to do this is probably to get a file-like object that represents the HTTP response and copy it to a real file using shutil.copyfileobj as below:

url = 'http://url_to_a_file'
c = urllib3.PoolManager()

with c.request('GET',url, preload_content=False) as resp, open(filename, 'wb') as out_file:
    shutil.copyfileobj(resp, out_file)

resp.release_conn()     # not 100% sure this is required though
Preiser answered 10/12, 2014 at 16:50 Comment(2)
Doing resp.release_conn() with preload_content=False is required so that the connection can be reused by the pool manager. See Streaming and IO.Addington
According to documentation resp.release_conn() seems not required. This is the description of the release_conn parameter: If False, then the urlopen call will not release the connection back into the pool once a response is received (but will release if you read the entire contents of the response such as when preload_content=True).Acidimetry
A
3

Most easy way with urllib3, you can use shutil do auto-manage packages.

import urllib3
import shutil

http = urllib3.PoolManager()
with open(filename, 'wb') as out:
    r = http.request('GET', url, preload_content=False)
    shutil.copyfileobj(r, out)
Affecting answered 28/5, 2020 at 22:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.