How to perform time limited response download with python requests?
Asked Answered
L

3

12

When downloading a large file with python, I want to put a time limit not only for the connection process, but also for the download.

I am trying with the following python code:

import requests

r = requests.get('http://ipv4.download.thinkbroadband.com/1GB.zip', timeout = 0.5, prefetch = False)

print r.headers['content-length']

print len(r.raw.read())

This does not work (the download is not time limited), as correctly noted in the docs: https://requests.readthedocs.org/en/latest/user/quickstart/#timeouts

This would be great if it was possible:

r.raw.read(timeout = 10)

The question is, how to put a time limit to the download?

Lengel answered 26/11, 2012 at 21:5 Comment(3)
I'm not advocating this as the best solution, but here's a general solution for putting time limits on function calls using signals: https://mcmap.net/q/94810/-how-to-limit-execution-time-of-a-function-call-duplicate . It's a kludge and I don't recommend using it unless a more elegant solution is not available.Elo
Yes, signals are not an option because of https://mcmap.net/q/94810/-how-to-limit-execution-time-of-a-function-call-duplicateLengel
Now you have a timeout parameter in requests :) See hausarztpraxis-altburg.deFoudroyant
L
11

And the answer is: do not use requests, as it is blocking. Use non-blocking network I/O, for example eventlet:

import eventlet
from eventlet.green import urllib2
from eventlet.timeout import Timeout

url5 = 'http://ipv4.download.thinkbroadband.com/5MB.zip'
url10 = 'http://ipv4.download.thinkbroadband.com/10MB.zip'

urls = [url5, url5, url10, url10, url10, url5, url5]

def fetch(url):
    response = bytearray()
    with Timeout(60, False):
        response = urllib2.urlopen(url).read()
    return url, len(response)

pool = eventlet.GreenPool()
for url, length in pool.imap(fetch, urls):
    if (not length):
        print "%s: timeout!" % (url)
    else:
        print "%s: %s" % (url, length)

Produces expected results:

http://ipv4.download.thinkbroadband.com/5MB.zip: 5242880
http://ipv4.download.thinkbroadband.com/5MB.zip: 5242880
http://ipv4.download.thinkbroadband.com/10MB.zip: timeout!
http://ipv4.download.thinkbroadband.com/10MB.zip: timeout!
http://ipv4.download.thinkbroadband.com/10MB.zip: timeout!
http://ipv4.download.thinkbroadband.com/5MB.zip: 5242880
http://ipv4.download.thinkbroadband.com/5MB.zip: 5242880
Lengel answered 27/11, 2012 at 20:42 Comment(3)
Have you seen GRequests: Asynchronous Requests?Craps
With this code, what happens when the timeout triggers? :) What guarantees do you have as to the state of a socket?Craps
AFAIK, there is no threading here, still operations are running in parallel. When timeout triggers, the non-blocking operation in progress is just cancelled. No killing. The socket is closed. I hope ;)Lengel
G
2

When using Requests' prefetch=False parameter, you get to pull in arbitrary-sized chunks of the respone at a time (rather than all at once).

What you'll need to do is tell Requests not to preload the entire request and keep your own time of how much you've spent reading so far, while fetching small chunks at a time. You can fetch a chunk using r.raw.read(CHUNK_SIZE). Overall, the code will look something like this:

import requests
import time

CHUNK_SIZE = 2**12  # Bytes
TIME_EXPIRE = time.time() + 5  # Seconds

r = requests.get('http://ipv4.download.thinkbroadband.com/1GB.zip', prefetch=False)

data = ''
buffer = r.raw.read(CHUNK_SIZE)
while buffer:
    data += buffer
    buffer = r.raw.read(CHUNK_SIZE)

    if TIME_EXPIRE < time.time():
        # Quit after 5 seconds.
        data += buffer
        break

r.raw.release_conn()

print "Read %s bytes out of %s expected." % (len(data), r.headers['content-length'])

Note that this might sometimes use a bit more than the 5 seconds allotted as the final r.raw.read(...) could lag an arbitrary amount of time. But at least it doesn't depend on multithreading or socket timeouts.

Grabowski answered 27/11, 2012 at 3:46 Comment(2)
Unfortunately this does not work, because not only the last, but even every r.raw.read(...) could lag an arbitrary amount of time. This can often lead to missing the timeout with downloads from arbitrary urls.Lengel
Then sounds like socket timeout is the only way to go.Grabowski
C
-3

Run download in a thread which you can then abort if not finished on time.

import requests
import threading

URL='http://ipv4.download.thinkbroadband.com/1GB.zip'
TIMEOUT=0.5

def download(return_value):
    return_value.append(requests.get(URL))

return_value = []
download_thread = threading.Thread(target=download, args=(return_value,))
download_thread.start()
download_thread.join(TIMEOUT)

if download_thread.is_alive():
    print 'The download was not finished on time...'
else:
    print return_value[0].headers['content-length']
Craps answered 26/11, 2012 at 21:14 Comment(7)
This is not a safe road to take. Threading with python is problematic and also I can't just kill the thread on timeout, this is not a clean solution.Lengel
You can replace thread with process if you like. Why can't you kill the thread?Craps
"It is generally a bad pattern to kill a thread abruptly, in python and in any language." https://mcmap.net/q/53213/-is-there-any-way-to-kill-a-thread There is no way to tell the thread to stop.Lengel
Using a process is too complicated, it would require inter-process communication.Lengel
With this code, what happens when the timeout triggers? The thread can potentially live forever, nobody stops it. With multiple, slow downloads in parallel this will lead to thread count explosion.Lengel
Yes, but you can switch to multiprocessing module and there you have Process.terminate() which you can use to terminate download process. However if you have multiple downloads then it sounds you would be better off using async approach with grequests and using timeouts on gevent level.Craps
Threads actually cannot be stopped in Python. They can be marked as stopped with the stop method, but they do in fact continue running in the background.Peking

© 2022 - 2024 — McMap. All rights reserved.