Read timeout using either urllib2 or any other http library
Asked Answered
S

8

26

I have code for reading an url like this:

from urllib2 import Request, urlopen
req = Request(url)
for key, val in headers.items():
    req.add_header(key, val)
res = urlopen(req, timeout = timeout)
# This line blocks
content = res.read()

The timeout works for the urlopen() call. But then the code gets to the res.read() call where I want to read the response data and the timeout isn't applied there. So the read call may hang almost forever waiting for data from the server. The only solution I've found is to use a signal to interrupt the read() which is not suitable for me since I'm using threads.

What other options are there? Is there a HTTP library for Python that handles read timeouts? I've looked at httplib2 and requests and they seem to suffer the same issue as above. I don't want to write my own nonblocking network code using the socket module because I think there should already be a library for this.

Update: None of the solutions below are doing it for me. You can see for yourself that setting the socket or urlopen timeout has no effect when downloading a large file:

from urllib2 import urlopen
url = 'http://iso.linuxquestions.org/download/388/7163/http/se.releases.ubuntu.com/ubuntu-12.04.3-desktop-i386.iso'
c = urlopen(url)
c.read()

At least on Windows with Python 2.7.3, the timeouts are being completely ignored.

Shepperd answered 3/3, 2012 at 18:51 Comment(2)
related to total connection timeout: HTTPConnection.request not respecting timeout?Arabella
Does this issue effect Python 3 as well? Have any steps been made to address it? Seems like an issue with the built-in Python HTTP library itself.Cowgirl
D
7

It's not possible for any library to do this without using some kind of asynchronous timer through threads or otherwise. The reason is that the timeout parameter used in httplib, urllib2 and other libraries sets the timeout on the underlying socket. And what this actually does is explained in the documentation.

SO_RCVTIMEO

Sets the timeout value that specifies the maximum amount of time an input function waits until it completes. It accepts a timeval structure with the number of seconds and microseconds specifying the limit on how long to wait for an input operation to complete. If a receive operation has blocked for this much time without receiving additional data, it shall return with a partial count or errno set to [EAGAIN] or [EWOULDBLOCK] if no data is received.

The bolded part is key. A socket.timeout is only raised if not a single byte has been received for the duration of the timeout window. In other words, this is a timeout between received bytes.

A simple function using threading.Timer could be as follows.

import httplib
import socket
import threading

def download(host, path, timeout = 10):
    content = None
    
    http = httplib.HTTPConnection(host)
    http.request('GET', path)
    response = http.getresponse()
    
    timer = threading.Timer(timeout, http.sock.shutdown, [socket.SHUT_RD])
    timer.start()
    
    try:
        content = response.read()
    except httplib.IncompleteRead:
        pass
        
    timer.cancel() # cancel on triggered Timer is safe
    http.close()
    
    return content

>>> host = 'releases.ubuntu.com'
>>> content = download(host, '/15.04/ubuntu-15.04-desktop-amd64.iso', 1)
>>> print content is None
True
>>> content = download(host, '/15.04/MD5SUMS', 1)
>>> print content is None
False

Other than checking for None, it's also possible to catch the httplib.IncompleteRead exception not inside the function, but outside of it. The latter case will not work though if the HTTP request doesn't have a Content-Length header.

Dubrovnik answered 20/9, 2015 at 21:51 Comment(5)
You don't need lambda here:Timer(timeout, sock.shutdown, [socket.SHUT_RDWR]). You should raise TimeoutError on timeout instead of returning None.Arabella
@J.F.Sebastian Yep, there are numerous ways to signal timeout here, such as raising a custom exception. Thanks for the args tip.Dubrovnik
There are preferable ways to signal the timeout: the download() function may be buried several stack frames down from the place that sets its parameters, the timeout can be triggered only for certain sites at certain times -- what do you expect intermediate functions to do if content is None? If even one place forgets to handle the error return value; it may have undesirable side-effects. Exceptions are the mechanism that delivers the error from the place where it is detected to the place where it is known what to do with it. And the default behavior (errors are not ignored) is more robust.Arabella
btw, as far as I can tell, your answer is the only one that does limit the total read timeout (you should probably pass timeout parameter to HTTPConnection to try to limit the connection timeout too).Arabella
the absence of class TimeoutError(EnvironmentError): pass is not the reason to promote bad practice.Arabella
F
5

I found in my tests (using the technique described here) that a timeout set in the urlopen() call also effects the read() call:

import urllib2 as u
c = u.urlopen('http://localhost/', timeout=5.0)
s = c.read(1<<20)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)
  File "/usr/lib/python2.7/httplib.py", line 561, in read
    s = self.fp.read(amt)
  File "/usr/lib/python2.7/httplib.py", line 1298, in read
    return s + self._file.read(amt - len(s))
  File "/usr/lib/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)
socket.timeout: timed out

Maybe it's a feature of newer versions? I'm using Python 2.7 on a 12.04 Ubuntu straight out of the box.

Footwear answered 10/5, 2012 at 13:41 Comment(2)
it may trigger the timeout for individual .recv() calls (that may return partial data) but it does not limit the total read timeout (until EOF).Arabella
Yes, that clarification has its value.Footwear
D
5

I'd expect this to be a common problem, and yet - no answers to be found anywhere... Just built a solution for this using timeout signal:

import urllib2
import socket

timeout = 10
socket.setdefaulttimeout(timeout)

import time
import signal

def timeout_catcher(signum, _):
    raise urllib2.URLError("Read timeout")

signal.signal(signal.SIGALRM, timeout_catcher)

def safe_read(url, timeout_time):
    signal.setitimer(signal.ITIMER_REAL, timeout_time)
    url = 'http://uberdns.eu'
    content = urllib2.urlopen(url, timeout=timeout_time).read()
    signal.setitimer(signal.ITIMER_REAL, 0)
    # you should also catch any exceptions going out of urlopen here,
    # set the timer to 0, and pass the exceptions on.

The credit for the signal part of the solution goes here btw: python timer mystery

Deirdra answered 7/8, 2013 at 18:21 Comment(3)
But does it timeout the read() call or the urlopen() one? I'd like to test this solution, but it's pretty hard to setup a situation in which the server timeouts during the clients recv call on the socket.Shepperd
Bjorn, as for the read vs urlopen - it timeouts both read, and the urlopen. I tested it with this url: "uberdns.eu" - which, at least yesterday, caused my crawler to hang on read. This is the solution that I tested and worked where both socket default timeout, and urlopen timeout failed.Deirdra
As for the threads - no idea, you'd have to check the setitimer documentation.Deirdra
S
4

One possible (imperfect) solution is to set the global socket timeout, explained in more detail here:

import socket
import urllib2

# timeout in seconds
socket.setdefaulttimeout(10)

# this call to urllib2.urlopen now uses the default timeout
# we have set in the socket module
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)

However, this only works if you're willing to globally modify the timeout for all users of the socket module. I'm running the request from within a Celery task, so doing this would mess up timeouts for the Celery worker code itself.

I'd be happy to hear any other solutions...

Schaffel answered 9/3, 2012 at 21:31 Comment(2)
At least on Windows with Python 2.7 it has no effect on the read() call.Shepperd
setdefaulttimeout() does not limit the total read timeout e.g., the server may send a byte every 5 seconds and the timeout never triggers.Arabella
A
4

Any asynchronous network library should allow to enforce the total timeout on any I/O operation e.g., here's gevent code example:

#!/usr/bin/env python2
import gevent
import gevent.monkey # $ pip install gevent
gevent.monkey.patch_all()

import urllib2

with gevent.Timeout(2): # enforce total timeout
    response = urllib2.urlopen('http://localhost:8000')
    encoding = response.headers.getparam('charset')
    print response.read().decode(encoding)

And here's asyncio equivalent:

#!/usr/bin/env python3.5
import asyncio
import aiohttp # $ pip install aiohttp

async def fetch_text(url):
    response = await aiohttp.get(url)
    return await response.text()

text = asyncio.get_event_loop().run_until_complete(
    asyncio.wait_for(fetch_text('http://localhost:8000'), timeout=2))
print(text)

The test http server is defined here.

Arabella answered 21/9, 2015 at 20:17 Comment(1)
This works great (the gevent snippet at least). I have a simple program to grab an image and store it with timestamp, and this did the job of letting the program end if the url is unavailable. Thanks!Fleece
A
3

pycurl.TIMEOUT option works for the whole request:

#!/usr/bin/env python3
"""Test that pycurl.TIMEOUT does limit the total request timeout."""
import sys
import pycurl

timeout = 2 #NOTE: it does limit both the total *connection* and *read* timeouts
c = pycurl.Curl()
c.setopt(pycurl.CONNECTTIMEOUT, timeout)
c.setopt(pycurl.TIMEOUT, timeout)
c.setopt(pycurl.WRITEFUNCTION, sys.stdout.buffer.write)
c.setopt(pycurl.HEADERFUNCTION, sys.stderr.buffer.write)
c.setopt(pycurl.NOSIGNAL, 1)
c.setopt(pycurl.URL, 'http://localhost:8000')
c.setopt(pycurl.HTTPGET, 1)
c.perform()

The code raises the timeout error in ~2 seconds. I've tested the total read timeout with the server that sends the response in multiple chunks with the time less than the timeout between chunks:

$ python -mslow_http_server 1

where slow_http_server.py:

#!/usr/bin/env python
"""Usage: python -mslow_http_server [<read_timeout>]

   Return an http response with *read_timeout* seconds between parts.
"""
import time
try:
    from BaseHTTPServer import BaseHTTPRequestHandler, HTTPServer, test
except ImportError: # Python 3
    from http.server import BaseHTTPRequestHandler, HTTPServer, test

def SlowRequestHandlerFactory(read_timeout):
    class HTTPRequestHandler(BaseHTTPRequestHandler):
        def do_GET(self):
            n = 5
            data = b'1\n'
            self.send_response(200)
            self.send_header("Content-type", "text/plain; charset=utf-8")
            self.send_header("Content-Length", n*len(data))
            self.end_headers()
            for i in range(n):
                self.wfile.write(data)
                self.wfile.flush()
                time.sleep(read_timeout)
    return HTTPRequestHandler

if __name__ == "__main__":
    import sys
    read_timeout = int(sys.argv[1]) if len(sys.argv) > 1 else 5
    test(HandlerClass=SlowRequestHandlerFactory(read_timeout),
         ServerClass=HTTPServer)

I've tested the total connection timeout with http://google.com:22222.

Arabella answered 21/9, 2015 at 0:35 Comment(0)
O
0

This isn't the behavior I see. I get a URLError when the call times out:

from urllib2 import Request, urlopen
req = Request('http://www.google.com')
res = urlopen(req,timeout=0.000001)
#  Traceback (most recent call last):
#  File "<stdin>", line 1, in <module>
#  ...
#  raise URLError(err)
#  urllib2.URLError: <urlopen error timed out>

Can't you catch this error and then avoid trying to read res? When I try to use res.read() after this I get NameError: name 'res' is not defined. Is something like this what you need:

try:
    res = urlopen(req,timeout=3.0)
except:           
    print 'Doh!'
finally:
    print 'yay!'
    print res.read()

I suppose the way to implement a timeout manually is via multiprocessing, no? If the job hasn't finished you can terminate it.

Offcolor answered 3/3, 2012 at 19:1 Comment(1)
I think you misunderstand. The urlopen() call connects to the server successfully, but then the program hangs at the read() call because the server returns the data to slowly. That is where the timeout is needed.Shepperd
S
-1

Had the same issue with socket timeout on the read statement. What worked for me was putting both the urlopen and the read inside a try statement. Hope this helps!

Spastic answered 12/10, 2013 at 2:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.