How to handle urllib's timeout in Python 3?
Asked Answered
M

3

35

First off, my problem is quite similar to this one. I would like a timeout of urllib.urlopen() to generate an exception that I can handle.

Doesn't this fall under URLError?

try:
    response = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
except (HTTPError, URLError) as error:
    logging.error(
        'Data of %s not retrieved because %s\nURL: %s', name, error, url)
else:
    logging.info('Access successful.')

The error message:

resp = urllib.request.urlopen(req, timeout=10).read().decode('utf-8')
File "/usr/lib/python3.2/urllib/request.py", line 138, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.2/urllib/request.py", line 369, in open
response = self._open(req, data)
File "/usr/lib/python3.2/urllib/request.py", line 387, in _open
'_open', req)
File "/usr/lib/python3.2/urllib/request.py", line 347, in _call_chain
result = func(*args)
File "/usr/lib/python3.2/urllib/request.py", line 1156, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/lib/python3.2/urllib/request.py", line 1141, in do_open
r = h.getresponse()
File "/usr/lib/python3.2/http/client.py", line 1046, in getresponse
response.begin()
File "/usr/lib/python3.2/http/client.py", line 346, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.2/http/client.py", line 308, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/lib/python3.2/socket.py", line 276, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out

There was a major change from in Python 3 when they re-organised the urllib and urllib2 modules into urllib. Is it possible that there was a change then that causes this?

Monagan answered 6/1, 2012 at 19:36 Comment(1)
An easy way to discover exception types is to except Exception as e: print(type(e)). Assuming you can reproduce your exceptions, that is.Barnet
P
51

Catch the different exceptions with explicit clauses, and check the reason for the exception with URLError (thank you Régis B. and Daniel Andrzejewski)

from socket import timeout
from urllib.error import HTTPError, URLError

try:
    response = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
except HTTPError as error:
    logging.error('HTTP Error: Data of %s not retrieved because %s\nURL: %s', name, error, url)
except URLError as error:
    if isinstance(error.reason, timeout):
        logging.error('Timeout Error: Data of %s not retrieved because %s\nURL: %s', name, error, url)
    else:
        logging.error('URL Error: Data of %s not retrieved because %s\nURL: %s', name, error, url)
else:
    logging.info('Access successful.')

NB For recent comments, the original post referenced python 3.2 where you needed to catch timeout errors explicitly with socket.timeout. For example



    # Warning - python 3.2 code
    from socket import timeout
    
    try:
        response = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
    except timeout:
        logging.error('socket timed out - URL %s', url)

Prot answered 6/1, 2012 at 19:45 Comment(2)
This is absolutely incorrect! In Python 3.9, only the first exception is caught. Perhaps some change was introduced between 3 and 3.9?Staggs
in Python3: from urllib.error import HTTPError, URLErrorDrag
P
18

The previous answer does not correctly intercept timeout errors. Timeout errors are raised as URLError, so if we want to specifically catch them, we need to write:

from urllib.error import HTTPError, URLError
import socket

try:
    response = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
except HTTPError as error:
    logging.error('Data not retrieved because %s\nURL: %s', error, url)
except URLError as error:
    if isinstance(error.reason, socket.timeout):
        logging.error('socket timed out - URL %s', url)
    else:
        logging.error('some other error happened)
else:
    logging.info('Access successful.')

Note that ValueError can independently be raised, i.e. if the URL is invalid. Like HTTPError, it is not associated with a timeout.

Poteat answered 17/9, 2018 at 20:45 Comment(1)
I once had a socket.timeout exception despite this code. It was not caught by this code. It happened just once among many attempts. The code is correct for the most part though in that catching URLError catches most timeout errors. This is with Python 3.7.2. In summary, to be safer, I'm also catching socket.timeout.Issie
S
2

What is a "timeout"? Holistically I think it means "a situation where the server didn't respond in time, typically because of high load, and it's worth retrying again."

HTTP status 504 "gateway timeout" would be a timeout under this definition. It's delivered via HTTPError.

HTTP status 429 "too many requests" would also be a timeout under that definition. It too is delivered via HTTPError.

Otherwise, what do we mean by a timeout? Do we include timeouts in resolving the domain name via the DNS resolver? timeouts when trying to send data? timeouts when waiting for the data to come back?

I don't know how to audit the source code of urllib to be sure that every possible way that I might consider a timeout, is being raised in a way that I'd catch. In a language without checked exceptions, I don't know how. I have a hunch that maybe connect-to-dns errors might be coming back as socket.timeout, and connect-to-remote-server errors might be coming back as URLError(socket.timeout)? It's just a guess that might explain earlier observations.

So I fell back to some really defensive coding. (1) I'm handling some HTTP status codes that are indicative of timeouts. (2) There are reports that some timeouts come via socket.timeout exceptions, and some via URLError(socket.timeout) exceptions, so I'm catching both. (3) And just in case, I threw in HTTPError(socket.timeout) as well.

while True:
    reason : Optional[str] = None
    try:
        with urllib.request.urlopen(url) as response:
            content = response.read()
            with open(cache,"wb") as file:
                file.write(content)
            return content
    except urllib.error.HTTPError as e:
        if e.code == 429 or e.code == 504: # 429=too many requests, 504=gateway timeout
            reason = f'{e.code} {str(e.reason)}'
        elif isinstance(e.reason, socket.timeout):
            reason = f'HTTPError socket.timeout {e.reason} - {e}'
        else:
            raise
    except urllib.error.URLError as e:
        if isinstance(e.reason, socket.timeout):
            reason = f'URLError socket.timeout {e.reason} - {e}'
        else:
            raise
    except socket.timeout as e:
        reason = f'socket.timeout {e}'
    except:
        raise
    netloc = urllib.parse.urlsplit(url).netloc # e.g. nominatim.openstreetmap.org
    print(f'*** {netloc} {reason}; will retry', file=sys.stderr)
    time.sleep(5)
So answered 27/1, 2021 at 19:59 Comment(2)
This answer has six question marks, two "I don't know"'s and one "I think". There is little confidence one should copy and paste this code into their program. A lot of hacking to break into one's working system is probably required to test the copy and pasted code. To top it all off there is an infinite loop with 5 second sleep.Amorphous
There are six question marks because I'm asking questions that the other answers didn't consider! I agree not to copy+paste. Instead, you should write whatever code you want, then verify whether it adequately handles the questions I raised, and if not then adjust. If you finally end up with code different from mine then you should ask yourself why. (PS. it's easy to audit that the infinite loop comes only in finite set of cases 504, 429, socket.timeout, since that's what I wanted, and if you don't want that then it's clear what to change!)So

© 2022 - 2024 — McMap. All rights reserved.