What should I do if socket.setdefaulttimeout() is not working?
Asked Answered
C

4

8

I'm writing a script(multi-threaded) to retrieve contents from a website, and the site's not very stable so every now and then there's hanging http request which cannot even be time-outed by socket.setdefaulttimeout(). Since I have no control over that website, the only thing I can do is to improve my codes but I'm running out of ideas right now.

Sample codes:

socket.setdefaulttimeout(150)

MechBrowser = mechanize.Browser()
Header = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 GTB7.1 (.NET CLR 3.5.30729)'}
Url = "http://example.com"
Data = "Justatest=whatever&letstry=doit"
Request = urllib2.Request(Url, Data, Header)
Response = MechBrowser.open(Request)
Response.close()

What should I do to force the hanging request to quit? Actually I want to know why socket.setdefaulttimeout(150) is not working in the first place. Anybody can help me out?

Added:(and yes problem still not solved)

OK, I've followed tomasz's suggestion and changed codes to MechBrowser.open(Request, timeout = 60), but same thing happens. I still got hanging requests randomly till now, sometimes it's several hours and other times it could be several days. What do I do now? Is there a way to force these hanging requests to quit?

Concinnity answered 11/12, 2011 at 13:42 Comment(3)
Looks like the .open() method of the Browser has a timeout argument. Have you tried using that instead?Striation
@yak: Well, I don't really want to go through all the codes and add timeout argument for each instance, so socket.setdefaulttimeout(150) looks like a better solution for me. And funny thing is, in order to test if it works I've tried to set it to socket.setdefaulttimeout(0.5) just to see if this would override timeout settings for all and it indeed works as expected. But with socket.setdefaulttimeout(150) I've no idea why some requests hanging after hours of running.Concinnity
with a small timeout, it "works" as expected, but with a not-so-big timeout duration explodes. It looks like you (or the libraries you're using) are making a lot of connections to the remote server, and if each of them takes 150 seconds before timeout, you can wait a long long time before the final result.Easterner
S
24

While socket.setsocketimeout will set the default timeout for new sockets, if you're not using the sockets directly, the setting can be easily overwritten. In particular, if the library calls socket.setblocking on its socket, it'll reset the timeout.

urllib2.open has a timeout argument, hovewer, there is no timeout in urllib2.Request. As you're using mechanize, you should refer to their documentation:

Since Python 2.6, urllib2 uses a .timeout attribute on Request objects internally. However, urllib2.Request has no timeout constructor argument, and urllib2.urlopen() ignores this parameter. mechanize.Request has a timeout constructor argument which is used to set the attribute of the same name, and mechanize.urlopen() does not ignore the timeout attribute.

source: http://wwwsearch.sourceforge.net/mechanize/documentation.html

---EDIT---

If either socket.setsockettimeout or passing timeout to mechanize works with small values, but not with higher, the source of the problem might be completely different. One thing is your library may open multiple connections (here credit to @Cédric Julien), so the timeout apply to every single attempt of socket.open and if it doesn't stop with first failure – can take up to timeout * num_of_conn seconds. The other thing is socket.recv: if the connection is really slow and you're unlucky enough, the whole request can take up to timeout * incoming_bytes as with every socket.recv we could get one byte, and every such call could take timeout seconds. As you're unlikely to suffer from exactly this dark scenerio (one byte per timeout seconds? you would have to be a very rude boy), it's very likely request to take ages for very slow connections and very high timeouts.

The only solution you have is to force timeout for the whole request, but there's nothing to do with sockets here. If you're on Unix, you can use simple solution with ALARM signal. You set the signal to be raised in timeout seconds, and your request will be terminated (don't forget to catch it). You might like to use with statement to make it clean and easy for use, example:

import signal, time

def request(arg):
  """Your http request"""
  time.sleep(2)
  return arg

class Timeout():
  """Timeout class using ALARM signal"""
  class Timeout(Exception): pass

  def __init__(self, sec):
    self.sec = sec

  def __enter__(self):
    signal.signal(signal.SIGALRM, self.raise_timeout)
    signal.alarm(self.sec)

  def __exit__(self, *args):
    signal.alarm(0) # disable alarm

  def raise_timeout(self, *args):
    raise Timeout.Timeout()

# Run block of code with timeouts
try:
  with Timeout(3):
    print request("Request 1")
  with Timeout(1):
    print request("Request 2")
except Timeout.Timeout:
  print "Timeout"

# Prints "Request 1" and "Timeout"

If want to be more portable than this, you have to use some bigger guns, for example multiprocessing, so you'll spawn a process to call your request and terminate it if overdue. As this would be a separate process, you have to use something to transfer the result back to your application, it might be multiprocessing.Pipe. Here comes the example:

from multiprocessing import Process, Pipe
import time

def request(sleep, result):
  """Your http request example"""
  time.sleep(sleep)
  return result

class TimeoutWrapper():
  """Timeout wrapper using separate process"""
  def __init__(self, func, timeout):
    self.func = func
    self.timeout = timeout

  def __call__(self, *args, **kargs):
    """Run func with timeout"""
    def pmain(pipe, func, args, kargs):
      """Function to be called in separate process"""
      result = func(*args, **kargs) # call func with passed arguments
      pipe.send(result) # send result to pipe

    parent_pipe, child_pipe = Pipe() # Pipe for retrieving result of func
    p = Process(target=pmain, args=(child_pipe, self.func, args, kargs))
    p.start()
    p.join(self.timeout) # wait for prcoess to end

    if p.is_alive():
      p.terminate() # Timeout, kill
      return None # or raise exception if None is acceptable result
    else:          
      return parent_pipe.recv() # OK, get result

print TimeoutWrapper(request, 3)(1, "OK") # prints OK
print TimeoutWrapper(request, 1)(2, "Timeout") # prints None

You really don't have much choice if you want to force the request to terminate after fixed number of seconds. socket.timeout will provide timeout for single socket operation (connect/recv/send), but if you have multiple of them you can suffer from very long execution time.

Superstructure answered 11/12, 2011 at 15:39 Comment(7)
Funny thing is, if you try with socket.setdefaulttimeout(0.1) Request = urllib2.Request("http://google.com", None) Response = MechBrowser.open(Request) it will raise urlopen error timed out which means socket.setsocketimeout does have control over mechanize. If like you said, "the library calls socket.setblocking on its socket, it'll reset the timeout.", what can I do to avoid this or is there a better way to set global timeout for all connections?Concinnity
@Shane, there is a chance the default timeout affects connect, but not recv (as they change parameters after successful connection) and this is the point where your application hangs. As library can call socket.settimeout or socket.setblocking, there is nothing you can do about the timeout. General solution might be: use a SIGALARM signal to interrupt the request (on Unix) or run subprocess and kill it when is overdue. But it looks like mechanize supports timeouts.Superstructure
Thanks mate, I'll give mechanize timeouts a try and see if problem can be solved.Concinnity
Hey mate, I've followed your suggestion and still get the same results. What do I do now? Any good idea?Concinnity
Thanks for the detailed answer. After searching around for quite a while, I decided to go with threading module and use threading.Thread to wrap these requests, then set a timeout limit in join(timeout). It works well right now.Concinnity
@Shane, the problem with threading module is you cannot terminate the tread. Using join(timeout) may seem to work, but - unless you exit the script immediately after the timeout - your request is still running in the background. Hence my example with multiprocessing where you can directly kill the process. Keep it in mind!Superstructure
Thanks a lot for the information! I will try to use multiprocessing instead of threading. Cheers!Concinnity
F
2

From their documentation:

Since Python 2.6, urllib2 uses a .timeout attribute on Request objects internally. However, urllib2.Request has no timeout constructor argument, and urllib2.urlopen() ignores this parameter. mechanize.Request has a timeout constructor argument which is used to set the attribute of the same name, and mechanize.urlopen() does not ignore the timeout attribute.

Perhaps you should try replacing urllib2.Request with mechanize.Request.

Fanny answered 23/12, 2011 at 2:14 Comment(0)
I
0

You could try to use mechanize with eventlet. It does not solve your timeout problem, but greenlet are non blocking, so it can solve your performance problem.

Inhumane answered 28/12, 2011 at 10:45 Comment(0)
S
-1

I suggest a simple workaround - move the request to a different process and if it fails to terminate kill it from the calling process, this way:

    checker = Process(target=yourFunction, args=(some_queue))
    timeout = 150
    checker.start()
    counter = 0
    while checker.is_alive() == True:
            time.sleep(1)
            counter += 1
            if counter > timeout :
                    print "Son process consumed too much run-time. Going to kill it!"
                    kill(checker.pid)
                    break

simple, fast and effective.

Spinule answered 28/12, 2011 at 10:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.