Performance difference between urllib2 and asyncore
Asked Answered
H

3

13

I have some questions about the performance of this simple python script:

import sys, urllib2, asyncore, socket, urlparse
from timeit import timeit

class HTTPClient(asyncore.dispatcher):
    def __init__(self, host, path):
        asyncore.dispatcher.__init__(self)
        self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
        self.connect( (host, 80) )
        self.buffer = 'GET %s HTTP/1.0\r\n\r\n' % path
        self.data = ''
    def handle_connect(self):
        pass
    def handle_close(self):
        self.close()
    def handle_read(self):
        self.data += self.recv(8192)
    def writable(self):
        return (len(self.buffer) > 0)
    def handle_write(self):
        sent = self.send(self.buffer)
        self.buffer = self.buffer[sent:]

url = 'http://pacnet.karbownicki.com/api/categories/'

components = urlparse.urlparse(url)
host = components.hostname or ''
path = components.path

def fn1():
    try:
        response = urllib2.urlopen(url)
        try:
            return response.read()
        finally:
            response.close()
    except:
        pass

def fn2():
    client = HTTPClient(host, path)
    asyncore.loop()
    return client.data

if sys.argv[1:]:
    print 'fn1:', len(fn1())
    print 'fn2:', len(fn2())

time = timeit('fn1()', 'from __main__ import fn1', number=1)
print 'fn1: %.8f sec/pass' % (time)

time = timeit('fn2()', 'from __main__ import fn2', number=1)
print 'fn2: %.8f sec/pass' % (time)

Here's the output I'm getting on linux:

$ python2 test_dl.py
fn1: 5.36162281 sec/pass
fn2: 0.27681994 sec/pass

$ python2 test_dl.py count
fn1: 11781
fn2: 11965
fn1: 0.30849886 sec/pass
fn2: 0.30597305 sec/pass

Why is urllib2 so much slower than asyncore in the first run?

And why does the discrepancy seem to disappear on the second run?

EDIT: Found a hackish solution to this problem here: Force python mechanize/urllib2 to only use A requests?

The five-second delay disappears if I monkey-patch the socket module as follows:

_getaddrinfo = socket.getaddrinfo

def getaddrinfo(host, port, family=0, socktype=0, proto=0, flags=0):
    return _getaddrinfo(host, port, socket.AF_INET, socktype, proto, flags)

socket.getaddrinfo = getaddrinfo
Helman answered 7/10, 2011 at 17:32 Comment(3)
After a little more research, I think I might have a lead on this. It appears that python's socket module doesn't specify the address family when creating connections. It defaults to 0 (AF_UNSPEC) rather than AF_INET (which is used in my asyncore HTTPClient class above). This can cause a 5 second delay in DNS lookup if an IPv6 response is received. The only problem with this explanation, though, is that I have IPv6 disabled on my linux box. So I'm not sure how this issue could still affect me...Helman
having it disabled doesn't mean that a client program couldn't try using it, especially when os and libraries actually have the capability.Blaine
but the kernel module isn't loaded at all, so why would python attempt to use it? i've tried re-building python with the "--disable-ipv6" flag, but that made no difference. is there anything else in my dns setup that i could change to stop this delay happening?Helman
H
1

Finally found a good explanation of what causes this problem, and why:

This is a problem with the DNS resolver.

This problem will occur for any DNS request which the DNS resolver does not support. The proper solution is to fix the DNS resolver.

What happens:

  • Program is IPv6 enabled.
  • When it looks up a hostname, getaddrinfo() asks first for a AAAA record
  • the DNS resolver sees the request for the AAAA record, goes "uhmmm I dunno what it is, lets throw it away"
  • DNS client (getaddrinfo() in libc) waits for a response..... has to time out as there is no response. (THIS IS THE DELAY)
  • No records received yet, thus getaddrinfo() goes for a the A record request. This works.
  • Program gets the A records and uses those.

This does NOT only affect IPv6 (AAAA) records, it also affects any other DNS record that the resolver does not support.

For me, the solution was to install dnsmasq (but I suppose any other DNS resolver would do).

Helman answered 5/4, 2012 at 18:44 Comment(1)
bit off-topic, but I found that powerdns-recursor is faster and less memory consuming resolver. I use it everywhere where I need just resolver, e.g. do not need to serve zones.Marquee
B
0

This probably is in your OS: If your OS caches DNS requests, the first request has to be answered by a DNS Server, subsequent requests for the same name are already at hand.

EDIT: as the comments show, it's probably not a DNS problem. I still maintain that it's the OS and not python. I've tested the code both on Windows and on FreeBSD and didn't see this kind of difference, both functions need about the same time.

Which is exactly how it should be, there shouldn't be a significant difference for a single request. I/O and network latency make up probably about 90% of these timings.

Blaine answered 7/10, 2011 at 19:26 Comment(3)
I don't think this can be right. The asyncore function always takes the same amount of time, no matter how many times it's is run. It's only the urllib2 function that is slower the first time it is run.Helman
even when you start with the asyncore function? Because your code which you posted starts always with the urllib2-variant.Blaine
yes - if i start with the asyncore function (or run it twice alone), there is no 5 delay.Helman
S
0

Did you try doing the reverse? i.e first via syncore and the urllib?

Case 1: We first try with urllib and then with ayncore.

fn1: 1.48460957 sec/pass
fn2: 0.91280798 sec/pass

Observation: Ayncore did the same operation in 0.57180159 secs less

Lets reverse it.

Case 2: We now try with ayncore and then urllib.

fn2: 1.27898671 sec/pass
fn1: 0.95816954 sec/pass the same operation in 0.12081717

Observation: This time Urllib took 0.32081717 secs than asyncore

Two conclusions here:

  1. urllib2 would always take more time than asyncore and this is because urllib2 defines the socket family type as unspecified while asyncore let user define it and in this case we have defined it as AF_INET IPv4 protocol.

  2. If two sockets are made to same server irrespective of ayncore or urllib, second socket would perform better. And this is because of Default cache behavior. To understand more this, check this out: https://mcmap.net/q/909224/-how-to-flush-cache-for-socket-gethostbyname-response

References:

Want a general overview of how socket works?

http://www.cs.odu.edu/~mweigle/courses/cs455-f06/lectures/2-1-ClientServer.pdf

Want to write your own socket in python?

http://www.ibm.com/developerworks/linux/tutorials/l-pysocks/index.html

To know about socket families or general terminology check this wiki:

http://en.wikipedia.org/wiki/Berkeley_sockets

Note: This answer was last updated on April 05, 2012, 2AM IST

Skimp answered 4/4, 2012 at 2:27 Comment(5)
You're completely wrong - follow the link in the edit to my original question for an explanation of the issues. Executive summary: the socket module uses AF_UNSPEC as the default address family when creating connections, which can result in a five second delay in DNS lookup for IPv6 requests.Helman
@ekhumoro: I have updated my answer. Hope this answer sounds more meaningful. 10x -:)Skimp
Both of your conclusions are still wrong: for me, urllib only takes more time on the first run (due to the IPv6 issue). For all subsequent runs, irrespective of order, the execution time is virually identical. The real underlying cause (as hinted at by @knitti) is probably some obscure problem with my linux DNS setup.Helman
@ekhumoro: thats what I am saying urllib took more time because of IPv6 issue.Skimp
No, you said urllib would always take more time than asyncore, which is not true (see the output in my original question, and my comments to @knitti's answer). As I said in my previous comment: this is not really a python problem - it's an OS-specific DNS setup problem.Helman

© 2022 - 2024 — McMap. All rights reserved.