Force python mechanize/urllib2 to only use A requests?
Asked Answered
S

4

11

Here is a related question but I could not figure out how to apply the answer to mechanize/urllib2: how to force python httplib library to use only A requests

Basically, given this simple code:

#!/usr/bin/python
import urllib2
print urllib2.urlopen('http://python.org/').read(100)

This results in wireshark saying the following:

  0.000000  10.102.0.79 -> 8.8.8.8      DNS Standard query A python.org
  0.000023  10.102.0.79 -> 8.8.8.8      DNS Standard query AAAA python.org
  0.005369      8.8.8.8 -> 10.102.0.79  DNS Standard query response A 82.94.164.162
  5.004494  10.102.0.79 -> 8.8.8.8      DNS Standard query A python.org
  5.010540      8.8.8.8 -> 10.102.0.79  DNS Standard query response A 82.94.164.162
  5.010599  10.102.0.79 -> 8.8.8.8      DNS Standard query AAAA python.org
  5.015832      8.8.8.8 -> 10.102.0.79  DNS Standard query response AAAA 2001:888:2000:d::a2

That's a 5 second delay!

I don't have IPv6 enabled anywhere in my system (gentoo compiled with USE=-ipv6) so I don't think that python has any reason to even try an IPv6 lookup.

The above referenced question suggested explicitly setting the socket type to AF_INET which sounds great. I have no idea how to force urllib or mechanize to use any sockets that I create though.

EDIT: I know that the AAAA queries are the issue because other apps had the delay as well and as soon as I recompiled with ipv6 disabled, the problem went away... except for in python which still performs the AAAA requests.

Saguache answered 6/1, 2010 at 16:43 Comment(1)
Same here, on different machines connected to differend providers. I've resorted to libwww-perl and it's GET command - this works instantly on all machines.Gastropod
E
17

Suffering from the same problem, here is an ugly hack (use at your own risk..) based on the information given by J.J. .

This basically forces the family parameter of socket.getaddrinfo(..) to socket.AF_INET instead of using socket.AF_UNSPEC (zero, which is what seems to be used in socket.create_connection), not only for calls from urllib2 but should do it for all calls to socket.getaddrinfo(..):

#--------------------
# do this once at program startup
#--------------------
import socket
origGetAddrInfo = socket.getaddrinfo

def getAddrInfoWrapper(host, port, family=0, socktype=0, proto=0, flags=0):
    return origGetAddrInfo(host, port, socket.AF_INET, socktype, proto, flags)

# replace the original socket.getaddrinfo by our version
socket.getaddrinfo = getAddrInfoWrapper

#--------------------
import urllib2

print urllib2.urlopen("http://python.org/").read(100)

This works for me at least in this simple case.

Excited answered 11/6, 2011 at 23:3 Comment(1)
Just tested, still works flawlessly in python 3.5.2.Husk
C
4

No answer, but a few datapoints. The DNS resolution appears to be originating from httplib.py in HTTPConnection.connect() (line 670 on my python 2.5.4 stdlib)

The code flow is roughly:

for res in socket.getaddrinfo(self.host, self.port, 0, socket.SOCK_STREAM):
    af, socktype, proto, canonname, sa = res
    self.sock = socket.socket(af, socktype, proto)
    try:
        self.sock.connect(sa)
    except socket.error, msg: 
        continue
    break

A few comments on what's going on:

  • the third argument to socket.getaddrinfo() limits the socket families -- i.e., IPv4 vs. IPv6. Passing zero returns all families. Zero is hardcoded into the stdlib.

  • passing a hostname into getaddrinfo() will cause name resolution -- on my OS X box with IPv6 enabled, both A and AAAA records go out, both answers come right back and both are returned.

  • the rest of the connect loop tries each returned address until one succeeds

For example:

>>> socket.getaddrinfo("python.org", 80, 0, socket.SOCK_STREAM)
[
 (30, 1, 6, '', ('2001:888:2000:d::a2', 80, 0, 0)), 
 ( 2, 1, 6, '', ('82.94.164.162', 80))
]
>>> help(socket.getaddrinfo)
getaddrinfo(...)
    getaddrinfo(host, port [, family, socktype, proto, flags])
        -> list of (family, socktype, proto, canonname, sockaddr)

Some guesses:

  • Since the socket family in getaddrinfo() is hardcoded to zero, you won't be able to override the A vs. AAAA records through some supported API interface in urllib. Unless mechanize does their own name resolution for some other reason, mechanize can't either. From the construct of the connect loop, this is By Design.

  • python's socket module is a thin wrapper around the POSIX socket APIs; I expect they're resolving every family available & configured on the system. Double-check Gentoo's IPv6 configuration.

Colza answered 10/1, 2010 at 1:19 Comment(1)
seems to me that python shouldn't pass 0 to socket.getaddrinfo if it is built with no ipv6 support. perhaps this could be considered a minor bug in some ways.Saguache
S
2

The DNS server 8.8.8.8 (Google DNS) replies immediately when asked about the AAAA of python.org. Therefore, the fact we do not see this reply in the trace you post probably indicate that this packet did not come back (which happens with UDP). If this loss is random, it is normal. If it is systematic, it means there is a problem in your network setup, may be a broken firewall which prevents the first AAAA reply to come back.

The 5-second delay comes from your stub resolver. In that case, if it is random, it is probably bad luck, but not related to IPv6, the reply for the A record could have failed as well.

Disabling IPv6 seems a very strange move, only two years before the last IPv4 address is distributed!

% dig @8.8.8.8  AAAA python.org

; <<>> DiG 9.5.1-P3 <<>> @8.8.8.8 AAAA python.org
; (1 server found)
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50323
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;python.org.                    IN      AAAA

;; ANSWER SECTION:
python.org.             69917   IN      AAAA    2001:888:2000:d::a2

;; Query time: 36 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Sat Jan  9 21:51:14 2010
;; MSG SIZE  rcvd: 67
Shauna answered 9/1, 2010 at 20:56 Comment(1)
well, i'd be happy to use IPv6...once it stops adding a 5 second delay to my DNS queries :-P. And unfortunately, it isn't "bad luck" it is every single query.Saguache
P
2

Most likely cause of this is a broken egress firewall. Juniper firewalls can cause this, for instance, though they have a workaround available.

If you can't get your network admins to fix the firewall, you can try the host-based workaround. Add this line to your /etc/resolv.conf:

options single-request-reopen

The man page explains it well:

The resolver uses the same socket for the A and AAAA requests. Some hardware mistakenly only sends back one reply. When that happens the client sytem will sit and wait for the second reply. Turning this option on changes this behavior so that if two requests from the same port are not handled correctly it will close the socket and open a new one before sending the second request.

Parnassian answered 3/12, 2012 at 6:50 Comment(1)
Thanks this fixed the ipv6 name resolution segfault problem I was having in python.Salvage

© 2022 - 2024 — McMap. All rights reserved.