getaddrinfo error with Mechanize

S

2

7

I wrote a script that will go through all of the customers in our database, verify that their website URL works, and try to find a twitter link on their homepage. We have a little over 10,000 URLs to verify. After a fraction of if the urls are verified, we start getting getaddrinfo errors for every URL.

Here's a copy of the code that scrapes a single URL:

def scrape_url(url) 
  url_found = false 
  twitter_name = nil 

  begin 
    agent = Mechanize.new do |a| 
      a.follow_meta_refresh = true 
    end 

    agent.get(normalize_url(url)) do |page| 
      url_found = true 
      twitter_name = find_twitter_name(page) 
    end 

    @err << "[#{@current_record}] SUCCESS\n" 
  rescue Exception => e 
    @err << "[#{@current_record}] ERROR (#{url}): " 
    @err << e.message 
    @err << "\n" 
  end 

  [url_found, twitter_name] 
end

Note: I've also run a version of this code that creates a single Mechanize instance that gets shared across all calls to scrape_url. It failed in exactly the same fashion.

When I run this on EC2, it gets through almost exactly 1,000 urls, then returns this error for the remaining 9,000+:

getaddrinfo: Temporary failure in name resolution

Note, I've tried using both Amazon's DNS servers and Google's DNS servers, thinking it might be a legitimate DNS issue. I got exactly the same result in both cases.

Then, I tried running it on my local MacBook Pro. It only got through about 250 before returning this error for the remainder of the records:

getaddrinfo: nodename nor servname provided, or not known

Does anyone know how I can get the script to make it through all of the records?

Shabuoth answered 1/11, 2012 at 22:9 Comment(6)

Show us the url it's failing on. – Dipterocarpaceous 1/11, 2012 at 22:24

It fails on around 9,000 of them. One example is agilecommerce.com. The URLs tend to work if plugged into a browser. – Shabuoth 1/11, 2012 at 22:58

could you be running out of memory? – Dipterocarpaceous 2/11, 2012 at 0:5

Try adding something to throttle your requests. I wouldn't be surprised if your DNS provider isn't getting upset and refusing your connection. – Houdini 2/11, 2012 at 0:10

You don't say what host OS you're running, but it looks like Fedora had some problems that returned the same error. – Houdini 2/11, 2012 at 0:14

I might have found a potential solution. I set keep_alive to false and set a 1 second idle timeout. My theory is that Mechanize was keeping the connections open until they timed out. At some point, a maximum number of connections was hit and it couldn't make another to do a DNS lookup. Strictly a theory at this point, but I'm just shy of 3,000 records processed. – Shabuoth 2/11, 2012 at 0:40

S

10

I found the solution. Mechanize was leaving the connection open and relying on GC to clean them up. After a certain point, there were enough open connections that no additional outbound connection could be established to do a DNS lookup. Here's the code that caused it to work:

agent = Mechanize.new do |a| 
  a.follow_meta_refresh = true
  a.keep_alive = false
end

By setting keep_alive to false, the connection is immediately closed and cleaned up.

Shabuoth answered 3/11, 2012 at 20:59 Comment(0)

D

0

See if this helps:

agent.history.max_size = 10

It will keep the history from using too much memory

Dipterocarpaceous answered 2/11, 2012 at 0:7 Comment(0)

Recommended topics

Hot tags