getaddrinfo error with Mechanize
Asked Answered
S

2

7

I wrote a script that will go through all of the customers in our database, verify that their website URL works, and try to find a twitter link on their homepage. We have a little over 10,000 URLs to verify. After a fraction of if the urls are verified, we start getting getaddrinfo errors for every URL.

Here's a copy of the code that scrapes a single URL:

def scrape_url(url) 
  url_found = false 
  twitter_name = nil 

  begin 
    agent = Mechanize.new do |a| 
      a.follow_meta_refresh = true 
    end 

    agent.get(normalize_url(url)) do |page| 
      url_found = true 
      twitter_name = find_twitter_name(page) 
    end 

    @err << "[#{@current_record}] SUCCESS\n" 
  rescue Exception => e 
    @err << "[#{@current_record}] ERROR (#{url}): " 
    @err << e.message 
    @err << "\n" 
  end 

  [url_found, twitter_name] 
end

Note: I've also run a version of this code that creates a single Mechanize instance that gets shared across all calls to scrape_url. It failed in exactly the same fashion.

When I run this on EC2, it gets through almost exactly 1,000 urls, then returns this error for the remaining 9,000+:

getaddrinfo: Temporary failure in name resolution

Note, I've tried using both Amazon's DNS servers and Google's DNS servers, thinking it might be a legitimate DNS issue. I got exactly the same result in both cases.

Then, I tried running it on my local MacBook Pro. It only got through about 250 before returning this error for the remainder of the records:

getaddrinfo: nodename nor servname provided, or not known

Does anyone know how I can get the script to make it through all of the records?

Shabuoth answered 1/11, 2012 at 22:9 Comment(6)
Show us the url it's failing on.Dipterocarpaceous
It fails on around 9,000 of them. One example is agilecommerce.com. The URLs tend to work if plugged into a browser.Shabuoth
could you be running out of memory?Dipterocarpaceous
Try adding something to throttle your requests. I wouldn't be surprised if your DNS provider isn't getting upset and refusing your connection.Houdini
You don't say what host OS you're running, but it looks like Fedora had some problems that returned the same error.Houdini
I might have found a potential solution. I set keep_alive to false and set a 1 second idle timeout. My theory is that Mechanize was keeping the connections open until they timed out. At some point, a maximum number of connections was hit and it couldn't make another to do a DNS lookup. Strictly a theory at this point, but I'm just shy of 3,000 records processed.Shabuoth
S
10

I found the solution. Mechanize was leaving the connection open and relying on GC to clean them up. After a certain point, there were enough open connections that no additional outbound connection could be established to do a DNS lookup. Here's the code that caused it to work:

agent = Mechanize.new do |a| 
  a.follow_meta_refresh = true
  a.keep_alive = false
end

By setting keep_alive to false, the connection is immediately closed and cleaned up.

Shabuoth answered 3/11, 2012 at 20:59 Comment(0)
D
0

See if this helps:

agent.history.max_size = 10

It will keep the history from using too much memory

Dipterocarpaceous answered 2/11, 2012 at 0:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.