I wrote a script that will go through all of the customers in our database, verify that their website URL works, and try to find a twitter link on their homepage. We have a little over 10,000 URLs to verify. After a fraction of if the urls are verified, we start getting getaddrinfo errors for every URL.
Here's a copy of the code that scrapes a single URL:
def scrape_url(url)
url_found = false
twitter_name = nil
begin
agent = Mechanize.new do |a|
a.follow_meta_refresh = true
end
agent.get(normalize_url(url)) do |page|
url_found = true
twitter_name = find_twitter_name(page)
end
@err << "[#{@current_record}] SUCCESS\n"
rescue Exception => e
@err << "[#{@current_record}] ERROR (#{url}): "
@err << e.message
@err << "\n"
end
[url_found, twitter_name]
end
Note: I've also run a version of this code that creates a single Mechanize instance that gets shared across all calls to scrape_url. It failed in exactly the same fashion.
When I run this on EC2, it gets through almost exactly 1,000 urls, then returns this error for the remaining 9,000+:
getaddrinfo: Temporary failure in name resolution
Note, I've tried using both Amazon's DNS servers and Google's DNS servers, thinking it might be a legitimate DNS issue. I got exactly the same result in both cases.
Then, I tried running it on my local MacBook Pro. It only got through about 250 before returning this error for the remainder of the records:
getaddrinfo: nodename nor servname provided, or not known
Does anyone know how I can get the script to make it through all of the records?