Best way to concurrently check urls (for status i.e. 200,301,404) for multiple urls in database
Asked Answered
H

3

0

Here's what I'm trying to accomplish. Let's say I have 100,000 urls stored in a database and I want to check each of these for http status and store that status. I want to be able to do this concurrently in a fairly small amount of time.

I was wondering what the best way(s) to do this would be. I thought about using some sort of queue with workers/consumers or some sort of evented model, but I don't really have enough experience to know what would work best in this scenario.

Ideas?

Headley answered 28/1, 2011 at 20:53 Comment(0)
U
4

Take a look at the very capable Typhoeus and Hydra combo. The two make it very easy to concurrently process multiple URLs.

The "Times" example should get you up and running quickly. In the on_complete block put your code to write your statuses to the DB. You could use a thread to build and maintain the queued requests at a healthy level, or queue a set number, let them all run to completion, then loop for another group. It's up to you.

Paul Dix, the original author, talked about his design goals on his blog.

This is some sample code I wrote to download archived mail lists so I could do local searches. I deliberately removed the URL to keep from subjecting the site to DOS attacks if people start running the code:

#!/usr/bin/env ruby

require 'nokogiri'
require 'addressable/uri'
require 'typhoeus'

BASE_URL = ''

url = Addressable::URI.parse(BASE_URL)
resp = Typhoeus::Request.get(url.to_s)
doc = Nokogiri::HTML(resp.body)

hydra = Typhoeus::Hydra.new(:max_concurrency => 10)
doc.css('a').map{ |n| n['href'] }.select{ |href| href[/\.gz$/] }.each do |gzip|
  gzip_url = url.join(gzip)
  request = Typhoeus::Request.new(gzip_url.to_s)

  request.on_complete do |resp|
    gzip_filename = resp.request.url.split('/').last
    puts "writing #{gzip_filename}"
    File.open("gz/#{gzip_filename}", 'w') do |fo|
      fo.write resp.body
    end  
  end
  puts "queuing #{ gzip }"
  hydra.queue(request)
end

hydra.run

Running the code on my several-year-old MacBook Pro pulled in 76 files totaling 11MB in just under 20 seconds, over wireless to DSL. If you're only doing HEAD requests your throughput will be better. You'll want to mess with the concurrency setting because there is a point where having more concurrent sessions only slow you down and needlessly use resources.

I give it a 8 out of 10; It's got a great beat and I can dance to it.


EDIT:

When checking the remove URLs you can use a HEAD request, or a GET with the If-Modified-Since. They can give you responses you can use to determine the freshness of your URLs.

Umberto answered 28/1, 2011 at 22:49 Comment(3)
Thanks Tin Man, I started this originally with em-http, using EM::MultiRequest to build 100 HEAD requests and then fire them off at the same time. If all goes well, it finishes in about 3-7 (before database writes) seconds. The only problem is if one of the URLs is going to timeout, it waits 60 seconds. Which means the next batch wouldn't be able to fire until that is complete (a terrible worst case if there's one of those in every batch). I'm looking into changing this so one request doesn't impact the others. I'll check out Typhoeus and see how it differs. Thanks!Headley
I saw the same problem with some Perl code I wrote years ago. Setting the HTTP timeout can help a lot; If the request times out I'd update the "next check time" timestamp to some reasonable time in the future so it'd be retried soon, but not immediately. I haven't encountered the problem with Typhoeus/Hydra but it was written for RSS processing which could encounter that easily, so hopefully it'll behave nicely. Let us know your results!Umberto
My time is split between multiple projects at the moment, but I'll definitely report back once I've got something working well.Headley
D
1

I haven't done anything multithreaded in Ruby, only in Java, but it seems pretty straightforward: http://www.tutorialspoint.com/ruby/ruby_multithreading.htm

From what you described, you don't need any queue and workers (well, I'm sure you can do it that way too, but I doubt you'll get much benefit). Just partition your urls between several threads, and let each thread do each chunk and update the database with the results. E.g., create 100 threads, and give each thread a range of 1000 database rows to process.

You could even just create 100 separate processes and give them rows as arguments, if you'd rather deal with processes than threads.

To get the URL status, I think you do an HTTP HEAD request, which I guess is http://apidock.com/ruby/Net/HTTP/request_head in ruby.

Dominance answered 28/1, 2011 at 21:54 Comment(0)
C
0

The work_queue gem is the easiest way to perform tasks asynchronously and concurrently in your application.

wq = WorkQueue.new 10

urls.each do |url|
  wq.enqueue_b do
    response = Net::HTTP.get_response(uri)
    puts response.code
  end
end

wq.join
Colleen answered 19/6, 2015 at 19:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.