Method for self-rearranging job queue
Asked Answered
M

2

6

I have a job queue (using Amazon SQS) which hands off jobs to many machines for fetching and processing various documents over HTTP. There are hundreds of different hosts which are accessed, and there is no predictable order for the jobs.

In order to be polite, I don't want my system to hammer repeatedly on a single host. Thus, if I get a job #123 to fetch something from example.com, but I see that I have just fetched another thing from example.com in the past X seconds, I should move on to something else and save job #123 for later.

The question is, what's a good way to implement this pattern?

It seems the first step would be to have the job runners keep a list somewhere of all domains and the last time something on that domain was accessed. I suppose this could be a simple DB table.

There are then many possible options for what to do if a message processor gets a job that must be deferred.

  1. Simply push a copy of the message onto the end of the queue, and throw it away without executing it. Hopefully, by the next time it comes around, enough time will have passed. This may result in a lot of redundant SQS messages, especially if a large cluster of jobs for the same domain goes through at once.

  2. Sleep for however many seconds are necessary until politeness dictates that the job can be executed. This may result in a lot of queue processors simultaneously doing nothing.

  3. Accept the job, but save it in a local queue somewhere on each queue processor. I imagine each processor could "claim" a number of jobs this way, and then elect to process them in whatever order achieves maximum politeness. This can still be unpredictable, because each queue processor needs to be aware of the domains hit by all the others.

  4. Establish separate queues for every domain and have one process dedicated to each queue. Each process would have to pause for X seconds between doing each job, so there's a lot of sleeping process overhead, but maybe this isn't such a bad thing.

Do you have any experience with designing this sort of thing? What strategy would you recommend?

Mulct answered 2/1, 2011 at 4:57 Comment(1)
Are you 100% stuck on SQS? There are good designs NOT forcing you into queue-per-domain solution, but they require you to have direct control of the queue which I am assuming SQS doesn't provide (to be precise, ability to "browse" the queue without taking top element, and ability to take Nth element instead of the top - basically, treating the queue as doubly linked list without insertion and not a pure queue).Legnica
U
2

Separate queues for each domain and a queue of domains.

Each processor should:

  1. Pick a domain from queue of domains.
  2. If domain was not recently updated, pick the top task from the domain queue.
  3. Put domain back to the end of domain queue.
  4. If we have a task to execute, do it.
  5. Sleep until it is the time to check the head of domain queue or the domain queue is updated.

It may help if you organize the queue of domains as a time-priority queue — store the domains in the order of the next update time.

Unspoiled answered 2/1, 2011 at 6:6 Comment(8)
If you have a sufficient number of distinct domains, and you anticipate contention on the queue of domains, you could make it so that processors put domains back onto their own local queue. Then modify step 1 to "If local_queue_size < some_threshold, pick a domain from the global queue of domains, otherwise pick a domain from the local queue of domains." Whenever a processor tries to get a job from the global queue and finds there are none left, trigger a global "donation" of e.g. 50% of the domains from all local queues back to the global queue.Redmund
@j_random_hacker: Not sure that's a good idea. Local queues complicate dataflow and the benefits are doubtful. If you have not enough processors, add more. If your domain queue is too huge, add some sharding.Unspoiled
@Alex: If I understand what you mean by "sharding", the only dataflow complication my suggestion would introduce -- namely, the return of local domains back to the global queue -- would occur exactly when plain sharding leaves one or more processors idle. It is effectively "auto-sharding" plus a recovery mechanism. Of course you could omit the recovery mechanism for a simpler implementation that has all the benefits (and inefficiencies) of sharding without the need to define shards a priori.Redmund
@j_random_hacker: What if processor would steal a domain to a private queue and would get overloaded afterwards? Who's going to steal domain back from it? Too complicated. Anyway, I'd leave that for later — now it looks like a premature optimization.Unspoiled
@Alex: True, but I think sharding is vulnerable to the same overloading problem -- right? I'm assuming that by "sharding" you mean "dividing domains up between processors beforehand". If not could you explain please? Thanks.Redmund
@j_random_hacker: By sharding I mean divide domains between queues beforehand. How processors access sharded domain queues is another question — up to a queue of domain queues if things are that bad...Unspoiled
@Alex: Thanks. That will have better load balancing than what I thought you were suggesting. Still I don't see how it would be simpler than my approach -- it is still the case that either (a) processors require extra logic to determine what to do when they try to access a queue and find it empty, or (b) we avoid this extra logic and just let them starve. But nor do I see that it's more complicated than my approach :)Redmund
@j_random_hacker: IMO, it is architecturally important to separate processor and the queue. In your case the separation is incomplete. Also, you did not said how you will solve the problem when processor has several domains and is overloaded. In my case this would not require extra logic.Unspoiled
A
0

I would recommend setting up a queue for each domain, and one processor per queue.

Most servers should have no problem with requests issued constantly in-series, so long as you keep an eye on total transfer quantity (for example, you should avoid indexing files above more than a few hundred KB unless you have a real need for it).

I assume you're also obeying robots.txt rules too.

Amazonite answered 2/1, 2011 at 5:7 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.