"Work stealing" vs. "Work shrugging"?
Asked Answered
R

6

15

Why is it that I can find lots of information on "work stealing" and nothing on "work shrugging" as a dynamic load-balancing strategy?

By "work-shrugging" I mean pushing surplus work away from busy processors onto less loaded neighbours, rather than have idle processors pulling work from busy neighbours ("work-stealing").

I think the general scalability should be the same for both strategies. However I believe that it is much more efficient, in terms of latency & power consumption, to wake an idle processor when there is definitely work for it to do, rather than having all idle processors periodically polling all neighbours for possible work.

Anyway a quick google didn't show up anything under the heading of "Work Shrugging" or similar so any pointers to prior-art and the jargon for this strategy would be welcome.

Clarification

I actually envisage the work submitting processor (which may or may not be the target processor) being responsible for looking around the immediate locality of the preferred target processor (based on data/code locality) to decide if a near neighbour should be given the new work instead because they don't have as much work to do.

I dont think the decision logic would require much more than an atomic read of the immediate (typically 2 to 4) neighbours' estimated q length here. I do not think this is any more coupling than implied by the thieves polling & stealing from their neighbours. (I am assuming "lock-free, wait-free" queues in both strategies).

Resolution

It seems that what I meant (but only partially described!) as "Work Shrugging" strategy is in the domain of "normal" upfront scheduling strategies that happen to be smart about processor, cache & memory loyality, and scaleable.

I find plenty of references searching on these terms and several of them look pretty solid. I will post a reference when I identify one that best matches (or demolishes!) the logic I had in mind with my definition of "Work Shrugging".

Rozamond answered 31/3, 2010 at 12:26 Comment(4)
What for do you add "tm" ?? SO isn't a marketing training groundChristo
This isn't a community wiki question; there's a right answer, here.Met
@Andres: Community Wiki doesn't mean "no right answer" — it means "editable by the community."Curie
Touché! I am happier now. :-)Met
M
22

Load balancing is not free; it has a cost of a context switch (to the kernel), finding the idle processors, and choosing work to reassign. Especially in a machine where tasks switch all the time, dozens of times per second, this cost adds up.

So what's the difference? Work-shrugging means you further burden over-provisioned resources (busy processors) with the overhead of load-balancing. Why interrupt a busy processor with administrivia when there's a processor next door with nothing to do? Work stealing, on the other hand, lets the idle processors run the load balancer while busy processors get on with their work. Work-stealing saves time.

Example

Consider: Processor A has two tasks assigned to it. They take time a1 and a2, respectively. Processor B, nearby (the distance of a cache bounce, perhaps), is idle. The processors are identical in all respects. We assume the code for each task and the kernel is in the i-cache of both processors (no added page fault on load balancing).

A context switch of any kind (including load-balancing) takes time c.

No Load Balancing

The time to complete the tasks will be a1 + a2 + c. Processor A will do all the work, and incur one context switch between the two tasks.

Work-Stealing

Assume B steals a2, incurring the context switch time itself. The work will be done in max(a1, a2 + c) time. Suppose processor A begins working on a1; while it does that, processor B will steal a2 and avoid any interruption in the processing of a1. All the overhead on B is free cycles.

If a2 was the shorter task, here, you have effectively hidden the cost of a context switch in this scenario; the total time is a1.

Work-Shrugging

Assume B completes a2, as above, but A incurs the cost of moving it ("shrugging" the work). The work in this case will be done in max(a1, a2) + c time; the context switch is now always in addition to the total time, instead of being hidden. Processor B's idle cycles have been wasted, here; instead, a busy processor A has burned time shrugging work to B.

Met answered 31/3, 2010 at 12:26 Comment(3)
Thanks Andres - this community is great! But I am realising how difficult it is to phrase a question concisely and yet include sufficient context! I am not in-fact concerned about processor virtualisation/context-switching issue! In my application I have N threads bound to N physical processors. The work they are given is guaranteed to be non-blocking. I wan't processors to step up to the mark very fast when there is work for them (no poll/sleep latency) and to get right out the way (preferably don't even touch memory) when there isn't.Rozamond
What I described applies directly to physical threads. "task" = "thread". As far as work being assigned immediately upon task creation, the mechanism you describe sounds good. I'd be shocked if modern schedulers didn't already do this, perhaps taking cache effects into account.Met
Oh yes, I see now. Your equation applies just as well to the physical case (I was blinded by the mention of kernel/context switch). I am now looking into "modern schedulers" etc. - thnx - question updated.Rozamond
T
5

I think the problem with this idea is that it makes the threads with actual work to do waste their time constantly looking for idle processors. Of course there are ways to make that faster, like have a queue of idle processors, but then that queue becomes a concurrency bottleneck. So it's just better to have the threads with nothing better to do sit around and look for jobs.

Tubby answered 31/3, 2010 at 15:18 Comment(1)
Thanks - I do not actually envisage any processors constantly looking for idle processors nor a global queue- just peeking at neighbours when there is new work available - I dont think this coupling is any greater than a thief peeking at neighbours for work to do - I have included a bit more detail in my question.Rozamond
M
3

The basic advantage of 'work stealing' algorithms is that the overhead of moving work around drops to 0 when everyone is busy. So there's only overhead when some processor would otherwise have been idle, and that overhead cost is mostly paid by the idle processor with only a very small bus-synchronization related cost to the busy processor.

Misrule answered 31/3, 2010 at 12:26 Comment(2)
Yes this is a clear benefit of "work stealing". One motivating issue for me considering an alternative to stealing is the latency for an idle processor to grab surplus work arriving in its neighbourhood? The stealing papers and examples I found all came down to how rapid is the polling cycle. I guess, to summarise, my application needs bursty loads eminating on one processor to be spread rapidly to less busy, including idle, neighbours. I am prepared to pay some modest upfront overhead to minimise the latency involved.Rozamond
OK so really I am talking about upfront (cached state & memory node aware) "dynamic load distribution" - and I get plenty of solid hits on that.Rozamond
T
2

Some issues... if a busy thread is busy, wouldn't you want it spending its time processing real work instead of speculatively looking for idle threads to offload onto?

How does your thread decide when it has so much work that it should stop doing that work to look for a friend that will help?

How do you know that the other threads don't have just as much work and you won't be able to find a suitable thread to offload onto?

Work stealing seems more elegant, because solves the same problem (contention) in a way that guarantees that the threads doing the load balancing are only doing the load balancing while they otherwise would have been idle.

It's my gut feeling that what you've described will not only be much less efficient in the long run, but will require lots of of tweaking per-system to get acceptable results.

Though in your edit you suggest that you want submitting processor to handle this, not the worker threads as you suggested earlier and in some of the comments here. If the submitting processor is searching for the lowest queue length, you're potentially adding latency to the submit, which isn't really a desirable thing.

But more importantly it's a supplementary technique to work-stealing, not a mutually exclusive technique. You've potentially alleviated some of the contention that work-stealing was invented to control, but you still have a number of things to tweak before you'll get good results, these tweaks won't be the same for every system, and you still risk running into situations where work-stealing would help you.

I think your edited suggestion, with the submission thread doing "smart" work distribution is potentially a premature optimization against work-stealing. Are your idle threads slamming the bus so hard that your non-idle threads can't get any work done? Then comes the time to optimize work-stealing.

Thebaid answered 31/3, 2010 at 12:26 Comment(0)
I
2

Work stealing, as I understand it, is designed for highly-parallel systems, to avoid having a single location (single thread, or single memory region) responsible for sharing out the work. In order to avoid this bottleneck, I think it does introduce inefficiencies in simple cases.

If your application is not so parallel that a single point of work distribution causes scalability problems, then I would expect you could get better performance by managing it explicitly as you suggest.

No idea what you might google for though, I'm afraid.

Internee answered 31/3, 2010 at 13:0 Comment(1)
My vision of Work-Shrugging is no-less scaleable (I think!) - each processor would be responsible for pushing excess work to a less loaded neighbour (NUMA, LAN or WAN) rather than pulling work to itself. I have added some clarification to the quesion. - thnxRozamond
R
0

So, by contrast to "Work Stealing", what is really meant here by "Work Shrugging", is a normal upfront work scheduling strategy that is smart about processor, cache & memory loyalty, and scalable.

Searching on combinations of the terms / jargon above yields many substantial references to follow up. Some address the added complication of machine virtualisation, which wasn't infact a concern of the questioner, but the general strategies are still relevent.

Rozamond answered 31/3, 2010 at 12:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.