Implementation of a work stealing queue in C/C++? [closed]

Asked 20/1, 2010 at 13:57 Answered 31/12, 2018 at 2:2

Solved c++multithreading algorithm queue work-stealing

I'm looking for a proper implementation of a work stealing queue in C/CPP. I've looked around Google but haven't found anything useful.

Perhaps someone is familiar with a good open-source implementation? (I prefer not to implement the pseudo-code taken from the original academic papers).

Patagium answered 20/1, 2010 at 13:57 Comment(0)

No free lunch.

Please take a look the original work stealing paper. This paper is hard to understand. I know that paper contains theoretical proof rather than pseudo code. However, there is simply no such much more simple version than TBB. If any, it won't give optimal performance. Work stealing itself incurs some amount of overhead, so optimizations and tricks are quite important. Especially, dequeues are must be thread-safe. Implementing highly scalable and low-overhead synchronizations are challenging.

I'm really wondering why you need it. I think that proper implementation means something like TBB and Cilk. Again, work stealing is hard to implement.

Thaumatrope answered 26/1, 2010 at 6:49 Comment(1)

This library github.com/cpp-taskflow/cpp-taskflow supports work stealing since Dec 2018. – Jugular 31/12, 2018 at 1:59

To implement "work stealing" isn't hard in theory. You need a set of queues containing tasks that do work by doing a combination of computing and generating other tasks to do more work. And you need atomic access to the queues to place newly generated tasks into those queues. Finally, you need a procedure that each task calls at the end, to find more work for the thread that executed the task; that procedure needs to look in work queues to find work.

Most such work-stealing systems make the assumption that there are a small number of threads (backed up typically by real processor cores), and that there is a exactly one work queue per thread. Then you first try to steal work from your own queue, and if it is empty, try to steal from others. What gets tricky is knowing which queues to look in; scanning them serially for work is pretty expensive and can create a huge amount of contention between threads looking for work.

So far this is all pretty generic stuff with one two major exceptions: 1) switching contexts (e.g, setting processor context registers such as a "stack") cannot be stated in pure C or C++. You can resolve this by agreeing to write part of your package in target-platform specific machine code. 2) Atomic access to the queues for a multiprocessor cannot be done purely in C or C++ (ignoring Dekker's algorithm), and so you'll need to code those using assembly language synchronization primitives like the X86 LOCK XCH or Compare and Swap. Now, the code involved in updating the queuse once you have safe access isn't very complex, and you could easily write that in a few lines of C.

However, I think you will find is that attempting to code such a package in C and C++ with mixed assembler is still rather inefficient and you'll eventually end up coding the entire thing in assembler anyway. All's that left are C/C++ compatible entry points :-}

I did this for our PARLANSE parallel programming language, which offers the idea of an arbitrarily large number of parallel computations live and interacting (synchonizing) at any instant. It is implemented behind the scenes on an X86 exactly with one thread per CPU, and the implementation is entirely in assembler. The work-stealing code is probably 1000 lines total, and its tricky code because you want it to be extremely fast in the non-contention case.

The real fly in the ointment for C and C++ is, when you create a task representing work, how much stack space do you assign? Serial C/C++ programs avoid this question by simply overallocating huge amounts (e.g, 10Mb) of one linear stack, and nobody cares much about how much of that stack space is wasted. But if you can create thousands of tasks and have them all live at a particular instant, you can't reasonably allocate 10Mb to each one. So now you either need to determine statically how much stack space a task will need (Turing-hard), or you'll need to allocate stack chunks (e.g., per function call), which widely available C/C++ compilers don't do (e.g, the one you are likely using). THe last way out is to throttle task creation to limit it to a few hundred at any instant, and multiplex a few-hundred really huge stacks among the tasks that are live. You can't do the last if the tasks can interlock/suspend state, because you'll run into your threshold. So you can only do this if the tasks only do computation. That seems like a pretty severe constraint.

For PARLANSE, we built a compiler that allocates activation records on the heap for each function call.

Thorndike answered 31/1, 2010 at 0:0 Comment(8)

Or you do the sane thing, and don't allocate space to tasks until they're actually running, and don't think of tasks as things to suspend and resume, but rather to run from execution to completion. – Umbelliferous 31/1, 2010 at 2:17

Your solution isn't sane. If you build complex systems, when one piece of work can call arbitrary other pieces of work, you can't guarantee that your task won't need suspension. You can certainly force this property to be true; you'll then have a hard time building complex systems. We build million-line paralell programs in PARLANSE. – Thorndike 31/1, 2010 at 6:23

How well does Linux do with a process with 10,000 threads? Windows bombs out at ~15,000 threads per process. blogs.technet.com/b/markrussinovich/archive/2009/07/08/…. I want to have literally millions of "threads" that individually need to wait on events. PARLANSE can do that. I don't think Linux or Windows OSes are configured to handle a million threads well. I'd expect all kinds of resource troubles, including managing just the thread handles. – Thorndike 3/3, 2014 at 22:17

It never fails: You see 'Ira Baxter' as the author and you just know the post is littered with advertising for some 3rd party program. How this guy not been banned yet for all the shilling is beyond me. – Courtney 10/10, 2018 at 3:20

How do you approach the tricky part of "knowing which queues to look in"? – Bedwarmer 29/12, 2018 at 14:59

@PSkocik: You mean "if I'm CPU k, which other queue 1..N do I look in for work to steal?" The awful way if for k to simply scan all the other queues if his is empty. With 4 queues this might be ok, not as attractive with 32-64 queues. A better way that adds some overhead is to keep a bit vector in a single word that tracks which queues have work; it can be updated cheaply with OR and AND. ... – Thorndike 29/12, 2018 at 21:55

... You can make that bit vector accurate if you lock the operations but that makes it expensive to update destroying its purpose. So I do this unsynchronized which means it is only advisory. Still, a pretty good hint where to look first. – Thorndike 29/12, 2018 at 21:55

Thanks. I rather like these parallel programming hacks. :) – Bedwarmer 29/12, 2018 at 22:43

This open source library https://github.com/cpp-taskflow/cpp-taskflow supports work stealing thread pools since Dec 2018.

Take a look at the WorkStealingQueue class which implements the work stealing queue as described in the paper "Dynamic Circular Work-stealing Deque," SPAA, 2015.

Jugular answered 31/12, 2018 at 2:2 Comment(0)

There exist a tool to simply doing it in an very elegant way. It is a really effective way to parrallelize your program in a very short time.

Cilk project

HPC Challenge Award

Our Cilk entry for the HPC Challenge Class 2 award won the 2006 award for ``Best Combination of Elegance and Performance''. The award was made at SC'06 in Tampa on November 14 2006.

Device answered 27/1, 2010 at 14:47 Comment(0)

If you're looking for a standalone workstealing queue class implementation in C++ built on pthread or boost::thread, good luck, to my knowledge there isn't one.

However, as others have said Cilk, TBB and Microsoft's PPL all have workstealing implementations under the hood.

The question is do you want to use a workstealing queue or implement one? If you just want to use one then the choices above are good starting points simply scheduling a 'task' in any of them will suffice.

As BlueRaja said the task_group & structured_task_group in PPL will do this, also note that these classes are available in the latest version of Intel's TBB as well. The parallel loops (parallel_for, parallel_for_each) are also implemented with workstealing.

If you must look at source rather than use an implementation, TBB is OpenSource and Microsoft ships the sources for its CRT, so you can go spelunking.

You can also look on Joe Duffy's blog for a C# implementation (but it's C# and the memory model is different).

-Rick

Scarlettscarp answered 31/1, 2010 at 3:42 Comment(0)

The structured_task_group class of the PPL uses a work stealing queue for its implementation. If you need a WSQ for threading, I would recommend that.
If you are actually looking for the source, I don't know if the code is given in ppl.h or if there is a precompiled object; I will have to check when I get home tonight.

Despond answered 25/1, 2010 at 17:10 Comment(0)

The closest implementation of this work stealing algorithm I've found is something called Wool by Karl-Filip Faxén. src / report / comparison

Site answered 20/1, 2014 at 16:59 Comment(0)

OpenMP may very well support work-stealing although its called recursive parallelism

OpenMP forum post

The OpenMP specification defines tasking constructs (which can be nested, so are very suitable for recursive parallelism) but does not specify the details of how they how they are implemented. OpenMP implementations, including gcc, typically use some form of work stealing for tasks, though the exact algorithm (and the resulting performance) may vary!

See #pragma omp task and #pragma omp taskwait

Update

Chapter 9 of the book C++ Concurrency in Action describes how to implement "work stealing for pool threads". I haven't read/implemented it myself but it doesn't look too difficult.

Cromwell answered 17/11, 2015 at 15:30 Comment(0)

I have ported this C project to C++.

The original Steal may experience a dirty read when the array is expanded. I tried to fix the bug, but eventually gave in because I didn't actually need a dynamically growing stack. Instead of trying to allocate space, the Push method simply returns false. The caller can then perform a spin-wait, i.e. while(!stack->Push(value)){}.

#pragma once
#include <atomic>

  // A lock-free stack.
  // Push = single producer
  // Pop = single consumer (same thread as push)
  // Steal = multiple consumer

  // All methods, including Push, may fail. Re-issue the request
  // if that occurs (spinwait).

  template<class T, size_t capacity = 131072>
  class WorkStealingStack {

  public:
    inline WorkStealingStack() {
      _top = 1;
      _bottom = 1;
    }

    WorkStealingStack(const WorkStealingStack&) = delete;

    inline ~WorkStealingStack()
    {

    }

    // Single producer
    inline bool Push(const T& item) {
      auto oldtop = _top.load(std::memory_order_relaxed);
      auto oldbottom = _bottom.load(std::memory_order_relaxed);
      auto numtasks = oldbottom - oldtop;

      if (
        oldbottom > oldtop && // size_t is unsigned, validate the result is positive
        numtasks >= capacity - 1) {
        // The caller can decide what to do, they will probably spinwait.
        return false;
      }

      _values[oldbottom % capacity].store(item, std::memory_order_relaxed);
      _bottom.fetch_add(1, std::memory_order_release);
      return true;
    }

    // Single consumer
    inline bool Pop(T& result) {

      size_t oldtop, oldbottom, newtop, newbottom, ot;

      oldbottom = _bottom.fetch_sub(1, std::memory_order_release);
      ot = oldtop = _top.load(std::memory_order_acquire);
      newtop = oldtop + 1;
      newbottom = oldbottom - 1;

      // Bottom has wrapped around.
      if (oldbottom < oldtop) {
        _bottom.store(oldtop, std::memory_order_relaxed);
        return false;
      }

      // The queue is empty.
      if (oldbottom == oldtop) {
        _bottom.fetch_add(1, std::memory_order_release);
        return false;
      }

      // Make sure that we are not contending for the item.
      if (newbottom == oldtop) {
        auto ret = _values[newbottom % capacity].load(std::memory_order_relaxed);
        if (!_top.compare_exchange_strong(oldtop, newtop, std::memory_order_acquire)) {
          _bottom.fetch_add(1, std::memory_order_release);
          return false;
        }
        else {
          result = ret;
          _bottom.store(newtop, std::memory_order_release);
          return true;
        }
      }

      // It's uncontended.
      result = _values[newbottom % capacity].load(std::memory_order_acquire);
      return true;
    }

    // Multiple consumer.
    inline bool Steal(T& result) {
      size_t oldtop, newtop, oldbottom;

      oldtop = _top.load(std::memory_order_acquire);
      oldbottom = _bottom.load(std::memory_order_relaxed);
      newtop = oldtop + 1;

      if (oldbottom <= oldtop)
        return false;

      // Make sure that we are not contending for the item.
      if (!_top.compare_exchange_strong(oldtop, newtop, std::memory_order_acquire)) {
        return false;
      }

      result = _values[oldtop % capacity].load(std::memory_order_relaxed);
      return true;
    }

  private:

    // Circular array
    std::atomic<T> _values[capacity];
    std::atomic<size_t> _top; // queue
    std::atomic<size_t> _bottom; // stack
  };

Full Gist (including unit tests). I have only run the tests on a strong architecture (x86/64), so as far as weak architectures go your mileage may vary if you try to use this on e.g. Neon/PPC.

Executioner answered 30/12, 2014 at 10:34 Comment(0)

-1

I don't think JobSwarm uses work stealing, but it's a first step. I'm not aware of other open source libraries for this purpose.

Anticathexis answered 20/1, 2010 at 15:18 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags