Does a multiple producer single consumer lock-free queue exist for c++? [closed]

Asked 18/1, 2012 at 22:23 Answered 4/8, 2012 at 2:25

The more I read the more confused I become... I would have thought it trivial to find a formally correct MPSC queue implemented in C++.

Every time I find another stab at it, further research seems to suggest there are issues such as ABA or other subtle race conditions.

Many talk of the need for garbage collection. This is something I want to avoid.

Is there an accepted correct open-source implementation out there?

Rogation answered 18/1, 2012 at 22:23 Comment(11)

My advice is to look for alternatives. For performance, lock free queues are not that great. It's just not possible to make anything fast whenever you have several threads writing to the same cache line. Using separate SPSC queue for each producer is way faster. The downside is that you lose the order of items. If you need an ordered queue and want to save yourself a lot of headache, just use a spinlock. In practice it's almost as good as lockfree and hundreds of times faster than e.g. Win32 critical sections. – Inartificial 19/1, 2012 at 3:6

Did you consider using ring buffer instead of queue? There are certain performance benefits - fixed memory location with slots allocated once, very simple counters to point to tail/head with no locking necessary, predictable cache-friendly memory access pattern etc. – Llanes 19/1, 2012 at 12:16

@Llanes could you elaborate please? Do you have a working example by any chance? – Rogation 19/1, 2012 at 17:51

Added an answer with link to distruptor. I don't know if you will like my implementation, it's definitely simpler than "real thing" but not portable, and there are dependencies on few obvious bits of code not attached. Probably best treat it as an ilustration of what I meant above :) – Llanes 19/1, 2012 at 22:44

@Llanes thanks - will def go over your example this weekend! – Rogation 20/1, 2012 at 14:30

@Inartificial excuse the lateness of my response (I have a valid excuse!) I don't think it's possible to use multiple consumers: Some colour on the scenario:... I welcome any suggestions/comments. This is for an algorithmic financial trading system. The typical run of things is that there is a dominant producer (market data), then there is a non-uniform periodic producer (response from orders sent as a result of market data plus asynchronous trade indications), and finally there is an infrequent producer (command-and-control parameters from a client app) - all need to be routed to the same consumer – Rogation 21/6, 2012 at 11:31

@Llanes - the disruptor definitely seems to be a candidate - thanks a lot! – Rogation 21/6, 2012 at 11:35

what about boost queue? boost.org/doc/libs/1_54_0/doc/html/lockfree/examples.html – Treen 26/9, 2013 at 13:18

@javapowered indeed Boost's lockfree queue is the perfect candidate. It didn't exist at the time I asked this question, but I have since started using it. – Rogation 26/9, 2013 at 21:40

@Timo:"The downside is that you lose the order of items" = two items pushed by the same producer can appear in reverse order in the consumer thread ? – Kunstlied 14/12, 2014 at 22:40

@Guillaume07 no the same producer will always push to the same queue, so the items it pushes will be in the same order in its queue. The downside is there is a queue per producer, and the consumer has no way of knowing the order in which each producer pushed items to its queue relative to other producers – Rogation 15/12, 2014 at 23:0

You may want to check disruptor; it's available in C++ here: http://lmax-exchange.github.io/disruptor/

You can also find explanation how it works here on stackoverflow Basically it's circular buffer with no locking, optimized for passing FIFO messages between threads in a fixed-size slots.

Here are two implementations which I found useful: Lock-free Multi-producer Multi-consumer Queue on Ring Buffer @ NatSys Lab. Blog and
Yet another implementation of a lock-free circular array queue @ CodeProject

NOTE: the code below is incorrect, I leave it only as an example how tricky these things can be.

If you don't like the complexity of google version, here is something similar from me - it's much simpler, but I leave it as an exercise to the reader to make it work (it's part of larger project, not portable at the moment). The whole idea is to maintain cirtular buffer for data and a small set of counters to identify slots for writing/written and reading/read. Since each counter is in its own cache line, and (normally) each is only atomically updated once in the live of a message, they can all be read without any synchronisation. There is one potential contention point between writing threads in post_done, it's required for FIFO guarantee. Counters (head_, wrtn_, rdng_, tail_) were selected to ensure correctness and FIFO, so dropping FIFO would also require change of counters (and that might be difficult to do without sacrifying correctness). It is possible to slightly improve performance for scenarios with one consumer, but I would not bother - you would have to undo it if other use cases with multiple readers are found.

On my machine latency looks like following (percentile on left, mean within this percentile on right, unit is microsecond, measured by rdtsc):

    total=1000000 samples, avg=0.24us
    50%=0.214us, avg=0.093us
    90%=0.23us, avg=0.151us
    99%=0.322us, avg=0.159us
    99.9%=15.566us, avg=0.173us

These results are for single polling consumer, i.e. worker thread calling wheel.read() in tight loop and checking if not empty (scroll to bottom for example). Waiting consumers (much lower CPU utilization) would wait on event (one of acquire... functions), this adds about 1-2us to average latency due to context switch.

Since there is verly little contention on read, consumers scale very well with number of worker threads, e.g. for 3 threads on my machine:

    total=1500000 samples, avg=0.07us
    50%=0us, avg=0us
    90%=0.155us, avg=0.016us
    99%=0.361us, avg=0.038us
    99.9%=8.723us, avg=0.044us

Patches will be welcome :)

// Copyright (c) 2011-2012, Bronislaw (Bronek) Kozicki
//
// Distributed under the Boost Software License, Version 1.0. (See accompanying
// file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)

#pragma once

#include <core/api.hxx>
#include <core/wheel/exception.hxx>

#include <boost/noncopyable.hpp>
#include <boost/type_traits.hpp>
#include <boost/lexical_cast.hpp>
#include <typeinfo>

namespace core { namespace wheel
{
  struct bad_size : core::exception
  {
    template<typename T> explicit bad_size(const T&, size_t m)
      : core::exception(std::string("Slot capacity exceeded, sizeof(")
                  + typeid(T).name()
                  + ") = "
                  + boost::lexical_cast<std::string>(sizeof(T))
                  + ", capacity = "
                  + boost::lexical_cast<std::string>(m)
                  )
    {}
  };        

  // inspired by Disruptor
  template <typename Header>
  class wheel : boost::noncopyable
  {
    __declspec(align(64))
    struct slot_detail
    {
      // slot write: (memory barrier in wheel) > post_done > (memory barrier in wheel)
      // slot read:  (memory barrier in wheel) > read_done > (memory barrier in wheel)

      // done writing or reading, must update wrtn_ or tail_ in wheel, as appropriate
      template <bool Writing>
      void done(wheel* w)
      {
        if (Writing)
          w->post_done(sequence);
        else
          w->read_done();
      }

      // cache line for sequence number and header
      long long sequence;
      Header header;

      // there is no such thing as data type with variable size, but we need it to avoid thrashing
      // cache - so we invent one. The memory is reserved in runtime and we simply go beyond last element.
      // This is well into UB territory! Using template parameter for this is not good, since it
      // results in this small implementation detail leaking to all possible user interfaces.
      __declspec(align(8))
      char data[8];
    };

    // use this as a storage space for slot_detail, to guarantee 64 byte alignment
    _declspec(align(64))
    struct slot_block { long long padding[8]; };

  public:
    // wrap slot data to outside world
    template <bool Writable>
    class slot
    {
      template<typename> friend class wheel;

      slot& operator=(const slot&); // moveable but non-assignable

      // may only be constructed by wheel
      slot(slot_detail* impl, wheel<Header>* w, size_t c)
        : slot_(impl) , wheel_(w) , capacity_(c)
      {}

    public:
      slot(slot&& s)
        : slot_(s.slot_) , wheel_(s.wheel_) , capacity_(s.capacity_)
      {
        s.slot_ = NULL;
      }

      ~slot()
      {
        if (slot_)
        {
          slot_->done<Writable>(wheel_);
        }
      }

      // slot accessors - use Header to store information on what type is actually stored in data
      bool empty() const          { return !slot_; }
      long long sequence() const  { return slot_->sequence; }
      Header& header()            { return slot_->header; }
      char* data()                { return slot_->data; }

      template <typename T> T& cast()
      {
        static_assert(boost::is_pod<T>::value, "Data type must be POD");
        if (sizeof(T) > capacity_)
          throw bad_size(T(), capacity_);
        if (empty())
          throw no_data();
        return *((T*) data());
      }

    private:
      slot_detail*    slot_;
      wheel<Header>*  wheel_;
      const size_t    capacity_;
    };

  private:
    // dynamic size of slot, with extra capacity, expressed in 64 byte blocks
    static size_t sizeof_slot(size_t s)
    {
      size_t m = sizeof(slot_detail);
      // add capacity less 8 bytes already within sizeof(slot_detail)
      m += max(8, s) - 8;
      // round up to 64 bytes, i.e. alignment of slot_detail
      size_t r = m & ~(unsigned int)63;
      if (r < m)
        r += 64;
      r /= 64;
      return r;
    }

    // calculate actual slot capacity back from number of 64 byte blocks
    static size_t slot_capacity(size_t s)
    {
      return s*64 - sizeof(slot_detail) + 8;
    }

    // round up to power of 2
    static size_t round_size(size_t s)
    {
      // enfore minimum size
      if (s <= min_size)
        return min_size;

      // find rounded value
      --s;
      size_t r = 1;
      while (s)
      {
        s >>= 1;
        r <<= 1;
      };
      return r;
    }

    slot_detail& at(long long sequence)
    {
      // find index from sequence number and return slot at found index of the wheel
      return *((slot_detail*) &wheel_[(sequence & (size_ - 1)) * blocks_]);
    }

  public:
    wheel(size_t capacity, size_t size)
      : head_(0) , wrtn_(0) , rdng_(0) , tail_(0) , event_()
      , blocks_(sizeof_slot(capacity)) , capacity_(slot_capacity(blocks_)) , size_(round_size(size))
    {
      static_assert(boost::is_pod<Header>::value, "Header type must be POD");
      static_assert(sizeof(slot_block) == 64, "This was unexpected");

      wheel_ = new slot_block[size_ * blocks_];
      // all slots must be initialised to 0
      memset(wheel_, 0, size_ * 64 * blocks_);
      active_ = 1;
    }

    ~wheel()
    {
      stop();
      delete[] wheel_;
    }

    // all accessors needed
    size_t capacity() const { return capacity_; }   // capacity of a single slot
    size_t size() const     { return size_; }       // number of slots available
    size_t queue() const    { return (size_t)head_ - (size_t)tail_; }
    bool active() const     { return active_ == 1; }

    // enough to call it just once, to fine tune slot capacity
    template <typename T>
    void check() const
    {
      static_assert(boost::is_pod<T>::value, "Data type must be POD");
      if (sizeof(T) > capacity_)
        throw bad_size(T(), capacity_);
    }

    // stop the wheel - safe to execute many times
    size_t stop()
    {
      InterlockedExchange(&active_, 0);
      // must wait for current read to complete
      while (rdng_ != tail_)
        Sleep(10);

      return size_t(head_ - tail_);
    }

    // return first available slot for write
    slot<true> post()
    {
      if (!active_)
        throw stopped();

      // the only memory barrier on head seq. number we need, if not overflowing
      long long h = InterlockedIncrement64(&head_);
      while(h - (long long) size_ > tail_)
      {
        if (InterlockedDecrement64(&head_) == h - 1)
          throw overflowing();

        // protection against case of race condition when we are overflowing
        // and two or more threads try to post and two or more messages are read,
        // all at the same time. If this happens we must re-try, otherwise we
        // could have skipped a sequence number - causing infinite wait in post_done
        Sleep(0);
        h = InterlockedIncrement64(&head_);
      }

      slot_detail& r = at(h);
      r.sequence = h;

      // wrap in writeable slot
      return slot<true>(&r, this, capacity_);
    }

    // return first available slot for write, nothrow variant
    slot<true> post(std::nothrow_t)
    {
      if (!active_)
        return slot<true>(NULL, this, capacity_);

      // the only memory barrier on head seq. number we need, if not overflowing
      long long h = InterlockedIncrement64(&head_);
      while(h - (long long) size_ > tail_)
      {
        if (InterlockedDecrement64(&head_) == h - 1)
          return slot<true>(NULL, this, capacity_);

        // must retry if race condition described above
        Sleep(0);
        h = InterlockedIncrement64(&head_);
      }

      slot_detail& r = at(h);
      r.sequence = h;

      // wrap in writeable slot
      return slot<true>(&r, this, capacity_);
    }

    // read first available slot for read
    slot<false> read()
    {
      slot_detail* r = NULL;
      // compare rdng_ and wrtn_ early to avoid unnecessary memory barrier
      if (active_ && rdng_ < wrtn_)
      {
        // the only memory barrier on reading seq. number we need
        const long long h = InterlockedIncrement64(&rdng_);
        // check if this slot has been written, step back if not
        if (h > wrtn_)
          InterlockedDecrement64(&rdng_);
        else
          r = &at(h);
      }

      // wrap in readable slot
      return slot<false>(r , this, capacity_);
    }

    // waiting for new post, to be used by non-polling clients
    void acquire()
    {
      event_.acquire();
    }

    bool try_acquire()
    {
      return event_.try_acquire();
    }

    bool try_acquire(unsigned long timeout)
    {
      return event_.try_acquire(timeout);
    }

    void release()
    {}

  private:
    void post_done(long long sequence)
    {
      const long long t = sequence - 1;

      // the only memory barrier on written seq. number we need
      while(InterlockedCompareExchange64(&wrtn_, sequence, t) != t)
        Sleep(0);

      // this is outside of critical path for polling clients
      event_.set();
    }

    void read_done()
    {
      // the only memory barrier on tail seq. number we need
      InterlockedIncrement64(&tail_);
    }

    // each in its own cache line
    // head_ - wrtn_ = no. of messages being written at this moment
    // rdng_ - tail_ = no. of messages being read at the moment
    // head_ - tail_ = no. of messages to read (including those being written and read)
    // wrtn_ - rdng_ = no. of messages to read (excluding those being written or read)
    __declspec(align(64)) volatile long long head_; // currently writing or written
    __declspec(align(64)) volatile long long wrtn_; // written
    __declspec(align(64)) volatile long long rdng_; // currently reading or read
    __declspec(align(64)) volatile long long tail_; // read
    __declspec(align(64)) volatile long active_;    // flag switched to 0 when stopped

    __declspec(align(64))
    api::event event_;          // set when new message is posted
    const size_t blocks_;       // number of 64-byte blocks in a single slot_detail
    const size_t capacity_;     // capacity of data() section per single slot. Initialisation depends on blocks_
    const size_t size_;         // number of slots available, always power of 2
    slot_block* wheel_;
  };
}}

Here is what polling consumer worker thread may look like:

  while (wheel.active())
  {
    core::wheel::wheel<int>::slot<false> slot = wheel.read();
    if (!slot.empty())
    {
      Data& d = slot.cast<Data>();
      // do work
    }
    // uncomment below for waiting consumer, saving CPU cycles
    // else
    //   wheel.try_acquire(10);
  }

Edited added consumer example

Llanes answered 19/1, 2012 at 21:51 Comment(1)

please can you explain what Header and Data are / what the difference are? If I want to store 3 * 64 bit words in each slot (that is the entirety of the payload) how would I use this? – Rogation 26/6, 2012 at 1:46

The most suitable implementation depends on the desired properties of a queue. Should it be unbounded or a bounded one is fine? Should it be linearizable, or less strict requirements would be fine? How strong FIFO guarantees you need? Are you willing to pay the cost of reverting the list by the consumer (there exists a very simple implementation where the consumer grabs the tail of a single-linked list, thus getting at once all items put by producers till the moment)? Should it guarantee that no thread is ever blocked, or tiny chances to get some thread blocked are ok? And etc.

Some useful links:
Is multiple-producer, single-consumer possible in a lockfree setting?
http://www.1024cores.net/home/lock-free-algorithms/queues
http://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue
https://groups.google.com/group/comp.programming.threads/browse_frm/thread/33f79c75146582f3

Hope that helps.

Sansom answered 18/1, 2012 at 23:22 Comment(1)

I created a version based on Dmitry Vyukov's implementation: github.com/samanbarghi/MPSCQ – Dupre 19/9, 2016 at 22:52

Below is the technique I used for my Cooperative Multi-tasking / Multi-threading library (MACE) http://bytemaster.github.com/mace/. It has the benefit of being lock-free except for when the queue is empty.

struct task {
   boost::function<void()> func;
   task* next;
};


boost::mutex                     task_ready_mutex;
boost::condition_variable        task_ready;
boost::atomic<task*>             task_in_queue;

// this can be called from any thread
void thread::post_task( task* t ) {
     // atomically post the task to the queue.
     task* stale_head = task_in_queue.load(boost::memory_order_relaxed);
     do { t->next = stale_head;
     } while( !task_in_queue.compare_exchange_weak( stale_head, t, boost::memory_order_release ) );

   // Because only one thread can post the 'first task', only that thread will attempt
   // to aquire the lock and therefore there should be no contention on this lock except
   // when *this thread is about to block on a wait condition.  
    if( !stale_head ) { 
        boost::unique_lock<boost::mutex> lock(task_ready_mutex);
        task_ready.notify_one();
    }
}

// this is the consumer thread.
void process_tasks() {
  while( !done ) {
   // this will atomically pop everything that has been posted so far.
   pending = task_in_queue.exchange(0,boost::memory_order_consume);
   // pending is a linked list in 'reverse post order', so process them
   // from tail to head if you want to maintain order.

   if( !pending ) { // lock scope
      boost::unique_lock<boost::mutex> lock(task_ready_mutex);                
      // check one last time while holding the lock before blocking.
      if( !task_in_queue ) task_ready.wait( lock );
   }
 }

Brakesman answered 4/8, 2012 at 2:25 Comment(2)

I believe that it should be pending = task_in_queue.exchange(0, boost::memory_order_acquire); since in the ISO C++11 standard is stated 29.3.2 "An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic operation B that performs an acquire operation on M and takes its value from any side effect in the release sequence headed by A." – Earplug 30/8, 2013 at 2:17

I wouldn't say its lock free when the "queue" isn't empty, due to memory allocation/reclamation – Yearbook 9/3, 2017 at 13:48

-1

I'm guessing no such thing exists - and if it does, it either isn't portable or isn't open source.

Conceptually, you are trying to control two pointers simultaneously: the tail pointer and the tail->next pointer. That can't generally be done with just lock-free primitives.

Confiteor answered 18/1, 2012 at 22:40 Comment(1)

Your guess is incorrect. Producers just need to move a tail. What you are describing is an intrusive queue. In that case, you could update the tail->next and then atomically move the tail. If you didn't succeed, loop again. – Auerbach 1/3, 2017 at 5:49

Recommended topics

Hot tags