Is there a production ready lock-free queue or hash implementation in C++ [closed]

Asked 22/7, 2009 at 9:8 Answered 24/10, 2017 at 17:13

I ve been googling quite a bit for a lock-free queue in C++. I found some code and some trials - but nothing that i was able to compile. A lock-free hash would also be welcome.

SUMMARY: So far i have no positive answer. There is no "production ready" library, and amazingly none of the existent libraries complies to the API of STL containers.

Fortyfive answered 22/7, 2009 at 9:8 Comment(7)

Visual Studio 2010 contains a lock free queue in <concurrent_queue.h> – Astray 14/8, 2010 at 5:14

And there is a hash_map and unordered_map at code.msdn.com/concrtextras – Astray 14/8, 2010 at 5:15

I read the documentation about concurrent_queue.h athttp://msdn.microsoft.com/en-us/library/ee355358.aspx. It says nothing about locks. Where do i find such information? – Fortyfive 14/8, 2010 at 12:19

concurrent_queue is lock-free, look at overview documentation for the concurrency runtime on msdn. – Astray 19/10, 2010 at 13:42

Note that, curiously, the term "lock-free" does not necessarily mean that there are no locks. See en.wikipedia.org/wiki/Non-blocking_algorithm for one definition. – Lithopone 19/4, 2011 at 16:3

Boost 1.53 has Lockfree library. – Faceharden 6/2, 2013 at 14:53

Has somebody mentioned libcds.sourceforge.net (pretty new I guess)? Moreover notice that it doesn't makes sense for a lockfree queue to comply to the STL API because front() and pop() need to be combined to be meaningfull. – Wideawake 26/5, 2015 at 13:48

As of 1.53, boost provides a set of lock free data structures, including queues, stacks and single-producer/single-consumer queues (i.e. ring buffers).

Tbilisi answered 18/2, 2013 at 12:53 Comment(3)

boost::lockfree::queue works only with POD types and most of the times it is not quite useful. I am sure if there was a way to provide a more flexible structure, boost would have introduced it. – Sere 26/10, 2013 at 13:55

@Sere where is the problem with that? If you want to pass anything else, especially objects with opaque possibly blocking side effects, you haven't understood the whole purpose of lock free design. – Paisley 7/3, 2014 at 17:1

Why boost provides lock-free queue and stack, both based on a linked list. But they are not providing lock-free linked list? – Reiko 8/7, 2016 at 12:25

The starting point would be either of Herb Sutter's DDJ articles for either a single producer and consumer or multiple ones. The code he gives (in-line starting on the second page of each article) uses the C++0x style atomic<T> template type; which you can imitate using the Boost interprocess library.

The boost code is buried in the depths of the interprocess library, but having read through the appropriate header file (atomic.hpp) the implementations for the necessary compare-and-swap operations on the systems I am familiar with look sound.

Rusticate answered 22/7, 2009 at 9:14 Comment(9)

Steve, I'm also interested in Boost's atomic implementations, but they seem to reside in Interprocess' detail/ and aren't documented. Are they safe to use anyhow? Thanks! – Haleyhalf 22/7, 2009 at 10:6

I know Herb Sutter's articles very well - where have you found the sources? They are not published by DDJ nor on his site (or maybe i am blind?). – Fortyfive 22/7, 2009 at 11:13

The code is in-line in those articles starting on their respective second pages. – Rusticate 22/7, 2009 at 11:38

"Imitating" a atomic template is not what i meant with production ready. Obviously thats why Herb did not publish a source file. Did you (or anybody else) extract a working implementation from the article? – Fortyfive 22/7, 2009 at 13:18

Here's a transcription of Sutter's single-producer/single-consumer queue -- tinesware.blogspot.com/2009/07/… – Rusticate 22/7, 2009 at 21:44

This one seems to work, but single-producer/single-consumer does not the job. What is needed are at least multiple producers. – Fortyfive 29/7, 2009 at 13:21

If you want truly lock-free code, for multiple priducers or consumers, then I'm out. Sutter's multiple-producer queue example isn't lock free -- there is a lock to serialize producers, and a lock to serialize consumers. If you can find one, I'd be interested in that as well. – Rusticate 29/7, 2009 at 15:8

There is a boost::lockfree project at tim.klingt.org/git?p=boost_lockfree.git; that you can look at. One of its goals is to provide a non ::details:: version of atomic primitives. – Wastage 31/7, 2009 at 10:5

The links above are broken. Archive.org has the articles, e.g.: web.archive.org/web/20120331121935/http://drdobbs.com/parallel/… – Bromleigh 23/8, 2019 at 22:5

Yes!

I wrote a lock-free queue. It has Features™:

Fully wait-free (no CAS loops)
Super fast (over a hundred million enqueue/dequeue operations per second)
Uses C++11 move semantics
Grows as needed (but only if you want it to)
Does lock-free memory management for the elements (using pre-allocated contiguous blocks)
Stand-alone (two headers plus a license and readme)
Compiles under MSVC2010+, Intel ICC 13, and GCC 4.7.2 (and should work under any C++11 fully-compliant compiler)

It's available on GitHub under the simplified BSD license (feel free to fork it!).

Caveats:

Only for single-producer single-consumer architecture (i.e. two threads)
Thoroughly tested on x86(-64), and should work on ARM, PowerPC, and other CPUs where aligned native-sized integers and pointer loads and stores are naturally atomic, but has not been field tested on non-x86 CPUs (if someone has one to test it on let me know)
No idea if any patents are infringed upon (use at your own risk, etc.). Note that I did design and implement it myself from scratch.

Bartlet answered 6/5, 2013 at 21:56 Comment(17)

Sounds very good but multiple producers and/or multiple consumers are needed to leverage real multi threading. – Fortyfive 7/5, 2013 at 8:37

@RED: Depends on the application. Single-producer/-consumer was all I needed, so it's all I built ;-) – Bartlet 7/5, 2013 at 16:15

@Cameron: Great stuff! Have you benchmarked your queue against Facebook's folly ProducerConsumerQueue? I have done it using your benchmark code and it appears to dramatically outperform both your RWQ and Dmitry's SPSC. I am on OS X 10.8.3 with a 3.06 GHz Core 2 Duo (T9900) and compiled the code using Clang with -O3. I did this because I am currently looking at a single-producer/single-consumer queue for one of my projects and I considered yours a candidate :) – Isbell 19/5, 2013 at 20:0

@André: I just checked now :-) Facebook's folly one is slightly faster than mine when dequeueing from an empty queue, and slightly slower when dequeueing from a non-empty queue on a single thread. All other operations are almost exactly the same speed (this was with g++ -O3 on a VM). What size are you using for the folly queue? (I used MAX.) Mine and Dmitry's both grow as needed, whereas the folly one is fixed -- and of course, the fastest enqueue operation is when there's no room and it simply fails. Looking at the code, folly's seems to be using the same ideas as mine, but w/o resizeability. – Bartlet 20/5, 2013 at 4:12

@André: Oh, one more thing I forgot to mention -- with my benchmark code, the "Raw empty remove" benchmark does by far the most iterations (since it's so simple, it takes more to get a measurable result), which tends to disproportionately affect the final "average ops/s" numbers. The multipliers (and flat timing values) are generally more useful. Anyway, at these speeds, all of these queues will be quite fast enough if they're actually being used for something meatier than my silly synthetic benchmarks ;-) – Bartlet 20/5, 2013 at 4:17

@Cameron: I used MAX as well for the size (actually MAX + 1). I know that folly is fixed size, but I tested yours with MAX as initial size and folly still outperformed yours in nearly every metric. I may have made some mistake though. The average ops/s were in the thousands whereas yours and Dmitry's are in the 70's..100's. The multipliers were also all below 1x when comparing with your RWQ. Anyways, I think you are right as any of these queues will be super efficient and will be spend most of the time waiting to do stuff ;) Could you update your bench repo with folly in it too? Cheers – Isbell 20/5, 2013 at 9:5

@André: Sure, I'll update my benchmark repo. I'm curious as to how your benchmark results are so different from mine! 1000s of millions of ops/s on average is hitting up against the CPU instructions/s speed (GHz) -- have you checked the assembly (objdump -d -C [-S] nixbench) to make sure the benchmark's not somehow being circumvented entirely by the optimizer? – Bartlet 20/5, 2013 at 19:10

@André: OK, I've updated the benchmark (results). – Bartlet 20/5, 2013 at 20:5

@Cameron: I found it really strange as well that the difference was so huge. Here you can check some example runs from my code. Unfortunately I am a total noob at analyzing assembly :( I have made a dump with otool -tV (I am in OS X) and you can check it out here. Maybe you can find out if the code is being circumvented or if I screwed up when adding it to the bench. Meanwhile I will pull your bench repo and run it again here. – Isbell 20/5, 2013 at 21:8

@Cameron: It appears that I had a bug somewhere in my initial code affecting the ops/s, that was giving those awkward results. I have run your new bench code and I got more "reasonable" results. You can check them here. I had to use TQueue q(MAX + 1);instead of TQueue q(MAX); since folly's only has size - 1 usable slots and one assert() would fail otherwise. Anyways, folly still outperforms both yours and Dmitry's queues and although the ops/s now appear more realistic, the factors remain almost always below 1x. I guess we have a winner :) – Isbell 20/5, 2013 at 21:49

@André: Ah, that explains it. You should define NDEBUG when running the benchmarks, though. I've fixed the off-by-one bug in the repo. Keep in mind different platforms/compilers will give significantly different results :-) Mine is slower than Dmitry's under ICC on my Intel laptop, for example, but much faster on a Linode VM with GCC. A single instruction extra or less can increase/decrease the running time by a significant margin (e.g. 25%), because the time scales are so small. – Bartlet 20/5, 2013 at 22:3

@Cameron: You are right, I forgot to define NDEBUG. I have defined it, re-run the benches and updated results in the previous gist. The results are better for all queues and are more stable. Yes, I am aware of the platform/compiler variance but since I will be using Clang for my specific project, I am tending towards folly's based on current results :). However, I will have to run these benches on ARM in a near future to be certain of it, because they may be quite different from Intel x64's. I will keep you posted! – Isbell 20/5, 2013 at 22:28

@André: Fascinating, it's definitely significantly faster in your environment. Use what works best! :-) I only wrote mine because I couldn't find anything else that had all the features I wanted (fast, pre-allocated to avoid calls to malloc to make it real-time friendly, can grow if needed, C++11 move semantics w/ proper ctor/dtor support). – Bartlet 20/5, 2013 at 23:4

@Cameron: I admire your effort and dedication to write such a complex structure from scratch using these advanced techniques. The fact that your queue is able to grow if needed is a very good feature that should not be overlooked for some types of situations. I shall keep it in my toolbox if such a situation presents itself ;) Keep up the good work! – Isbell 20/5, 2013 at 23:14

@André: I made one last tweak to my queue; it may affect the performance (though folly's will probably still be faster). Let me know how it goes on ARM! And good luck with your application :-) – Bartlet 21/5, 2013 at 3:28

@Cameron, Do you think about make your lock-free elements come out of the queue in the same order they were put absolutely? – Fulgor 26/12, 2018 at 7:51

@Vladimir: Yes, that would require rewriting the queue from scratch (new algorithm). Maybe another time :-) – Bartlet 3/1, 2019 at 19:15

Facebook's Folly seems to have lock free data structures based on C++11 <atomic>:

ProducerConsumerQueue with docs and example code here.
AtomicHashMap with docs and example code here

I would dare to say these are currently used in production, so I guess they could safely be used in other projects.

Cheers!

Brummett answered 15/4, 2013 at 15:51 Comment(1)

They also have an MPMC queue. that they claim is "optionally blocking." It appears to not be a part of their regular documentation, unsure if it's advised to be used. – Delorisdelorme 20/6, 2015 at 17:29

There is such a library, but it's in C.

Wrapping to C++ should be straightforward.

liblfds

Absorb answered 17/10, 2009 at 10:18 Comment(0)

boost.lockfree is an attempt to create c++ implementations of lockfree stack and fifo classes.

public git repository

Ashla answered 28/8, 2009 at 10:21 Comment(2)

Documentation: tim.klingt.org/boost_lockfree – Ashla 28/8, 2009 at 10:22

This is part of standard boost now - check out the doc at boost.org/doc/libs/1_53_0/doc/html/lockfree.html – Winker 8/3, 2013 at 0:17

After having checked most of the given answers, i can only state:

The answer is NO.

There is no such thing right that could be used right out of the box.

Fortyfive answered 27/8, 2009 at 19:51 Comment(4)

100% correct. I came to the same result with help from comp.programming.threads newsgroup. One reason is that the area of lock free data structures is a patent mine field. So even commercial libs like Intels are avoiding it. – Radioluminescence 6/9, 2009 at 11:34

This is C, not C++. Please read questions before down voting. – Fortyfive 11/10, 2010 at 9:43

Apologies. I note SO will not let me undo my vote as it feels the vote is too old. I think the SO developers need more to do - they seem to be adding increasing numbers of unhelpful behaviours. – Absorb 13/10, 2010 at 9:35

Why this answer is getting so many upvotes. The question can be easily edited. Or this could be in a comment. – Equanimous 7/6, 2013 at 6:29

The closest thing I am aware of is Windows Interlocked Singly Linked Lists. Of course, it is Windows only.

Karakoram answered 9/10, 2009 at 16:6 Comment(3)

Wow - that seems to be it. I'll need some time to check it (i cant do it currently) out but i'll return to you. – Fortyfive 9/10, 2009 at 19:28

Interlocked Singly Linked List is a wonderful tool but sadly it is not FIFO. – Weathertight 1/12, 2010 at 15:6

It's not a proper list, as I recall. You cannot unlink arbitrary elements; the only thing you can do is delete the whole list. Perhaps it's moved on since then... – Absorb 8/12, 2010 at 13:2

If you have a multiple-producer / single-consumer Queue/FIFO, you can easily make one LockFree using SLIST or a trivial Lock Free LIFO stack. What you do is have a second "private" stack for the consumer (which can also be done as a SLIST for simplicity or any other stack model you choose). The consumer pops items off the private stack. Whenever the private LIFO is exhasted, you do a Flush rather than Pop off the shared concurrent SLIST (grabbing the entire SLIST chain) and then walk the Flushed list in-order pushing items onto the private stack.

That works for single-producer / single-consumer and for multiple-producer / single-consumer.

However, it does not work for multiple-consumer cases (with either single-producer or multiple-producers).

Also, as far as hash tables go, they are an ideal candidate for "striping" which is just dividing the hash into segments having a lock per segments of the cache. This is how the Java concurrent library does it (using 32-stripes). If you have a light-weight reader-writer lock, the hash table can be concurrently accessed for simultaneous reads and you will only stall when write are occurring on contested stripes (and possibly if you allow for growing the hash table).

If you roll your own, make sure to interleave your locks with the hash entries rather than put all your locks in an array next to each other so you're less likely to have false sharing.

Raeleneraf answered 9/10, 2009 at 16:0 Comment(1)

Thanks for your Answer. I am looking for a "production ready" solution / template in C++. I dont want to roll my own. Do you know such a implementation? – Fortyfive 9/10, 2009 at 19:25

And then Intel Threading Building Blocks came. And for a time, it was good.

PS : you are looking for concurrent_queue and concurrent_hash_map

Willable answered 22/7, 2009 at 12:18 Comment(3)

Those are not lock-free. See software.intel.com/en-us/forums/intel-threading-building-blocks/… – Fortyfive 22/7, 2009 at 12:57

I know, to the strict sens of lock-free, but I nevertheless thought it might help the OP with his problem, as the lock-free thing is just an implementation detail. I thought he was looking for collections that work well with concurrent access. – Willable 22/7, 2009 at 13:36

Lock-free thing is not just an inplementation detail. It's a different beast altogether. – Proboscis 10/1, 2016 at 17:25

I may come a bit late on this.

The absence of solutions (at the question was asked) are mainly due to an important issue in C++ (before C++0x/11): C++ have (has) no concurrent memory model.

Now, using std::atomic, you can control memory ordering issues and have proper compare-and-swap operations. I've written myself an implementation of Micheal&Scott's lock-free queue (PODC96) using C++11 and Micheal's Hazard Pointers (IEEE TPDS 2004) to avoid early free and ABA problems. It's working fine but it's a quick and dirty implementation and I'm not satisfied with the actual performances. Code is available on bitbucket: LockFreeExperiment

It's also possible to implement lock-free queue without hazard pointers using double-words CAS (but 64bit versions will only be possible on x86-64 using cmpxchg16b), I've a blog post about that (with untested code for the queue) here: Implementing generic double-word compare and swap for x86/x86-64 (LSE blog.)

My own benchmark show me that double-locked queue (also in Micheal&Scott 1996 paper) performs as well as the lock-free one (I haven't reach enough contention so that locked data structures have performances issues, but my bench are too light for now) and the concurrent queue from Intel's TBB seems even better (two time faster) for a relatively small number (depending on the operating system, under FreeBSD 9, the lowest bound I've found so far, this number is 8 threads on an i7 with 4 ht-core, and thus 8 logical CPUs) of threads and have very strange behavior (execution time of my simple benchmark move from seconds to hours !)

Another limitations about lock-free queues following the STL style: having iterators on lock-free queue has no sens.

Oath answered 30/7, 2013 at 19:26 Comment(0)

To the best of my knowledge, there is no such thing publicly available yet. One issue an implementor needs to solve is that you need a lock-free memory allocator, which exists, though I cannot seem to find the link right now.

Trebuchet answered 22/7, 2009 at 12:14 Comment(1)

Doesn't make sense to me why a memory allocator is available. Just use datastructures with intrinsic pointers (you know the good old way until went crazy with containers and lost skills to even implement simple Hashtables). – Radioluminescence 26/8, 2012 at 19:29

The following is from Herb Sutter's article on Concurrent lock free Queue http://www.drdobbs.com/parallel/writing-a-generalized-concurrent-queue/211601363?pgno=1 . I have made some changes like compiler reordering stuff. One needs GCC v4.4+ to compile this code.

#include <atomic>
#include <iostream>
using namespace std;

//compile with g++ setting -std=c++0x

#define CACHE_LINE_SIZE 64

template <typename T>
struct LowLockQueue {
private:
    struct Node {
    Node( T* val ) : value(val), next(nullptr) { }
    T* value;
    atomic<Node*> next;
    char pad[CACHE_LINE_SIZE - sizeof(T*)- sizeof(atomic<Node*>)];
    };
    char pad0[CACHE_LINE_SIZE];

// for one consumer at a time
    Node* first;

    char pad1[CACHE_LINE_SIZE
          - sizeof(Node*)];

// shared among consumers
    atomic<bool> consumerLock;

    char pad2[CACHE_LINE_SIZE
          - sizeof(atomic<bool>)];

// for one producer at a time
    Node* last;

    char pad3[CACHE_LINE_SIZE
          - sizeof(Node*)];

// shared among producers
    atomic<bool> producerLock;

    char pad4[CACHE_LINE_SIZE
          - sizeof(atomic<bool>)];

public:
    LowLockQueue() {
    first = last = new Node( nullptr );
    producerLock = consumerLock = false;
    }
    ~LowLockQueue() {
    while( first != nullptr ) {      // release the list
        Node* tmp = first;
        first = tmp->next;
        delete tmp->value;       // no-op if null
        delete tmp;
    }
    }

    void Produce( const T& t ) {
    Node* tmp = new Node( new T(t) );
    asm volatile("" ::: "memory");                            // prevent compiler reordering
    while( producerLock.exchange(true) )
        { }   // acquire exclusivity
    last->next = tmp;         // publish to consumers
    last = tmp;             // swing last forward
    producerLock = false;       // release exclusivity
    }

    bool Consume( T& result ) {
    while( consumerLock.exchange(true) )
        { }    // acquire exclusivity
    Node* theFirst = first;
    Node* theNext = first-> next;
    if( theNext != nullptr ) {   // if queue is nonempty
        T* val = theNext->value;    // take it out
        asm volatile("" ::: "memory");                            // prevent compiler reordering
        theNext->value = nullptr;  // of the Node
        first = theNext;          // swing first forward
        consumerLock = false;             // release exclusivity
        result = *val;    // now copy it back
        delete val;       // clean up the value
        delete theFirst;      // and the old dummy
        return true;      // and report success
    }
    consumerLock = false;   // release exclusivity
    return false;                  // report queue was empty
    }
};

int main(int argc, char* argv[])
{
    //Instead of this Mambo Jambo one can use pthreads in Linux to test comprehensively
LowLockQueue<int> Q;
Q.Produce(2);
Q.Produce(6);

int a;
Q.Consume(a);
cout<< a << endl;
Q.Consume(a);
cout<< a << endl;

return 0;
}

Monostich answered 14/9, 2012 at 13:25 Comment(1)

This is not lock free. Sure it doesn't use a lock provided by the os, but the way it spins on (for example) "atomic<bool> consumerLock" is definitely locking behavior. If a thread crashes while it holds one of those locks, no more work can get done. Even Herb himself says so (I think on page 4 of that article). – Publicize 27/11, 2012 at 15:35

I found another solution written in c:

http://www.ddj.com/hpc-high-performance-computing/219500200

Fortyfive answered 21/10, 2009 at 9:13 Comment(0)

I wrote this at some point probably back in 2010, I'm sure with help from different references. It Multi-Producer Single Consumer.

template <typename T>
class MPSCLockFreeQueue 
{
private:
    struct Node 
    {
        Node( T val ) : value(val), next(NULL) { }
        T value;
        Node* next;
    };
    Node * Head;               
    __declspec(align(4)) Node * InsertionPoint;  //__declspec(align(4)) forces 32bit alignment this must be changed for 64bit when appropriate.

public:
    MPSCLockFreeQueue() 
    {
        InsertionPoint = new Node( T() );
        Head = InsertionPoint;
    }
    ~MPSCLockFreeQueue() 
    {
        // release the list
        T result;
        while( Consume(result) ) 
        {   
            //The list should be cleaned up before the destructor is called as there is no way to know whether or not to delete the value.
            //So we just do our best.
        }
    }

    void Produce( const T& t ) 
    {
        Node * node = new Node(t);
        Node * oldInsertionPoint = (Node *) InterLockedxChange((volatile void **)&InsertionPoint,node);
        oldInsertionPoint->next = node;
    }

    bool Consume( T& result ) 
    {
        if (Head->next)
        {
            Node * oldHead = Head;
            Head = Head->next;
            delete oldHead;
            result = Head->value;
            return true;
        }       
        return false;               // else report empty
    }

};

Lamprophyre answered 24/10, 2017 at 17:13 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Yes!

Recommended topics

Hot tags