Is there a production ready lock-free queue or hash implementation in C++ [closed]
Asked Answered
F

15

92

I ve been googling quite a bit for a lock-free queue in C++. I found some code and some trials - but nothing that i was able to compile. A lock-free hash would also be welcome.

SUMMARY: So far i have no positive answer. There is no "production ready" library, and amazingly none of the existent libraries complies to the API of STL containers.

Fortyfive answered 22/7, 2009 at 9:8 Comment(7)
Visual Studio 2010 contains a lock free queue in <concurrent_queue.h>Astray
And there is a hash_map and unordered_map at code.msdn.com/concrtextrasAstray
I read the documentation about concurrent_queue.h athttp://msdn.microsoft.com/en-us/library/ee355358.aspx. It says nothing about locks. Where do i find such information?Fortyfive
concurrent_queue is lock-free, look at overview documentation for the concurrency runtime on msdn.Astray
Note that, curiously, the term "lock-free" does not necessarily mean that there are no locks. See en.wikipedia.org/wiki/Non-blocking_algorithm for one definition.Lithopone
Boost 1.53 has Lockfree library.Faceharden
Has somebody mentioned libcds.sourceforge.net (pretty new I guess)? Moreover notice that it doesn't makes sense for a lockfree queue to comply to the STL API because front() and pop() need to be combined to be meaningfull.Wideawake
T
45

As of 1.53, boost provides a set of lock free data structures, including queues, stacks and single-producer/single-consumer queues (i.e. ring buffers).

Tbilisi answered 18/2, 2013 at 12:53 Comment(3)
boost::lockfree::queue works only with POD types and most of the times it is not quite useful. I am sure if there was a way to provide a more flexible structure, boost would have introduced it.Sere
@Sere where is the problem with that? If you want to pass anything else, especially objects with opaque possibly blocking side effects, you haven't understood the whole purpose of lock free design.Paisley
Why boost provides lock-free queue and stack, both based on a linked list. But they are not providing lock-free linked list?Reiko
R
25

The starting point would be either of Herb Sutter's DDJ articles for either a single producer and consumer or multiple ones. The code he gives (in-line starting on the second page of each article) uses the C++0x style atomic<T> template type; which you can imitate using the Boost interprocess library.

The boost code is buried in the depths of the interprocess library, but having read through the appropriate header file (atomic.hpp) the implementations for the necessary compare-and-swap operations on the systems I am familiar with look sound.

Rusticate answered 22/7, 2009 at 9:14 Comment(9)
Steve, I'm also interested in Boost's atomic implementations, but they seem to reside in Interprocess' detail/ and aren't documented. Are they safe to use anyhow? Thanks!Haleyhalf
I know Herb Sutter's articles very well - where have you found the sources? They are not published by DDJ nor on his site (or maybe i am blind?).Fortyfive
The code is in-line in those articles starting on their respective second pages.Rusticate
"Imitating" a atomic template is not what i meant with production ready. Obviously thats why Herb did not publish a source file. Did you (or anybody else) extract a working implementation from the article?Fortyfive
Here's a transcription of Sutter's single-producer/single-consumer queue -- tinesware.blogspot.com/2009/07/…Rusticate
This one seems to work, but single-producer/single-consumer does not the job. What is needed are at least multiple producers.Fortyfive
If you want truly lock-free code, for multiple priducers or consumers, then I'm out. Sutter's multiple-producer queue example isn't lock free -- there is a lock to serialize producers, and a lock to serialize consumers. If you can find one, I'd be interested in that as well.Rusticate
There is a boost::lockfree project at tim.klingt.org/git?p=boost_lockfree.git; that you can look at. One of its goals is to provide a non ::details:: version of atomic primitives.Wastage
The links above are broken. Archive.org has the articles, e.g.: web.archive.org/web/20120331121935/http://drdobbs.com/parallel/…Bromleigh
B
21

Yes!

I wrote a lock-free queue. It has Features™:

  • Fully wait-free (no CAS loops)
  • Super fast (over a hundred million enqueue/dequeue operations per second)
  • Uses C++11 move semantics
  • Grows as needed (but only if you want it to)
  • Does lock-free memory management for the elements (using pre-allocated contiguous blocks)
  • Stand-alone (two headers plus a license and readme)
  • Compiles under MSVC2010+, Intel ICC 13, and GCC 4.7.2 (and should work under any C++11 fully-compliant compiler)

It's available on GitHub under the simplified BSD license (feel free to fork it!).

Caveats:

  • Only for single-producer single-consumer architecture (i.e. two threads)
  • Thoroughly tested on x86(-64), and should work on ARM, PowerPC, and other CPUs where aligned native-sized integers and pointer loads and stores are naturally atomic, but has not been field tested on non-x86 CPUs (if someone has one to test it on let me know)
  • No idea if any patents are infringed upon (use at your own risk, etc.). Note that I did design and implement it myself from scratch.
Bartlet answered 6/5, 2013 at 21:56 Comment(17)
Sounds very good but multiple producers and/or multiple consumers are needed to leverage real multi threading.Fortyfive
@RED: Depends on the application. Single-producer/-consumer was all I needed, so it's all I built ;-)Bartlet
@Cameron: Great stuff! Have you benchmarked your queue against Facebook's folly ProducerConsumerQueue? I have done it using your benchmark code and it appears to dramatically outperform both your RWQ and Dmitry's SPSC. I am on OS X 10.8.3 with a 3.06 GHz Core 2 Duo (T9900) and compiled the code using Clang with -O3. I did this because I am currently looking at a single-producer/single-consumer queue for one of my projects and I considered yours a candidate :)Isbell
@André: I just checked now :-) Facebook's folly one is slightly faster than mine when dequeueing from an empty queue, and slightly slower when dequeueing from a non-empty queue on a single thread. All other operations are almost exactly the same speed (this was with g++ -O3 on a VM). What size are you using for the folly queue? (I used MAX.) Mine and Dmitry's both grow as needed, whereas the folly one is fixed -- and of course, the fastest enqueue operation is when there's no room and it simply fails. Looking at the code, folly's seems to be using the same ideas as mine, but w/o resizeability.Bartlet
@André: Oh, one more thing I forgot to mention -- with my benchmark code, the "Raw empty remove" benchmark does by far the most iterations (since it's so simple, it takes more to get a measurable result), which tends to disproportionately affect the final "average ops/s" numbers. The multipliers (and flat timing values) are generally more useful. Anyway, at these speeds, all of these queues will be quite fast enough if they're actually being used for something meatier than my silly synthetic benchmarks ;-)Bartlet
@Cameron: I used MAX as well for the size (actually MAX + 1). I know that folly is fixed size, but I tested yours with MAX as initial size and folly still outperformed yours in nearly every metric. I may have made some mistake though. The average ops/s were in the thousands whereas yours and Dmitry's are in the 70's..100's. The multipliers were also all below 1x when comparing with your RWQ. Anyways, I think you are right as any of these queues will be super efficient and will be spend most of the time waiting to do stuff ;) Could you update your bench repo with folly in it too? CheersIsbell
@André: Sure, I'll update my benchmark repo. I'm curious as to how your benchmark results are so different from mine! 1000s of millions of ops/s on average is hitting up against the CPU instructions/s speed (GHz) -- have you checked the assembly (objdump -d -C [-S] nixbench) to make sure the benchmark's not somehow being circumvented entirely by the optimizer?Bartlet
@André: OK, I've updated the benchmark (results).Bartlet
@Cameron: I found it really strange as well that the difference was so huge. Here you can check some example runs from my code. Unfortunately I am a total noob at analyzing assembly :( I have made a dump with otool -tV (I am in OS X) and you can check it out here. Maybe you can find out if the code is being circumvented or if I screwed up when adding it to the bench. Meanwhile I will pull your bench repo and run it again here.Isbell
@Cameron: It appears that I had a bug somewhere in my initial code affecting the ops/s, that was giving those awkward results. I have run your new bench code and I got more "reasonable" results. You can check them here. I had to use TQueue q(MAX + 1);instead of TQueue q(MAX); since folly's only has size - 1 usable slots and one assert() would fail otherwise. Anyways, folly still outperforms both yours and Dmitry's queues and although the ops/s now appear more realistic, the factors remain almost always below 1x. I guess we have a winner :)Isbell
@André: Ah, that explains it. You should define NDEBUG when running the benchmarks, though. I've fixed the off-by-one bug in the repo. Keep in mind different platforms/compilers will give significantly different results :-) Mine is slower than Dmitry's under ICC on my Intel laptop, for example, but much faster on a Linode VM with GCC. A single instruction extra or less can increase/decrease the running time by a significant margin (e.g. 25%), because the time scales are so small.Bartlet
@Cameron: You are right, I forgot to define NDEBUG. I have defined it, re-run the benches and updated results in the previous gist. The results are better for all queues and are more stable. Yes, I am aware of the platform/compiler variance but since I will be using Clang for my specific project, I am tending towards folly's based on current results :). However, I will have to run these benches on ARM in a near future to be certain of it, because they may be quite different from Intel x64's. I will keep you posted!Isbell
@André: Fascinating, it's definitely significantly faster in your environment. Use what works best! :-) I only wrote mine because I couldn't find anything else that had all the features I wanted (fast, pre-allocated to avoid calls to malloc to make it real-time friendly, can grow if needed, C++11 move semantics w/ proper ctor/dtor support).Bartlet
@Cameron: I admire your effort and dedication to write such a complex structure from scratch using these advanced techniques. The fact that your queue is able to grow if needed is a very good feature that should not be overlooked for some types of situations. I shall keep it in my toolbox if such a situation presents itself ;) Keep up the good work!Isbell
@André: I made one last tweak to my queue; it may affect the performance (though folly's will probably still be faster). Let me know how it goes on ARM! And good luck with your application :-)Bartlet
@Cameron, Do you think about make your lock-free elements come out of the queue in the same order they were put absolutely?Fulgor
@Vladimir: Yes, that would require rewriting the queue from scratch (new algorithm). Maybe another time :-)Bartlet
B
18

Facebook's Folly seems to have lock free data structures based on C++11 <atomic>:

I would dare to say these are currently used in production, so I guess they could safely be used in other projects.

Cheers!

Brummett answered 15/4, 2013 at 15:51 Comment(1)
They also have an MPMC queue. that they claim is "optionally blocking." It appears to not be a part of their regular documentation, unsure if it's advised to be used.Delorisdelorme
A
13

There is such a library, but it's in C.

Wrapping to C++ should be straightforward.

liblfds

Absorb answered 17/10, 2009 at 10:18 Comment(0)
A
11

boost.lockfree is an attempt to create c++ implementations of lockfree stack and fifo classes.

public git repository

Ashla answered 28/8, 2009 at 10:21 Comment(2)
Documentation: tim.klingt.org/boost_lockfreeAshla
This is part of standard boost now - check out the doc at boost.org/doc/libs/1_53_0/doc/html/lockfree.htmlWinker
F
8

After having checked most of the given answers, i can only state:

The answer is NO.

There is no such thing right that could be used right out of the box.

Fortyfive answered 27/8, 2009 at 19:51 Comment(4)
100% correct. I came to the same result with help from comp.programming.threads newsgroup. One reason is that the area of lock free data structures is a patent mine field. So even commercial libs like Intels are avoiding it.Radioluminescence
This is C, not C++. Please read questions before down voting.Fortyfive
Apologies. I note SO will not let me undo my vote as it feels the vote is too old. I think the SO developers need more to do - they seem to be adding increasing numbers of unhelpful behaviours.Absorb
Why this answer is getting so many upvotes. The question can be easily edited. Or this could be in a comment.Equanimous
K
6

The closest thing I am aware of is Windows Interlocked Singly Linked Lists. Of course, it is Windows only.

Karakoram answered 9/10, 2009 at 16:6 Comment(3)
Wow - that seems to be it. I'll need some time to check it (i cant do it currently) out but i'll return to you.Fortyfive
Interlocked Singly Linked List is a wonderful tool but sadly it is not FIFO.Weathertight
It's not a proper list, as I recall. You cannot unlink arbitrary elements; the only thing you can do is delete the whole list. Perhaps it's moved on since then...Absorb
R
5

If you have a multiple-producer / single-consumer Queue/FIFO, you can easily make one LockFree using SLIST or a trivial Lock Free LIFO stack. What you do is have a second "private" stack for the consumer (which can also be done as a SLIST for simplicity or any other stack model you choose). The consumer pops items off the private stack. Whenever the private LIFO is exhasted, you do a Flush rather than Pop off the shared concurrent SLIST (grabbing the entire SLIST chain) and then walk the Flushed list in-order pushing items onto the private stack.

That works for single-producer / single-consumer and for multiple-producer / single-consumer.

However, it does not work for multiple-consumer cases (with either single-producer or multiple-producers).

Also, as far as hash tables go, they are an ideal candidate for "striping" which is just dividing the hash into segments having a lock per segments of the cache. This is how the Java concurrent library does it (using 32-stripes). If you have a light-weight reader-writer lock, the hash table can be concurrently accessed for simultaneous reads and you will only stall when write are occurring on contested stripes (and possibly if you allow for growing the hash table).

If you roll your own, make sure to interleave your locks with the hash entries rather than put all your locks in an array next to each other so you're less likely to have false sharing.

Raeleneraf answered 9/10, 2009 at 16:0 Comment(1)
Thanks for your Answer. I am looking for a "production ready" solution / template in C++. I dont want to roll my own. Do you know such a implementation?Fortyfive
W
4

And then Intel Threading Building Blocks came. And for a time, it was good.

PS : you are looking for concurrent_queue and concurrent_hash_map

Willable answered 22/7, 2009 at 12:18 Comment(3)
Those are not lock-free. See software.intel.com/en-us/forums/intel-threading-building-blocks/…Fortyfive
I know, to the strict sens of lock-free, but I nevertheless thought it might help the OP with his problem, as the lock-free thing is just an implementation detail. I thought he was looking for collections that work well with concurrent access.Willable
Lock-free thing is not just an inplementation detail. It's a different beast altogether.Proboscis
O
4

I may come a bit late on this.

The absence of solutions (at the question was asked) are mainly due to an important issue in C++ (before C++0x/11): C++ have (has) no concurrent memory model.

Now, using std::atomic, you can control memory ordering issues and have proper compare-and-swap operations. I've written myself an implementation of Micheal&Scott's lock-free queue (PODC96) using C++11 and Micheal's Hazard Pointers (IEEE TPDS 2004) to avoid early free and ABA problems. It's working fine but it's a quick and dirty implementation and I'm not satisfied with the actual performances. Code is available on bitbucket: LockFreeExperiment

It's also possible to implement lock-free queue without hazard pointers using double-words CAS (but 64bit versions will only be possible on x86-64 using cmpxchg16b), I've a blog post about that (with untested code for the queue) here: Implementing generic double-word compare and swap for x86/x86-64 (LSE blog.)

My own benchmark show me that double-locked queue (also in Micheal&Scott 1996 paper) performs as well as the lock-free one (I haven't reach enough contention so that locked data structures have performances issues, but my bench are too light for now) and the concurrent queue from Intel's TBB seems even better (two time faster) for a relatively small number (depending on the operating system, under FreeBSD 9, the lowest bound I've found so far, this number is 8 threads on an i7 with 4 ht-core, and thus 8 logical CPUs) of threads and have very strange behavior (execution time of my simple benchmark move from seconds to hours !)

Another limitations about lock-free queues following the STL style: having iterators on lock-free queue has no sens.

Oath answered 30/7, 2013 at 19:26 Comment(0)
T
1

To the best of my knowledge, there is no such thing publicly available yet. One issue an implementor needs to solve is that you need a lock-free memory allocator, which exists, though I cannot seem to find the link right now.

Trebuchet answered 22/7, 2009 at 12:14 Comment(1)
Doesn't make sense to me why a memory allocator is available. Just use datastructures with intrinsic pointers (you know the good old way until went crazy with containers and lost skills to even implement simple Hashtables).Radioluminescence
M
1

The following is from Herb Sutter's article on Concurrent lock free Queue http://www.drdobbs.com/parallel/writing-a-generalized-concurrent-queue/211601363?pgno=1 . I have made some changes like compiler reordering stuff. One needs GCC v4.4+ to compile this code.

#include <atomic>
#include <iostream>
using namespace std;

//compile with g++ setting -std=c++0x

#define CACHE_LINE_SIZE 64

template <typename T>
struct LowLockQueue {
private:
    struct Node {
    Node( T* val ) : value(val), next(nullptr) { }
    T* value;
    atomic<Node*> next;
    char pad[CACHE_LINE_SIZE - sizeof(T*)- sizeof(atomic<Node*>)];
    };
    char pad0[CACHE_LINE_SIZE];

// for one consumer at a time
    Node* first;

    char pad1[CACHE_LINE_SIZE
          - sizeof(Node*)];

// shared among consumers
    atomic<bool> consumerLock;

    char pad2[CACHE_LINE_SIZE
          - sizeof(atomic<bool>)];

// for one producer at a time
    Node* last;

    char pad3[CACHE_LINE_SIZE
          - sizeof(Node*)];

// shared among producers
    atomic<bool> producerLock;

    char pad4[CACHE_LINE_SIZE
          - sizeof(atomic<bool>)];

public:
    LowLockQueue() {
    first = last = new Node( nullptr );
    producerLock = consumerLock = false;
    }
    ~LowLockQueue() {
    while( first != nullptr ) {      // release the list
        Node* tmp = first;
        first = tmp->next;
        delete tmp->value;       // no-op if null
        delete tmp;
    }
    }

    void Produce( const T& t ) {
    Node* tmp = new Node( new T(t) );
    asm volatile("" ::: "memory");                            // prevent compiler reordering
    while( producerLock.exchange(true) )
        { }   // acquire exclusivity
    last->next = tmp;         // publish to consumers
    last = tmp;             // swing last forward
    producerLock = false;       // release exclusivity
    }

    bool Consume( T& result ) {
    while( consumerLock.exchange(true) )
        { }    // acquire exclusivity
    Node* theFirst = first;
    Node* theNext = first-> next;
    if( theNext != nullptr ) {   // if queue is nonempty
        T* val = theNext->value;    // take it out
        asm volatile("" ::: "memory");                            // prevent compiler reordering
        theNext->value = nullptr;  // of the Node
        first = theNext;          // swing first forward
        consumerLock = false;             // release exclusivity
        result = *val;    // now copy it back
        delete val;       // clean up the value
        delete theFirst;      // and the old dummy
        return true;      // and report success
    }
    consumerLock = false;   // release exclusivity
    return false;                  // report queue was empty
    }
};

int main(int argc, char* argv[])
{
    //Instead of this Mambo Jambo one can use pthreads in Linux to test comprehensively
LowLockQueue<int> Q;
Q.Produce(2);
Q.Produce(6);

int a;
Q.Consume(a);
cout<< a << endl;
Q.Consume(a);
cout<< a << endl;

return 0;
}
Monostich answered 14/9, 2012 at 13:25 Comment(1)
This is not lock free. Sure it doesn't use a lock provided by the os, but the way it spins on (for example) "atomic<bool> consumerLock" is definitely locking behavior. If a thread crashes while it holds one of those locks, no more work can get done. Even Herb himself says so (I think on page 4 of that article).Publicize
F
0

I found another solution written in c:

http://www.ddj.com/hpc-high-performance-computing/219500200

Fortyfive answered 21/10, 2009 at 9:13 Comment(0)
L
0

I wrote this at some point probably back in 2010, I'm sure with help from different references. It Multi-Producer Single Consumer.

template <typename T>
class MPSCLockFreeQueue 
{
private:
    struct Node 
    {
        Node( T val ) : value(val), next(NULL) { }
        T value;
        Node* next;
    };
    Node * Head;               
    __declspec(align(4)) Node * InsertionPoint;  //__declspec(align(4)) forces 32bit alignment this must be changed for 64bit when appropriate.

public:
    MPSCLockFreeQueue() 
    {
        InsertionPoint = new Node( T() );
        Head = InsertionPoint;
    }
    ~MPSCLockFreeQueue() 
    {
        // release the list
        T result;
        while( Consume(result) ) 
        {   
            //The list should be cleaned up before the destructor is called as there is no way to know whether or not to delete the value.
            //So we just do our best.
        }
    }

    void Produce( const T& t ) 
    {
        Node * node = new Node(t);
        Node * oldInsertionPoint = (Node *) InterLockedxChange((volatile void **)&InsertionPoint,node);
        oldInsertionPoint->next = node;
    }

    bool Consume( T& result ) 
    {
        if (Head->next)
        {
            Node * oldHead = Head;
            Head = Head->next;
            delete oldHead;
            result = Head->value;
            return true;
        }       
        return false;               // else report empty
    }

};
Lamprophyre answered 24/10, 2017 at 17:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.