Atomic shared_ptr for lock-free singly linked list

Asked 7/7, 2015 at 18:59 Answered 27/10, 2015 at 17:51

c++data-structures shared-ptr lock-free stdatomic

I'm wondering if it is possible to create a lock-free, thread-safe shared pointer for any of the "common" architectures, like x64 or ARMv7 / ARMv8.

In a talk about lock-free programming at cppcon2014, Herb Sutter presented a (partial) implementation of a lock-free singly linked list. The implementation looks quite simple, but it relies on an atomic shared_ptr implementation that doesn't exist in the standard library yet or on using the specialized std::atomic... functions. This is especially important as single push/pop calls potentially invoke multiple atomic loads/stores and compare_exchange operations.

The problem I see (and I think some of the questions in the talk went into the same direction) is that for this to be an actual lock-free data structure, those atomic operations would have to be lock-free themselves. I don't know of any standard library implementation for the std::atomic... functions that is lock-free and - at least with a short google / SO search - I also didn't find a suggestion of how to implement a lock-free specialization for std::atomic<std::shared_ptr>.

Now before I'm wasting my time on this I wanted to ask:

Do you know, if it is possible to write an lockfree, atomic shared pointer at all?
Are there already any implementations that I've overlooked and - ideally - are even compatible with what you would expect from a std::atomic<std::shared_ptr>? For the mentioned queue it would especially require a CAS-operation.
If there is no way to implement this on current architectures, do you see any other benefit in Herb's implementation compared to a "normal" linked list that is protected by a lock?

For reference, here is the code from Herb Sutter (might contain typos from me):

template<class T> 
class slist {
    struct Node { T t;  std::shared_ptr<Node> next; };
    std::atomic<std::shared_ptr<Node>> head;        
public:
    class reference{
        std::shared_ptr<Node> p;
    public:
        reference(std::shared_ptr<Node> p_){}
        T& operator*(){ return p->t; }
        T* operator->(){ return &p->t; }
    };
    auto find(T t) const {
        auto p = head.load();
        while (p && p-> != t) {
            p = p - next;
        }
        return reference(move(p));
    }
    void push_front(T t) {
        auto p = std::make_shared<Node>();
        p->t = t;
        p->next = head;
        while (!head.compare_exchange_weak(p->next, p)) {}
    }
    void pop_front() {
        auto p = head.load();
        while (p && !head.compare_exchange_weak(p, p - next)) { ; }
    }
};

Note, that in this implementation, single instances of a shared_ptr can be accessed / modified by multiple different threads. It can be read/copied, reset and even deleted (as part of a node). So this not about whether multiple different shared_ptr objects (that manage the same object) can be used by multiple threads without a race condition - this that is already true for current implementations and required by the standard - but it is about concurrent access to a single pointer instance, which is - for standard shared pointers - no more threadsafe than the same operations on raw pointers would be.

To explain my motivation:
This is mainly an academic question. I've no intention of implementing my own lock free list in production code, but I find the topic interesting and at first glance, Herb's presentation seemed to be a good introduction. However, while thinking about this question and @sehe's comment on my answer, I remembered this talk, had another look at it and realized that it doesn't make much sense to call Herb's implementation lock-free, if it's primitive operations require locks (which they currently do). So I was wondering, whether this is just a limitation of the current implementations or a fundamental flaw in the design.

Induplicate answered 7/7, 2015 at 18:59 Comment(6)

Are you aware of the proposals for atomic smart pointers? – Stuyvesant 7/7, 2015 at 19:27

@dyp: I'm aware of proposals up to N4162, but it doesn't mention a lock free version. On the contrary: It speaks about performance gains, because the atomic flag used for spinlock implementations can be stored as part of the smart pointer. – Induplicate 7/7, 2015 at 19:32

Some data points: Both libstdc++ and libc++ use a global array of mutexes to protect atomic access to shared_ptrs via the [util.smartptr.shared.atomic] functions. They hash the address of the shared_ptr object and use a fixed-size global hash table of mutexes. – Stuyvesant 7/7, 2015 at 20:2

Have you read the proposals for the current atomic shared ptr operations? N2674 mentions they could be made lock-free.. – Stuyvesant 7/7, 2015 at 20:39

@dyp: No, not yet. Thanks for the link, I'll take a look. – Induplicate 7/7, 2015 at 21:28

Atomic shared_ptr needs hardware support of CAS2 instructions as what I remember, which handle atomic operations for 2 independent memory locations. i.e, you need to test & set the inner pointer and the refCount atomically. – Supremacy 9/11, 2015 at 8:8

I'm adding this as an answer since it's too long to fit in a comment:

Something to consider. A lock-free shared_ptr is not needed to implement lock-free/wait-free data structures.

The reason Sutter uses shared_ptr in his presentation is because the most complicated part of writing lock-free data structures is not the synchronization, but the memory reclamation: we cannot delete nodes while they're potentially accessed by other threads, so we have to leak them and reclaim later. A lock-free shared_ptr implementation essentially provides "free" memory reclamation and makes examples of lock-free code palatable, especially in a context of a time-limited presentation.

Of course, having a lock-free atomic_shared_ptr as part of the Standard would be a huge help. But it's not the only way of doing memory reclamation for lock-free data structures, there's the naive implementation of maintaining a list of nodes to be deleted at quiescent points in execution (works in low-contention scenarios only), hazard pointers, roll-your-own atomic reference counting using split counts.

As for performance, @mksteve is correct, lock-free code in not guaranteed to outperform lock-based alternatives unless maybe it runs on a highly parallel system offering true concurrency. It's goal is to enable maximum concurrency and because of that what we typically get is threads doing less waiting at the the cost of performing more work.

PS If this is something that interests you, you should consider taking a look at C++ Concurrency in Action by Anthony Williams. It dedicates a whole chapter to writing lock-free/wait-free code, which offers a good starting place, walking through implementations of lock-free stack and queue.

Hiett answered 27/10, 2015 at 17:51 Comment(0)

Do you know, if it is possible to write an lockfree, atomic shared pointer at all?
Are there already any implementations that I've overlooked and - ideally - are even compatible with what you would expect from a std::atomic?

I think the std::atomic_... offers a form of implementation, where the slist would perform special atomic_ queries on the shared_ptr. The problem with this being separated into two classes (std::atomic and std::shared_ptr) is that they each have constraints which need to be adhered to in order to function. The class separation, makes that knowledge of shared constaints impossible.

Within slist, which knows both items, it can help the situation, and thus probably the atomic_... functions will work.

If there is no way to implement this on current architectures, do you see any other benefit in Herb's implementation compared to a "normal" linked list that is protected by a lock?

From Wikipedia : Non blocking algorithm the purpose of the lock free nature, is to guarantee some progress is being made by at least one thread.

This does not give a guarantee of better performance than a locked implementation, but does give a guarantee that deadlocks will not occur.

Imagine T required a lock to perform a copy, this could also have been owned by some operations outside of the list. Then a deadlock would be possible, if it was owned, and a lock based implementation of slist was called.

I think CAS is implemented in the std::compare_exchange_weak, so would be implementation independent.

Current lock free algorithms for complex structures (e.g vector, map) tend to be significantly less efficient than locking algorithms, Dr Dobbs : lock-free data structures but the benefit offered (improved thread performance) would improve significantly the performance of computers, which tend to have large numbers of idle cpus.

Further research into the algorithms may identify new instructions which could be implemented in the CPUs of the future, to give us wait-free performance and improved utilization of computing resources.

Reagent answered 31/8, 2015 at 8:16 Comment(0)

-1

It is possible to write a lock-free shared ptr as the only thing that needs changing is the count. The ptr itself is only copied so no special care needed here. When deleting, this must be the last instance, so no other copies exist in other threads so nobody would increment in the same time.
But having said that, std::atomic> wuold be a very specialized thing as it's not exactly a primitive type.

I've seen a few implementations of lock-free lists but none of them was of shared pointers. These containers usually have a special purpose and therefore there is an agreement around their usage (when/who creates/deletes) so using shared pointers is not required.
Also, shared pointers introduce an overhead that is contrary to our low latency goals that brought us to the lock-free domain in the first place.

So, back to your question- I think it is possible, but I don't see why do that.
If you really need something like that, a refCount member variable would serve better.

I see no specific benefit in Herb's specific implementation, maybe except the academic one, but lock-free lists have the obvious motivation of not having a lock. They often serve as queues or just to share a collection of nodes between threads that are allergic to locks.
Maybe we should ask Herb.. Herb? are you listening?

EDIT: Following all the comments below, I've implemented a lock-free singly linked list. The list is fairly complex to prevent shared ptrs from being deleted while they are accessed. It is too big to post here but here are the main ideas:
- The main idea is to stash removed entries in a separate place - a garbage collector - to make them inaccessible to later actions.
- An atomic ref count is incremented on entry to every function (push_front, pop_front, and front) and auto-decremented on exit. On decrementing to zero a version counter is incremented. All in one atomic instruction.
- When a shared ptrs needs to be erases, in pop_front, it is pushed into a GC. There's a GC per version number. The GC is implemented using a simpler lock-free list that can only push_front or pop_all. I've created a circular buffer of 256 GCs, but some other scheme can be applied.
- A version's GC is flushed on version increment and then shared ptrs delete the holders.
So, if you call pop_front, without anything else running, the ref count is incremented to 1, the front shared ptr is pushed into GC[0], ref count back to zero and version to 1, GC[0] is flushed - it decrements the shared ptr we popped and possibly deletes the object it owns.

Now, wrt a lock-free shared_ptr. I believe this is doable. Here are the ideas I thought of:
- You can have a spin lock of sorts using the low bits of the pointer to the holder, so you can dereference it only after you've locked it. You can use different bit for inc/dec etc. This is much better than lock on the entire thing.
The problem here is that the shared ptr itself can be deleted so whatever contains it would have to provide some protection from outside, like the linked list.
- You can have some central registry of shared pointers. This does not suffer from the problem above, but would be challenging to scale up without latency spikes once in a while.

To summarize, I currently think this whole idea is moot. If you find some other approach that does not suffer from big problems - I'll be very curious to know about it :)
Thanks!

Spotlight answered 22/7, 2015 at 1:4 Comment(7)

Thank you, but you'd also need an atomic compare and swap operation. And even copying a shared_ptr atomically sounds non-trivial to me, because you have to bump the ref count and copy the address. Or am I Missing some obvious technique to do this? – Induplicate 22/7, 2015 at 6:7

CAS is needed for the list, but not in shared ptr. A lock-free shared_ptr would contain a pointer to struct Holder { T* ptr; int count; }; So the Holder is what you copy, and there's no need for atomicity in that. Only the count inc/decrement is atomic. – Spotlight 22/7, 2015 at 6:53

You still have to dereference the pointer to the Holder and bump the ref count. Otherwise the ref holder could be destroyed, right after you dereferenced the pointer to it. What you are describing is the current implementation of std::share_ptr (except for the missing weak ref counter) which is not threadsafe. – Induplicate 22/7, 2015 at 6:58

The scenario you describe does not exist. Deletion occurs when it's the last copy and this last copy can exist only in a local scope so no other thread would be copying it in the same time. This is the shared ptr agreement - it should not be destructed explicitly in a shared container. Deletion in the shared container occurs only on container destruction that in itself must occur from a single owner. – Spotlight 22/7, 2015 at 7:22

Actually this more tricky than I initially thought. Imagine one thread removes from the shared container while (=right after) another thread copied it out. The counter should increment on return from the container before it's decremented on removal. No CAS would help with that. Let me think it through, maybe write some code, and I'll edit the answer. – Spotlight 22/7, 2015 at 7:43

I'd recommend to a) watch herb's talk and b) read this question. Maybe that makes my problem clearer. I'll see if I can incorporate more information in my question. – Induplicate 22/7, 2015 at 8:19

@BitWhistler Post your code. I'm sure people would want to see it! – Krell 23/7, 2015 at 21:57

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags