Why are these memory orders applied here in the implementation of the lock free queue in C++ Concurrency in Action?
Asked Answered
F

1

9

Brief Description:

The code below is an implementation of a lock-free queue from C++ Concurrency in Action 2nd Edition. The queue uses a linked list as the underlying data structure and is implemented with split reference counting technology.

What confuses me is that there are multiple compare_exchange_strong in the code that use the acquire memory order, but I don't understand why this order is used.

For example, among all accesses to the count member of a node object (see below), there are no store operations with release order, but there are many compare_exchange_strong operations with acquire order.

For the complete code for this lock-free queue, my explanation of the code, and a detailed description of my problem, please read the full description section.


Full description:

I am reading the second edition of C++ Concurrency in Action by Anthony Williams. The book describes how to implement a lock-free queue using split reference counting technique. I'll start by briefly explaining how the code works based on my understanding to help you read the code quickly. The full implementation of this queue will be given later.

The implementation uses a singly linked list to implement the queue. The queue holds pointers to the head node and tail node of the linked list, which are head and tail in the code respectively, and tail points to a dummy node.

When pushing an element into the queue, we need to put the element value in the dummy node pointed to by tail, then add a dummy node after the node, and let tail point to the new dummy node.

When popping an element from the queue, we should pop the element pointed to by head, and then set head to head->next. When head and tail point to the same node (i.e. a dummy node), the queue is empty.

The code uses reference counting to manage the lifetime of deleted nodes. The reference count associated with a node is divided into two parts, the external reference count and the internal reference count. The external count plus the internal count is the full reference count value for that node. The external count is stored in the pointer to the node (i.e. counted_node_ptr in the code), while the internal count is stored inside the node object (i.e. count in the code).

In order to prevent the object pointed to by a thread from being deleted by another thread before dereferencing the pointer, the external counter must be incremented first to ensure that the node is not deleted. This is done by increase_external_count() in the code.

When an external reference no longer refers to a node, the external count stored in it must be added to the node's internal count, which is done by free_external_counter(). The internal count is stored in count.internal_count in the node object. And count.external_counters represents the number of external references referencing this node. Because these external references each hold their own external counts, these counts must all be taken into account. The node can only be safely deleted if and only if the internal count is 0 and the number of the external counter is also 0.

For references in pop that did not successfully point to the node being popped (usually because another thread has already popped it), those references are simply discarded, but their reference counts must be subtracted. This is done by release_ref(). This function decrements the internal count of the node by one. Every time the reference count is decremented, or the inner count and the external count are merged, a check is made to see if the condition for the node to be deleted is met.

Here is the complete implementation of the lock-free queue in the book:

template<typename T>
class lock_free_queue
{
private:
    struct node;
    struct counted_node_ptr
    {
        int external_count;
        node* ptr;
    };

    std::atomic<counted_node_ptr> head;
    std::atomic<counted_node_ptr> tail;

    struct node_counter
    {
        unsigned internal_count:30;
        unsigned external_counters:2;
    };

    struct node
    {
        std::atomic<T*> data;
        std::atomic<node_counter> count;
        counted_node_ptr next;

        node()
        {
            node_counter new_count;
            new_count.internal_count=0;
            new_count.external_counters=2;
            count.store(new_count);
            next.ptr=nullptr;
            next.external_count=0;
            data.store(nullptr);
        }

        void release_ref()
        {
            node_counter old_counter=
                count.load(std::memory_order_relaxed);
            node_counter new_counter;
            do
            {
                new_counter=old_counter;
                --new_counter.internal_count;
            }
            while(!count.compare_exchange_strong(
                      old_counter,new_counter,
                      std::memory_order_acquire,std::memory_order_relaxed));
            if(!new_counter.internal_count &&
               !new_counter.external_counters)
            {
                delete this;
            }
        }
    };

    static void increase_external_count(
        std::atomic<counted_node_ptr>& counter,
        counted_node_ptr& old_counter)
    {
        counted_node_ptr new_counter;
        do
        {
            new_counter=old_counter;
            ++new_counter.external_count;
        }
        while(!counter.compare_exchange_strong(
                  old_counter,new_counter,
                  std::memory_order_acquire,std::memory_order_relaxed));
        old_counter.external_count=new_counter.external_count;
    }

    static void free_external_counter(counted_node_ptr &old_node_ptr)
    {
        node* const ptr=old_node_ptr.ptr;
        int const count_increase=old_node_ptr.external_count-2;
        node_counter old_counter=
            ptr->count.load(std::memory_order_relaxed);
        node_counter new_counter;
        do
        {
            new_counter=old_counter;
            --new_counter.external_counters;
            new_counter.internal_count+=count_increase;
        }
        while(!ptr->count.compare_exchange_strong(
                  old_counter,new_counter,
                  std::memory_order_acquire,std::memory_order_relaxed));
        if(!new_counter.internal_count &&
           !new_counter.external_counters)
        {
            delete ptr;
        }
    }
public:
    lock_free_queue() {
        counted_node_ptr h;
        h.ptr = new node;
        h.external_count = 1;

        head.store(h);
        tail.store(h);
    }

    lock_free_queue(const lock_free_queue&) = delete;
    lock_free_queue& operator=(const lock_free_queue&) = delete;

    ~lock_free_queue() {
        while (pop())
            ;
    }

    void push(T new_value)
    {
        std::unique_ptr<T> new_data(new T(new_value));
        counted_node_ptr new_next;
        new_next.ptr=new node;
        new_next.external_count=1;
        counted_node_ptr old_tail=tail.load();
        for(;;)
        {
            increase_external_count(tail,old_tail);
            T* old_data=nullptr;
            if(old_tail.ptr->data.compare_exchange_strong(
                   old_data,new_data.get()))
            {
                old_tail.ptr->next=new_next;
                old_tail=tail.exchange(new_next);
                free_external_counter(old_tail);
                new_data.release();
                break;
            }
            old_tail.ptr->release_ref();
        }
    }

    std::unique_ptr<T> pop()
    {
        counted_node_ptr old_head=head.load(std::memory_order_relaxed);
        for(;;)
        {
            increase_external_count(head,old_head);
            node* const ptr=old_head.ptr;
            if(ptr==tail.load().ptr)
            {
                ptr->release_ref();
                return std::unique_ptr<T>();
            }
            if(head.compare_exchange_strong(old_head,ptr->next))
            {
                T* const res=ptr->data.exchange(nullptr);
                free_external_counter(old_head);
                return std::unique_ptr<T>(res);
            }
            ptr->release_ref();
        }
    }
};

I think I fully understand how push and pop work and think they are correct. But what I don't understand is why some atomic operations in the code use that memory order, and the book doesn't give a reason.

The count member of the node object stores the node's internal count value, as well as the number of external references to the node. The count member is used only a few times in the code:

  1. In the constructor of node, there is a seq-cst store for count:
node()
{
    node_counter new_count;
    new_count.internal_count=0;
    new_count.external_counters=2;
    count.store(new_count);   // seq-cst store
    next.ptr=nullptr;
    next.external_count=0;
}
  1. In release_ref, there are relaxed load and compare_exchange_strong for count:
void release_ref()
{
    node_counter old_counter=
        count.load(std::memory_order_relaxed);  // relaxed load
    node_counter new_counter;
    do
    {
        new_counter=old_counter;
        --new_counter.internal_count;
    }
    while(!count.compare_exchange_strong(
           old_counter,new_counter,
           std::memory_order_acquire,std::memory_order_relaxed)); // acquire RMW operation
    // ...
}
  1. In free_external_counter, there are also relaxed load and acquire compare_exchange_strong:
static void free_external_counter(counted_node_ptr &old_node_ptr)
{
    node* const ptr=old_node_ptr.ptr;
    int const count_increase=old_node_ptr.external_count-2;
    node_counter old_counter=
        ptr->count.load(std::memory_order_relaxed);   // relaxed load
    node_counter new_counter;
    do
    {
        new_counter=old_counter;
        --new_counter.external_counters;
        new_counter.internal_count+=count_increase;
    }
    while(!ptr->count.compare_exchange_strong(
              old_counter,new_counter,
              std::memory_order_acquire,std::memory_order_relaxed));  // acquire RMW
    // ...
}

It can be seen that, except for a seq-cst store performed in the constructor of node, other operations are load operations with relaxed or acquire order, but no store operations with release order.

So what is the purpose of using acquire memory order here? What operation should it be synchronized with?

In addition, increase_external_count also uses acquire order, what is the reason here? While other atomic operations use the default seq-cst ordering, some of these operations seem to be able to be replaced with weaker memory ordering as well.

Filagree answered 19/10, 2022 at 15:4 Comment(11)
I was going to test it under ThreadSanitizer but it doesn't actually run because the first push dereferences a null tail pointer. Is there more code that you omitted, e.g. a constructor and destructor?Crosscurrent
@NateEldredge Really sorry I forgot to add the constructor of queue, I'll edit the question right away.Filagree
Ok, so TSAN does think there is a data race: godbolt.org/z/6GEhaxPsj. Will investigate when I have some time.Crosscurrent
@NateEldredge Thank you for investigating this issue. Looking forward to your answer.Filagree
I think the issue is that in free_external_counter, the dereference of ptr can theoretically be reordered after the compare_exchange_strong. That doesn't make sense in real life, since you had to finish dereferencing ptr in order to do the compare_exchange in the first place, but the memory model allows it. So if the compare_exchange decrements the count to 1, another thread can be deleting ptr in a race with the delayed dereference.Crosscurrent
@NateEldredge I found that if I change the memory order of compare_exchange_strong in free_external_counter from acquire to seq-cst, the sanitizer will not report errors.Filagree
I think release_ref has a similar race that the sanitizer does not detect. In both cases, at first glance it looks like acq_rel would suffice, but I'd want to sit down and write a proof.Crosscurrent
@NateEldredge The book also implements a lock-free stack, and uses the same reference counting technique. In the stack implementation, decrementing the reference count (just like what free_external_counter does here) has a release order, and once the reference count reaches 0, an acquire load operation should be added to establish synchronization. I just tested the same code by modifying compare_exchange_strong to release order, and adding a load with acquire order to the success branch of the if below, and now Sanitizer doesn't report an error.Filagree
Or you can put an acquire fence at the top of the success branch. (ThreadSanitizer doesn't really know about fences, though, so it will still report a race, but it's a false positive.)Crosscurrent
I went ahead and fixed the uninitialized data bug in the question.Crosscurrent
As to your last question, I think you are right that the compare_exchange in increase_external_count can be weakened to relaxed. When it's called, the counter must already be positive, or else accessing the counter itself is unsafe. So code that follows the compare_exchange would be equally well protected if it were reordered before it. (This is not a proof, of course.)Crosscurrent
C
2

Good catch. These are data races. ThreadSanitizer agrees in both cases.

In free_external_counter:

        while(!ptr->count.compare_exchange_strong(
                  old_counter,new_counter,
                  std::memory_order_acquire,std::memory_order_relaxed));
        if(!new_counter.internal_count &&
           !new_counter.external_counters)
        {
            delete ptr;
        }

We read the object *ptr before the compare_exchange (to get ptr->count). But since the compare_exchange is not release, the read of *ptr could in theory be reordered after it. If so, then by the time the read of *ptr happens, the counter has already been decremented, and in the meantime some other thread might have decremented it again and then done delete ptr.

I am not sure how this could happen in real life, short of having a time machine, because we had to read *ptr in order to even find the object ptr->count that we are supposed to be writing to. You'd need some miraculous machine that can speculatively make stores globally visible, and then roll back all other cores if it turns out the wrong address or value was stored. But anyway, the C++ memory model doesn't forbid it.

There's a similar bug in release_ref, where the object being raced on is *this.

I think these can be fixed by upgrading the compare-exchanges to acq_rel. That should ensure that all accesses to *ptr happen before its deletion. In fact, as you point out, we only need the acquire ordering in the case that the count has become zero, since it's only the delete ptr that must be protected from reordering before the compare_exchange. You could do that by having just release ordering on the compare_exchange, and putting an acquire fence inside the if block, just before delete ptr;. I have not taken the time to prove this correct, though.

On the other hand, I also think you're right that the compare_exchange in increase_external_count could be weakened to relaxed. When it's called, the counter must already be positive, or else accessing the counter itself is unsafe. So code that follows the compare_exchange would be equally well protected if it were reordered before it. (This is not a proof, of course.)

Crosscurrent answered 21/10, 2022 at 6:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.