Are memory orders for each atomic correct in this lock-free SPSC ring buffer queue?
Asked Answered
T

2

9

I have a ring buffer that looks like:

template<class T>
class RingBuffer {
public:
  bool Publish();
 
  bool Consume(T& value);

  bool IsEmpty(std::size_t head, std::size_t tail);

  bool IsFull(std::size_t head, std::size_t tail);

private:
  std::size_t Next(std::size_t slot);

  std::vector<T> buffer_;
  std::atomic<std::size_t> tail_{0};
  std::atomic<std::size_t> head_{0};
  static constexpr std::size_t kBufferSize{8};
};

This data structure is intended to work with two threads: a publisher thread and a consumer thread. The two major functions without passed memory orders to atomics listed below:

bool Publish(T value) {
    const size_t curr_head = head_.load(/* memory order */);
    const size_t curr_tail = tail_.load(/* memory_order */);

    if (IsFull(curr_head, curr_tail)) {
      return false;
    }

    buffer_[curr_tail] = std::move(value);
    tail_.store(Next(curr_tail) /*, memory order */);
    return true;
  }

bool Consume(T& value) {
    const size_t curr_head = head_.load(/* memory order */);
    const size_t curr_tail = tail_.load(/* memory order */);

    if (IsEmpty(curr_head, curr_tail)) {
      return false;
    }

    value = std::move(buffer_[curr_head]);
    head_.store(Next(curr_head) /*, memory order */);
    return true;
  }

I know that, at least, I must have tail_.store() in the Publish() function with std::memory_order::release and tail_.load() with std::memory_order::acquire in the Consume() function to create happens before connection between a write to buffer_ and a read buffer_. Also, I can pass std::memory_order::relaxed to tail_.load() in Publish() and to head_.load() in Consume() because the same thread will see the last write to an atomic. Now the functions are something like this:

 bool Publish(T value) {
     const size_t curr_head = head_.load(/* memory order */);
     const size_t curr_tail = tail_.load(std::memory_order::relaxed);
    
     if (IsFull(curr_head, curr_tail)) {
         return false;
     }
    
     buffer_[curr_tail] = std::move(value);
     tail_.store(Next(curr_tail), std::memory_order::release);
     return true;
 }

 bool Consume(T& value) {
     const size_t curr_head = head_.load(std::memory_order::relaxed);
     const size_t curr_tail = tail_.load(std::memory_order::acquire);
    
     if (IsEmpty(curr_head, curr_tail)) {
         return false;
     }
    
     value = std::move(buffer_[curr_head]);
     head_.store(Next(curr_head) /*, memory order */);
     return true;
 }

The last step is to put memory orders in the remaining pair: head_.load() in Publish() and head_.store() in Consume(). I have to have value = std::move(buffer_[curr_head]); line executed before head_.store() in Consume(), otherwise I will have data races in cases when the buffer is full, so, at least, I must pass std::memory_order::release to that store operation to avoid reorderings. But do I have to put std::memory_order::acquire in head_.load() in the Publish() function? I understand that it will help head_.load() to see head_.store() for reasonable time unlike std::memory_order::relaxed, but if I don't need this guarantee of shorter time to see a side effect of a store operation, can I have a relaxed memory order? If I can't, then why? Completed code:

bool Publish(T value) {
    const size_t curr_head = head_.load(std::memory_order::relaxed); // or acquire?
    const size_t curr_tail = tail_.load(std::memory_order::relaxed);
        
    if (IsFull(curr_head, curr_tail)) {
        return false;
    }
        
    buffer_[curr_tail] = std::move(value);
    tail_.store(Next(curr_tail), std::memory_order::release);
    return true;
}
    
bool Consume(T& value) {
    const size_t curr_head = head_.load(std::memory_order::relaxed);
    const size_t curr_tail = tail_.load(std::memory_order::acquire);
        
    if (IsEmpty(curr_head, curr_tail)) {
        return false;
    }
        
    value = std::move(buffer_[curr_head]);
    head_.store(Next(curr_head), std::memory_order::release);
    return true;
}

Are memory orders for each atomic correct? Am I right about explanation of use of each memory order in each atomic variable?

Talca answered 28/12, 2021 at 20:51 Comment(10)
I don't think that a release without an acquire has any effect.Trunnion
@user17732522, It could have an effect of std::atomic_thread_fence(std::memory_order_release) as i knowInfinitesimal
@ShamilMukhetdinov That is stronger than a release operation on an atomic. See here.Trunnion
@Trunnion you right, but i mean, that store-release operation prevents all preceding writes from moving past the store-releaseInfinitesimal
@ShamilMukhetdinov But that is only observable in a thread performing an acquire operation on the same atomic (at least in the C++ memory model).Trunnion
@user17732522, I'm a bit confused. If the head.store-release is complete, then I can tell that the operation on the value has also been completed by that point, regardless of the head.load operation, right? Or i have to creste "synchronized-with" connection?Infinitesimal
@ShamilMukhetdinov Threads do not need to agree with the order in which the operations happen and only the release/acquire connection establishes such an agreement as far as I know. Yes that is the synchronizes-with property in the standard. I couldn't find any other effect of the release order on an atomic store than the synchronizes-with ordering between release/acquire.Trunnion
@user17732522, as I undersrand, store/load operations could be used as fences: eel.is/c++draft/….Infinitesimal
@ShamilMukhetdinov I am talking about the std::atomic release/acquire, i.e. the one associated with a memory location. I am not talking about the release fence, i.e. the one not associated with a memory location. If you mean that OP could use a release fence instead of head store-release, then that might be true (I haven't thought about the details.)Trunnion
@ShamilMukhetdinov: A std::atomic_thread_fence(std::memory_order_release) is a release fence, not a release operation. (The standard apparently describes it as a "synchronization operation" in the part you linked, but it's not a "release operation". See preshing.com/20131125/… for more about the difference.)Overweening
S
7

Previous answers may help as background:
c++, std::atomic, what is std::memory_order and how to use them?
https://bartoszmilewski.com/2008/12/01/c-atomics-and-memory-ordering/

Firstly the system you describe is known as a Single Producer - Single Consumer queue. You can always look at the boost version of this container to compare. I often will examine boost code, even when I work in situations where boost is not allowed. This is because examining and understanding a stable solution will give you insights into problems you may encounter (why did they do it that way? Oh, I see it - etc). Given your design, and having written many similar containers I will say that your design has to be careful about distinguishing empty from full. If you use a classic {begin,end} pair, you hit the problem that due to wrapping

{begin, begin+size} == {begin, begin} == empty

Okay, so back synchronisation issue.

Given that the order only effects reording, the use of release in Publish seems a textbook use of the flag. Nothing will read the value until the size of the container is incremented, so you don't care if the orders of writes of the value itself happen in a random order, you only care that the value must be fully written before the count is increased. So I would concur, you are correctly using the flag in the Publish function.
I did question whether the "release" was required in the Consume, but if you are moving out of the queue, and those moves are side-effecting, it may be required. I would say that if you are after raw speed, then it may be worth making a second version, that is specialised for trivial objects, that uses relaxed order for incrementing the head.

You might also consider inplace new/delete as you push/pop. Whilst most moves will leave an object in an empty state, the standard only requires that it is left in a valid state after a move. explicitly deleting the object after the move may save you from obscure bugs later.

You could argue that the two atomic loads in consume could be memory_order_consume. This relaxes the constraints to say "I don't care what order they are loaded, as long as they are both loaded by the time they are used". Although I doubt in practice it produces any gain. I am also nervous about this suggestion because when I look at the boost version it is remarkably close to what you have. https://www.boost.org/doc/libs/1_66_0/boost/lockfree/spsc_queue.hpp

 template <typename Functor>
    bool consume_one(Functor & functor, T * buffer, size_t max_size)
    {
        const size_t write_index = write_index_.load(memory_order_acquire);
        const size_t read_index  = read_index_.load(memory_order_relaxed);
        if ( empty(write_index, read_index) )
            return false;

        T & object_to_consume = buffer[read_index];
        functor( object_to_consume );
        object_to_consume.~T();

        size_t next = next_index(read_index, max_size);
        read_index_.store(next, memory_order_release);
        return true;
    }

In general though your approach looks good and very similar to the boost version. but... If I were you I might go through the boost version line-by-line, to see what it does differently. It is very easy to make a mistake.

EDIT: Sorry I just noticed you have the order of the memory_order_acquire/memory_order_relaxed flags the wrong way around in your code. You should make the last write a "release" and the first read "acquire". This doesn't effect the behaviour significantly, but it makes it easier to read. (sync at the start, sync at the end)

Replies to comments: As @user17732522 implies, the copy operation is also a write, so the optimisation for trivial objects will not be synchronised and the optimisation for trivial object will introduce U/B (damn it's easy to make a mistake)

Sproul answered 31/12, 2021 at 11:59 Comment(13)
I don't think whether the moved objects are trivial should matter. If the write/write operation pair races, then the write/read operation pair also races and both are UB.Trunnion
The release order flag only effects writes to memory. A trivial object is not changed when it is moved (copied) from, and so there are no writes to ensure are completed before the count is written to. That said, it would be a fragile change, so would require the compiler to enforce it (IMHO)Sproul
No on second thoughts I'm being thick. The copy itself is a write. Dohh. I will add that to the answerSproul
The store in Consume needs to be release because it's like an unlock: as soon as the writer sees the new value, it can potentially store stuff into that entry. We mustn't allow that to happen until after reading the entry is finished, thus release, regardless of side-effects.Overweening
memory_order_consume isn't "I don't care what order they are loaded, as long as they are both loaded by the time they are used". It means that further loads dependent on the value loaded must happen after the consume load, like buffer_[curr_head] is ordered after curr_head = head_.load(consume), but other loads/stores like tail_.load(relaxed) are not required to complete before the load from the buffer. (Speculative execution can make that happen even though there's a check for empty involving tail.)Overweening
BTW, mo_consume is temporarily deprecated and compilers promote it to acquire. See What does memory_order_consume really do? for details on that and more.Overweening
Thank you Peter. Did you know that some species of sea-squirt have two stages of life, once they have settled in a single home, they no longer move. They digest their own brains, to become more like a plant. I think lockdown did the same to me :-)Sproul
@Peter : If you have time to answer, I'd like to know why the branch on empty doesn't count as a "dependency". I understand that a speculative execution might follow both branches, but if the retire-operation is dependent on the value, is it that this is the wrong type of dependency (not data-dependent)?Sproul
Right, control dependencies are handled via branch prediction and speculative execution. So confirmation that we're on the right path of execution can come after later loads. Data dependencies work by actually waiting for the data to be ready for an execution unit. IDK if I've gone into more detail than that about it in my own answers, but some of the links in Memory order consume usage in C11 might explain more fully. (Sorry, there are a lot.) Maybe puppetmastertrading.com/images/hwViewForSwHackers.pdfOverweening
Thanks for the info. I appreciate your time.Sproul
The Linux kernel's memory-barriers.txt has a section on control dependencies github.com/torvalds/linux/blob/…. (And search for "branch" in that file. mo_consume is exposing the same concept that Linux RCU depends on, so see Linux details about getting things to actually compile safely with data dependencies and how optimizers can break things without a usable C11 mo_consume, e.g. by noticing that flag ^ flag only has one possible value, 0, and making the load address not dependent if you used that.)Overweening
@Tiger4Hire: Came across Difference between data dependence and control dependence while looking for something else. My answer there is relevant to what you were asking, and has some links to more stuff. It's mostly not about memory ordering, but rather how CPUs work in general. Also CPU prediction and memory barrierOverweening
I'm familiar with the xor trick. In the old days it was quicker to xor a register with itself, rather than load it with 0 - so it was used a lot. As per your answer above, a branch dependency does imply data dependency. I tend to ignore branch prediction when looking for dependencies, as it it is a finite resource and may not be used. However, your answer also states that a branch-dependency will not create more (pipeline)dependencies. I overlooked that. The documentation of the flag should be improved to make it clear, but from the sounds of it, it's pretty much deprecated at this stage.Sproul
M
0

To understand correct std::memory_order for this problem, let's consider for a moment if there is only one thread rather than producer and consumer threads.
In single thread scenario bool Publish(T value) will see all operation performed by previous bool Consume(T& value) in same order, similarly bool Consume(T& value) will see all operations performed by previous bool Publish(T value) in same order.
So in multi-threads scenario, publisher and consumer threads must be synchronized in similar way. Synchronization can be achieved by memory barriers.

Consume function release atomic store head_.store(Next(curr_head), std::memory_order::release); guarantees operations issued previous to store will be performed before it and will not cross it, and publish can be synchronized by acquire atomic load of head_ variable and it is guaranteed if Publish can see head_ it will see all operation performed before it,

bool Publish(T value) {
    const size_t curr_head = head_.load(std::memory_order::acquire);
    const size_t curr_tail = tail_.load(std::memory_order::relaxed);
        
    if (IsFull(curr_head, curr_tail)) {
        return false;
    }
        
    buffer_[curr_tail] = std::move(value);
    tail_.store(Next(curr_tail), std::memory_order::release);
    return true;
}

Similarly Consume can be synchronized with Publish release atomic store tail_.store(Next(curr_tail), std::memory_order::release); by acquire atomic load of tail_ variable and if consume can see tail_ it is guaranteed to see all operations performed before it.

bool Consume(T& value) {
    const size_t curr_tail = tail_.load(std::memory_order::acquire);
    const size_t curr_head = head_.load(std::memory_order::relaxed);
        
    if (IsEmpty(curr_head, curr_tail)) {
        return false;
    }
        
    value = std::move(buffer_[curr_head]);
    head_.store(Next(curr_head), std::memory_order::release);
    return true;
}
Meyeroff answered 6/1, 2022 at 12:16 Comment(1)
I got your point, but the problem is that a read doesn't have a side effect unlike a write. So, I think it is not neccessary to make that a write can see a read because the second did nothing to make the first see it. The main aspect we have to take into consideration is data races that may occur in the Consume function due to reorderings of the store operation and the buffer read operation.Talca

© 2022 - 2024 — McMap. All rights reserved.