Does this envelope implementation correctly use C++11 atomics?

Asked 24/9, 2019 at 22:35 Answered 30/9, 2019 at 10:58

c++multithreading thread-safety memory-model stdatomic

I have written a simple 'envelope' class to make sure I understand the C++11 atomic semantics correctly. I have a header and a payload, where the writer clears the header, fills in the payload, then fills the header with an increasing integer. The idea is that a reader then can read the header, memcpy out the payload, read the header again, and if the header is the same the reader can then assume they successfully copied the payload. It's OK that the reader may miss some updates, but it's not OK for them to get a torn update (where there is a mix of bytes from different updates). There is only ever a single reader and a single writer.

The writer uses release memory order and the reader uses acquire memory order.

Is there any risk of the memcpy being reordered with the atomic store/load calls? Or can the loads be reordered with each other? This never aborts for me but maybe I'm lucky.

#include <iostream>
#include <atomic>
#include <thread>
#include <cstring>

struct envelope {
    alignas(64) uint64_t writer_sequence_number = 1;
    std::atomic<uint64_t> sequence_number;
    char payload[5000];

    void start_writing()
    {
        sequence_number.store(0, std::memory_order::memory_order_release);
    }

    void publish()
    {
        sequence_number.store(++writer_sequence_number, std::memory_order::memory_order_release);
    }

    bool try_copy(char* copy)
    {
        auto before = sequence_number.load(std::memory_order::memory_order_acquire);
        if(!before) {
            return false;
        }
        ::memcpy(copy, payload, 5000);
        auto after = sequence_number.load(std::memory_order::memory_order_acquire);
        return before == after;
    }
};

envelope g_envelope;

void reader_thread()
{
    char local_copy[5000];
    unsigned messages_received = 0;
    while(true) {
        if(g_envelope.try_copy(local_copy)) {
            for(int i = 0; i < 5000; ++i) {
                // if there is no tearing we should only see the same letter over and over
                if(local_copy[i] != local_copy[0]) {
                    abort();
                }
            }
            if(messages_received++ % 64 == 0) {
                std::cout << "successfully received=" << messages_received << std::endl;
            }
        }
    }
}

void writer_thread()
{
    const char alphabet[] = {"ABCDEFGHIJKLMNOPQRSTUVWXYZ"};
    unsigned i = 0;
    while(true) {
        char to_write = alphabet[i % (sizeof(alphabet)-1)];
        g_envelope.start_writing();
        ::memset(g_envelope.payload, to_write, 5000);
        g_envelope.publish();
        ++i;
    }
}

int main(int argc, char** argv)
{
    std::thread writer(&writer_thread);
    std::thread reader(&reader_thread);

    writer.join();
    reader.join();

    return 0;
}

Marj answered 24/9, 2019 at 22:35 Comment(7)

@Tas good catch, fixed. still never aborts though ;) – Marj 24/9, 2019 at 22:53

I don't see what would prevent reordering of these (in try_copy): ::memcpy(copy, payload, 5000); auto after = sequence_number.load(std::memory_order::memory_order_acquire); I feel like that should be a release? Similarly, start_writing seems like it should use acquire. – Gap 25/9, 2019 at 20:50

@DavisHerring oops, fixed – Marj 26/9, 2019 at 13:56

The ` = {"ABCDEFGHIJKLMNOPQRSTUVWXYZ"}` is a bizarre initialization. – Dermatome 1/10, 2019 at 20:45

@Dermatome Why? Just could have done the ASCII math? – Marj 1/10, 2019 at 23:45

@JosephGarvin I'm fine with the content of the string, I find the syntax choice str = { "..." } quite odd: curly braces are usually used for structures or arrays not for strings. – Dermatome 2/10, 2019 at 1:50

This question belongs on codereview.stackexchange.com – Trismus 18/10, 2019 at 12:20

This is called a seqlock; it has a data race simply because of the conflicting calls to memset and memcpy. There have been proposals to provide a memcpy-like facility to make this sort of code correct; the most recent is not likely to appear before C++26 (even if approved).

Arnie answered 26/9, 2019 at 5:0 Comment(4)

Very interesting, +1. So, for now, I guess, only seqlocks with atomic data loads and stores is race condition free. – Uriah 26/9, 2019 at 14:54

Thinking more about it, even if memcpy was atomic atomic_thread_fence(memory_order_acquire); in P1478R2 example wouldn't do the right thing because it doesn't prevent preceding operations to not be reordered past it. It should probably be atomic_thread_fence(memory_order_acq_rel);. – Uriah 27/9, 2019 at 20:21

@MaximEgorushkin: Yup, implementing a seqlock requires some 2-way fences, not just acquire loads and release stores. e.g. my attempt in Implementing 64 bit atomic counter with 32 bit atomics, and see also Optimal way to pass a few variables between 2 threads pinning different CPUs. What you want is to get the compiler to emit efficient wide loads, regardless of atomicity. C++ makes that impossible without UB, but at least C allows struct foo = volatile struct foo to let the compiler pick an efficient way to do volatile – Covenantor 30/9, 2019 at 9:47

@MaximEgorushkin: You can of course avoid UB by using relaxed loads/stores of chunks of atomic<long> for the shared data, but that prevents current compilers from doing multiple atomic long loads/stores with a single SIMD load/store. (Especially since HW vendors like Intel fail to even document that aligned SIMD loads/stores have per-element atomicity: Per-element atomicity of vector load/store and gather/scatter? although I think everyone assumes that.) – Covenantor 30/9, 2019 at 9:50

This is called a seqlock. It's a known pattern, and it works well for publish occasionally, read often. If you republish too often (especially for a buffer as large as 5000 bytes), you risk too many retries by the readers as they keep detecting possible tearing. It's commonly used to e.g. publish a 64-bit or 128-bit timestamp from a timer interrupt handler to all cores, where the fact that the writer doesn't have to acquire a lock is great, and so is the fact that readers are read-only and have negligible overhead in the fast-path.

Acq and Rel are one-way barriers.

You need atomic_thread_fence(mo_acquire) before the 2nd load of the sequence number in the reader to make sure it doesn't happen earlier, before the memcpy finishes. And same for atomic_thread_fence(mo_release) in the writer, after the first store before writing the data. Note that acquire / release fences are 2-way barriers, and do affect non-atomic variables¹. (Despite misconceptions to the contrary, fences really are 2-way barriers, unlike acquire or release operations. Jeff Preshing explains and debunks the confusion)

See also Implementing 64 bit atomic counter with 32 bit atomics for my attempt at a templated SeqLock class. I required the template class T to provide an assignment operator to copy itself around, but using memcpy might be better. I was using volatile for extra safety against the C++ UB we include. That works easily for uint64_t but is a huge pain in C++ for anything wider, unlike in C where you can get the compiler to efficiently emit code to load from a volatile struct into a non-volatile temporary.

You're going to have C++ data-race UB either way (because C++ makes best efficiency impossible without UB: the whole point of a SeqLock is to let tearing potentially happen on data[], but detect that and never actually look at the torn data). You could avoid UB by copying your data as an array of atomic<unsigned long> or something, but current compilers aren't smart enough to use SIMD for that so the access to the shared data would be inefficient. (And HW vendors fail to document Per-element atomicity of vector load/store and gather/scatter?, even though we all know that current CPUs do give that and future CPUs almost certainly will too.)

A memory barrier is probably sufficient, but it would be nice to do something to "launder" the value to make extra sure the compiler doesn't put another reload of the non-atomic data after the 2nd load. Like What is the purpose of glibc's atomic_forced_read function?. But as I said, I think atomic_thread_fence() is sufficient. At least in practice with compilers like GCC, which treat thread_fence like asm("":::"memory") that tells the compiler all values in memory might have changed.

Footnote 1: Maxim points out that atomic_thread_fence may be sort of a hack because ISO C++ specifies things only in terms of barriers and release-sequences synchronizing with loads that see the value stored.

But it's well known how fences and acq/rel loads/stores map to asm for any given target platform. It's implausible that a compiler will do enough whole-program inter-thread analysis to prove that it can break your code.

There might be an argument to be made in terms of the language used in the C++ standard about establishing happens-before relationships between the store of tmp+1 and at least some hypothetical reader. In practice that's enough to stop a compiler from breaking the writer: it can't know what code will be reading the data it's writing so it has to respect barriers. And probably the language in the standard is strong enough that a reader that sees an odd sequence number (and avoids reading data[]) can avoid data-race UB, so there would be a valid happens-before relationship between an atomic store that has to stay ahead of some non-atomic stores. So I'm not convinced that there's any room for a malicious all-seeing compiler to not respect atomic_thread_fence() there, let alone any real compiler.

In any case, you definitely do not want _mm_lfence() on x86. You want the compiler barrier against runtime reordering, but you definitely do not want the main effect of lfence: blocking out-of-order execution.Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths and Are loads and stores the only instructions that gets reordered?

i.e. you just want GNU C asm("":::"memory"), aka atomic_signal_fence(mo_seq_cst). Also equivalent to atomic_thread_fence(mo_acq_rel) on x86, which only has to block compile-time reordering to control runtime ordering, because the only runtime reording x86's strong memory model allows is StoreLoad (except for NT stores). x86's memory model is seq_cst + a store-buffer with store forwarding (which weakens seq_cst to acq/rel, and occasionally has other funky effects especially for loads that partially overlap a store).

For more about _mm_lfence() and so on vs. the asm instructions, see When should I use _mm_sfence _mm_lfence and _mm_mfence.

Other tweaks

Your sequence number is unnecessarily wide, and 64-bit atomics are less efficient on some 32-bit platforms, and very inefficient on a few. A 32-bit sequence number won't wrap in any reasonable thread-sleep time. (e.g. a 4GHz CPU will take about a whole second to do 2^32 stores at 1 store per clock, and that's with zero contention for writes to the cache line. And no cycles spend executing stores of the actual data. And practical use-cases don't have the writer in a tight loop publishing new values constantly: that could lead to something similar to livelock with readers constantly retrying and making no progress.)

unsigned long is never (AFAIK) too wide to handle efficiently, except on CPUs narrower than 32-bit. So atomic<long> or atomic<unsigned long> would use a 64-bit counter on CPUs where that's fine, but definitely avoid the risk of using a 64-bit atomic in 32-bit code. And long is required to be at least 32 bits wide.

Also, you don't need two copies of the write sequence number. Just have the writer do an atomic load into a tmp var, then separate atomic stores of tmp+1 and tmp+2.

(You're correct in wanting to avoid sequence_number++; it would be a bad idea to ask the compiler to do two atomic RMWs). The only advantage of a separate non-atomic var for the writer's private seq number is if this can inline into a write loop and keep it in a register so the writer never reloads the value.

Covenantor answered 30/9, 2019 at 10:58 Comment(3)

Presing says: An acquire fence prevents the memory reordering of any read which precedes it in program order with any read or write which follows it in program order. However, the C++ reference says: no reads or writes in the current thread can be reordered before this load. Where does the C++ standard required preceding loads and stores to not be reordered past an acquire operation please? – Uriah 30/9, 2019 at 13:10

@MaximEgorushkin: An acquire fence isn't an acquire operation. So the C++ standard doesn't say that. That's exactly the misconception that many people had for a while (including Herb Sutter and other C++ experts) preshing.com/20131125/… is the link I meant to include in this answer. I forget Jeff Preshing had 2 blog posts with acq/rel fences in the title. – Covenantor 30/9, 2019 at 13:12

That's quite an insightful article about fences, thanks for sharing. I see now that fences are stronger than loads and stores with the same memory order. – Uriah 30/9, 2019 at 13:27

Acq and Rel are one-way barriers.

Other tweaks

Recommended topics

Hot tags