Is memory barrier or atomic operation required in a busy-wait loop?

Asked 20/9, 2015 at 9:10 Answered 21/9, 2017 at 19:6

Solved c++multithreading gcc memory-barriers spinlock

Consider the following spin_lock() implementation, originally from this answer:

void spin_lock(volatile bool* lock)  {  
    for (;;) {
        // inserts an acquire memory barrier and a compiler barrier
        if (!__atomic_test_and_set(lock, __ATOMIC_ACQUIRE))
            return;

        while (*lock)  // no barriers; is it OK?
            cpu_relax();
    }
}

What I already know:

volatile prevents compiler from optimizing out *lock re-read on each iteration of the while loop;
volatile inserts neither memory nor compiler barriers;
such an implementation actually works in GCC for x86 (e.g. in Linux kernel) and some other architectures;
at least one memory and compiler barrier is required in spin_lock() implementation for a generic architecture; this example inserts them in __atomic_test_and_set().

Questions:

Is volatile enough here or are there any architectures or compilers where memory or compiler barrier or atomic operation is required in the while loop?

1.1 According to C++ standards?

1.2 In practice, for known architectures and compilers, specifically for GCC and platforms it supports?
Is this implementation safe on all architectures supported by GCC and Linux? (It is at least inefficient on some architectures, right?)
Is the while loop safe according to C++11 and its memory model?

There are several related questions, but I was unable to construct an explicit and unambiguous answer from them:

Q: Memory barrier in a single thread

In principle: Yes, if program execution moves from one core to the next, it might not see all writes that occurred on the previous core.
Q: memory barrier and cache flush

On pretty much all modern architectures, caches (like the L1 and L2 caches) are ensured coherent by hardware. There is no need to flush any cache to make memory visible to other CPUs.
Q: Is my spin lock implementation correct and optimal?
Q: Do spin locks always require a memory barrier? Is spinning on a memory barrier expensive?
Q: Do you expect that future CPU generations are not cache coherent?

Koeppel answered 20/9, 2015 at 9:10 Comment(8)

The first assumption is incorrect - on a multi CPU system, a volatile read is not guaranteed to synchronize caches. It is supposed to be used for device interfacing, not for threading. See Volatile and multithreading: is the following thread safe? – Metrics 20/9, 2015 at 9:22

@BoPersson I believe he just means that it forces the compiler not to hoist the memory load operation out of the loop, so that it will at least re-read the local processor's cache. The question is about whether there actually exists an architecture for which cache coherency would actually be a real issue without a memory barrier, which implies that OP does understand that volatile does not create such a barrier. – Derivative 20/9, 2015 at 9:32

@davmac, yes! This is exactly what I'm asking about. – Koeppel 20/9, 2015 at 9:33

@g-v btw "doesn't insert neither" is a double negative. You meant "inserts neither". – Derivative 20/9, 2015 at 11:41

@davmac, thanks, fixed. – Koeppel 20/9, 2015 at 11:44

As a partial answer to your 1.2, here is how LLVM implements volatile and atomic memory orderings: llvm.org/docs/Atomics.html. So it's not safe there. – Dysplasia 20/9, 2015 at 19:43

@Lorehead, thanks. LLVM states this for non-atomics (including volatile): "If there is a race on a given memory location, loads from that location return undef." Could you please clarify: does GCC expliticly promises stronger guaranties somewhere and is there an actual difference in operations generated for volatile loads by GCC and LLVM? BTW, In Microsoft's compiler volatile loads are explicitly promised to be atomic: #7007903 – Koeppel 21/9, 2015 at 7:43

@g-v The relevant section of the gcc manual: gcc.gnu.org/onlinedocs/gcc/Volatiles.html Note that g++ does not make the same guarantees for volatile references as for volatile pointers. – Dysplasia 21/9, 2015 at 8:32

Is volatile enough here or are there any architectures or compilers where memory or compiler barrier or atomic operation is required in the while loop?

will the volatile code see the change. Yes, but not necessarily as quickly as if there was a memory barrier. At some point, some form of synchronization will occur, and the new state will be read from the variable, but there are no guarantees on how much has happened elsewhere in the code.

1.1 According to C++ standards?

From cppreference : memory_order

It is the memory model and memory order which defines the generalized hardware that the code needs to work on. For a message to pass between threads of execution, an inter-thread-happens-before relationship needs to occur. This requires either...

A synchronizes-with B
A has a std::atomic operation before B
A indirectly synchronizes with B (through X).
A is sequenced before X which inter-thread happens before B
A interthread happens before X and X interthread happens before B.

As you are not performing any of those cases there will be forms of your program where on some current hardware, it may fail.

In practice, the end of a time-slice will cause the memory to become coherent, or any form of barrier on the non-spinlock thread will ensure that the caches are flushed.

Not sure on the causes of the volatile read getting the "current value".

1.2 In practice, for known architectures and compilers, specifically for GCC and platforms it supports?

As the code is not consistent with the generalized CPU, from C++11 then it is likely this code will fail to perform with versions of C++ which try to adhere to the standard.

From cppreference : const volatile qualifiers Volatile access stops optimizations from moving work from before it to after it, and from after it to before it.

"This makes volatile objects suitable for communication with a signal handler, but not with another thread of execution"

So an implementation has to ensure that instructions are read from the memory location rather than any local copy. But it does not have to ensure that the volatile write is flushed through the caches to produce a coherent view across all the CPUs. In this sense, there is no time boundary on how long after a write into a volatile variable will become visible to another thread.

Also see kernel.org why volatile is nearly always wrong in kernel

Is this implementation safe on all architectures supported by GCC and Linux? (It is at least inefficient on some architectures, right?)

There is no guarantee the volatile message gets out of the thread which sets it. So not really safe. On linux it may be safe.

Is the while loop safe according to C++11 and its memory model?

No - as it doesn't create any of the inter-thread messaging primitives.

Pledge answered 21/9, 2017 at 19:6 Comment(1)

Thanks! A very clear answer. Could you please clarify this point?

As the code is not consistent with the generalized CPU, from C++11 then it is likely this code will fail to perform with versions of C++ which try to adhere to the standard.

Do you have an example in mind what exactly may happen? – Koeppel 23/9, 2017 at 12:37

This important: in C++ volatile has nothing at all to do with concurrency! The purpose of volatile is to tell the compiler that it shall not optimize accesses to the affected object. It does not tell the CPU anything, primarily because the CPU would know already whether memory would be volatile or not. The purpose of volatile is effectively to deal with memory mapped I/O.

The C++ standard is very clear in section 1.10 [intro.multithread] that unsynchronized access to an object which is modified in one thread and is accessed (modified or read) in another thread is undefined behavior. The synchronization primitives avoiding undefined behavior are library components like the atomic classes or mutexes. This clause mentions volatile only in the context of signals (i.e., as volatile sigatomic_t) and in the context of forward progress (i.e., that a thread will eventually do something which has an observable effect like accessing a volatile object or doing I/O). There is no mention of volatile in conjunction with synchronization.

Thus, unsynchronized assess to a variable shared across threads leads to undefined behavior. Whether it is declared volatile or not doesn't matter to this undefined behavior.

Locomobile answered 20/9, 2015 at 9:30 Comment(4)

Thanks. So the answer for question 3. is definitely no, it is not safe according to C++ standard. – Koeppel 20/9, 2015 at 9:55

That said, the GCC manual says that dereferencing a volatile* will actually reload the value from memory, and not be optimized away, so on GCC, the OP was correct about what it means. – Dysplasia 20/9, 2015 at 19:10

@Lorehead: nope, that conclusion is false: the compiler will cause the data to be loaded but whether that affects what the CPU does in the context of a multi-core system is entirely unrelated. Some CPUs will actually reread the data (e.g., current x86 systems) while others will not. – Lodger 20/9, 2015 at 19:16

I don't think we disagree at all, Dietmar. The OP realized volatile doesn't guarantee cache coherency. – Dysplasia 20/9, 2015 at 19:18

From the Wikipedia page on memory barriers:

... Other architectures, such as the Itanium, provide separate "acquire" and "release" memory barriers which address the visibility of read-after-write operations from the point of view of a reader (sink) or writer (source) respectively.

To me this implies that Itanium requires a suitable fence to make reads/writes visible to other processors, but this may in fact just be for purposes of ordering. The question, I think, really boils down to:

Does there exist an architecture where a processor might never update its local cache if not instructed to do so? I don't know the answer, but if you posed the question in this form then someone else might. In such an architecture your code potentially goes into an infinite loop where the read of *lock always sees the same value.

In terms of general C++ legality, the one atomic test and set in your example isn't enough, since it implements only a single fence which will allow you to see the initial state of the *lock when entering the while loop but not to see when it changes (which results in undefined behavior, since you are reading a variable that is changed in another thread without synchronisation) - so the answer to your question (1.1/3) is no.

On the other hand, in practice, the answer to (1.2/2) is yes (given GCC's volatile semantics), so long as the architecture guarantees cache coherence without explicit memory fences, which is true of x86 and probably for many architectures but I can't give a definite answer on whether it is true for all architectures that GCC supports. It is however generally unwise to knowingly rely on particular behavior of code that is technically undefined behavior according to the language spec, especially if it is possible to get the same result without doing so.

Incidentally, given that memory_order_relaxed exists, there seems little reason not to use it in this case rather than try to hand-optimise by using non-atomic reads, i.e. changing the while loop in your example to:

    while (atomic_load_explicit(lock, memory_order_relaxed)) {
        cpu_relax();
    }

On x86_64 for instance the atomic load becomes a regular mov instruction and the optimised assembly output is essentially the same as it would be for your original example.

Derivative answered 20/9, 2015 at 9:27 Comment(13)

"To me this implies that Itanium requires a suitable fence to make reads/writes visible to other processors." -- Is it really true that on Itanium (or some other architectures) changes may be not visible without a barrier? I though barriers only restrict order of such changes, but not affect visibility. – Koeppel 20/9, 2015 at 10:1

"it depends on what cpu_relax does" -- Lets assume that is does nothing or nothing related to barriers/cache flushes/etc. "is this from the Linux kernel btw?" -- No, it's just an example. Kernel version is here: lxr.free-electrons.com/source/arch/x86/include/asm/… – Koeppel 20/9, 2015 at 10:3

@g-v if cpu_relax does nothing relating to memory barriers then the answer to (3) is definitely "no" and I'm pretty sure the answer to (2) is also "no". – Derivative 20/9, 2015 at 10:14

@g-v "Is it really true that on Itanium (or some other architectures) changes may be not visible without a barrier?" - it's certainly possible although I'm not knowledgable enough about other architectures to give you a definite answer. In any case correct ordering is critical to the operation of any type of mutex including a spinlock; you don't want to see a processor mutating the protected structure after it has released the lock, right? – Derivative 20/9, 2015 at 10:19

@g-v (although I suppose this isn't actually an issue with your example, since there is always a barrier when the spinlock is actually acquired. Hmmm). – Derivative 20/9, 2015 at 10:25

"although I suppose this isn't actually an issue with your example..." - Yes! Also, it seems that IA64 implements cache coherency for data cache, but not for instruction cache: see informit.com/articles/article.aspx?p=29961&seqNum=6 and puppetmastertrading.com/images/hwViewForSwHackers.pdf So my example probably is safe on Itanium. – Koeppel 20/9, 2015 at 10:51

"The question, I think, really boils down to: Does there exist an architecture where a processor might never update its local cache if not instructed to do so?" -- Yes, thanks. And also "Is there a compiler where volatile will not work in such scenario?" and "Does this non-standard volatile behaviour always work in GCC?". – Koeppel 20/9, 2015 at 12:2

Because volatile is generally accepted to mean that the address might refer to a location whose value is controlled by hardware (i.e. that is memory-mapped IO address), I think you can be fairly sure that GCC and other compilers will do what you want here. But see amended answer - there's no real need to rely on this when you can use an atomic load with memory_order_relaxed anyway. – Derivative 20/9, 2015 at 12:17

Great hint! Is seems that if we'll use memory_order_consume instead of memory_order_relaxed here, it will be safe even on architectures without cache coherency (if spin_unlock() uses memory_order_release barrier). See "Release-Consume ordering" here: en.cppreference.com/w/cpp/atomic/memory_order It also states: "On all mainstream CPUs other than DEC Alpha, dependency ordering is automatic, no additional CPU instructions are issued for this synchronization mode, only certain compiler optimizations are affected" – Koeppel 20/9, 2015 at 12:30

@g-v I am fairly certain that any atomic operation has to be safe without cache coherency anyway. The memory ordering constraints are for surrounding reads/writes to non-atomics. Long discussion here, though it's for C++ I think the same principle applies to C: #30691635 – Derivative 20/9, 2015 at 12:43

Ah, I see now. I also found this question: #27333811 – Koeppel 20/9, 2015 at 13:20

@Derivative As for what GCC does and does not guarantee, here is the relevant section of the manual: gcc.gnu.org/onlinedocs/gcc/Volatiles.html – Dysplasia 20/9, 2015 at 19:29

@Lorehead thanks, I've added this link to the answer. – Derivative 20/9, 2015 at 22:20