Memory barriers force cache coherency?

Asked 20/6, 2015 at 20:5 Answered 27/1, 2019 at 8:34

Solved multithreading caching cpu-architecture multicore memory-barriers

I was reading this question about using a bool for thread control and got intrigued by this answer by @eran:

Using volatile is enough only on single cores, where all threads use the same cache. On multi-cores, if stop() is called on one core and run() is executing on another, it might take some time for the CPU caches to synchronize, which means two cores might see two different views of isRunning_.

If you use synchronization mechanisms, they will ensure all caches get the same values, in the price of stalling the program for a while. Whether performance or correctness is more important to you depends on your actual needs.

I have spent over an hour searching for some statement that says synchronization primitives force cache coherency but have failed. The closest I have come is Wikipedia:

The keyword volatile does not guarantee a memory barrier to enforce cache-consistency.

Which suggests that memory barriers do force cache consistency, and since some synchronization primitives are implemented using memory barriers (again from Wikipedia) this is some "evidence".

But I don't know enough to be certain whether to believe this or not, and be sure that I'm not misinterpreting it.

Can someone please clarify this?

Childs answered 20/6, 2015 at 20:5 Comment(1)

That's right the volatile keyword in C and C++ does nothing for thread synchronization (don't remember about C#). Memory barriers do enforce cache coherence. You might want to read up on strong / weak memory models, and memory ordering. – Pharmacopsychosis 20/6, 2015 at 22:20

As I understand, synchronization primitives won't affect cache coherency at all. Cache is French for hidden, it's not supposed to be visible to the user. A cache coherency protocol should work without the programmer's involvement.

Synchronization primitives will affect the memory ordering, which is well defined and visible to the user through the processor's ISA.

A good source with detailed information is A Primer on Memory Consistency and Cache Coherence from the Synthesis Lectures on Computer Architecture collection.

EDIT: To clarify your doubt

The Wikipedia statement is slightly wrong. I think the confusion might come from the terms memory consistency and cache coherency. They don't mean the same thing.

The volatile keyword in C means that the variable is always read from memory (as opposed to a register) and that the compiler won't reorder loads/stores around it. It doesn't mean the hardware won't reorder the loads/stores. This is a memory consistency problem. When using weaker consistency models the programmer is required to use synchronization primitives to enforce a specific ordering. This is not the same as cache coherency. For example, if thread 1 modifies location A, then after this event thread 2 loads location A, it will receive an updated (consistent) value. This should happen automatically if cache coherency is used. Memory ordering is a different problem. You can check out the famous paper Shared Memory Consistency Models: A Tutorial for more information. One of the better known examples is Dekker's Algorithm which requires sequential consistency or synchronization primitives.

EDIT2: I would like to clarify one thing. While my cache coherency example is correct, there is a situation where memory consistency might seem to overlap with it. This when stores are executed in the processor but delayed going to the cache (they are in a store queue/buffer). Since the processor's cache hasn't received an updated value, the other caches won't either. This may seem like a cache coherency problem but in reality it is not and is actually part of the memory consistency model of the ISA. In this case synchronization primitives can be used to flush the store queue to the cache. With this in mind, the Wikipedia text that you highlighted in bold is correct but this other one is still slightly wrong: The keyword volatile does not guarantee a memory barrier to enforce cache-consistency. It should say: The keyword volatile does not guarantee a memory barrier to enforce memory consistency.

Sphygmomanometer answered 20/6, 2015 at 20:31 Comment(20)

Thanks for commenting, but this kind of disagrees with the quote that I posted in my question - are you saying that the poster of what I quoted is wrong? – Childs 21/6, 2015 at 12:36

It's a bit wrong, yes. I've updated my answer with extra information for you. – Sphygmomanometer 21/6, 2015 at 12:55

Right, I am still confused, so please accept my apologies. Firstly, for the record, the part I emboldened is not from Wikipedia, its from eran's answer on the linked SO question. Now, your first line in your answer is "As I understand, synchronization primitives won't affect cache coherency at all". You then say that the emboldended part I quote ("If you use synchronization mechanisms, they will ensure all caches get the same values") is right. These two contradict each other! Either sync. primitives do affect cache coherency, or they don't... – Childs 21/6, 2015 at 17:56

I try to clarify this point in EDIT2 but I understand it can be confusing. Cache coherency is a hardware protocol and the user does not control it. However, there are cases when a new value may delay being written to the cache. In these cases none of the caches see the new value. Here you can use synchronization primitives to flush the store queue to the cache. Once it's in the local cache, the cache coherency protocol will automatically make the new value visible to the other caches. Do you see the difference? The important thing to note is that cache coherency ≠ memory consistency. – Sphygmomanometer 21/6, 2015 at 18:9

The book I linked to, A Primer on Memory Consistency and Cache Coherence, is excellent and explains these things quite well. I'm sure reading it will clarify your doubts. – Sphygmomanometer 21/6, 2015 at 18:12

Thanks for adhering with me on this! OK, so as soon as I update one CPU's cache, the cache coherency protocol will take care of flushing that to the caches of other CPUs, right? That being the case, why do we need to use a syncho primitive instead of a bool to stop a thread (as in the link in my question)? As soon as I write my bool, the other CPUs caches will be updated according to what you are saying? But this means that the whole comment I originally questioned is wrong...is that correct? – Childs 21/6, 2015 at 18:36

So, if we rephrase your question "why use synchronization primitives instead of bools to force memory consistency?", then we are getting somewhere interesting. To summarise an answer, you need more than one variable to synchronise and those variables need special properties to be serialised and flushed within a single processor. Even then, you need to be able to flush your critical section before leaving it. Read this about the problems encountered with Dekker's Algorithm run on an x86 machine without synchronization primitives. – Sphygmomanometer 21/6, 2015 at 18:55

"you need more than one variable to synchronise" But we can synchronize with a single mutex? – Childs 21/6, 2015 at 18:57

Yes, but this is a whole other issue. Mutexes are often implemented with special instructions like atomic test and sets. These behave differently to regular loads/stores. This is a much different question to your original though. – Sphygmomanometer 21/6, 2015 at 19:4

+1 - this is more correct than my 4-year-old answer. On most cases, consistency, not coherence, is the issue, and this is where volatile miserably fails. If I could, another +1 for referencing those two papers, authored by some of the most noticeable researchers in the computer architecture community. – Myel 22/6, 2015 at 8:50

OK. @eran Perhaps I am a little slow, or perhaps the answer has been lost in all this info. Can someone please post a new (summary?) answer to what I ultimately asked (why using a sync. primitive enables all cores to see an updated value when a bool doesn't) so future searchers won't have to wade through all the comments. – Childs 22/6, 2015 at 10:22

@eran please don't take offence anyone, but from link, which talks about load and store barriers (which we mentioned in another comment) talk about "making program state visible to all CPUs" - if sync. primitives use these barriers then surely that is what I'm looking for - something saying that sync. primitives force all CPUs to see updated state? Is this correct? – Childs 22/6, 2015 at 10:22

Also, usr's answer was downvoted. It wasn't me as I don't have the reputation to do so, hence I am assuming it was potentially @eran or hayesti. We're all friends here, and the overall aim is (hopefully) in helping me understand this, so can I please learn what was wrong with this answer? – Childs 22/6, 2015 at 10:22

@Childs But we've answered this question several times over. If you encapsulate a critical section with a boolean, many processors will try to be "clever" and may speculatively execute the critical section before the boolean. Why do they do this? Because of the memory consistency model. The weaker it is, the more optimization can happen. How do we fix it? By temporarily making the memory consistency model very strong. How do we do that? Using processor-specific synchronization instructions. These are what are inside mutexes, semaphores, monitors etc.. – Sphygmomanometer 22/6, 2015 at 12:31

@Childs The literature that I've linked to explain this in detail. There isn't much more I can clarify if you don't read the background text. – Sphygmomanometer 22/6, 2015 at 12:33

@Childs Your latest link is fine and the statement "sync. primitives force all CPUs to see updated state" is fine. The problem was that you originally asked if they force cache coherency, which they don't. The clarification and discussion is coming from this. – Sphygmomanometer 22/6, 2015 at 12:51

Wad, I agree with hayesti's comments above. I'm a bit short in time, and can't read any additional material now, so can't comment on that link. I do know the papers in the answer for quite some time, and think they are excellent resources. Coherence, consistency, memory models and such are very complicated topics, and wrapping your head around them requires some serious reading. As for @usr's answer, I have no idea who downvoted it and why. All I can say is I think haysti's answer is better IMHO. – Myel 22/6, 2015 at 18:35

@Sphygmomanometer How do you differentiate "sync. primitives force all CPUs to see updated state" vs "if they force cache coherency, which they don't"? Forcing all CPUs to see the same state is achieved through cache coherency, requires that all copies of a memory address do not contain stale values in any cache line containing the memory address. – Pharmacopsychosis 24/6, 2015 at 14:2

@ChrisO Sorry, I don't understand what you're asking me. Can you rephrase? – Sphygmomanometer 24/6, 2015 at 14:30

@Sphygmomanometer Even when the value is written to the cache, the cache coherency protocol does not guarantee that this value is available in all other caches. We need a read memory barrier to enforce that. Take a look at my answer – Sigmatism 15/2, 2019 at 19:45

Short Answer : Cache coherency works most of the time but not always. You can still read stale data. If you don't want to take chances, then just use a memory barrier

Long Answer : CPU core is no longer directly connected to the main memory. All loads and stores have to go through the cache. The fact that each CPU has its own private cache causes new problems. If more than one CPU is accessing the same memory it must still be assured that both processors see the same memory content at all times. If a cache line is dirty on one processor (i.e., it has not been written back yet to main memory) and a second processor tries to read the same memory location, the read operation cannot just go out to the main memory. . Instead the content of the first processor’s cacheline is needed. The question now is when does this cache line transfer have to happen? This question is pretty easy to answer: when one processor needs a cache line which is dirty in another processor’s cache for reading or writing. But how can a processor determine whether a cache line is dirty in another processor’s cache? Assuming it just because a cache line is loaded by another processor would be suboptimal (at best). Usually the majority of memory accesses are read accesses and the resulting cache lines are not dirty. Here comes cache coherency protocols. CPU's maintain data consistency across their caches via MESI or some other cache coherence protocol.

With cache coherency in place, should we not see that latest value always for the cacheline even if it was modified by another CPU? After all that is whole purpose of the cache coherency protocols. Usually when a cacheline is modified, the corresponding CPU sends an "invalidate cacheline" request to all other CPU's. It turns out that CPU’s can send acknowledgement to the invalidate requests immediately but defer the actual invalidation of the cacheline to a later point in time. This is done via invalidation queues. Now if we get un-lucky enough to read the cacheline within this short window (between the CPU acknowledging an invalidation request and actually invalidating the cacheline) then we can read a stale value. Now why would a CPU do such a horrible thing. The simple answer is PERFORMANCE. So lets look into different scenarios where invalidation queues can improve performance

Scenario 1 : CPU1 receives an invalidation request from CPU2. CPU1 also has a lot of stores and loads queued up for the cache. This means that the invalidation of the requested cacheline takes times and CPU2 gets stalled waiting for the acknowledgment
Scenario 2 : CPU1 receives a lot of invalidation requests in a short amount of time. Now it takes time for CPU1 to invalidate all the cachelines.

Placing an entry into the invalidate queue is essentially a promise by the CPU to process that entry before transmitting any MESI protocol messages regarding that cache line. So invalidation queues are the reason why we may not see the latest value even when doing a simple read of a single variable.

Now the keen reader might be thinking, when the CPU wants to read a cacheline, it could scan the invalidation queue first before reading from the cache. This should avoid the problem. However the CPU and invalidation queue are physically placed on opposite sides of the cache and this limits the CPU from directly accessing the invalidation queue. (Invalidation queues of one CPU’s cache are populated by cache coherency messages from other CPU’s via the system bus. So it kind of makes sense for the invalidation queues to be placed between the cache and the system bus). So in order to actually see the latest value of any shared variable, we should empty the invalidation queue. Usually a read memory barrier does that.

I just talked about invalidation queues and read memory barriers. [1] is a good reference for understanding the need for read and write memory barriers and details of MESI cache coherency protocol

[1] http://www.puppetmastertrading.com/images/hwViewForSwHackers.pdf

Sigmatism answered 27/1, 2019 at 8:34 Comment(3)

Despite invalidation queues, most ISAs have a memory model that guarantees all other cores agree on the order of two stores (the IRIW litmus test). PowerPC is one notable exception with hardware that can really do it in practice. (ARMv7 allowed it on paper but no hardware ever did it; ARMv8 is multi-copy atomic). Will two atomic writes to different locations in different threads always be seen in the same order by other threads? – Varices 20/11, 2021 at 21:42

Do invalidation queues introduce any new reordering possibilities, or do they just make it look like the reading core was further "ahead" in what it was doing than the other cores? I've never been clear on why they're relevant when thinking about memory ordering. (But I'm not very familiar with PowerPC.) Is there some litmus test where a final result is allowed on some machines which would be impossible with a store buffer, OoO exec / hit-under-miss of loads, but not invalidate queues? I should probably ask that as a new question. – Varices 20/11, 2021 at 22:52

Someone asked a followup to this, if you're interested in answering: Does Cache Coherence always prevent reading a stale value? Do invalidation queues allow it? – Varices 27/8, 2022 at 11:25

Synchronization primitives will affect the memory ordering, which is well defined and visible to the user through the processor's ISA.

A good source with detailed information is A Primer on Memory Consistency and Cache Coherence from the Synthesis Lectures on Computer Architecture collection.

EDIT: To clarify your doubt

The Wikipedia statement is slightly wrong. I think the confusion might come from the terms memory consistency and cache coherency. They don't mean the same thing.

Sphygmomanometer answered 20/6, 2015 at 20:31 Comment(20)

Thanks for commenting, but this kind of disagrees with the quote that I posted in my question - are you saying that the poster of what I quoted is wrong? – Childs 21/6, 2015 at 12:36

It's a bit wrong, yes. I've updated my answer with extra information for you. – Sphygmomanometer 21/6, 2015 at 12:55

"you need more than one variable to synchronise" But we can synchronize with a single mutex? – Childs 21/6, 2015 at 18:57

@Childs The literature that I've linked to explain this in detail. There isn't much more I can clarify if you don't read the background text. – Sphygmomanometer 22/6, 2015 at 12:33

@ChrisO Sorry, I don't understand what you're asking me. Can you rephrase? – Sphygmomanometer 24/6, 2015 at 14:30

What wikipedia tells you is that volatile does not mean that a memory barrier will be inserted to enforce cache-consistency. A proper memory barrier will however enforce that memory access between multiple CPU cores is consistent, you may find reading the std::memory_order documentation helpful.

Smoot answered 20/6, 2015 at 20:9 Comment(4)

Thanks. I understand about volatile but what I'm asking for is something that explicitly states that "A proper memory barrier will however enforce that memory access between multiple CPU cores is consistent" - can you point me at anything? – Childs 21/6, 2015 at 12:34

Its also confusing because what I've read about cache syncing is that it happens in hardware - that being the case how can a software "concept" force it? – Childs 21/6, 2015 at 12:35

@Childs Some examples are the CLFLUSH and MFENCE IA32 instructions, a large pile of documentation can be found here – Pharmacopsychosis 21/6, 2015 at 15:57

@Childs I pointed you at std::memory_order which, together with std::atomic_thread_fence, can be used to insert memory barriers in your code. As each CPU architecture has its own fences and even differently strict requirements (weakly ordered vs strongly ordered for example), you can use this high level concept and have the compiler insert the right instruction for the target CPU. And of course the cache is implemented in hardware, but so is the ALU and that can be driven by software as well. – Smoot 21/6, 2015 at 16:19

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags