Does a memory barrier ensure that the cache coherence has been completed?

Asked 12/3, 2017 at 11:23 Answered 17/4, 2020 at 20:6

Solved assembly x86 operating-system cpu-cache memory-barriers

Say I have two threads that manipulate the global variable x. Each thread (or each core I suppose) will have a cached copy of x.

Now say that Thread A executes the following instructions:

set x to 5
some other instruction

Now when set x to 5 is executed, the cached value of x will be set to 5, this will cause the cache coherence protocol to act and update the caches of the other cores with the new value of x.

Now my question is: when x is actually set to 5 in Thread A's cache, do the caches of the other cores get updated before some other instruction is executed? Or should a memory barrier be used to ensure that?:

set x to 5
memory barrier
some other instruction

Note: Assume that the instructions were executed in order, also assume that when set x to 5 is executed, 5 is immediately placed in Thread A`'s cache (so the instruction was not placed in a queue or something to be executed later).

Bondon answered 12/3, 2017 at 11:23 Comment(6)

Just a guess: no. IMO it takes more than one cycle to update caches of other cores, so you have to use lock on the set, to wait for it and make it distribute properly. Without lock the Thread B may see partial modification, or even partially overwrite x (or even fully overwrite it or see full old value). And the memory barrier variant will IMO not help, if both threads are writing into the variable, w/o locking w/ barrier you may still end with combined value from two threads, when each thread will write different part of it. – Meeker 12/3, 2017 at 11:28

Are you asking if synchronization methods ensure cache is updated in other processors ? – Dorren 12/3, 2017 at 11:45

@Tony Tannous Yes. For example: when Thread A unlocks a mutex, does the unlock code contains a memory barrier that will make sure that the cashes of the other cores has been updated before actually making the mutex available for the other threads to lock? so by the time Thread B locks the mutex, Thread B can be sure that all of the modifications done on the global variables by Thread A will be seen by Thread B? – Bondon 12/3, 2017 at 11:55

Very much a hardware thing and could be implementation specific (one generation of x86 may have a different answer than another), but should all be well documented. Where in your system do the cores come together? L1, L2, system memory? And for each of the not shared layers what does the documentation say in order to push those items out one layer? And most importantly what happened when you tried or didnt try each of these things, did it work for you? – Montreal 12/3, 2017 at 14:20

FWIW, cache coherence normally doesn't work as you suggest. A CPU that modifies a value is generally not "pushing out" that value to other CPU's caches on each modification. Rather, prior to modifying the value, copies in other CPUs caches are invalidated (if there are any), and then the CPU is free to privately modify the value as many times as it wants until some other CPU needs the value. It is then that other CPU that triggers a cache coherence transaction to get the modified value... at least in most MESI-like systems. It is pull, not push. – Lala 18/3, 2017 at 19:31

Also, like Margaret makes clear in her answer, there are really two distinct parts to this question: what is formally guaranteed by the CPU memory model, and how it works under the hood. You are mixing the two parts: you'll find that the memory models are written in a very general way and you can't answer questions about "how" that happens: but it should answer what you need to know to write a correct program. The "how" is practically an EE question, and changes from arch to arch. – Lala 18/3, 2017 at 19:34

The memory barriers present on the x86 architecture - but this is true in general - not only guarantee that all the previous¹ loads, or stores, are completed before any subsequent load or store is executed - they also guarantee that the stores have became globally visible.

By globally visible it is meant that other cache-aware agents - like other CPUs - can see the store.
Other agents non aware of the caches - like a DMA capable device - will not usually see the store if the target memory has been marked with a cache type that doesn't enforce an immediate write into memory.
This has nothing to do with the barrier it-self, it is a simple fact of the x86 architecture: caches are visible to the programmer and when dealing with hardware they are usually disabled.

Intel is purposely generic on the description of the barriers because it doesn't want to tie her-self to a specific implementation.
You need to think in abstract: globally visible implies that the hardware will take all the necessary steps to make the store globally visible. Period.

To understand the barriers however it is worth taking a look at the current implementations.
Note that Intel is free to turn the modern implementation up-side down at will, as long it keep the visible behaviour correct.

A store in an x86 CPU is executed in the core, then placed in the store buffer.
For example mov DWORD [eax+ebx*2+4], ecx, once decoded is stalled until eax, ebx and ecx are ready² then it is dispatched to an execution unit capable of computing its address.
When the execution is done the store has become a pair (address, value) that is moved into the store buffer.
The store is said to be completed locally (in the core).

The store buffer allows the out-of-order execution part of the CPU to forget about the store and consider it completed even if an attempt to write is has not even been made yet.

Upon specific events, like a serialization event, an exception, the execution of a barrier or the exhaustion of the buffer, the CPU flushes the store buffer.
The flush is always in order - First In, First written.

From the store buffer the store enters the realm of the cache.
It can be combined yet into another buffer called the Write Combining buffer (and later written into memory by-passing the caches) if the target address is marked with a WC cache type, it can be written into the L1D cache, the L2, the L3 or the LLC if it is not one of the previous if the cache type is WB or WT.
It can also be written directly in memory if the cache type is UC or WT.

As today that's what it means to become globally visible: leave the store buffer.
Beware of two very important things:

The cache type still influences the visibility.
Globally visible doesn't mean visible in memory, it means visible where loads from other cores will see it.
If the memory region is WB cacheable, the load could end in the cache, so it is globally visible there - only for the agent aware of the existence of the cache. (But note that most DMA on modern x86 is cache-coherent).
This also apply to the WC buffer that is non-coherent.
The WC is not kept coherent - its purpose is to coalesce the stores to memory areas where the order doesn't matter, like a framebuffer. This is not really globally visible yet, only after the write-combining buffer is flushed can anything outside the core see it.

sfence does exactly that: wait for all the previous stores to complete locally and then drains the store buffer.
Since each store in the store buffer can potentially miss, you see how heavy such instruction is. (But out-of-order execution including later loads can continue. Only mfence would block later loads from being globally visible (reading from L1d cache) until after the store buffer finishes committing to cache.)

But does sfence wait for the store to propagates into other caches?
Well, no.
Because there is not propagation - lets see what a write into the cache implies from an high-level perspective.

The cache is kept coherent among all the processors with the MESI protocol (MESIF for multi-socket Intel systems, MOESI for AMD ones).
We will only see MESI.

Suppose the writes indexes the cache line L, and suppose all the processors has this line L in their caches with the same value.
The state of this line is Shared, in every CPU.

When our stores lands in the cache, L is marked as Modified and a special transaction is made on the internal bus (or QPI for multi-socket Intel systems) to invalidate line L in other processors.

If L was not initially in the S state, the protocol is changed accordingly (e.g. if L is in state Exclusive no transactions on the bus are done^[1]).

At this point the write is complete and sfence completes.

This is enough to keep the cache coherent.
When another CPU request line L, our CPU snoops that request and L is flushed to memory or into the internal bus so the other CPU will read the updated version.
The state of L is set to S again.

So basically L is read on-demand - this makes sense since propagating the write to other CPU is expensive and some architectures do it by writing L back into memory (this works because the other CPU has L in state Invalid so it must read it from memory).

Finally it is not true that sfence et all are normally useless, on the contrary they are extremely useful.
It is just that normally we don't care how other CPUs see us making our stores - but acquiring a lock without an acquiring semantic as defined, for example, in C++, and implemented with the fences, is totally nuts.

You should think of the barriers as Intel says: they enforce the order of global visibility of memory accesses.
You can help your self understanding this by thinking of the barriers as enforcing the order or writing into the cache. The cache coherence will then take rest of assuring that a write to a cache is globally visible.

I can't help but stress out one more time that cache coherency, global visibility and memory ordering are three different concepts.
The first guarantees the second, that is enforced by the third.

Memory ordering -- enforces --> Global visibility -- needs -> Cache coherency
'.______________________________'_____________.'                            '
                 Architectural  '                                           '
                                 '._______________________________________.'
                                             micro-architectural

Footnotes:

In program order.
That was a simplification. On Intel CPUs, mov [eax+ebx*2+4], ecx decodes into two separate uops: store-address and store-data. The store-address uop has to wait until eax and ebx are ready, then it is dispatched to an execution unit capable of computing its address. That execution unit writes the address into the store buffer, so later loads (in program order) can check for store-forwarding.

When ecx is ready, the store-data uop can dispatch to the store-data port, and write the data into the same store buffer entry.

This can happen before or after the address is known, because the store-buffer entry is reserved probably in program order, so the store buffer (aka memory order buffer) can keep track of load / store ordering once the address of everything is eventually known, and check for overlaps. (And for speculative loads that ended up violating x86's memory ordering rules if another core invalidated the cache line they loaded from before the earliest point they were architecturally allowed to laod. This leads to a memory-order mis-speculation pipeline clear.)

Ethe answered 12/3, 2017 at 17:39 Comment(23)

Why would barriers ensure the loads and stores are complete when you can make them globally visible (in the context of cache-aware components) long before they're complete? – Mesothelium 12/3, 2017 at 19:44

@Mesothelium a store is complete when it reaches the caches and it cannot be visible any sooner. – Ethe 12/3, 2017 at 19:48

Oh. Terminology disagreement, I guess. – Mesothelium 12/3, 2017 at 21:32

excellent information!! but, when a write is done on a cache line which has Shared coherent state(MESI), in terms of event ordering, doesnt it change to M (S->M) and all other caches to (S->I), before the store is placed in store buffer? – Sliest 13/3, 2017 at 8:34

@IsuruH The store buffer is before the cache. When the CPU drains the store buffer it writes into the caches (if applicable) and each write entitles the managements of the MESI (et all) state. – Ethe 13/3, 2017 at 9:31

@MargaretBloom, thanks! If I understand correctly deciding when to drain the store buffer is what actually affect the event ordering and global visibility etc. – Sliest 13/3, 2017 at 10:24

@IsuruH It affects the global visibility. The ordering is affected by how the stores enter the store buffer. The SB is drained in order, FIFO. – Ethe 13/3, 2017 at 11:1

Where do you read that barriers imply global visibility? IMO barriers imply the intra-access ordering they are documented to enforce, but not necessarily anything more. Global visibility (however you want to define it) is another thing entirely. It is totally possible for a processor to execute a barrier and not have preceding stores be visible until some time later (I don't think "globally visible" as a meaning for reads, does it?). Indeed, that's exactly how SFENCE is often implemented on x86: it doesn't force any stores to be visible (more generally, it is almost a no-op on x86). – Lala 18/3, 2017 at 19:25

It is better to avoid terms like "global visibility" anyway when discussing the nuances of memory models: it's hard to define that, since it's all relativistic anyway: there is no "global clock" that you can use to make decisions about what became visible when. That's why memory models are mostly defined in relative terms: what you observe about B if you have seen A, and what some other actor will observe about those two things, etc. You can talk about "sequential consistency", but that again sidesteps visibility. You don't even find "visibility" mentioned in the Intel ordering whitepaper. – Lala 18/3, 2017 at 19:29

@Lala The sfence page in volume 2 reads: "... in program order becomes globally visible before any..". I understand that barriers should enforce the ordering only and let the visibility to other instructions but it seems they are fused on x86. sfence is not a nop, it drains the store buffer. I believe the "fusion" is due to the fact that the caches are architecturally visible while the store buffer is not (or should not). – Ethe 18/3, 2017 at 20:38

@Lala So, from an Intel POV, a store completes when it reaches the caches since, formally, there can't be any intermediate state. This is also true for ARM with the dmb instruction where the programmer has more control over the final region of destination. It is not true for PowerPC, where the barriers (like eieio or *sync) are just ordering and not visibility control (managed by dcbf for example). – Ethe 18/3, 2017 at 20:38

Well using of a term like "the store is complete" is part of the confusion above. It's open to interpretation. I could argue that a store completes when it is no longer speculative in the ROB, or when it reaches DRAM, or when it reaches disk for a memory mapped file, etc, etc. SFENCE does not drain the store buffer! SFENCE is only around for some "weird" types of stores that bypass the store buffer. The store buffer itself is inherently ordered: that's a big reason why it's there in the first place (also to kill speculative stores). – Lala 18/3, 2017 at 20:46

(Normal) stores are already strongly order on Intel: stores cannot pass stores. So from their perspective of SFENCE is a no-op. It does nothing to the store buffer, and you won't see (correct) code using it unless they are (a) using weird memory types like WC or the NT stores. That's why SFENCE executues in 1-6 cycles on various Intel and AMD archs (including Ryzen, which can issue 6 SFENCE a cycle!). Instructions that really drain the store buffers, like MFENCE and all the LOCK instructions take 20+ cycles. – Lala 18/3, 2017 at 20:49

You are right though that they do use the term "globally visible" in their instruction set guide! Note that in the SFENCE section they actually kind of obtusely explain that it's not for "normal" stores. Actually the fact that Intel was for a long time unclear on their exact memory model (you had to glean what you could from various dispersed descriptions including the barrier ones) was a big reason they put out the whitepaper, and there were even a few academic papers on the topic. – Lala 18/3, 2017 at 20:53

@Lala Intel also says that, in section 11.10, that sfence drains the store buffer, so I stayed loyal to that. But I've heard of the behaviour you are describing. I can't recall where, though. Do you mind sharing a link? P.S. what are those "weird" stores that bypass the SB? – Ethe 18/3, 2017 at 20:58

Finally, from the PoV of the memory model both the caches and the store buffer, and all sorts of other speculative mechanisms are basically invisible. The memory model doesn't talk about them. It just says what other agents in the system are allowed and not allowed to observe. The store buffer often comes up in the discussion because it's one way, on a single chip, to easily see StoreLoad re-ordering (the only allowed re-ordering on x86 for normal ops). MESI does a fine job of keeping full apparent ordering, so without the buffers you might not even have that reordering. – Lala 18/3, 2017 at 21:4

@MargaretBloom - well read any section which primarily talks about P4 as the most current arch with a grain of salt, but this section seems right: it talks about draining the WC buffers. Those are very different than the store buffers. Those are the write-combining buffers for the "weird" stores I mentioned above, either the explicit NT stores, or for cases where the memory has been marked WC. Since there aren't cacheable, Intel has some limited (like 4 or so) WC buffers to absorb multiple stores to the same line before sending them out to DRAM. They don't play for normal stores. – Lala 18/3, 2017 at 21:8

Here's a good link from Peter Cordes on the topic. Intel do a pretty bad job of explaining that you pretty much never need these barriers for normal code, and so when people read about barriers in a textbook or some other more idealized architecture (say SPARC where they explicitly have all the barrier types), they come over to Intel and naturally look for barriers. For useful standalone barriers you only the full MFENCE for normal ops... – Lala 18/3, 2017 at 21:11

... but it's usually more expensive that the LOCKed instructions which also give you a full barrier and an atomic op to boot! So on high-performance implementations you just see a redundant LOCKED op to the stack for a "barrier" and only when you need a StoreLoad barrier since that's the only allowed re-ordering on Intel. Put another way, plain Intel stores already have "release" semantics. Check this mapping barriers to instructions. BTW, if SFENCE did drain the store buffer you could massively speed up many concurr algorithms and runtimes! – Lala 18/3, 2017 at 21:15

@Lala Wait... Are you saying that the section 11.10, where Intel uses the term "store buffer" should actually read "WC buffer"? Thanks for those links, I didn't know that NT moves to/from WC memory types are weakly ordered (other than cache bypassing)! Anyway, I've found no proof that sfence doesn't actually drain the SB. Granted that it is useless for reordering normal stores, this alone doesn't imply that sfence has no accessory function (i.e. ordering + visibility) My version of Fog's inst table doesn't list the latency for the fences. Honestly, I'm confused... IDK what to think. – Ethe 18/3, 2017 at 21:50

OK, so now I am correctly reading 11.10. Somehow I was reading from 11.3.1 before (which does talk about WC buffers). You know what? I was partly mixing up SFENCE performance with LFENCE performance, so there is no "1 cycle SFENCE" on Ryzen - it takes 20c there. More numbers. So yeah, I think you are actually right: SFENCE needs to drain both the store buffer and WC buffers to do its thing, otherwise how could it guarantee that [normal store, sfence, weak store] would be properly ordered? – Lala 18/3, 2017 at 22:7

So where I'm at: 8.2 which is really the core doc for memory ordering makes it pretty clear in several places that SFENCE is only useful when weak stores are in the mix, but it still relates to normal stores in its operation since you might have a mix or normal and weak stores. So I believe it does the equivalent of draining the store buffer. Now SFENCE seems to run in 6 cycles on recent Intel (but 20c on AMD), while the atomic ops are all around 20c, so if it could be useful used for normal stores it would be great, You have no guarantee that it actually prevents StoreLoad reordering. – Lala 18/3, 2017 at 22:12

Let us continue this discussion in chat. – Lala 18/3, 2017 at 22:22

Now when set x to 5 is executed, the cached value of x will be set to 5, this will cause the cache coherence protocol to act and update the caches of the other cores with the new value of x.

There are multiple different x86 CPUs with different cache coherency protocols (none, MESI, MOESI), plus different types of caching (uncached, write-combining, write-only, write-through, write-back).

In general when a write is being done (when setting x to 5) the CPU determines the type of caching being done (from MTRRs or TLBs), and if the cache line could be cached it checks its own cache to determine what state that cache line is in (from its own perspective).

Then the type of caching and the state of the cache line is used to determine if the data is written directly to the physical address space (bypassing caches), or if it has to fetch the cache line from elsewhere while simultaneously telling other CPUs to invalidate old copies, or if it has exclusive access in its own caches and can modify it in the cache without telling anything.

A CPU never "injects" data into another CPU's cache (and only tells other CPUs to invalidate/discard their copy of a cache line). Telling other CPUs to invalidate/discard their copy of a cache line causes them to fetch the current copy of it if/when they want it again.

Note that none of this has anything to do with memory barriers.

There are 3 types of memory barriers (sfence, lfence and mfence), which tell the CPU to complete stores, loads or both before allowing later stores, loads or both to occur. Because the CPU is normally cache coherent anyway these memory barriers/fences are normally pointless/unnecessary. However there are situations where the CPU is not cache coherent (incuding "store forwarding", when the write-combining caching type is being used, when non-temporal stores are being used, etc). Memory barriers/fences are needed to enforce ordering (if necessary) for these special/rare cases.

Koeninger answered 12/3, 2017 at 12:1 Comment(4)

"Because the CPU is normally cache coherent anyway these memory barriers/fences are normally pointless/unnecessary" But you said that memory barriers are used to tell the CPU to complete stores, loads or both before allowing later stores, loads or both to occur. I have read that a CPU can put store operations in a queue and execute them later, so we should use a memory barrier if we want them to be executed before continuing with the rest of our instruction. Am I missing something? – Bondon 12/3, 2017 at 14:12

You answer nails the point (MESI/MOESI doesn't push data into other caches, so the OP question is ill-formed - no need to wait for anything to complete) but the last paragraph is wrong. You are confusing memory ordering with cache coherence. Once in the cache, at least for the x86 systems, data is globally visible. But due to reordering and the store buffer the time a store becomes globally visible is not in program order or at the time the store is completed -> hence the barriers. – Ethe 12/3, 2017 at 14:47

@Christopher: For normal RAM using normal write-back caching the CPU's memory ordering ensure that everything is ordered in a sane way without any barriers/fences. The "put store operations in a queue and execute them later" is a relatively abnormal special case (involving "write-combining caching and not write-back" and/or non-temporal stores) where the CPU's normal memory ordering is being deliberately bypassed (and causes the need for barriers/fences because normal memory ordering is deliberately bypassed). – Koeninger 14/3, 2017 at 12:53

The caches do cache physical address space. I think you were trying to use a broad term to cover DRAM and I/O space, but as soon as a store commits to L1d cache and thus becomes globally visible, it has been written to "physical address space". IDK if non-cache-coherent DMA is still possible on modern x86; with integrated memory controllers, device DMA can (and does) normally snoop cache on the way to DRAM. – Herbert 11/3, 2018 at 1:15

No, a memory barrier does not ensure that cache coherence has been "completed". It often involves no coherence operation at all and can be performed speculatively or as a no-op.

It only enforces the ordering semantics described in the barrier. For example, an implementation might just put a marker in the store queue such that store-to-load forwarding doesn't occur for stores older than the marker.

Intel, in particular, already has a strong memory model for normal loads and stores (the kind that compilers generate and that you'd use in assembly) where the only possible re-ordering is later loads passing earlier stores. In the terminology of SPARC memory barriers, every barrier other than StoreLoad is already a no-op.

In practice, the interesting barriers on x86 are attached to LOCKed instructions, and the execution of such an instruction doesn't necessarily involve any cache coherence at all. If the line is already in an exclusive state, the CPU may simply execute the instruction, making sure not to release the exclusive state of the line while the operation is in progress (i.e., between the read of the argument and writeback of the result) and then only deal with preventing store-to-load forwarding from breaking the total ordering that LOCK instructions come with. Currently they do that by draining the store queue, but in future processors even that could be speculative.

What a memory barrier or barrier+op does is ensure that the operation is seen by other agents in a relative order that obeys all the restriction of the barrier. That certainly doesn't usually involve pushing the result to other CPUs as a coherence operation as you question implies.

Lala answered 18/3, 2017 at 21:33 Comment(0)

-1

If no other processor has X in its cache, doing x=5 on processor A will not update the caches in any other processor. If processor B reads variable X, processor A will detect the read (this is called snooping) and will provide the data, 5, on the bus for processor B. Now processor B will have the value 5 in its cache. If no other processor reads variable X then their caches will never be updated with the new value 5.

Kuroshio answered 17/4, 2020 at 20:6 Comment(8)

That's a really misleading description. An x=5 store will invalidate any other cached copies of the line before it can modify its copy (i.e. it gets exclusive ownership); that's how other cores know they need to re-fetch the value instead of using a locally-cached value. You make it sound like they still have an old cached value (not possible with coherent caches that use MESI cache coherence) but somehow they still make a request that the writing core can see. – Herbert 17/4, 2020 at 20:14

After one core invalidates every other core's copy of the line, yes it's true that if no other core reads x then they won't cache the new (or any) value for it. – Herbert 17/4, 2020 at 20:15

Thanks Peter, that's right. I was assuming no other processor had X cached. Will edit and clarify. – Kuroshio 17/4, 2020 at 20:26

That would be one possible way for caches to work in theory (and is how en.wikipedia.org/wiki/MESI_protocol describes it), but this is an x86 question about memory barrier instructions. Your answer doesn't mention the store buffer or barriers. That snoop model doesn't match how CPUs really work. Having every core snoop every off-core load done by every other core would not be scalable at all. They aren't all connected to a single shared bus to memory (or L3) where they all naturally see every other request. e.g. Intel CPUs use a ring bus between cores, with L3 tags as snoop filter. – Herbert 17/4, 2020 at 20:40

How will a non-neighbor core get the data if does a load of X? – Kuroshio 18/4, 2020 at 0:51

On an Intel CPU like I was talking about, it misses in L1 then L2, then sends a message over the ring bus to request that line from L3. It could hit there if the other core has already written back, but if not it misses in L3 and the L3 tags indicate which core owns a modified copy of the line, so the L3 cache controller can send a share request over the ring bus to that core to write-back that cache line to L3, and satisfy that other core's load. This is what I meant by L3 tags acting as a snoop filter: instead of all cores snooping themselves, there's a shared cache that knows who has what – Herbert 18/4, 2020 at 1:2

related: How are the modern Intel CPU L3 caches organized? and Which cache mapping technique is used in intel core i7 processor? has some more details. This is en.wikipedia.org/wiki/Directory-based_cache_coherence. AMD CPUs do actually can I think do direct cache-to-cache transfers of dirty data without writing back to a shared cache (MOESI instead of Intel's MESIF), but again use some kind of filter to avoid the scaling disaster for many-core chips of having each core broadcast to all. – Herbert 18/4, 2020 at 1:6

Hey @DavidP was writing about processors, not cores of a single processor. This must be kept in mind for servers development. Though Xeons have cross-processor link for cache coherency. – Who 15/4, 2022 at 10:28

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags