What happens to expected memory semantics (such as read after write) when a thread is scheduled on a different CPU core?

Asked 5/2, 2020 at 17:53 Answered 11/2, 2020 at 1:14

multithreading operating-system cpu-architecture cpu-cache memory-barriers

Code within a single thread has certain memory guarantees, such as read after write (i.e. writing some value to a memory location, then reading it back should give the value you wrote).

What happens to such memory guarantees if a thread is rescheduled to execute on a different CPU core? Say a thread writes 10 to memory location X, then gets rescheduled to a different core. That core's L1 cache might have a different value for X (from another thread that was executing on that core previously), so now a read of X wouldn't return 10 as the thread expects. Is there some L1 cache synchronization that occurs when a thread is scheduled on a different core?

Inherited answered 5/2, 2020 at 17:53 Comment(1)

I wanted to tag this with memory-order, but this tag is currently considered as a synonym to memory-barriers, which is confusing. – Pietje 9/2, 2020 at 14:31

All that is required in this case is that the writes performed while on the first processor become globally visible before the process begins executing on the second processor. In the Intel 64 architecture this is accomplished by including one or more instructions with memory fence semantics in the code that the OS uses to transfer the process from one core to another. An example from the Linux kernel:

/*
 * Make previous memory operations globally visible before
 * sending the IPI through x2apic wrmsr. We need a serializing instruction or
 * mfence for this.
 */
static inline void x2apic_wrmsr_fence(void)
{
    asm volatile("mfence" : : : "memory");
}

This ensures that the stores from the original core are globally visible before execution of the inter-processor interrupt that will start the thread running on the new core.

Reference: Sections 8.2 and 8.3 of Volume 3 of the Intel Architectures Software Developer's Manual (document 325384-071, October 2019).

Gaseous answered 5/2, 2020 at 18:46 Comment(0)

TL;DR: It depends on the architecture and the OS. On x86, this type of read-after-write hazard is mostly not issue that has to be considered at the software level, except for the weakly-order WC stores which require a store fence to be executed in software on the same logical core before the thread is migrated.

Usually the thread migration operation includes at least one memory store. Consider an architecture with the following property:

The memory model is such that memory stores may not become globally observable in program order. This Wikipedia article has a not-accurate-but-good-enough table that shows examples of architectures that have this property (see the row "Stores can be reordered after stores ").

The ordering hazard you mentioned may be possible on such an architecture because even if the thread migration operation completes, it doesn't necessarily mean that all the stores that the thread has performed are globally observable. On architectures with strict sequential store ordering, this hazard cannot occur.

On a completely hypothetical architecture where it's possible to migrate a thread without doing a single memory store (e.g., by directly transferring the thread's context to another core), the hazard can occur even if all stores are sequential on an architecture with the following property:

There is a "window of vulnerability" between the time when a store retires and when it becomes globally observable. This can happen, for example, due to the presence of store buffers and/or MSHRs. Most modern processors have this property.

So even with sequential store ordering, it may be possible that the thread running on the new core may not see the last N stores.

Note that on an machine with in-order retirement, the window of vulnerability is a necessary but insufficient condition for a memory model that supports stores that may not be sequential.

Usually a thread is rescheduled to run on a different core using one of the following two methods:

A hardware interrupt, such as a timer interrupt, occurs that ultimately causes the thread to be rescheduled on a different logical core.
The thread itself performs a system call, such as sched_setaffinity, that ultimately causes it to run on a different core.

The question is at which point does the system guarantee that retired stores become globally observable? On Intel and AMD x86 processors, hardware interrupts are fully serializing events, so all user-mode stores (including cacheable and uncacheable) are guaranteed to be globally observable before the interrupt handler is executed, in which the thread may be rescheduled to run a different logical core.

On Intel and AMD x86 processors, there are multiple ways to perform system calls (i.e., change the privilege level) including INT, SYSCALL, SYSENTER, and far CALL. None of them guarantee that all previous stores become globally observable. Therefore, the OS is supposed to do this explicitly when scheduling a thread on a different core by executing a store fence operation. This is done as part of saving the thread context (architectural user-mode registers) to memory and adding the thread to the queue associated with the other core. These operations involve at least one store that is subject to the sequential ordering guarantee. When the scheduler runs on the target core, it would see the full register and memory architectural state (at the point of the last retired instruction) of the thread would be available on that core.

On x86, if the thread uses stores of type WC, which do not guarantee the sequential ordering, the OS may not guarantee in this case that it will make these stores globally observable. The x86 spec explicitly states that in order to make WC stores globally observable, a store fence has to be used (either in the thread on the same core or, much simpler, in the OS). An OS generally should do this, as mentioned in @JohnDMcCalpin's answer. Otherwise, if the OS doesn't provide the program order guarantee to software threads, then the user-mode programmer may need to take this into account. One way would be the following:

Save a copy of the current CPU mask and pin the thread to the current core (or any single core).
Execute the weakly-ordered stores.
Execute a store fence.
Restore the CPU mask.

This temporarily disables migration to ensure that the store fence is executed on the same core as the weakly-ordered stores. After executing the store fence, the thread can safely migrate without possibly violating program order.

Note that user-mode sleep instructions, such as UMWAIT, cannot cause the thread to be rescheduled on a different core because the OS does not take control in this case.

Thread Migration in the Linux Kernel

The code snippet from @JohnDMcCalpin's answer falls on the path to send an inter-processor interrupt, which is achieved using a WRMSR instruction to an APIC register. An IPI may be sent for many reasons. For example, to perform a TLB shootdown operation. In this case, it's important to ensure that the updated paging structures are globally observable before invaliding the TLB entries on the other cores. That's why x2apic_wrmsr_fence may be needed, which is invoked just before sending an IPI.

That said, I don't think thread migration requires sending an IPI. Essentially, a thread is migrated by removing it from some data structure that is associated with one core and add it to the one associated with the target core. A thread may be migrated for numerous reasons, such as when the affinity changes or when the scheduler decides to rebalance the load. As mentioned in the Linux source code, all paths of thread migration in the source code end up executing the following:

stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg)

where arg holds the task to be migrated and the destination core identifier. migration_cpu_stop is a function that does the actual migration. However, the task to be migrated may be currently running or waiting in some runqueue to run on the source core (i.e, the core on which the task is currently scheduled). It's required to stop the task before the migrating it. This is achieved by adding the call to the function migration_cpu_stop to the queue of the stopper task associated with the source core. stop_one_cpu then sets the stopper task as ready for execution. The stopper task has the highest priority. So on the next timer interrupt on the source core (Which could be the same as the current core), one of the tasks with the highest priority will be selected to run. Eventually, the stopper task will run and it will execute migration_cpu_stop, which in turn performs the migration. Since this process involves a hardware interrupt, all stores of the target task are guaranteed to be globally observable.

There appears to be a bug in x2apic_wrmsr_fence

The purpose of x2apic_wrmsr_fence is to make all previous stores globally observable before sending the IPI. As discussed in this thread, SFENCE is not sufficient here. To see why, consider the following sequence:

store
sfence
wrmsr

The store fence here can order the preceding store operation, but not the MSR write. The WRMSR instruction doesn't have any serializing properties when writing to an APIC register in x2APIC mode. This is mentioned in the Intel SDM volume 3 Section 10.12.3:

To allow for efficient access to the APIC registers in x2APIC mode, the serializing semantics of WRMSR are relaxed when writing to the APIC registers.

The problem here is that MFENCE is also not guaranteed to order the later WRMSR with respect to previous stores. On Intel processors, it's documented to only order memory operations. Only on AMD processors it's guaranteed to be fully serializing. So to make it work on Intel processors, there needs to be an LFENCE after the MFENCE (SFENCE is not ordered with LFENCE, so MFENCE must be used even though we don't need to order loads). Actually Section 10.12.3 mentions this.

Pietje answered 9/2, 2020 at 19:37 Comment(19)

So you're claiming that movntps [mem], xmm0 ; syscall; mov eax, [mem] would be justified in having mov possibly reload a stale value of mem? (So the kernel could just use plain stores, doing acq/rel synchronization on the architectural state which doesn't respect NT stores). That doesn't seem right to me. (Of course weakly-ordered ISAs that can reorder normal stores need a barrier or release-store to make sure any pending user-space stores are visible to another core that loads the architectural state.) John McCalpin's answer is about a barrier in the kernel, not userspace. – Tweedsmuir 11/2, 2020 at 0:49

@PeterCordes Why not? I don't see which part of the x86 spec guarantees that movntps [mem], xmm0 becomes observable from another core at any given time unless something happens that makes the store globally observable (such as executing a store fence). Yea now I notice that John McCalpin's answer is about a fence in the kernel. But I disagree. AFAIK, Linux and Windows don't guarantee that all stores of a thread will be observed on another core if it gets migrated to it. I think this is something that the thread itself has to do. – Pietje 11/2, 2020 at 1:0

@HadiBrais How could the thread possibly do that since it has no idea where in its execution it might get migrated? The scheduler has to do this. – Tybie 11/2, 2020 at 1:9

@DavidSchwartz It doesn't have to know that and it doesn't matter whether it may get migrated and when. The x86 manual specifies that if you want to make a WC store observable from another agent, you have to explicitly use a store fence. This applies irrespective of whether that other agent happens to be executing the same thread (because the thread migrated to it) or some other thread. But sure, an OS can provide this guarantee by always executing a store fence when migrating a thread. I don't think Linux or Windows provide this guarantee though. Fundamentally, the thread itself has to do it. – Pietje 11/2, 2020 at 1:17

@DavidSchwartz Although I can see why would that be difficult. The thread may get migrated just before executing the store fence, which has to be executed on the same core, not the core it's being migrated to. So if the thread has to do it, it has to keep track of which core it's running on. Preferably the kernel just provides this guarantee. My doubt is that I've never read anywhere that Linux or Windows provide this guarantee in all past and future versions. But if the kernel does it, then the thread doesn't have to track which core it's running on, which is nice. – Pietje 11/2, 2020 at 1:26

@HadiBrais See my answer. If a thread has the guarantee that a read will see a previous store, then anything that migrates threads must preserve this guarantee. It's absurd to put this burden on the user-space code in a pre-emptive multitasking OS because that code has no way to know where it might get switched. Not assuring that in the scheduler (or elsewhere in the OS) is a complete non-starter. (It's also absurdly inefficient. The CPU goes to great cost to provide this guarantee. For the OS to remove it for all user-space code for no great gain would be utterly self-defeating.) – Tybie 11/2, 2020 at 1:29

context switch triggered by interrupts definitely have to respect reloads of NT stores because that can happen asynchronously. e.g. movnt / migrate / sfence leaves the NT store in flight on the old => disaster. @DavidSchwartz: I also don't buy Hadi's argument that a syscall between an NT store and a reload in the same thread could be allowed to break program order within a single thread, but that is something a thread can avoid. Context switch, even when triggered by a syscall, must not break that thread's program-order visibility of its own operations. That way lies madness. – Tweedsmuir 11/2, 2020 at 1:53

@DavidSchwartz Single-thread memory ordering rules apply within a logical core that is executing a single instruction stream. At this level, there is no concept of thread migration. So a software thread has the guarantee that a read will see a previous store only if both accesses are executed on the same logical core. Also multiple software threads that are sequentially executed on the same logical core do get this guarantee as well. Otherwise, the multiprocessor memory ordering rules apply, including on the case of thread migration. – Pietje 11/2, 2020 at 2:7

But the operations involved in thread migration would have to be considered as discussed in my answer. But yes, at a higher level of abstraction, I agree with you and @PeterCordes that a pre-emptive multitasking OS should provide this same guarantee at the level of a software thread in spite of possible migration. My answer says that this requires using an explicit store fence in x86 in the kernel. But it's not like any OS has to be like that by definition. – Pietje 11/2, 2020 at 2:8

I don't see which part of the x86 spec guarantees that movntps [mem], xmm0 becomes observable from another core at any given time. But it is guaranteed that the thread that did the NT store can see it immediately, like any other store. Lack of visibility guarantee is exactly the problem; migration must not be allowed to break program order of a single thread even when it reloads its own NT stores. My example was for a single thread that (foolishly) did an NT store and immediate reload. (On x86, only NT stores are a problem, assuming plain mov acq/rel of other state in the kernel.) – Tweedsmuir 11/2, 2020 at 2:9

@PeterCordes Right, the memory model guarantees this if both the store and load are executed on the same logical core. And yes, migration must not be allowed to break the program order at the level of a software thread, but the software/kernel may be required to do something to ensure this, such as executing a a store fence. – Pietje 11/2, 2020 at 2:15

It wasn't previously clear exactly what you were arguing. Your last 2 comments make sense, but earlier claims that a thread might have to worry about its own migration make no sense unless that's limited to migrations triggered synchronously (e.g. by syscall). That would be a plausible design, but probably not what any real OSes do. Flushing for async interrupts would guaranteed for free on x86 if x86 interrupts were truly serializing (draining the store buffer and WC buffers as well as ROB), but they aren't. – Tweedsmuir 11/2, 2020 at 2:25

A store buffer full of graduated cache-miss stores can cause high interrupt latency, but my understanding is that's mostly because any normal stores done by the ISR will have to wait for the SB to drain before they can become visible. (And in / out instructions or lock-anything have to wait before they can happen at all). Also iret is serializing. Anyway, your answer says (ed:said) "I don't know which real OSes provide such a guarantee." I think the only sane answer is "all of them". Migration after acq/rel sync that didn't respect NT stores would be considered a bug by most OSes. – Tweedsmuir 11/2, 2020 at 2:30

@PeterCordes I initially thought the thread has to use a store fence if it wants to get that guarantee, but after carefully thinking about it, most OSes should provide the program order guarantee in spite of thread migration. I think that's where I was wrong and the discussion with you and David helped me think more carefully about it. I've edited my answer to improve that part. If there is anything else that I've missed, please let me know. – Pietje 11/2, 2020 at 2:36

Your wording here about an OS "maybe no" providing that guarantee doesn't look conditional on synchronous migrations. That would be the only sane thing. Another problem: you say "hardware interrupts are fully serializing events". But in Estimating of interrupt latency on the x86 CPUs, you say they aren't. (And other discussion with @Bee and maybe you has talked about ISRs starting to execute while there are still graduated stores in the store buffer left over from user-space.) – Tweedsmuir 11/2, 2020 at 2:53

Much of this is probably moot on modern x86 OSes with slow Spectre mitigations, or even just Meltdown + MDS mitigations. Yay? :( – Tweedsmuir 11/2, 2020 at 2:54

@PeterCordes Oh, I think that part of my other answer (which cites one of your answers) is wrong. Section 11.10 of the Intel manual V3 says that the store buffer is drained when an interrupt occurs. The same applies to WC buffers and on AMD. Hmm, but are they fully serializing? I gotta go get some food and will think about it later :) – Pietje 11/2, 2020 at 3:6

@HadiBrais: When an interrupt is generated: could that be talking only about synchronous interrupts, or otherwise generated by the CPU? Rather than external interrupts which are merely handled by the core when they arrive? I think I got the idea that interrupts weren't serializing from for having seen discussion of needing barriers in the kernel, but I could have misinterpreted something. Like maybe a statement that interrupts aren't "serializing". They're not on paper guaranteed to be, but in practice they drain at least the ROB because uarches don't rename the privilege level. – Tweedsmuir 11/2, 2020 at 3:34

Updated the relevant section of my answer on Interrupting an assembly instruction while it is operating that was originally to blame for this misinformation. I should really put it somewhere else; it was just a fun fact and now it's a bunch of discussion. – Tweedsmuir 11/2, 2020 at 5:16

If a platform is going to support moving a thread from one core to another, whatever code does that moving must respect whatever guarantees a thread is allowed to rely on. If a thread is allowed to rely on the guarantee that a read after a write will see the updated value, then whatever code migrates a thread from one core to another must ensure that guarantee is preserved.

Everything else is platform specific. If a platform has an L1 cache then hardware must make that cache fully coherent or some form of invalidation or flushing will be necessary. On most typical modern processors, hardware makes the cache only partially coherent because reads can also be prefetched and writes can be posted. On x86 CPUs, special hardware magic solves the prefetch problem (the prefetch is invalidated if the L1 cache line is invalidated). I believe the OS and/or scheduler has to specifically flush posted writes, but I'm not entirely sure and it may vary based on the exact CPU.

The CPU goes to great cost to ensure that a write will always see a previous read in the same instruction stream. For an OS to remove this guarantee and require all user-space code to work without it would be a complete non-starter since user-space code has no way to know where in its code it might get migrated.

Tybie answered 11/2, 2020 at 1:14 Comment(5)

How can prefetches or posted writes make the cache partially coherent? I'm not sure what you mean by partially coherent. – Pietje 11/2, 2020 at 2:26

@HadiBrais: David seems to be using "prefetch" to describe OoO exec of loads, reading from L1d cache ahead of when program order would. This is not normal usage of the technical term "prefetch"; instead it's called Load Load reordering or hit under miss. And "posted writes" are how he's describing the store buffer. None of this makes cache non-coherent with other cores, but it makes execution decoupled from cache and introduces memory reordering on top of a coherent cache. ("non-coherent" has a specific meaning and I don't this is really correct here.) – Tweedsmuir 11/2, 2020 at 2:37

Good attempt to answer for the general case including non-cache-coherent multiprocessors. Nobody (AFAIK) transparently runs multiple threads of the same process across cores with non-coherent caches, but migration of a process to another coherency domain is certainly possible. – Tweedsmuir 11/2, 2020 at 2:42

re: flushing the store buffer: the kernel presumably wants acquire/release sync between cores anyway to reload the architectural state. Things only get complicated when you have different memory ordering rules for some kinds of stores (like x86's NT stores) that don't respect the normal acq/rel mechanism. Thus mfence, or just sfence before the normal release-store of the fact that the task is not "running" on this core anymore, and can thus is up for grabs by the scheduler on other cores. (Scheduling is a distributed algorithm: you normally don't literally "send" a task to another core.) – Tweedsmuir 11/2, 2020 at 2:45

@HadiBrais By "partially coherent", I mean that while there is cache coherence provided by hardware, the caches do not necessarily appear coherent from the point of view of a thread because of other hardware optimizations such as out of order loads and stores. From the point of view of the instruction stream, we don't care what the hardware issue is, whether it's buffering, caching, or whatever, all we care about is what we observe. And even with cache coherence guaranteed in hardware, we can still see the same effects we would see were it not coherent in hardware. – Tybie 11/2, 2020 at 3:9

-1

Adding my two bits here. On first glance, a barrier seems like an overkill (answers above)

Consider this logic: when a thread wants to write to a cacheline, HW cache coherence kicks in and we need to invalidate all other copies of the cacheline that are present with other cores in the system; the write doesn't proceed without the invalidations. When a thread is re-scheduled to a different core then, it will have to fetch the cacheline from the L1-cache that has write permission thereby maintaining read-after-write sequential behavior.

The problem with this logic is that invalidations from cores aren't applied immediately, hence it is possible to read a stale value after being rescheduled (the read to the new L1-cache somehow beats the pending invalidation present in a queue with that core). This is ok for different threads because they are allowed to slip and slide, but with the same thread a barrier becomes essential.

Indissoluble answered 10/2, 2020 at 19:53 Comment(11)

Cache itself is always coherent. A core can't commit a new value until receiving acknowledgement of its invalidate or RFO (read-for-ownership) of the line. This is how MESI maintains coherence. en.wikipedia.org/wiki/MESI_protocol. The problem is the store buffer: if a store is still sitting in the store buffer, the core might not have even done an RFO to get exclusive ownership of that line yet, so other cores could still have it cached in other states. That's how migrating a thread without a full barrier could fail to respect a program-order RAW dependency. – Tweedsmuir 10/2, 2020 at 20:44

(Without migration, that pending store would be "seen" via store-forwarding. A core can see its own stores before they become globally visible.) – Tweedsmuir 10/2, 2020 at 21:39

With a split-transaction bus, what happens is that the bus controller will issue an invalidate without actually invalidating the cacheline. So, if P1 issues a write it will receive all the invalidates, but it is still possible that P2 gets a read of the old copy from its cache because invalidate (from the bus controller) hasn't been applied yet. This is ok because threads are allowed to slip and slide (It is as if P2 read its value long before the invalidate was issued) – Indissoluble 10/2, 2020 at 22:5

I didn't get what you're trying to saying in the first paragraph of the answer. Anyway, the details of cache coherence are not fundamentally important here because these details can only affect the time it takes to make a store globally observable. I've updated my answer to discuss the necessary conditions under which this type of RAW hazard can occur. – Pietje 11/2, 2020 at 0:37

If coherence transitions occur immediately, we won't need a barrier. For example in a system with a atomic bus, and no store buffers, when P1 wants to write to a cacheline all other cores must invalidate their cacheline. Consequently, when you re-schedule a thread to a different core, the L1-cache in the new core must fetch the cacheline from the old core. In practice, coherence transitions don't register instantaneously and hence a barrier is needed. – Indissoluble 11/2, 2020 at 1:17

You seem to be talking about coherence using a different strategy / model than CPUs actually use. AFAIK, all real CPUs use (some variant of) MESI, regardless of shared bus to memory, ring bus between cores (and L3 slices) and memory controllers, a mesh network, or some other kind of interconnect. – Tweedsmuir 11/2, 2020 at 2:5

When stores execute locally, they just put the data (and physical address) into a store buffer. Once they become non-speculative (i.e. the store instruction retires from OoO exec), the store buffer will do an RFO to get exclusive ownership before it commits the store to L1d cache. If an invalidate isn't replied to immediately, that simply delays the completion of the RFO, making the store sit in the store buffer for longer. A barrier is needed to drain the store buffer of pending stores, not to "wait for coherence". Until commit, a core can see its own stores via store-forwarding. – Tweedsmuir 11/2, 2020 at 2:7

@Indissoluble A memory barrier doesn't affect in any way how cache coherence transactions work, but it may affect the order in which they occur in to control the order in which memory accesses become globally observable and potentially stall the pipeline until certain memory access become globally observable. The implementation details of coherence are not pertinent. – Pietje 11/2, 2020 at 2:56

@PeterCordes Even with MESI, the cores are connected through a bus (or some network) which is non-atomic, and transactions are split (no in order guarantees). As a result, coherence transactions don't register instantaneously. – Indissoluble 11/2, 2020 at 4:40

@HadiBrais You are correct that barrier imposes strict ordering guarantees. In between the barriers, coherence transactions may or mayn't be in some order (Depends on the architecture, and in general sequential consistency is hard to guarantee). An atomic bus makes sure that all cores see the coherence transactions in the same order (like an implicit barrier). In such a case, we wouldn't need an explicit barrier. In practice, for performance reasons, we don't have atomic buses and hence need a barrier to enforce strict ordering. – Indissoluble 11/2, 2020 at 4:51

As a result, coherence transactions don't register instantaneously Right, so the store buffer on the core doing the store waits for them to be acted on before making a store visible to other cores. It's not the timing of (responding to) invalidates that's the key issue, it's the timing of commit to L1d cache. Since stores aren't (usually) doing full-line writes, they usually need to RFO the old line, so even if you never had to wait for invalidate processing, you'd still have store buffer effects for cache-miss stores that could create this problem. – Tweedsmuir 11/2, 2020 at 4:58

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags