TL;DR: It depends on the architecture and the OS. On x86, this type of read-after-write hazard is mostly not issue that has to be considered at the software level, except for the weakly-order WC stores which require a store fence to be executed in software on the same logical core before the thread is migrated.
Usually the thread migration operation includes at least one memory store. Consider an architecture with the following property:
- The memory model is such that memory stores may not become globally observable in program order. This Wikipedia article has a not-accurate-but-good-enough table that shows examples of architectures that have this property (see the row "Stores can be reordered after stores ").
The ordering hazard you mentioned may be possible on such an architecture because even if the thread migration operation completes, it doesn't necessarily mean that all the stores that the thread has performed are globally observable. On architectures with strict sequential store ordering, this hazard cannot occur.
On a completely hypothetical architecture where it's possible to migrate a thread without doing a single memory store (e.g., by directly transferring the thread's context to another core), the hazard can occur even if all stores are sequential on an architecture with the following property:
- There is a "window of vulnerability" between the time when a store retires and when it becomes globally observable. This can happen, for example, due to the presence of store buffers and/or MSHRs. Most modern processors have this property.
So even with sequential store ordering, it may be possible that the thread running on the new core may not see the last N stores.
Note that on an machine with in-order retirement, the window of vulnerability is a necessary but insufficient condition for a memory model that supports stores that may not be sequential.
Usually a thread is rescheduled to run on a different core using one of the following two methods:
- A hardware interrupt, such as a timer interrupt, occurs that ultimately causes the thread to be rescheduled on a different logical core.
- The thread itself performs a system call, such as
sched_setaffinity
, that ultimately causes it to run on a different core.
The question is at which point does the system guarantee that retired stores become globally observable? On Intel and AMD x86 processors, hardware interrupts are fully serializing events, so all user-mode stores (including cacheable and uncacheable) are guaranteed to be globally observable before the interrupt handler is executed, in which the thread may be rescheduled to run a different logical core.
On Intel and AMD x86 processors, there are multiple ways to perform system calls (i.e., change the privilege level) including INT
, SYSCALL
, SYSENTER
, and far CALL
. None of them guarantee that all previous stores become globally observable. Therefore, the OS is supposed to do this explicitly when scheduling a thread on a different core by executing a store fence operation. This is done as part of saving the thread context (architectural user-mode registers) to memory and adding the thread to the queue associated with the other core. These operations involve at least one store that is subject to the sequential ordering guarantee. When the scheduler runs on the target core, it would see the full register and memory architectural state (at the point of the last retired instruction) of the thread would be available on that core.
On x86, if the thread uses stores of type WC, which do not guarantee the sequential ordering, the OS may not guarantee in this case that it will make these stores globally observable. The x86 spec explicitly states that in order to make WC stores globally observable, a store fence has to be used (either in the thread on the same core or, much simpler, in the OS). An OS generally should do this, as mentioned in @JohnDMcCalpin's answer. Otherwise, if the OS doesn't provide the program order guarantee to software threads, then the user-mode programmer may need to take this into account. One way would be the following:
- Save a copy of the current CPU mask and pin the thread to the current core (or any single core).
- Execute the weakly-ordered stores.
- Execute a store fence.
- Restore the CPU mask.
This temporarily disables migration to ensure that the store fence is executed on the same core as the weakly-ordered stores. After executing the store fence, the thread can safely migrate without possibly violating program order.
Note that user-mode sleep instructions, such as UMWAIT
, cannot cause the thread to be rescheduled on a different core because the OS does not take control in this case.
Thread Migration in the Linux Kernel
The code snippet from @JohnDMcCalpin's answer falls on the path to send an inter-processor interrupt, which is achieved using a WRMSR
instruction to an APIC register. An IPI may be sent for many reasons. For example, to perform a TLB shootdown operation. In this case, it's important to ensure that the updated paging structures are globally observable before invaliding the TLB entries on the other cores. That's why x2apic_wrmsr_fence
may be needed, which is invoked just before sending an IPI.
That said, I don't think thread migration requires sending an IPI. Essentially, a thread is migrated by removing it from some data structure that is associated with one core and add it to the one associated with the target core. A thread may be migrated for numerous reasons, such as when the affinity changes or when the scheduler decides to rebalance the load. As mentioned in the Linux source code, all paths of thread migration in the source code end up executing the following:
stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg)
where arg
holds the task to be migrated and the destination core identifier. migration_cpu_stop
is a function that does the actual migration. However, the task to be migrated may be currently running or waiting in some runqueue to run on the source core (i.e, the core on which the task is currently scheduled). It's required to stop the task before the migrating it. This is achieved by adding the call to the function migration_cpu_stop
to the queue of the stopper task associated with the source core. stop_one_cpu
then sets the stopper task as ready for execution. The stopper task has the highest priority. So on the next timer interrupt on the source core (Which could be the same as the current core), one of the tasks with the highest priority will be selected to run. Eventually, the stopper task will run and it will execute migration_cpu_stop
, which in turn performs the migration. Since this process involves a hardware interrupt, all stores of the target task are guaranteed to be globally observable.
There appears to be a bug in x2apic_wrmsr_fence
The purpose of x2apic_wrmsr_fence
is to make all previous stores globally observable before sending the IPI. As discussed in this thread, SFENCE
is not sufficient here. To see why, consider the following sequence:
store
sfence
wrmsr
The store fence here can order the preceding store operation, but not the MSR write. The WRMSR instruction doesn't have any serializing properties when writing to an APIC register in x2APIC mode. This is mentioned in the Intel SDM volume 3 Section 10.12.3:
To allow for efficient access to the APIC registers in x2APIC mode,
the serializing semantics of WRMSR are relaxed when writing to the
APIC registers.
The problem here is that MFENCE
is also not guaranteed to order the later WRMSR
with respect to previous stores. On Intel processors, it's documented to only order memory operations. Only on AMD processors it's guaranteed to be fully serializing. So to make it work on Intel processors, there needs to be an LFENCE
after the MFENCE
(SFENCE
is not ordered with LFENCE
, so MFENCE
must be used even though we don't need to order loads). Actually Section 10.12.3 mentions this.
memory-order
, but this tag is currently considered as a synonym tomemory-barriers
, which is confusing. – Pietje