If i
and j
are local variables, nothing. The compiler can keep them in registers across the function call if it can prove that nothing outside the current function have their address.
But any global variables, or locals whose address might be stored in a global, do have to be "in sync" in memory for a non-inline function call. The compiler has to assume that any function call it can't inline modifies any / every variable it can possibly have a reference to.
So for example, if int i;
is a local variable, after sscanf("0", "%d", &i);
its address will have escaped the function and the compiler will then have to spill/reload it around function calls instead of keeping it in a call-preserved register.
See my answer on Understanding volatile asm vs volatile variable, with an example of asm volatile("":::"memory")
being a barrier for a local variable whose address escaped the function (sscanf("0", "%d", &i);
), but not for locals that are still purely local. It's exactly the same behaviour for exactly the same reason.
I assume that the above quote is talking about CPU reordering and not about compiler reordering.
It's talking about both, because both are necessary for correctness.
This is why the compiler can't reorder updates to shared variables with any function call. (This is very important: the weak C11 memory model allows lots of compile-time reordering. The strong x86 memory model only allows StoreLoad reordering, and local store-forwarding.)
pthread_mutex_lock
being a non-inline function call takes care of compile-time reordering, and the fact that it does a lock
ed operation, an atomic RMW, also means it includes a full runtime memory barrier on x86. (Not the call
instruction itself, though, just the code in the function body.) This gives it acquire semantics.
Unlocking a spinlock only needs a release-store, not a RMW, so depending on the implementation details the unlock function might not be a StoreLoad barrier. (This is still ok: it keeps everything in the critical section from getting out. It's not necessary to stop later operations from appearing before the unlock. See Jeff Preshing's article explaining Acquire and Release semantics)
On a weakly-ordered ISA, those mutex functions would run barrier instructions, like ARM dmb
(data memory barrier). Normal functions wouldn't, so the author of that guide is correct to point out that those functions are special.
Now what prevents the CPU from reordering mov 10 into i and mov 20 into j to above call pthread_mutex_lock()
This isn't the important reason (because on a weakly-ordered ISA pthread_mutex_unlock
would run a barrier instruction), but it is actually true on x86 that the stores can't even be reorder with the call
instruction, let alone actual locking/unlocking of the mutex done by the function body before the function returns.
x86 has strong memory-ordering semantics (stores don't reorder with other stores), and call
is a store (pushing the return address).
So mov [i], 10
must appear in the global store between the stores done by the call
instruction.
Of course in a normal program, nobody is observing the call stack of other threads, just the xchg
to take the mutex or the release-store to release it in pthread_mutex_unlock
.
pthread_mutex_lock()
andpthread_mutex_unlock()
that realize their promises about runtime ordering. CPUs that perform such reordering also have instructions for modulating it, and the mutex lock / unlock functions use these (among other things). – Harbourcall
as if it were any other instruction. It can't. Reordering happens against the "dynamic trace". – Winogradpthread_mutex_lock/pthread_mutex_unlock
implements CPU barrier, but why these mutex functions are obliged to implement said CPU barrier from the meaning of mutex. Below I will give my answer to the first question ("Now what prevents ..."), in a newbie for newbie way, ... – Crossoverpthread_mutex_lock()
is nothing more than a sequence of instructions, so general speaking again, the CPU has no idea that it is so special a function that processor memeory reordering is prohibited. So, general speaking the third time, nothing can prevent the CPU from reorderingmov 10 into i
to abovecall pthread_mutex_lock()
. As a result, it is natural to ask this question. Now let me answer below, starting with the meaning of mutex. ... – Crossovermov 10 into i
is reordered abovecall pthread_mutex_lock()
in thread 1, thread 2 might be running the critical section at that point because thread 1 has not yet acquired the mutex, which is a violation of meaning of mutex. You can find a lot of examples of disasters caused by running critical section code by two threads simultaneously. ... – Crossoverpthread_mutex_lock()
has to serve as a memory barrier. Of course, simply naming a function mutex_lock or something like does not mean it will function as a mutex acquisition; we have to implement it. In fact, we have to not only implement the traditional mutex acquisition which is written in many textbooks, but also the acquire semantics. As is said in the next paragraph of the preshing article, "Every implementation of a lock, ..., should provide these guarantees." ... – Crossovercall pthread_mutex_lock()
solely out of conscientiousness and respect of the semantic of mutex. In practice, it is likely that a programmer failed to write CPU memory barrier in the function but added it later as a bug fix, or happened to use some instructions that automatically implement the barrier without the programmer even knowing it. These implementation details were described in great details by the expert answers, so I will not repeat. The same argument can be applied forpthread_mutex_unlock()
to observe release semantics. – Crossover