How does a mutex lock and unlock functions prevents CPU reordering?

Asked 20/6, 2018 at 14:47 Answered 20/6, 2018 at 16:5

Solved c assembly x86 mutex memory-barriers

As far as I know, a function call acts as a compiler barrier, but not as a CPU barrier.

This tutorial says the following:

acquiring a lock implies acquire semantics, while releasing a lock implies release semantics! All the memory operations in between are contained inside a nice little barrier sandwich, preventing any undesireable memory reordering across the boundaries.

I assume that the above quote is talking about CPU reordering and not about compiler reordering.

But I don't understand how does a mutex lock and unlock causes the CPU to give these functions acquire and release semantics.

For example, if we have the following C code:

pthread_mutex_lock(&lock);
i = 10;
j = 20;
pthread_mutex_unlock(&lock);

The above C code is translated into the following (pseudo) assembly instructions:

push the address of lock into the stack
call pthread_mutex_lock()
mov 10 into i
mov 20 into j
push the address of lock into the stack
call pthread_mutex_unlock()

Now what prevents the CPU from reordering mov 10 into i and mov 20 into j to above call pthread_mutex_lock() or to below call pthread_mutex_unlock()?

If it is the call instruction that prevents the CPU from doing the reordering, then why is the tutorial I quoted makes it seem like it is the mutex lock and unlock functions that prevents the CPU reordering, why the tutorial I quoted didn't say that any function call will prevent the CPU reordering?

My question is about the x86 architecture.

Molt answered 20/6, 2018 at 14:47 Comment(12)

Why didn't you undelete and edit your previous version of this: #50949288? This is the same question with minor edits, and Martin's comment still applies. – Zonda 20/6, 2018 at 14:52

"I assume that the above quote is talking about CPU reordering and not about compiler reordering." Certainly, locking/unlocking a pthread mutex MUST ensure the compiler does not re-order the instructions it is protecting. So the quote is talking about both compiler and CPU reordering - they're equally important from the point of a pthread mutex. – Frumentaceous 20/6, 2018 at 15:23

It is the implementations of pthread_mutex_lock() and pthread_mutex_unlock() that realize their promises about runtime ordering. CPUs that perform such reordering also have instructions for modulating it, and the mutex lock / unlock functions use these (among other things). – Harbour 20/6, 2018 at 16:26

@JohnBollinger: I already mentioned this in my answer. I highlighted it since it's an important point about why the mutex functions are special. – Zonda 20/6, 2018 at 18:5

And I already upvoted your answer, @PeterCordes, and I declined to write one of my own. But your answer is so full of information that even with the highlighting, it is easy to overlook this point -- which I think is the crux -- within. – Harbour 20/6, 2018 at 18:23

@JohnBollinger: That's fair, it's definitely a key point. I started writing before reading the question super carefully, so I probably answered some stuff that the question doesn't ask explicitly. Seemed like a good idea to cover a bunch of related ground. (And I started with compiler reordering because I'd just been looking at another Q&A about that. :P) – Zonda 20/6, 2018 at 18:28

I wrote my answer because although @PeterCordes answer is great I thought this question also deserved one "little" answer just focusing on the narrow thing that I thought the OP was asking: how does the compiler/runtime/implementation prevent CPU reordering (not compiler reordering). The answer for that is simple: the implementation includes barriers which prevent CPU reordering. The OP commented on my answer which reveals their original misunderstanding: that reordering could occur around call as if it were any other instruction. It can't. Reordering happens against the "dynamic trace". – Winograd 20/6, 2018 at 19:25

I am also a newbie on processor reordering, so although I'm undoubtedly unable to give a better answer than the experts here, I have a better understanding what the OP was asking, because I had the same bepuzzlement when reading this paragraph. He was not asking how pthread_mutex_lock/pthread_mutex_unlock implements CPU barrier, but why these mutex functions are obliged to implement said CPU barrier from the meaning of mutex. Below I will give my answer to the first question ("Now what prevents ..."), in a newbie for newbie way, ... – Crossover 12/6, 2021 at 6:41

... first of all, general speaking, the pthread_mutex_lock() is nothing more than a sequence of instructions, so general speaking again, the CPU has no idea that it is so special a function that processor memeory reordering is prohibited. So, general speaking the third time, nothing can prevent the CPU from reordering mov 10 into i to above call pthread_mutex_lock(). As a result, it is natural to ask this question. Now let me answer below, starting with the meaning of mutex. ... – Crossover 12/6, 2021 at 6:42

... We all know the meaning of mutex is specify the critical section, as written in all OS/Parallel Programming textbooks. Critical section is defined to be executed by only one thread at most. So, if code like mov 10 into i is reordered above call pthread_mutex_lock() in thread 1, thread 2 might be running the critical section at that point because thread 1 has not yet acquired the mutex, which is a violation of meaning of mutex. You can find a lot of examples of disasters caused by running critical section code by two threads simultaneously. ... – Crossover 12/6, 2021 at 6:43

... So, boiling down, it is the meaning of mutex that prevents memory reordering. This explains why pthread_mutex_lock() has to serve as a memory barrier. Of course, simply naming a function mutex_lock or something like does not mean it will function as a mutex acquisition; we have to implement it. In fact, we have to not only implement the traditional mutex acquisition which is written in many textbooks, but also the acquire semantics. As is said in the next paragraph of the preshing article, "Every implementation of a lock, ..., should provide these guarantees." ... – Crossover 12/6, 2021 at 6:44

... So, from a root viewpoint, a memory barrier is put in call pthread_mutex_lock() solely out of conscientiousness and respect of the semantic of mutex. In practice, it is likely that a programmer failed to write CPU memory barrier in the function but added it later as a bug fix, or happened to use some instructions that automatically implement the barrier without the programmer even knowing it. These implementation details were described in great details by the expert answers, so I will not repeat. The same argument can be applied for pthread_mutex_unlock() to observe release semantics. – Crossover 12/6, 2021 at 6:46

The short answer is that the body of the pthread_mutex_lock and pthread_mutex_unlock calls will include the necessary platform-specific memory barriers which will prevent the CPU from moving memory accesses within the critical section outside of it. The instruction flow will move from the calling code into the lock and unlock functions via a call instruction, and it is this dynamic instruction trace you have to consider for the purposes of reordering - not the static sequence you see in an assembly listing.

On x86 specifically, you probably won't find explicit, standalone memory barriers inside those methods, since you'll already have lock-prefixed instructions in order to perform the actual locking and unlocking atomically, and these instructions imply a full memory barrier, which prevents the CPU reordering you are concerned about.

For example, on my Ubuntu 16.04 system with glibc 2.23, pthread_mutex_lock is implemented using a lock cmpxchg (compare-and-exchange) and pthread_mutex_unlock is implemented using lock dec (decrement), both of which have full barrier semantics.

Winograd answered 20/6, 2018 at 16:5 Comment(14)

"The short answer is that the body of the pthread_mutex_lock and pthread_mutex_unlock calls will include the necessary platform-specific memory barriers which will prevent the CPU from moving memory accesses within the critical section outside of it." I don't underatnd why the body of these functions matters, I mean i = 10; could be executed before pthread_mutex_lock(&lock); is executed, and so what is inside the critical section has been moved to the outside of the critical section. Or is such reordering not allowed on all CPU architectures? – Molt 20/6, 2018 at 18:27

@Molt - because it's the dynamic flow of instructions that (mostly) matters for re-ordering. The CPU doesn't execute the call as a single instruction and then continue on to the next instruction in source/assembly order, it goes into the body of the pthread function. So for the purpose of analysis, you can imagine the entire body of the lock and unlock calls sort of "inlined" in the assembly at the place the call instruction appears. If that weren't the case, mutexes and many other sync mechanisms would be impossible to implement as plain function calls! – Winograd 20/6, 2018 at 19:15

@Molt - I updated my question to make it clearer. The bottom line is that, at the assembly level, you can't talk about what types of reordering are possible across a call, jmp or any other control flow change because you need to know what happens at the location jumped to. – Winograd 20/6, 2018 at 19:17

x86 call does a release-store (pushing the return address), so if you want to be pedantic the call itself does actually have an effect. But no moreso than any normal stores done inside the function. But yes, very good update, I think that might be getting at the heart of what the OP is missing. – Zonda 20/6, 2018 at 19:24

@PeterCordes I never say that call doesn't have release effect do I? I'm saying you need to know what is inside the call to know about the entire effect of call on the instruction trace. Of course, the call itself may have some ordering semantics, such as those associated with the store to the stack, but that is practically unusable and is IMO a total red herring (and another answer covers it well :-). Note that release-acquire talks about a location: the lock part of a mutex doesn't synchronize with a push the return address on another thread, so that release isn't very useful. – Winograd 20/6, 2018 at 19:26

I just meant that in reply to you can't talk about what types of reordering are possible across a call..." in your 2nd comment. Not suggesting you overcomplicate this answer with that, because I already did that in mine :P Agreed that it's a total red herring, and the memory-ordering part of call is totally unimportant because other threads aren't looking at your call stack. – Zonda 20/6, 2018 at 19:30

@PeterCordes - well yes, I wonder if someone would come along say something like that, and I half-considered making it more price, like noting that the call itself may prevent some reorderings, but the body may then prevent more reordering. That said, I actually don't think it's meaningful to say that call has "release semantics" (with the implication perhaps being that you don't need a store in side the unlock implementation): release is meaningful when talking about a particular location and in the case of call the location is an anonymous location on the stack. – Winograd 20/6, 2018 at 19:33

Ah right, that's a good analysis of exactly why it's not meaningful or useful. It doesn't matter where call's store appears in the global order, just stores outside the critical section vs. the mutex lock vs. inside. I knew I was being pedantic, but hadn't thought through exactly how useless that was. If we're talking about the non-store part of instructions, then it's another out-of-order exec vs. memory ordering question. – Zonda 20/6, 2018 at 19:34

... because in what sense can the acquire side (the lock call) synchronize with the store implied by call? I don't think it can. I can't see a way you could rely on the so-called "release" semantics of the store. [<-- this part of my comment raced with your last reply]. So we should not think of release stores as some type of memory barrier really: putting a "dummy" store with "release" semantics doesn't do anything meaningful. It's only useful to something that consumes its value, to establish a type of happens-before relationship or however you want to think of it. – Winograd 20/6, 2018 at 19:36

@Winograd "The CPU doesn't execute the call as a single instruction and then continue on to the next instruction in source/assembly order, it goes into the body of the pthread function" I didn't mean in my question in the comments that I thought that the CPU will execute pthread_mutex_lock(&lock); and then immediately afterwards the CPU will execute i = 10; without going to the function body first. What I meant was: what if the CPU completely ignored pthread_mutex_lock(&lock); and instead executed i = 10; first, and then after that, the CPU executed pthread_mutex_lock(&lock);. – Molt 21/6, 2018 at 22:3

@Molt - right, I understood. What I am saying though is that the CPU doesn't reorder at the assembly level listing. You are imagining assembly like call mutex_lock; store 10 to I; and asking "what happens if the CPU reorders the store before the call? The answer is that wasn't the right way to look at it in the first place: the CPU doesn't reorder against the static assembly you find on disk in the binary, it operates against the dynamic stream of instructions, so you need to mentally trace out the execution and record that stream. – Winograd 22/6, 2018 at 17:34

In this case the result is simple: you just end up with the instructions that are part of the body of the lock and unlock functions replacing the call instruction. Then you can talk about reordering against that list of instructions - that's what I call the dynamic trace. – Winograd 22/6, 2018 at 17:35

Oh! in that case, is the purpose of the memory barrier inside the pthread_mutex_lock(&lock); function is to prevent the reordering of i = 10; to above the code that sets/locks the mutex variable inside the pthread_mutex_lock(&lock); function? – Molt 22/6, 2018 at 19:45

@user - correct. As a consequence it also prevents it from moving above the lock function enirely. – Winograd 22/6, 2018 at 19:47

If i and j are local variables, nothing. The compiler can keep them in registers across the function call if it can prove that nothing outside the current function have their address.

But any global variables, or locals whose address might be stored in a global, do have to be "in sync" in memory for a non-inline function call. The compiler has to assume that any function call it can't inline modifies any / every variable it can possibly have a reference to.

So for example, if int i; is a local variable, after sscanf("0", "%d", &i); its address will have escaped the function and the compiler will then have to spill/reload it around function calls instead of keeping it in a call-preserved register.

See my answer on Understanding volatile asm vs volatile variable, with an example of asm volatile("":::"memory") being a barrier for a local variable whose address escaped the function (sscanf("0", "%d", &i);), but not for locals that are still purely local. It's exactly the same behaviour for exactly the same reason.

I assume that the above quote is talking about CPU reordering and not about compiler reordering.

It's talking about both, because both are necessary for correctness.

This is why the compiler can't reorder updates to shared variables with any function call. (This is very important: the weak C11 memory model allows lots of compile-time reordering. The strong x86 memory model only allows StoreLoad reordering, and local store-forwarding.)

pthread_mutex_lock being a non-inline function call takes care of compile-time reordering, and the fact that it does a locked operation, an atomic RMW, also means it includes a full runtime memory barrier on x86. (Not the call instruction itself, though, just the code in the function body.) This gives it acquire semantics.

Unlocking a spinlock only needs a release-store, not a RMW, so depending on the implementation details the unlock function might not be a StoreLoad barrier. (This is still ok: it keeps everything in the critical section from getting out. It's not necessary to stop later operations from appearing before the unlock. See Jeff Preshing's article explaining Acquire and Release semantics)

On a weakly-ordered ISA, those mutex functions would run barrier instructions, like ARM dmb (data memory barrier). Normal functions wouldn't, so the author of that guide is correct to point out that those functions are special.

Now what prevents the CPU from reordering mov 10 into i and mov 20 into j to above call pthread_mutex_lock()

This isn't the important reason (because on a weakly-ordered ISA pthread_mutex_unlock would run a barrier instruction), but it is actually true on x86 that the stores can't even be reorder with the call instruction, let alone actual locking/unlocking of the mutex done by the function body before the function returns.

x86 has strong memory-ordering semantics (stores don't reorder with other stores), and call is a store (pushing the return address).

So mov [i], 10 must appear in the global store between the stores done by the call instruction.

Of course in a normal program, nobody is observing the call stack of other threads, just the xchg to take the mutex or the release-store to release it in pthread_mutex_unlock.

Zonda answered 20/6, 2018 at 14:56 Comment(10)

Being a non-inline function call is not necessarily enough to prevent compile-time reordering across the call. For example, if a translation unit has file-scope static variables, and the compiler can prove that their addresses are not accessible to the call, then the compiler is free to reorder accesses to those variables, across the call, even though those variables could be accessed by more than one thread. – Kellyekellyn 21/6, 2018 at 23:38

@ArchD.Robison: A compiler could only prove that if the only functions that touch the static var are themselves static, and not called from any non-static functions and don't have the function address passed to any black-box functions. Otherwise it has to assume that the unknown function can call back into this translation unit to eventually reach a function that reads or writes the static var. Thread-creation and signal-handler functions are a black-box, so there's no way to start another thread running a static function without having static var access "escape" the translation unit. – Zonda 21/6, 2018 at 23:47

@PeterCordes If you have a set of variables V only manipulated by functions F1 that are only called by the set of functions F2 and none of these can be named outside the TU, and no other function inside the TU takes the address to put it in an object accessible elsewhere... at the end of the day, V, F1, F2... serve no purpose at all. – Glindaglinka 24/6, 2018 at 3:2

@ArchD.Robison Presumably, in a realistic program static variables can be indirectly accessed from main. – Glindaglinka 24/6, 2018 at 3:8

@curiousguy: Good point. Unless this is the TU containing main, then maybe the compiler could prove that those variables can only be used by the main thread (if it optimized based on it being UB to call main again from inside the program). I don't think there are any cases where a non-inline function call can fail to order any variables that are actually shared. – Zonda 24/6, 2018 at 13:22

I concur with @PeterCordes' analysis. I had forgotten to consider the "call back in" possibility. – Kellyekellyn 24/6, 2018 at 23:40

"if it can prove that nothing outside the current function have their address." Is there a name for that? Function private variables? – Glindaglinka 4/11, 2019 at 4:41

@curiousguy: en.wikipedia.org/wiki/Escape_analysis is the name for the checking. I'm not aware of a widely-established name for variables that meet the criteria. I'd maybe say "variable that hasn't escaped". – Zonda 4/11, 2019 at 4:44

How does this change with link-time optimization? – Antonia 18/3 at 23:5

@user129393192: Without -fwhole-program, the compiler still has to assume that other threads can potentially be running code it can't see and reading + writing global variables. And even with -fwhole-program, the compiler probably isn't going to try to prove anything about whether two threads might both be calling the same function that accesses the same global or static variable, so memory barriers still work. If libpthread itself was compiled with LTO, then it had better be written with atomic_thread_fence or memory_order_acquire ordering on the RMWs and release on unlocking. – Zonda 18/3 at 23:21

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags