Can the C++ compiler coalesce adjacent mutex locks?

Asked 22/6, 2024 at 13:0 Answered 22/6, 2024 at 14:45

Solved c++c++11 language-lawyer mutex memory-model

Consider the following code, with adjacent mutex sections containing shared memory accesses:

std::mutex mutex;
int x;

void func() {
    {
        std::lock_guard lock{mutex};
        x++;
    }
    {
        std::lock_guard lock{mutex};
        x++;
    }
}

Without the mutexes, the compiler could obviously coalesce the increments, like so:

void func() {
    x += 2;
}

This got me thinking about what might prevent that optimisation. We know even if x is std::atomic<int>, the optimisation is still legal (but perhaps not done in practice).

But what about with mutexes? Is it legal for the compiler to transform func into this?

void func() {
    std::lock_guard lock{mutex};
    x += 2;
}

It's clear that the writes to x must be visible side effects to other threads due to acquire-release semantics, but I'm not sure if that forbids the optimisation. Perhaps you can argue under the as-if rule that you couldn't tell the difference if the optimisation was done or not, therefore the optimisation is legal (since the difference between "unlock then lock again" and "keep holding lock" depends only on thread scheduling timing).
On the other hand, it seems plausible this optimisation could slippery-slope to every program containing a mutex being optimised into: lock mutex, do entire program, unlock mutex. Such behaviour would be obviously unhelpful for concurrent applications, so it might be reasonably forbidden.

What aspect of the C++ memory model, and wording in the Standard, allows or forbids this optimisation?

(Currently, Clang and GCC do not perform this optimisation, as can be seen here. However, this is not proof that the optimisation is or should necessarily be illegal.)

Captor answered 22/6, 2024 at 13:0 Comment(6)

I suspect there might be some wording about mutex (un)locking being a visible external effect akin to I/O, e.g. [intro.progress]/1 lists both. – Aguie 22/6, 2024 at 13:3

"... The lock() operation on a Mutex is also an Acquire operation" and memory_order_acquire "...no reads or writes in the current thread can be reordered before this load...." – Meshach 22/6, 2024 at 13:25

@Aguie If you are talking about synchronizes-with, happens-before, and how they affect visible side effects, then I'm not convinced. My reading is that IF thread B reads value V then everything before thread A writes V must be committed and visible. It seems plausible that there is no restriction on what value thread B MUST read. (I think this is a similar situation to coalescing std::atomic accesses, which I linked.) – Gratify 22/6, 2024 at 13:25

@RichardCritten I think this is true of std::atomic too, which we think can be coalesced: https://mcmap.net/q/14865/-why-don-39-t-compilers-merge-redundant-std-atomic-writes/7064452 – Gratify 22/6, 2024 at 13:30

Interesting question! MSVC doesn't optimize it into one lock either btw. – Desimone 22/6, 2024 at 13:59

Allowed and useful are different. Threads in C++ are allowed to run sequentially. It is not useful and no implementation will ever do that. Same deal here. – Trangtranquada 22/6, 2024 at 14:29

Yes, the transformation is allowed, for the reasons you yourself gave.

For any given input, no matter how the rest of the code looks, the set of observable behaviors permitted with the transformed function will always be a subset of the observable behaviors permitted with the original function.

There is no requirement that unlocking the mutex will cause another thread blocked on acquiring a lock to immediately wake up and atomically with the unlocking acquire the lock. Therefore, it is perfectly fine to schedule the unlocking thread so that it will immediately reacquire the lock, regardless of the actions of any other thread.

Whether the lock was released is not an observable behavior. Observable behavior is only volatile access of objects, the final state of files written and interactive IO in an implementation-defined manner. See [intro.abstract]/6.

So the release/reacquire can be skipped without affecting the observable behavior with this scheduling scheme, which is a permissible scheduling scheme and therefore defines one of the permissible observable behavior choices. The compiler only has to assure that any one choice of the permissible observable behaviors is realized. See [intro.abstract]/5.

On the other hand, it seems plausible this optimisation could slippery-slope to every program containing a mutex being optimised into: lock mutex, do entire program, unlock mutex. Such behaviour would be obviously unhelpful for concurrent applications, so it might be reasonably forbidden.

There is [intro.progress]/8, which recommends that the thread executing main and the threads created with std::thread/std::jthread should provide the defined concurrent forward-progress guarantees, which means that the thread should, as long as it hasn't terminated, make progress ([intro.progress]/6) eventually.

For the purpose of making progress, per [intro.progress]/4, a blocking library call is considered to continuously check the condition causing it to block, each such check "making progress". Other execution steps that cause progress to be made are access through volatiles, completion of a call to a library IO function, of synchronization operations and of atomic operations.

However, I don't think that this even forbids the transformation of

while(true)
{
    std::lock_guard lock{mutex};
    //A
    x++; // (assume x is unsigned here)
}

into

std::lock_guard lock{mutex};
while(true)
{
    x++;
}

Even if another thread attempts to lock the mutex, the following scheduling behavior would not violate this forward progress guarantee: Repeatedly execute the loop until //A, then switch to the other thread to "make progress" by checking the blocking condition, then switches back and repeat the above.

Forward progress and scheduling of threads is largely left as a quality-of-implementation decision to the implementation. See also N3209 discussing this from when multi-threaded execution was added to C++11.

I do not expect that any compiler will make any attempt at such transformations at the code level.

However, even common OS schedulers won't provide any strict guarantee that the scheduling scheme I described above won't happen. Generally it just becomes probabilistically unlikely over longer times because threads are preempted at more or less stochastically noisy intervals. If the scheduling happens to be as described above, then even without the compilers transformation the behavior of the program will be as if the transformation was made.

I could imagine other environments where the described scheduling behavior may occur indefinitely. Suppose for example that the mutex locking is implemented as a simple spin lock on a uniprocessor system and suppose that the scheduler is perfectly fair and deterministic in the number of instructions it will let each thread execute in any time slice. In such an environment you may be unlucky that the locked thread happens to be always able to reacquire the lock in the same time slice that it is released.

Or multithreading may be completely cooperative, in which case there may be an issue if the mutex is not implemented to yield after an unlock.

I think the standard can't make any stronger guarantees than it does, in part because it would make supporting such environments impossible.

Torey answered 22/6, 2024 at 14:45 Comment(11)

I am not sure this is correct. AFAIK scopes should never be optimized away. (something to do with memory model and memory ordering, I have not time to look that up right now) – Haeckel 22/6, 2024 at 16:2

@PepijnKramer I do not see any reason why that would be the case. Scopes don't matter, as long as the observable behavior isn't affected and it is impossible to observe whether a scope has been "optimized away", whatever that is supposed to mean exactly. – Torey 22/6, 2024 at 16:33

@PepijnKramer In fact, it would be bad if a compiler wouldn't optimize across multiple scopes. For example if I have int i = 0; { i = 1; } { i = 2; } I expect an optimizing compiler to only emit the store of 2 to i. – Torey 22/6, 2024 at 16:35

Locks are permitted to be (and in practice are) unfair. Releasing a lock and immediately reaquiring it will succeed if nobody can sneak in. (And on a uniprocessor system, nobody will sneak in if the thread still has quantum.) So the system behaves in practice as if the optimization had occurred. – Prosperus 22/6, 2024 at 16:37

I do not believe that, in practice, any compiler can optimize away mutex's lock/unlock. It is less due to the specification of what mutexes do according to the standard but what, in practice, compilers have to perform to lock/unlock a mutex - the code is way too complex. – Syllabic 22/6, 2024 at 17:25

@Syllabic Even if the complexity would allow it, I do not expect that a compiler writer would attempt to implement such a transformation, for the same reason that they decide to not optimize consecutive loads/stores on the same atomic. It is against the programmer's expectation and can mess with forward progress. Typical example is a loading bar which updates its state in a loop reading from shared state (whether by atomic or mutex). It would be a quality-of-implementation issue if the loop would not regularly update the loading bar as expected. – Torey 22/6, 2024 at 17:44

@Torey Thank you for your well-written answer. I think I agree with your reasoning overall; the forward progress argument is solid. One thing: "Whether the lock was released is not an observable behavior" - do you have any technical info from the Standard to back this up? I think many people believe it could well be observable behaviour. – Gratify 23/6, 2024 at 5:46

@Syllabic On typical Linux systems, std::mutex lock/unlock translates directly to pthread lock/unlock calls (you can see in the Godbolt I linked). It seems plausible to me that compilers can hardcode special knowledge about such calls, like they have special knowledge about memory allocation, intrinsics, etc. Although I do agree that aggressively optimising these scenarios is against the spirit of how programmers want to do concurrency. – Gratify 23/6, 2024 at 5:50

@MCΔT I have added a reference to timsong-cpp.github.io/cppwp/n4950/intro#abstract-6 for the definition of observable behavior. – Torey 23/6, 2024 at 9:10

@MCΔT which are function calls that do who knows what. I have some doubts that compilers perform such ooptimisations. Usually, they simplify instructions, while function calls act as barriers unless inlined. – Syllabic 23/6, 2024 at 12:57

@Syllabic They do make behavior assumptions about some C standard library functions for optimization purposes, but I haven't seen it for C++ or POSIX standard library functions (outside once having magic behavior anyway). Typical example: Compilers replacing printf("%s", /*...*/) with puts(/*...*/), replacing memset/memcpy with direct machine instructions or the other way around, eliding malloc/free combinations, etc. – Torey 23/6, 2024 at 15:12

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags