Does std::mutex create a fence?
Asked Answered
L

3

32

If I lock a std::mutex will I always get a memory fence? I am unsure if it implies or enforces you to get the fence.

Update:

Found this reference following up on RMF's comments.

Multithreaded programming and memory visibility

Lucia answered 23/6, 2012 at 20:56 Comment(2)
On a single unit of execution system, you don't need a CPU fence.Liliuokalani
Related: How does a mutex lock and unlock functions prevents CPU reordering?Formica
S
12

Unlocking a mutex synchronizes with locking the mutex. I don't know what options the compiler has for the implementation, but you get the same effect of a fence.

Salzhauer answered 23/6, 2012 at 21:2 Comment(9)
I think the OP is asking about fencing for other memory locations other than the mutex itself.Pauperism
I don't understand. Fences in C++ don't affect specific memory locations.Salzhauer
I'm saying this: Suppose Core A writes to A[0] and then releases the mutex. Then Core B acquires the mutex and reads A[0] (before cache coherence can propagate the new value of A[0] to core B.) In other words, does a mutex force all memory locations to be up-to-date before returning.Pauperism
Just like with a fence, it only forces some memory locations to be up-to-date (those with changes the memory model requires to be visible as per the synchronizes with relationship).Salzhauer
If it helps, fences are defined as simply establishing synchronizes with relationships, nothing else. All the guarantees you get from a fence are just the consequences of the memory model rules. Since mutexes also establish synchronizes with relationships, their effects are the same (well, mutexes have other effects, but not relevant here).Salzhauer
@R.MartinhoFernandes It does help, ty. I'm using a std container and having problems with size, which is probably part of the problem. I'll try and get a simple example up later in the evening.Lucia
@Pauperism How can you outrun cache coherence? On which CPU?Liliuokalani
@Liliuokalani You can on x86 using non-temporal stores if you're relying on acquire/release semantics for synchronization. But an std::mutex will force a full memory barrier that includes non-temporal stores. So you can't "outrun" std::mutex, but you can outrun std::atomic::store(..., std::memory_order_release);Pauperism
@Pauperism How is that guaranteed?Liliuokalani
Q
18

As I understand this is covered in:

1.10 Multi-threaded executions and data races

Para 5:

The library defines a number of atomic operations (Clause 29) and operations on mutexes (Clause 30) that are specially identified as synchronization operations. These operations play a special role in making assignments in one thread visible to another. A synchronization operation on one or more memory locations is either a consume operation, an acquire operation, a release operation, or both an acquire and release operation. A synchronization operation without an associated memory location is a fence and can be either an acquire fence, a release fence, or both an acquire and release fence. In addition, there are relaxed atomic operations, which are not synchronization operations, and atomic read-modify-write operations, which have special characteristics. [Note: For example, a call that acquires a mutex will perform an acquire operation on the locations comprising the mutex. Correspondingly, a call that releases the same mutex will perform a release operation on those same locations. Informally, performing a release operation on A forces prior side effects on other memory locations to become visible to other threads that later perform a consume or an acquire operation on A. “Relaxed” atomic operations are not synchronization operations even though, like synchronization operations, they cannot contribute to data races. —end note]

Quartas answered 23/6, 2012 at 21:10 Comment(0)
S
12

Unlocking a mutex synchronizes with locking the mutex. I don't know what options the compiler has for the implementation, but you get the same effect of a fence.

Salzhauer answered 23/6, 2012 at 21:2 Comment(9)
I think the OP is asking about fencing for other memory locations other than the mutex itself.Pauperism
I don't understand. Fences in C++ don't affect specific memory locations.Salzhauer
I'm saying this: Suppose Core A writes to A[0] and then releases the mutex. Then Core B acquires the mutex and reads A[0] (before cache coherence can propagate the new value of A[0] to core B.) In other words, does a mutex force all memory locations to be up-to-date before returning.Pauperism
Just like with a fence, it only forces some memory locations to be up-to-date (those with changes the memory model requires to be visible as per the synchronizes with relationship).Salzhauer
If it helps, fences are defined as simply establishing synchronizes with relationships, nothing else. All the guarantees you get from a fence are just the consequences of the memory model rules. Since mutexes also establish synchronizes with relationships, their effects are the same (well, mutexes have other effects, but not relevant here).Salzhauer
@R.MartinhoFernandes It does help, ty. I'm using a std container and having problems with size, which is probably part of the problem. I'll try and get a simple example up later in the evening.Lucia
@Pauperism How can you outrun cache coherence? On which CPU?Liliuokalani
@Liliuokalani You can on x86 using non-temporal stores if you're relying on acquire/release semantics for synchronization. But an std::mutex will force a full memory barrier that includes non-temporal stores. So you can't "outrun" std::mutex, but you can outrun std::atomic::store(..., std::memory_order_release);Pauperism
@Pauperism How is that guaranteed?Liliuokalani
L
-1

A mutex operation (lock or unlock) on a particular mutex M is only useful for any purpose related to synchronization or memory visibility if M is shared by different threads and they perform these operations. A mutex defined locally and only used by one thread does not provide any meaningful synchronization.

[Note: The optimizations I describe here are probably not done by many compilers, which might view these mutex and atomic synchronization operations as "black boxes" that cannot be optimized (or even that should not optimized in order to preserve the predictability of code generation, and some particular patterns, which is a bogus argument). I would not be surprised if zero compiler did the optimization in even the simpler case but there is no doubt that they are legal.]

A compiler can easily determine that some variables are never used by multiple threads (or any asynchronous execution), notably for an automatic variable whose address is not taken (nor a reference to it). Such objects are called here "thread private". (All automatic variables candidate for register allocation are thread private.)

For a thread private mutex, no meaningful code needs to be generated for lock/unlock operations: no atomic compare and exchange, no fencing, and often no state needs to be kept at all, except for the case of "safe mutex" where the behavior of recursive locking is well defined and should fail (to make the sequence m.lock(); bool locked = m.try_lock(); work you need to keep at least a boolean state).

This is also true for any thread private atomic objects: only the naked non atomic type is needed and normal operations can be performed (so a fetch-add 1 becomes as regular post increment).

The reason why these transformations are legal:

  • the obvious observation that if an object is accessed by only of thread or parallel execution (they aren't even accessed by an asynchronous signal handler) so the use of non atomic operation in assembly is not detectable by any mean
  • the less obvious remark that no ordering/memory visibility is implied by any use of thread private synchronization object.

All synchronization objects are specified as tools for inter thread communication: they can guarantee that the side effects in one thread are visible in another thread; they cause a well defined order of operations to exist not just in one thread (the sequential order of execution of operations of one thread) but in multiple threads.

A common example is the publication of an information with an atomic pointer type:

The shared data is:

atomic<T*> shared; // null pointer by default

The publishing thread does:

T *p = new T;
*p = load_info();
shared.store(p, memory_order_release);

The consuming thread can check whether the data is available by loading the atomic object value, as a consumer:

T *p = shared.load(memory_order_acquire);
if (p) use *p;

(There is no defined way to wait for availability here, it's a simple example to illustrate publication and consumption of the published value.)

The publishing thread only needs to set the atomic variable after it has finished the initialization of all fields; the memory order is a release to communicate the fact the memory manipulations are finished.

The other threads only need an acquire memory order to "connect" with the release operation if there was one. If the value is still zero, the thread has learn nothing about the world and the acquire is meaningless; it can't act on it. (By the time the thread check the pointer and see a null, the shared variable might have been changed already. It doesn't matter as the designer considered that not having a value in that thread is manageable, or it would have performed the operation in a sequence.)

All atomic operations are intended to be possibly lock less, that is to finish in a short finite time whatever other threads are doing, and even if they are stuck. It means that you can't depend on another thread having finished a job.

At the other end of the spectrum of thread communication primitives, mutexes don't hold a value that can be used to carry information between threads (*) but they ensure that one thread can enter a lock-ops-unlock sequence only after another thread has finished his own lock-ops-unlock sequence.

(*) not even a boolean value, as using a mutex to as a general boolean signal (= binary semaphore) between threads is specifically prohibited

A mutex is always used in connection with a set of shared variables: the protected variables or objects V; these V are used to carry information between threads and the mutex make access to that information well ordered (or serialized) between threads. In technical terms, all but the first mutex lock (on M) operation pair with previous unlock operation on M:

  • a lock of M is an acquire operation on M
  • an unlock of M is a release operation on M

The semantic of locking/unlocking is defined on a single M so let's stop repeating "on M"; we have threads A and B. The lock by B is an acquire that pairs with the unlock by A. Both operations together form an inter thread synchronization.

[What about a thread that happens to lock M often and will often re-lock M without any other thread acting on M in the meantime? Nothing interesting, the acquire is still paired with a release but A = B so nothing is accomplished. The unlock was sequenced in the same thread of execution so it's pointless in that particular case, but in general a thread can't tell that it's pointless. It isn't even special cased by the language semantic.]

The synchronization that occurs is between the set of threads T acting on a mutex: no other thread is guaranteed to be able to view any memory operation performed by these T. Note that in practice on most real computers, once a memory modification hits the cache, all CPU will view it if they check that same address, by the power of the cache consistency. But C/C++ threads(#) are not specified in term of a globally consistent cache, and not in term of ordering visible on a CPU, as the compiler itself can assume that non atomic objects are not mutated in arbitrary ways by the program without synchronization (the CPU cannot assume any such thing as it doesn't a concept of atomic vs. non atomic memory locations). That means that a guarantee available at the CPU/memory system you are targeting is not in general available at the C/C++ high level model. You absolutely cannot use normal C/C++ code as a high level assembly; only by dousing your code with volatile (almost everywhere) can you even vaguely approach high level assembly (but not quite).

(#) "C/C++ thread/semantics" not "C/C++ programming language thread semantics": C and C++ are based on the same specification for synchronization primitives, that doesn't mean that there is a C/C++ language)

Since the effect of mutex operations on M is only to serialize access to some data by threads that use M, it's clear that other threads don't see any effect. In technical terms, the synchronize with relation is between threads using the same synchronization objects (mutexes in that context, same atomic objects in the context of atomic use).

Even when the compiler emits memory fences in assembly language, it doesn't have to assume that an unlock operation makes changes performed before the unlock to threads outside the set T.

That allows the decomposition of sets of threads for program analysis: if a programs runs in parallel two sets of threads U and V, and U and V are created such that U and V can't access any common synchronization object (but they can access common non atomic objects), then you can analyse the interactions of U and of V separately from the point of view of thread semantics, as U and V cannot exchange information in well defined inter threads ways (they can still exchange information via the system, for example via disk files, sockets, for system specific shared memory).

(That observation might allow the compiler to optimize some threads without doing a full program analysis, even if some common mutable objects are "pulled it" via a third party class that has static members.)

Another way to explain that is to say that the semantics of these primitives is not leaking: only those threads that participate get a defined result.

Note that this is only true at the specification level for acquire and release operations, not sequentially consistent operations (which is the default order for operations on atomic object with you don't specify a memory order): all sequentially consistent actions (operations on an atomic object or fences) occur in a well defined global order. That however doesn't mean anything for independent threads having no atomic objects in common.

An order of operations is unlike an order of elements in a container where you can really navigate the container, or saying that files are presented as ordered by names. Only objects are observable, the orders of operations are not. Saying that there is a well defined order only means that values don't appear to have provably changed backward (vs. some abstract order).

If you have two unrelated sets that are ordered, say the integers with usual order and the words with lexicographical order), you can define the sum of these sets as having an order compatible with both orders. You might put the numbers before, after, or alternated with the words. You are free to do what you want because the elements in the sum of two sets doesn't have any relation with each other when they don't come from the same set.

You could say that there is a global order of all mutex operations, it's just not useful, like defining the order of the sum of unrelated sets.

Liliuokalani answered 25/5, 2019 at 13:22 Comment(8)
Acquiring a mutex is a memory_order_acquire operation, according to the Atomics chapter (quoted in @Alok Save's answer), and releasing is a release operation (on the mutex object itself). This answer doesn't support its assertion that mutexes can be removed. Although that's plausible because they aren't defined as acq_rel, so the acquire can move arbitrarily early and the release arbitrarily late, allowing anything to reorder with anything. Your answer should make this argument, or something equivalent. Then you have an interesting point.Formica
@PeterCordes I agree that they are acquire and release ops; they are not like fences (that act on everything), they are like atomic loads and stores (that act on some thing: the specified atomic obj). The argument is the same as for why atomic ops can be optimized out when the obj is not shared between threads. I will spell it out more clearly.Liliuokalani
That's what I said: release operation not release fence. Otherwise they couldn't reorder like I described.Formica
@PeterCordes A release fence can reorder a lot as non atomic operations can pass the fence in both directions until they cross a previous atomic store.Liliuokalani
@PeterCordes "This answer doesn't support its assertion that mutexes can be removed" They can be compiled as NOP if they accomplish nothing. "Although that's plausible because they aren't defined as acq_rel" Irrelevant: these mutexes don't interact w/ other threads. "so the acquire can move arbitrarily early and the release arbitrarily late," Irrelevant: the operations are not moved around but compile to nothing (suppressed at codegen). "allowing anything to reorder with anything" I don't see how you would move an unlock around a lock. I hope you can't in general.Liliuokalani
Move until it bumps into something else that it can't move past, obviously. Anyway, fences exist in the C++ abstract machine. You can only remove them after finding a compile-time ordering that corresponds to the asm you want to emit.Formica
@PeterCordes Then you can only move unlock immediately before the next lock operation, which doesn't look like a good idea in general, and doesn't make any of these operations vanish. The whole point of my (now longer) answer is that they can vanish.Liliuokalani
Let us continue this discussion in chat.Liliuokalani

© 2022 - 2024 — McMap. All rights reserved.