Firstly...
Things to ignore:
memory_order_consume
- apparently no major compiler implements it, and they silently replace it with a stronger memory_order_acquire
. Even the standard itself says to avoid it.
A big part of the cppreference article on memory orders deals with 'consume', so dropping it simplifies things a lot.
It also lets you ignore related features like [[carries_dependency]]
and std::kill_dependency
.
Data race: Writing to a non-atomic variable from one thread and reading/writing to it from a different thread, if nothing prevents the two actions from happening at the same time, is called a data race, and causes undefined behavior.
Race condition is not a synonym, it's just a loose word for an intermittent bug that depends on how fast the threads happen to execute. (The buggy program may or may not also have a data race.)
Memory orders:
memory_order_relaxed
is the weakest and supposedly the fastest memory order.
Any reads/writes to atomics can't cause data races (and subsequent UB). relaxed
provides just this minimal guarantee, for a single variable. It doesn't provide any guarantees for other variables (atomic or not).
All threads agree on the order of operations on every particular atomic variable. But it's the case only for invididual variables. If other variables (atomic or not) are involved, threads might disagree on how exactly the operations on different variables are interleaved.
It's as if relaxed operations propagate between threads with slight unpredictable delays.
This means that you can't use relaxed atomic operations to judge when it's safe to access other non-atomic memory (can't synchronize access to it).
By "threads agree on the order" I mean that:
- Each thread will access each separate variable in the exact order you tell it to. E.g.
a.store(1, relaxed); a.store(2, relaxed);
will write 1
, then 2
, never in the opposite order. But accesses to different variables in the same thread can still be reordered relative to each other.
- If a thread A writes to a variable several times, then thread B reads seveal times, it will get the values in the same order (but of course it can read some values several times, or skip some, if you don't synchronize the threads in other ways).
- No other guarantees are given.
Example uses: Anything that doesn't try to use an atomic variable to synchronize access to non-atomic data: various counters (that exist for informational purposes only), or 'stop flags' to signal other threads to stop. Another example: operations on shared_ptr
s that increment the reference count internally use relaxed
.
Fences: atomic_thread_fence(relaxed);
does nothing.
memory_order_release
, memory_order_acquire
do everything relaxed
does, and more (so it's supposedly slower or equivalent).
Only stores (writes) can use release
. Only loads (reads) can use acquire
. Read-modify-write operations such as fetch_add
can be both (memory_order_acq_rel
), but they also can be just release
or just acquire
.
Those let you synchronize threads:
Let's say thread 1 reads/writes to some memory M (any non-atomic or atomic variables, doesn't matter).
Then thread 1 performs a release store to a variable A. Then it stops
touching memory M.
If thread 2 then performs an acquire load of the same variable A, then the store in thread 1 is said to synchronize with with this load in thread 2.
Now thread 2 can safely read/write to that memory M (without inducing a data race UB you would have otherwise).
You only synchronize with the latest writer, not preceding writers.
You can chain synchronizations across multiple threads.
There's a special rule that synchronization propagates across any number of read-modify-write operations regardless of their memory order. E.g. if thread 1 does a.store(1, release);
, then thread 2 does a.fetch_add(2, relaxed);
, then thread 3 does a.load(acquire)
, then thread 1 successfully synchronizes with thread 3, even though there's a relaxed operation in the middle.
In the above rule, a release operation X, and any subsequent read-modify-write operations on the same variable (stopping at the next non-read-modify-write operation) are called a release sequence headed by X. (So if an acquire reads from any operation in a release sequence, it synchronizes with the head of the sequence.)
If read-modify-write operations are involved, nothing stops you from synchronizing with more than one operation. In the example above, if fetch_add
was using acquire
or acq_rel
, thread 1 would additonally synchronize with it, and conversely, if it used release
or acq_rel
, it would additonally syncrhonize with thread 3.
Example use: shared_ptr
decrements its reference counter using something like fetch_sub(1, acq_rel)
.
Here's why: imagine that thread 1 reads/writes to *ptr
, then destroys its copy of ptr
, decrementing the ref count. Then thread 2 destroys the last remaining pointer, also decrementing the ref count, and then runs the destructor.
Since the destructor in thread 2 is going to access the memory previously accessed by thread 1, the acq_rel
synchronization in fetch_sub
is necessary. Otherwise you'd have a data race and UB.
Fences: Using atomic_thread_fence
, you can essentially turn relaxed atomic operations into release/acquire operations. A single fence can apply to more than one operation, and/or can be performed conditionally.
If you do a relaxed read (or with any other order) from one or more variables, then do atomic_thread_fence(acquire)
in the same thread, then all those reads count as acquire operations.
Conversely, if you do atomic_thread_fence(release)
, followed by any number of (possibly relaxed) writes, those writes count as release operations.
An acq_rel
fence combines the effect of acquire
and release
fences.
Of course, you can't benefit from a fence between it and the affected atomic operation (after a relaxed read but before the acquire fence; or conversely after the release fence but before a relaxed write). Without this rule we would have time travel: you could, say, call a release fence at the beginning of a thread and an acquire fence at the end, and thus bless all operations in between, which doesn't make any sense.
How to choose between relaxed order + fence vs a stronger order without fence.
Similarity with other standard library features:
Several standard library features also cause a similar synchronizes with relationship. E.g. locking a mutex synchronizes with the latest unlock, as if locking was an acquire operation, and unlocking was a release operation.
memory_order_seq_cst
does everything acquire
/release
do, and more. This is supposedly the slowest order, but also the safest.
seq_cst
reads count as acquire operations. seq_cst
writes count as release operations. seq_cst
read-modify-write operations count as both.
seq_cst
operations can synchronize with each other, and with acquire/release operations. Beware of special effects of mixing them (see below).
seq_cst
is the default order, e.g. given atomic_int x;
, x = 1;
does x.store(1, seq_cst);
.
seq_cst
has an extra property compared to acquire/release: all threads agree on the order in which all seq_cst
operations happen. This is unlike weaker orders, where threads agree only on the order of operations on each individual atomic variable, but not on how the operations are interleaved - see relaxed
order above.
The presence of this global operation order seems to only affect which values you can get from seq_cst
loads, it doesn't in any way affect non-atomic variables and atomic operations with weaker orders (unless seq_cst
fences are involved, see below), and by itself doesn't prevent any extra data race UB compared to acq/rel operations.
Among other things, this order respects the synchronizes with relationship described for acquire/release above, unless (and this is weird) that synchronization comes from mixing a seq-cst operation with an acquire/release operation (release syncing with seq-cst, or seq-cst synching with acquire). Such mix essentially demotes the affected seq-cst operation to an acquire/release (it maybe retains some of the seq-cst properties, but you better not count on it).
Example use:
atomic_bool x = true;
atomic_bool y = true;
// Thread 1:
x.store(false, seq_cst);
if (y.load(seq_cst)) {...}
// Thread 2:
y.store(false, seq_cst);
if (x.load(seq_cst)) {...}
Lets say you want only one thread to be able to enter the if
body. seq_cst
allows you to do it. Acquire/release or weaker orders wouldn't be enough here.
Fences: atomic_thread_fence(seq_cst);
does everything an acq_rel
fence does, and more.
Like you would expect, they bring some seq-cst properties to atomic operations done with weaker orders.
All threads agree on the order of seq_cst
fences, relative to one another and to any seq_cst
operations (i.e. seq_cst
fences participate in the global order of seq_cst
operations, which was described above).
They essentially prevent atomic operations from being reordered across themselves.
E.g. we can transform the above example to:
atomic_bool x = true;
atomic_bool y = true;
// Thread 1:
x.store(false, relaxed);
atomic_thread_fence(seq_cst);
if (y.load(relaxed)) {...}
// Thread 2:
y.store(false, relaxed);
atomic_thread_fence(seq_cst);
if (x.load(relaxed)) {...}
Both threads can't enter if
at the same time, because that would require reordering a load across the fence to be before the store.
But formally, the standard doesn't describe them in terms of reordering. Instead, it just explains how the seq_cst
fences are placed in the global order of seq_cst
operations. Let's say:
Thread 1 performs operation A on atomic variable X using using seq_cst
order, OR a weaker order preceeded by a seq_cst
fence.
Then:
Thread 2 performs operation B the same atomic variable X using seq_cst
order, OR a weaker order followed by a seq_cst
fence.
(Here A and B are any operations, except they can't both be reads, since then it's impossible to determine which one was first.)
Then the first seq_cst
operation/fence is ordered before the second seq_cst
operation/fence.
Then, if you imagine an scenario (e.g. in the example above, both threads entering the if
) that imposes a contradicting requirements on the order, then this scenario is impossible.
E.g. in the example above, if the first thread enters the if
, then the first fence must be ordered before the second one. And vice versa. This means that both threads entering the if
would lead to a contradition, and hence not allowed.
Interoperation between different memory orders
Summarizing the above:
|
relaxed write |
release write |
seq-cst write |
relaxed load |
- |
- |
- |
acquire load |
- |
synchronizes with |
synchronizes with* |
seq-cst load |
- |
synchronizes with* |
synchronizes with |
* = The participating seq-cst operation gets a messed up seq-cst order, effectively being demoted to an acquire/release operation. This is explained above.
Does using a stronger memory order makes data transfer between threads faster?
No, it seems not.
Sequental consistency for data-race-free programs
The standard explains that if your program only uses seq_cst
accesses (and mutexes), and has no data races (which cause UB), then you don't need to think about all the fancy operation reorderings. The program will behave as if only one thread executed at a time, with the threads being unpredictably interleaved.
Various orders of operations
As explained above, there's no single timeline in which things happen in a multithreaded program. The standard defines a bunch of "orders" (or relations) between operations.
Those are just a more formal way to look at what has already been explain above.
A is sequenced before B — both happen in the same thread, and B comes after A.
Side note: you might remember that i++ + i++
is UB. The reason why it's UB is because neither i++
is sequenced before the other (though not a data race, because it's the same thread).
A synchronizes with B — This was explained in the "acquire/release" section above. Some library operations are said to synchronize with other: release (or stronger) write synchronizes with an acquire (or stronger) read of the same variable (if the value it reads comes from this write and not another one). Unlocking a mutex synchronizes with the next lock of it, etc.
A happens before B — This is a combination of 'sequenced before' and 'synchronizes with', spanning across threads. A happends before B if and only if: A is sequenced before B, or A synchronizes with B, or transitively (there's C such that A happens before C and C happens before B). (This is a slightly simplified definition, see below.)
'Happens before' is used in the definition of 'data race'. If neither A nor B happen before each other (and they are operations on the same non-atomic variable in different threads), then they create a data race and UB.
Variable modification orders — Each individual atomic variable has an associated 'modification order', i.e. the order in which all writes to it happen. (Reads do respect this order, though not formally a part of it.)
Variable modification orders are consistent with 'happens before' (inplying they are also consistent with 'sequenced before' of each individual thread).
Global seq-cst order — one single order in which all seq-cst operations (including fences) happen.
It's consistent with individual variable modification orders, and with sequenced-before.
It's usually consistent with 'happens before', but not when you mix seq-cst and acq/rel operations (see below).
It's also consistent with something called coherence-ordered before, which just looks like an extension of variable modification order that includes reads.
There are in fact several flavors of 'happens before':
Simply happens before — uses the simple definition of 'happens before' I gave above.
Inter-thread happens before — used in the definition of memory_order_consume
, hence completely useless.
Happens before — a combination of 'simply happens before' and 'inter-thread happens before', hence for all practical purposes is equivalent to 'simply happens before'.
This is what's used in the definion of a data race, as stated above.
Strongly happens before — like 'simply happens before', but excludes mixed synchronization of seq-cst and acq-rel.
Therefore the global seq-cst order is consistent with 'strongly happens before', but not always with '(simply) happens before'.