If the caller does g(); atomic_thread_fence(acquire);
, it could create a happens-before relationship with another thread which wouldn't exist if zero loads had been executed. It would have no way of knowing what it synced-with since the load result is thrown away, but it's not totally obvious that optimizing away the loads entirely would be equivalent.
Perhaps there's some timing reason that on some real implementation means some store in another thread should be visible to that relaxed load. This sounds pretty hand-waivey, but some run-time timing conditions could lead to UB-free executions that could be broken by optimizing away all the loads.
Collapsing 100 loads down to 1 rather than 0 wouldn't have this problem, but would still require a specific optimization pass to look for unused loads not separated by any memory-ordering effects like fences. It seems hard to argue with the safety of that; if code was relying on each load happening as a delay measure (or for MMIO), they should use volatile atomic
.
Collapsing multiple loads or stores in general is a separate thing that's been discussed: Can and does the compiler optimize out two atomic loads? / Why don't compilers merge redundant std::atomic writes?
Two back-to-back loads of the same atomic object might or might not produce the same value if other threads are writing concurrently. If a compiler makes asm that only loads once, it's effectively nailing down part of the run-time memory-ordering at compile time. That's usually ok to a limited extent, but as those WG21 papers point out, it's not always ok, especially for stores. (Compiling a C++ program for an ISA with a strong memory model also makes some ISO-C++-allowed memory orderings impossible in real executions, but it doesn't tend to remove possible interleavings of sequentially-consistent executions across threads.)
None of this is a real obstacle to collapsing 100 unused loads down to 1, but it's relevant in terms of compilers for now choosing not to optimize atomics at all because it's tied up with thornier issues.
Practical reasons
Compiler internals may use some of their existing support for volatile
to handle atomics, i.e. not assuming that multiple reads will give the same value. This has the side-effect of basically treating them like volatile atomic
.
For GCC, @o11c linked https://gcc.gnu.org/wiki/Atomic/GCCMM/Optimizations about GCC's limitations on optimizing atomics. It claims that [intro.races]/19 would forbid treating int x=a.load(relaxed); int y=a.load(relaxed);
as x=y=a.load(relaxed);
. But that's a misreading. It forbids doing them in the other order, which could lead to x
having newer value from the modification-order of a
, but forcing them to both have the same value from the modification-order doesn't violate the read-read coherence rule [intro.races]/16 the note is summarizing. The last edit to that GCC wiki page was apparently 2016, before WG21/P0062R1; hopefully most GCC devs involved with C/C++ atomics have realized since then that optimization is allowed on paper by the ISO standard even for loads that are used.
Also, compiler devs may be reluctant to add code that looks for optimizations that are very rarely profitable. Larger codebases for compilers like GCC, LLVM, and MSVC take more dev work to maintain, potentially slowing the addition of other features and maintenance.
Also, looking for such optimizations makes compile times slower. That might not be a problem here; modern ahead-of-time compilers already transform program logic to an SSA form where unused results should be easy to find even in less trivial cases than this (e.g. when assigning load results to local vars that are unused after optimizations). Compilers can already warn about unused return values for cases this trivial.
volatile
to force loads and stores to use memory (for visibility). Then the lack of optimization is just a side effect – Foilsmanstd::atomic
implementation can achieve that is withvolatile
. For example, onX86
, load/store operations tovolatile int
are usually equivalent to theirstd::atomic<int>
(mo_relaxed) counterparts. Withoutvolatile
, the compiler might keep a store local to the CPU (ie. in a register) and no other core would observe it. I am not sayingvolatile
is required; an implementation may have an different approach to get the same result (eg.gcc
, see link above). – Foilsmangcc
, which treats all atomic operations as full compiler barriers; as such, it has the same limitations as volatile qualified operations; ie. 2 relaxed atomic operations (on different variables) will not be re-ordered even though the memory model allows that. An optimal implementation needs to know the full set of rules of the memory model. – Foilsman