Why are unnecessary atomic loads not optimized away?
Asked Answered
S

1

10

Let's consider this trivial code:

#include <atomic>

std::atomic<int> a;
void f(){
    for(int k=0;k<100;++k)
        a.load(std::memory_order_relaxed);
}

MSVC, Clang and GCC all perform 100 loads of a, while it seems obvious it could have been optimized away. I expected the function f to be a nop (See generated code here)

Actually, I expected this code generation for a volatile atomic:

volatile std::atomic<int> va;
void g(){
    for(int k=0;k<100;++k)
        va.load(std::memory_order_relaxed);
}

Why do compilers not optimize away unnecessary atomic loads?

Soave answered 8/5, 2019 at 17:52 Comment(12)
I'm not sure we can answer the question of why the compiler fails to perform some specific optimization that would rarely be used. I assume that sane programmers think carefully when writing any line of code that invokes atomic operations, and would not write code like this.Boob
GCC's TODO list: gcc.gnu.org/wiki/Atomic/GCCMM/OptimizationsVillanueva
@Villanueva Funny, I was told that compiler writers don't know how to implement consume semantics, but the GCC team just correctly described how!Mentor
@Brian "programmers think carefully" is a (non) argument that can be used in any redundant computation and any code. They wouldn't optimize redundant operations or empty loop. But reality isn't so simple and real world code is not written with all the information available. Library code in particular needs to assume the worst.Mentor
@JesperJuhl "If you don't care about atomicity, then don't use atomics_" Who suggested that we don't care about atomicity? "And how do you know that a load is unnecessary?" The same ways you know a load of any other non volatile variable is unnecessary. Because it's clearly redundant or even unused. "If some hardware changed the value of a register or memory location via DMA" Why do you believe that non volatile atomics can be safely used for memory mapped registers or DMA? What suggests that in the std or in std related discussion?Mentor
implementations may use volatile to force loads and stores to use memory (for visibility). Then the lack of optimization is just a side effectFoilsman
@Foilsman What do you mean? What would happen if they didn't "force loads and stores to use memory (for visibility)"? Visibility of what specifically?Mentor
@LWimsey, considering the GCC document linked above, the induction is in the other direction. Atomic operations are implemented as functions that have unknown side effects. So atomic behave as if they were volatile qualified.Soave
@Mentor Atomic operations must become visible to other cores. One (simple) way a std::atomic implementation can achieve that is with volatile. For example, on X86, load/store operations to volatile int are usually equivalent to their std::atomic<int> (mo_relaxed) counterparts. Without volatile, the compiler might keep a store local to the CPU (ie. in a register) and no other core would observe it. I am not saying volatile is required; an implementation may have an different approach to get the same result (eg. gcc, see link above).Foilsman
@Soave The document is about one implementation, gcc, which treats all atomic operations as full compiler barriers; as such, it has the same limitations as volatile qualified operations; ie. 2 relaxed atomic operations (on different variables) will not be re-ordered even though the memory model allows that. An optimal implementation needs to know the full set of rules of the memory model.Foilsman
@Foilsman 1) I agree that a volatile word variable is a proper impl of relaxed loads/stores. 2) I disagree here: "the compiler might keep a store local to the CPU (ie. in a register)" correct w/o volatile the impl may not immediately write to memory for each relaxed store, now how is that a bug?Mentor
@Foilsman In the presence of infinite loops with only mo_relaxed ops, you are right if you want the impl to make stores globally visible in finite time, which is a Good Thing but not strictly required (not a "shall" in standardese). So technically a compiler is allowed to keep a store to an atomic object "in a register" forever (in a purely computational function). The only time it shall make it visible is when a release operation is done. This is ridiculously perverse behavior and not acceptable QoI.Mentor
T
2

If the caller does g(); atomic_thread_fence(acquire);, it could create a happens-before relationship with another thread which wouldn't exist if zero loads had been executed. It would have no way of knowing what it synced-with since the load result is thrown away, but it's not totally obvious that optimizing away the loads entirely would be equivalent.

Perhaps there's some timing reason that on some real implementation means some store in another thread should be visible to that relaxed load. This sounds pretty hand-waivey, but some run-time timing conditions could lead to UB-free executions that could be broken by optimizing away all the loads.

Collapsing 100 loads down to 1 rather than 0 wouldn't have this problem, but would still require a specific optimization pass to look for unused loads not separated by any memory-ordering effects like fences. It seems hard to argue with the safety of that; if code was relying on each load happening as a delay measure (or for MMIO), they should use volatile atomic.


Collapsing multiple loads or stores in general is a separate thing that's been discussed: Can and does the compiler optimize out two atomic loads? / Why don't compilers merge redundant std::atomic writes?

Two back-to-back loads of the same atomic object might or might not produce the same value if other threads are writing concurrently. If a compiler makes asm that only loads once, it's effectively nailing down part of the run-time memory-ordering at compile time. That's usually ok to a limited extent, but as those WG21 papers point out, it's not always ok, especially for stores. (Compiling a C++ program for an ISA with a strong memory model also makes some ISO-C++-allowed memory orderings impossible in real executions, but it doesn't tend to remove possible interleavings of sequentially-consistent executions across threads.)

None of this is a real obstacle to collapsing 100 unused loads down to 1, but it's relevant in terms of compilers for now choosing not to optimize atomics at all because it's tied up with thornier issues.


Practical reasons

Compiler internals may use some of their existing support for volatile to handle atomics, i.e. not assuming that multiple reads will give the same value. This has the side-effect of basically treating them like volatile atomic.

For GCC, @o11c linked https://gcc.gnu.org/wiki/Atomic/GCCMM/Optimizations about GCC's limitations on optimizing atomics. It claims that [intro.races]/19 would forbid treating int x=a.load(relaxed); int y=a.load(relaxed); as x=y=a.load(relaxed);. But that's a misreading. It forbids doing them in the other order, which could lead to x having newer value from the modification-order of a, but forcing them to both have the same value from the modification-order doesn't violate the read-read coherence rule [intro.races]/16 the note is summarizing. The last edit to that GCC wiki page was apparently 2016, before WG21/P0062R1; hopefully most GCC devs involved with C/C++ atomics have realized since then that optimization is allowed on paper by the ISO standard even for loads that are used.

Also, compiler devs may be reluctant to add code that looks for optimizations that are very rarely profitable. Larger codebases for compilers like GCC, LLVM, and MSVC take more dev work to maintain, potentially slowing the addition of other features and maintenance.

Also, looking for such optimizations makes compile times slower. That might not be a problem here; modern ahead-of-time compilers already transform program logic to an SSA form where unused results should be easy to find even in less trivial cases than this (e.g. when assigning load results to local vars that are unused after optimizations). Compilers can already warn about unused return values for cases this trivial.

Transplant answered 9/12, 2023 at 14:31 Comment(3)
I'm having trouble seeing how you can get a happens-before. I realize that even without testing the value loaded, you may be able to deduce from other program state what the value must have been, and thus maybe get synchronization. But is there an actual example of this? It may be that the coherence rules would always permit you to observe some store which you already knew must have happened-before the load.Crossstitch
Even if there was such an example, I would think any imaginable implementation would still be fine with omitting the load. The necessary ordering on other operations would be imposed by the acquire barrier itself.Crossstitch
@NateEldredge: You can have a happens-before in some executions but not others. That's usually not useful, but it would be slightly more aggressive than normal for a compiler to break even the executions where you could have gotten lucky. "Lucky" if the happens-before was required for correctness; maybe it's not but leads to some observable difference? Like the ordering between two acquire loads on separate locations. Maybe the values you see have some property like x < y if there was a happens-before with a writer. (IDK if you could infer anything from seeing it or not.)Transplant

© 2022 - 2024 — McMap. All rights reserved.