Optimizations around atomic load stores in C++
Asked Answered
N

1

10

I have read about std::memory_order in C++ and understood partially. But I still had some doubts around it.

  1. Explanation on std::memory_order_acquire says that, no reads or writes in the current thread can be reordered before this load. Does that mean compiler and cpu is not allowed to move any instruction present below the acquire statement, above it?
auto y = x.load(std::memory_order_acquire);
z = a;  // is it leagal to execute loading of shared `b` above acquire? (I feel no)
b = 2;  // is it leagal to execute storing of shared `a` above acquire? (I feel yes)

I can reason out why it is illegal for executing loads before acquire. But why it is illegal for stores?

  1. Is it illegal to skip a useless load or store from atomic objects? Since they are not volatile, and as I know only volatile has this requirement.
auto y = x.load(std::memory_order_acquire);  // `y` is never used
return;

This optimization is not happening even with relaxed memory order.

  1. Is compiler allowed to move instructions present above acquire statement, below it?
z = a;  // is it leagal to execute loading of shared `b` below acquire? (I feel yes)
b = 2;  // is it leagal to execute storing of shared `a` below acquire? (I feel yes)
auto y = x.load(std::memory_order_acquire);
  1. Can two loads or stores be reordered without crossing acquire boundary?
auto y = x.load(std::memory_order_acquire);
a = p;  // can this move below the below line?
b = q;  // shared `a` and `b`

Similar and corresponding 4 doubts with release semantics also.

Related to 2nd and 3rd question, why no compiler is optimizing f(), as aggressive as g() in below code?

#include <atomic>

int a, b;

void dummy(int*);

void f(std::atomic<int> &x) {
    int z;
    z = a;  // loading shared `a` before acquire
    b = 2;  // storing shared `b` before acquire
    auto y = x.load(std::memory_order_acquire);
    z = a;  // loading shared `a` after acquire
    b = 2;  // storing shared `b` after acquire
    dummy(&z);
}

void g(int &x) {
    int z;
    z = a;
    b = 2;
    auto y = x;
    z = a;
    b = 2;
    dummy(&z);
}
f(std::atomic<int>&):
        sub     rsp, 24
        mov     eax, DWORD PTR a[rip]
        mov     DWORD PTR b[rip], 2
        mov     DWORD PTR [rsp+12], eax
        mov     eax, DWORD PTR [rdi]
        lea     rdi, [rsp+12]
        mov     DWORD PTR b[rip], 2
        mov     eax, DWORD PTR a[rip]
        mov     DWORD PTR [rsp+12], eax
        call    dummy(int*)
        add     rsp, 24
        ret
g(int&):
        sub     rsp, 24
        mov     eax, DWORD PTR a[rip]
        mov     DWORD PTR b[rip], 2
        lea     rdi, [rsp+12]
        mov     DWORD PTR [rsp+12], eax
        call    dummy(int*)
        add     rsp, 24
        ret
b:
        .zero   4
a:
        .zero   4
Narcis answered 6/7, 2022 at 19:13 Comment(5)
This has tremendous information: C++ and Beyond 2012: Herb Sutter - atomic Weapons 1 of 2Merited
It's illegal for stores because the acquire might be part of an atomic RMW or double-checked locking that takes a lock, giving mutual exclusion with other threads. Or otherwise seeing that another thread has finished what it was doing to do with a non-atomic object. Full locking would require an atomic RMW or seq_cst loads/stores, but you could have cases where another thread is done writing something, and signals that with a release-store to a flag.Paulus
re: not-done optimizations like removing unused loads: ISO C++ allows that, but current compilers treat atomic very much like volatile, pending further work on compilers and designing ways to prevent optimizations that could be problematic for timing: Why don't compilers merge redundant std::atomic writes?Paulus
Your example in Q1 is a little tricky. How would you complete the program such that the reordering in question could actually be observed? All the ways I can think of would create a data race, or at least a potential data race, and that makes the whole program UB so anything could happen.Persiflage
@CraigEstey Thanks, the talk was excellent. But I had a question regarding some claims made by him in the talk. I have posted the question regarding them here.Narcis
P
4

Q1

Generally, yes. Any load or store that follows (in program order) an acquire load, must not become visible before it.

Here is an example where it matters:

#include <atomic>
#include <thread>
#include <iostream>

std::atomic<int> x{0};
std::atomic<bool> finished{false};
int xval;
bool good;

void reader() {
    xval = x.load(std::memory_order_relaxed);
    finished.store(true, std::memory_order_release);
}

void writer() {
    good = finished.load(std::memory_order_acquire);
    x.store(42, std::memory_order_relaxed);
}

int main() {
    std::thread t1(reader);
    std::thread t2(writer);
    t1.join();
    t2.join();
    if (good) {
        std::cout << xval << std::endl;
    } else {
        std::cout << "too soon" << std::endl;
    }
    return 0;
}

Try on godbolt

This program is free of UB and must print either 0 or too soon. If the writer store of 42 to x could be reordered before the load of finished, then it would be possible that the reader load of x returns 42 and the writer load of finished returns true, in which case the program would improperly print 42.

Q2

Yes, it would be okay for a compiler to delete the atomic load whose value is never used, since there is no way for a conforming program to detect whether the load was done or not. However, current compilers generally don't do such optimizations. Partly out of an abundance of caution, because optimizations on atomics are hard to get right and bugs can be very subtle. It may also be partly to support programmers writing implementation-dependent code, that is able to detect via non-standard features whether the load was done.

Q3

Yes, this reordering is perfectly legal, and real-world architectures will do it. An acquire barrier is only one way.

Q4

Yes, this is also legal. If a,b are not atomic, and some other thread is reading them concurrently, then the code has a data race and is UB, so it is okay if the other thread observes the writes having happened in the wrong order (or summons nasal demons). (If they are atomic and you are doing relaxed stores, then you can't get nasal demons, but you can still observe the stores out of order; there is no happens-before relationship mandating the contrary.)

Optimization comparison

Your f versus g examples is not really a fair comparison: in g, the load of the non-atomic variable x has no side effects and its value is not used, so the compiler omits it altogether. As mentioned above, the compiler doesn't omit the unnecessary atomic load of x in f.

As to why the compilers don't sink the first accesses to a and b past the acquire load: I believe it's simply a missed optimization. Again, most compilers deliberately don't try to do all possible legal optimizations with atomics. However, you could note that on ARM64 for instance, the load of x in f compiles to ldar, which the CPU can definitely reorder with earlier plain loads and stores

Persiflage answered 11/7, 2022 at 22:19 Comment(3)
One way to summarize the reason for acquire loads being ordered wrt. later stores is to make sure those stores don't get stepped-on by stores in code that "happens-before" the release-store it syncs with. BTW, std::endl is not necessary or useful here, you can just use "too soon\n". It's exactly equivalent to << '\n' and an fflush, even on systems that use a different newline convention in text streams. (i.e. std::endl's purpose is not portable line endings, despite the unfortunate choice of name.)Paulus
@Nate Why atomic load of x in f is considered as a side effect even if its value is not used?Narcis
@SouravKannanthaB: An atomic load is not a side effect. I may have worded that poorly. But as mentioned, compilers will usually not optimize it out. In that respect, it is similar to loading a volatile variable, which is considered a side effect.Persiflage

© 2022 - 2024 — McMap. All rights reserved.