How C++ Standard prevents deadlock in spinlock mutex with memory_order_acquire and memory_order_release?
Asked Answered
T

2

8

TL:DR: if a mutex implementation uses acquire and release operations, could an implementation do compile-time reordering like would normally be allowed and overlap two critical sections that should be independent, from different locks? This would lead to a potential deadlock.


Assume a mutex is inmplement on std::atomic_flag:

struct mutex
{
   void lock() 
   {
       while (lock.test_and_set(std::memory_order_acquire)) 
       {
          yield_execution();
       }
   }

   void unlock()
   {
       lock.clear(std::memory_order_release);
   }

   std::atomic_flag lock; // = ATOMIC_FLAG_INIT in pre-C++20
};

So far looks ok, regarding using single such mutex: std::memory_order_release is sychronized with std::memory_order_acquire.

The use of std::memory_order_acquire/std::memory_order_release here should not raise questions at the first sight. They are similar to cppreference example https://en.cppreference.com/w/cpp/atomic/atomic_flag

Now there are two mutexes guarding different variables, and two threads accessing them in different order:

mutex m1;
data  v1;

mutex m2;
data  v2;

void threadA()
{
    m1.lock();
    v1.use();
    m1.unlock();

    m2.lock();
    v2.use();
    m2.unlock();
}

void threadB()
{
    m2.lock();
    v2.use();
    m2.unlock();

    m1.lock();
    v1.use();
    m1.unlock();
}

Release operations can be reordered after unrelated acquire operation (unrelated operation == a later operation on a different object), so the execution could be transformed as follows:

mutex m1;
data  v1;

mutex m2;
data  v2;

void threadA()
{
    m1.lock();
    v1.use();

    m2.lock();
    m1.unlock();

    v2.use();
    m2.unlock();
}

void threadB()
{
    m2.lock();
    v2.use();

    m1.lock();
    m2.unlock();

    v1.use();
    m1.unlock();
}

So it looks like there is a deadlock.

Questions:

  1. How Standard prevents from having such mutexes?
  2. What is the best way to have spin lock mutex not suffering from this issue?
  3. Is the unmodified mutex from the top of this post usable for some cases?

(Not a duplicate of C++11 memory_order_acquire and memory_order_release semantics?, though it is in the same area)

Tumulus answered 19/4, 2020 at 4:51 Comment(9)
"Release operations can be reordered after unrelated acquire operation" What is an "unrelated acquire operation"?Laynelayney
Release operations can be reordered after unrelated acquire operation - note that the C++ standard does not specify things in those terms. It specifies in terms of what values a thread can see after doing an acquire load that synchronizes-with a release-store (or a release-sequence). But generally compilers making asm for some target ISA can at least in theory do the reordering you suggest at compile time. Certainly for a plain release store, but IDK if real compilers would do it for a release RMW.Janiuszck
@NicolBolas: unrelated = a later acquire operation on a different object. So in terms of a 1-way-barrier model like shown in preshing.com/20120913/acquire-and-release-semantics, they can reorder.Janiuszck
@PeterCordes that's correct what I meant by unrelated expanded post with your explanationTumulus
I added a TL:DR at the top so future readers don't have to wade through a long question and read the details of your code before they even know what problem you have in mind.Janiuszck
@PeterCordes,thanks for TL:DR section. But there's emphasis on compile-time reordering. Could such reordering also happen in runtime? See: godbolt.org/z/A3Te52 , I see that lock release is compiled as mov DWORD PTR std::atomic_flag f, 0 , so I'm also concerned about runtime reordering.Tumulus
@AlexGuteniev: If it happens once at runtime, who cares? It won't keep happening and stop the release from ever happening. There's obviously no problem in the pure ISO C++ standard itself, only in implementations that have to nail something down in asm for some ISA.Janiuszck
So, you are saying that compiler barriers after clear(release) and before test_and_set(acquire) fixes the issue? And there's no problem in C++ itself, only with implementation, so these compiler barriers should be parts of std::atomic_flag, and not the responsibility of std::atomic_flag users? This may be the answer already.Tumulus
No, I wasn't saying anything about compiler barriers being necessary. At that point I was still thinking through why this wasn't a real-world problem, and whether it was a QoI issue or a standards-compliance issue. I kept it short because I was part way through writing an answer about the retry loop being the key :PJaniuszck
J
7

There's no problem in the ISO C++ standard; it doesn't distinguish compile-time vs. run-time reordering, and the code still has to execute as if it ran in source order on the C++ abstract machine. So the effects of m2.test_and_set(std::memory_order_acquire) trying to take the 2nd lock can become visible to other threads while still holding the first (i.e. before m1.reset), but failure there can't prevent m1 from ever being released.

The only way we'd have a problem is if compile-time reordering nailed down that order into asm for some machine, such that the m2 lock retry loop had to exit before actually releasing m1.

Also, ISO C++ only defines ordering in terms of synchronizes-with and what can see what, not in terms of re-ordering operations relative into some new order. That would imply some order existed. No such order that multiple threads can agree on is even guaranteed to exist for separate objects, unless you use seq_cst operations. (And a modification order for each object separately is guaranteed to exist.)

The 1-way-barrier model of acquire and release operations (like the diagram in https://preshing.com/20120913/acquire-and-release-semantics) is a convenient way to think about things, and matches reality for pure-loads and pure-stores on x86 and AArch64 for example. But as far as language-lawyering, it's not how the ISO C++ standard defines things.


You're reordering a whole retry loop, not just a single acquire

Reordering an atomic operation across a long-running loop is a theoretical problem allowed by the C++ standard. P0062R1: When should compilers optimize atomics? points out that delaying a store until after a long-running loop is technically allowed by standard's wording of 1.10p28:

An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.

But a potentially infinite loop would violate that, not being finite in the deadlock case for example, so compilers must not do that.

It's not "just" a quality-of-implementation issue. A successful mutex lock is an acquire operation, but you should not look at the retry loop as a single acquire operation. Any sane compiler won't.

(The classic example of something that aggressive atomics optimization could break is a progress bar, where the compiler sinks all the relaxed stores out of a loop and then folds all the dead stores into one final store of 100%. See also this Q&A - current compilers don't, and basically treat atomic as volatile atomic until C++ solves the problem of giving programmers a way to let the compiler know when atomics can/can't be optimized safely.)

Janiuszck answered 19/4, 2020 at 6:47 Comment(15)
So if I implement timed_mutex and will try to lock for an hour, it looks like that until P0062R1 is addressed, Standard allows moving such timed_mutex::unlock() past that hour of other timed_mutex::try_lock_for. And if I implement this hour delay without access to some external clock, with number of iteration known at compile time (say, I know the minimum duration of pause instruction for some specific CPU), then a compiler might actually do that. Right?Tumulus
@AlexGuteniev: IDK, maybe in theory, but then we're into quality-of-implementation territory. A good compiler that people want to use isn't going to reorder an atomic past anything possibly lengthy. And in practice because it's hard to prove stuff, probably not past any loop whose trip-count isn't provably small at compile time. And not past any non-inline function call. Reordering past an asm volatile is plausible but I'd hope not, and certainly not in a loop. But if the loop was inside the asm statement, maybe spinning on rdtsc for 1 hour could be plausible.Janiuszck
Or maybe 20x pause might be something real-world code might do, although with pause being ~5 cycles before Skylake but ~100 core clock cycles on SKL, it's maybe unwise to use a fixed number of pause iterations. Hopefully tpause (with a TSC deadline) will be supported on mainstream CPUs later, not just Tremont atom.Janiuszck
"The classic example of something that aggressive atomics optimization could break is a progress bar" But then aggressive write optimization can break the write(".",1); console progress bar too. And nobody has ever discussed that optimization because it's insane. Also, the aggressive volatile write optimization can break probably almost any device driver and it's insane too.Outset
@AlexGuteniev If you really are well versed into semantics, you will notice that nothing in C or C++ is defined. Nothing. No program has defined behavior. It's all QOI.Outset
(I disagree with curiousguy's paranoia about nothing being well-defined in multi-threaded C++. Certainly some things that many people take for granted are not 100% nailed down by the standard, but some things truly are, and aren't just QOI or implementation details. They've argued this position on numerous SO Q&As without presenting convincing evidence such as an example of a real algorithm that's impossible to write in a way that a standards-compliant implementation couldn't break.)Janiuszck
@PeterCordes Doesn't this part of the spec prohibit the re-ordering of the m1.unlock(); m2.lock() pair? (eel.is/c++draft/intro.multithread#intro.progress-7): "For a thread of execution providing concurrent forward progress guarantees, the implementation ensures that the thread will eventually make progress for as long as it has not terminated." Basically a non-deadlocking program cannot be optimized into a potentially deadlocking one.Enwrap
@Erik: That's guaranteeing that no thread will be starved indefinitely (in implementations where std​::​thread provides concurrent forward progress guarantees, which the standard says it should, but not must; see the new couple bullet points.) That's a stronger guarantee than simply not introducing deadlocks, but yes, it would rule out statically reordering to force m2.lock() to happen before m1.unlock(). It doesn't of course forbid the appearance on one run of an un-contended m2.lock() appearing to take the lock before other threads see m1.unlock()'s release operation.Janiuszck
@Erik: ISO C++ does (or should) prevent introducing deadlocks even on implementations without "concurrent forward progress" guarantees, where some activities by other threads can starve a thread indefinitely as long as some other threads are making progress. But if those other threads eventually got stuck waiting for a mutex, yes at that point even a DeathStation 9000 would have no choice but to let the m1.unlock() make progress in this thread.Janiuszck
@PeterCordes I'm not sure actually that it means what I thought it means. It seems that making an iteration of the test-and-set loop would be considered forward progress. I'm also need really buying your argument based on "An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.". A compiler I would want to use, cannot optimize my non-deadlocking code into deadlocking code. I can't find a water tight argument preventing that based on the standards wording.Enwrap
@PeterCordes I think the standard requires a well formed program to not have any infinite loops. The lock-unlock re-ordering can transform a finite loop into a infinite loop. Hence it's not allowed.Enwrap
@Erik: The C++ standard (unlike C IIRC), requires no infinite loops that don't contain an atomic or volatile access, or an IO operation. (Or something like that; I forget how it's phrased. But that might in theory allow a compliant implementation to implement std::thread with cooperative multi-threading where any of those operations are possible context-switch points.)Janiuszck
@Erik: Agreed that nobody would want to use a compiler that could introduce deadlocks (in this or any other way). I'm 100% sure that compiler devs agree with that, and understand that statically doing this transformation at compile-time is thus not allowed. The only question is how strongly we can say that it's formally by the standard, not just quality-of-implementation. IMO it would at least violate the as-if rule, producing a version of the program that didn't work the way the source did, with infinite loops or not being a qualitative difference.Janiuszck
@PeterCordes Hmm, you are right, infinite loop with atomic access is allowed (eel.is/c++draft/intro.multithread#intro.progress-1). Seems to me that the standard technically allows the lock-unlock transformation.Enwrap
@Erik: I'm not totally convinced. It's not disallowed for that reason, but it seems like a violation of as-if. I do find it convincing that infinitely delaying visibility of the unlock would violate at least the should ... finite time part I quoted. It doesn't seem like something even a DeathStation 9000 implementation should be allowed to do. I hope there's also some more relevant formalism somewhere. Maybe it would be interesting to ask a new language-lawyer question, or put a bounty on this one, if someone else can find something in the standard.Janiuszck
T
3

You can try CppMem (help page) to test things like that.
CppMem is a tool by the authors of Mathematizing C++ Concurrency.
To prove that some behavior of a program is allowed by the C++ memory model, one has to find an execution of this program for which each of the huge number of the memory model rules isn't violated.
CppMem can do it automatically.

I simplified your program to this (to reduce the number of executions that CppMem has to check):

// RA mutex deadlock
int main() {
  atomic_int m1 = 0;
  atomic_int m2 = 0;
  
  {{{ { 
        // m1.lock
        atomic_compare_exchange_strong_explicit(m1, 0, 1, memory_order_acquire, memory_order_acquire);
        
        // m1.unlock
        m1.store(0,memory_order_release);
        
        // m2 is busy
        m2.load(memory_order_acquire).readsvalue(1);
      }
  ||| { 
        // m2.lock
        atomic_compare_exchange_strong_explicit(m2, 0, 1, memory_order_acquire, memory_order_acquire);
        
        // m1 is busy
        m1.load(memory_order_acquire).readsvalue(1);
      }
  }}}
  return 0;
}

CppMem finds an execution where both threads are waiting for mutexes:

the execution graph

In this execution both threads simultaneously are trying to acquire a lock held by another thread.
That is similar to a deadlock, but fortunately, this cannot last forever because the standard also guarantees this:

  • 6.9.2.3 [intro.progress]:

    18 An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.

  • 33.5.4 [atomics.order]:

    11 Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.

The quotes mean that the write m1.store(0,memory_order_release) (which is the last write in the modification order for m1) will have to become visible to m1.load(memory_order_acquire) eventually, and then the threads will be able to make progress.

Trichromatism answered 21/5, 2023 at 21:16 Comment(3)
That's not a deadlock, that's just a failure of some finite number of iterations of the lock-taking spin loop. Because as you say, there is a pending store which the standard requires to be visible to some future store.Janiuszck
Unfortunately, that tool only works with C++11. I wish it had support for the C++20 concurrency features too.Chloramphenicol
@PeterCordes I've changed the wording.Trichromatism

© 2022 - 2024 — McMap. All rights reserved.