Can std::atomic be used sometimes instead of std::mutex in C++?
Asked Answered
B

1

6

I suppose that std::atomic sometimes can replace usages of std::mutex. But is it always safe to use atomic instead of mutex? Example code:

std::atomic_flag f, ready; // shared

// ..... Thread 1 (and others) ....
while (true) {
    // ... Do some stuff in the beginning ...
    while (f.test_and_set()); // spin, acquire system lock
    if (ready.test()) {
        UseSystem(); // .... use our system for 50-200 nanoseconds ....
    }
    f.clear(); // release lock
    // ... Do some stuff at the end ...
}

// ...... Thread 2 .....
while (true) {
    // ... Do some stuff in the beginning ...
    InitSystem();
    ready.test_and_set(); // signify system ready
    // .... sleep for 10-30 milli-seconds ....
    while (f.test_and_set()); // acquire system lock
    ready.clear(); // signify system shutdown
    f.clear(); // release lock
    DeInitSystem(); // finalize/destroy system
    // ... Do some stuff at the end ...
}

Here I use std::atomic_flag to protect use of my system (some complex library). But is it safe code? Here I suppose that if ready is false then system is not available and I can't use it and if it is true then it is available and I can use it. For simplicity suppose that code above doesn't throw exceptions.

Of cause I can use std::mutex to protect read/modify of my system. But right now I need very high performance code in Thread-1 that should use atomics very often instead of mutexes (Thread-2 can be slow and use mutexes if needed).

In Thread-1 system-usage code (inside while loop) is run very often, each iteration around 50-200 nano-seconds. So using extra mutexes will be to heavy. But Thread-2 iterations are quite large, as you can see in each iteration of while loop when system is ready it sleeps for 10-30 milli-seconds, so using mutexes only in Thread-2 is quite alright.

Thread-1 is example of one thread, there are several threads running same (or very similar) code as Thread-1 in my real project.

I'm concerned about memory operations ordering, meaning that it can probably happen somtimes that system is not yet in fully consistent state (not yet inited fully) when ready becomes true in Thread-1. Also it may happen that ready becomes false in Thread-1 too late, when system already made some destroying (deinit) operations. Also as you can see system can be inited/destroyed many times in a loop of Thread-2 and used many times in Thread-1 whenever it is ready.

Can my task be solved somehow without std::mutex and other heavy stuff in Thread-1? Only using std::atomic (or std::atomic_flag). Thread-2 can use heavy synchronization stuff if needed, mutexes etc.

Basically Thread-2 should somehow propagate whole inited state of system to all cores and other threads before ready becomes true and also Thread-2 should propagate ready equal to false before any single small operation of system destruction (deinit) is done. By propagating state I mean that all system's inited data should be 100% written consistently to global memory and caches of other core, so that other threads see fully consistent system whenever ready is true.

It is even allowed to make small (milliseconds) pause after system init and before ready is set to true if it improves situation and guarantees. And also it is allowed to make pause after ready is set to false and before starting system destruction (deinit). Also doing some expensive CPU operations in Thread-2 is also alright if there exist some operations like "propagate all Thread-2 writes to global memory and caches to all other CPU cores and threads".

Update: As a solution for my question above right now in my project I decided to use next code with std::atomic_flag to replace std::mutex:

std::atomic_flag f = ATOMIC_FLAG_INIT; // shared
// .... Later in all threads ....
while (f.test_and_set(std::memory_order_acquire)) // try acquiring
    std::this_thread::yield();
shared_value += 5; // Any code, it is lock-protected.
f.clear(std::memory_order_release); // release

This solution above runs 9 nanoseconds on average (measured 2^25 operations) in single thread (release compiled) on my Windows 10 64-bit 2Ghz 2-core laptop. While using std::unique_lock<std::mutex> lock(mux); for same protection purpose takes 100-120 nanoseconds on same Windows PC. If it is needed for threads to spinlock instead of sleeping while waiting then instead of std::this_thread::yield(); in code above I just use semicolon ;. Full online example of usage and time measurements.

Bo answered 14/2, 2021 at 9:27 Comment(6)
I don't know why you think that mutexes are heavy. They aren't. There are only two issues: (1) mutex can cause a long wait depending on usage (2) the memory fencing instructions can be problematic. You'd much better be using a mutex rather than a spinlock.Palgrave
@Palgrave I tested std::mutex on my Win 10 64-bit, it appeared to be that just std::unique_lock<std::mutex> lock(mux); line works around 100-120 nanoseconds. On linux it is much faster, around 20-30 nanoseconds. And atomic<size_t>/atomic_flag are both 17 nanoseconds on both Windows and Linux. These all tested in CLang release -O3. For me 17 nanoseconds is much more preferable than 100-120 nanoseconds.Bo
Properly testing performance of such thing is extremely complex - thus I have serious doubts that you could do it properly. Furthermore, you need to take into account various complex issues surrounding your code and synchronisation within. E.g., how do you even use atomics? You need to place everywhere appropriate memory fencing in atomics to even begin testing fairly. Then you need to think, what really takes more time during program run - single element instructions or memory fencing? The latter is normally the primary concern.Palgrave
@Palgrave That is basically purpose of my question about how to correctly solve my task using only atomics. If it is possible to solve with only atomic values or if it can't be solved without extra memory fencing. In general the question is when (in what cases) atomics can replace mutexes. Probably in most general form atomics solution will be not faster than mutex, but for some particular special-case tasks probably atomics may be significantly faster. So I wanted to figure out in what cases atomics can be used as a faster replacement for mutex.Bo
In general, no. A mutex and an atomic variable do two different things. A mutex protects code, and an atomic variable protects data. The question asked here is not what the title suggest; the question is whether it's possible to implement a mutex using atomics, and the superficial answer is yes, of course.Amadeus
@PeteBecker there is atomic_flag whose only purpose to implement a spinlock - which is the same as mutex except that it is ill advised as most modern mutex implementation spin for a short while.Palgrave
E
11

I'll ignore your code for the sake of the answer, the answer generally is yes.

A lock does the following things :

  1. allows only one thread to acquire it at any given time
  2. when the lock is acquired, a read barrier is placed
  3. right before the lock is released, a write barrier is placed

The combination of the 3 points above makes the critical section thread safe. only one thread can touch the shared memory, all changes are observed by the locking thread because of the read barrier, and all the changes are to be visible to other locking threads, because of the write barrier.

Can you use atomics to achieve it? Yes, And real life locks (provided for example, by Win32/Posix) ARE implemented by either using atomics and lock free programming, either by using locks that use atomics and lock free programing.

Now, realistically speaking, should you use a self-written lock instead of the standard locks? Absolutely not.

Many concurrency tutorials preserve the notion that spin-locks are "more efficient" than regular locks. I can't stress enough how foolish it is. A user-mode spinlock IS NEVER more efficient than a lock that the OS provides. The reason is simple, that OS locks are wired to the OS scheduler. So if a lock tries to lock a lock and fails - the OS knows to freeze this thread and not reschedule it to run until the lock has been released.

With user-mode spinlocks, this doesn't happen. The OS can't know that the relevant thread tries to acquire to the lock in a tight loop. Yielding is just a patch and not a solution - we want to spin for a short time, then go to sleep until the lock is released. With user mode spin locks, we might waste the entire thread quantum trying to lock the spinlock and yielding.

I will say, for the sake of honesty, that recent C++ standards do give us the ability to sleep on an atomic waiting for it to change its value. So we can, in a very lame way, implement our own "real" locks that try to spin for a while and then sleep until the lock is released. However, implementing a correct and efficient lock when you're not a concurrency expert is pretty much impossible.

My own philosophical opinion that in 2021, developers should rarely deal with those very low-level concurrency topics. Leave those things to the kernel guys. Use some high level concurrency library and focus on the product you want to develop rather than micro-optimizing your code. This is concurrency, where correctness >>> efficiency.

A related rant by Linus Torvalds

Endosteum answered 14/2, 2021 at 11:1 Comment(7)
Just a note - in my case I have a very tiny operations and very many of those. Each operation is dozens of nanoseconds. So spin-locking on atomics should probably be faster than sleeping thread with OS scheduler because there will be no profit in going switching to OS scheduler, waiting there for 30 nanoseconds and waking up thread again. Would be great if you can provide some code of correct way of using atomics to model your 3 steps, main thing I don't know how to set a read and write barriers. Because I have multi-cores then each thread spin-locking for dozens of nanoseconds is alright.Bo
Can you redesign your code to so every thread works on its own non-atomic data then all threads combine their results? usually, this gives the most correct and efficient code. the best way to share data between threads is not to share it to begin with.Endosteum
Unfortunately this can't be redesigned in such a way, whole code of using System is very chaotic. Just for example I put simple while loop. In real code of my project System is used in chaotic manner, a lot of operations at random points of time in each thread, even all threads will run different code. Each operation is very short, dozens of nanoseconds. But because they all use same shared System then access to this system should be uniquely locked by all worker threads, locked separately for every small operation.Bo
Locks in all threads are read-write locks, because each operation mixes reading and writing sub-operations of the System. Basically all threads are not doing same code and work hence they can't be orchestrated as one system of parallel computations. So each operation needs to acquire global unique lock of the System, do 20-50 nanoseconds of read/write from/to the System and then release lock.Bo
Would be great if you, if not to difficult, show me smallest possible example code in standard C++20 that does next thing - 1) acquires unique global lock by spin-locking single atomic, 2) issue read barrier 3) reads/writes to system (any non-atomic code) 4) issues write barrier 5) releases atomic-lock. I just don't know how to do these 5 steps in standard C++20. For example I see this doc but can't figure out how to use it totally correctly for my case of 5 steps.Bo
For example is this correct code for those 5 steps mentioned in my above comment? Have I used correct memory orders? Is this kind of locking totally correct for any complex code inside locked block (if no inner code operations can't escape out of this fence barriers?)?Bo
It sounds like RCU or flat-combining is what you need. I still think you can redesign you need.Endosteum

© 2022 - 2024 — McMap. All rights reserved.