No, you don't need barriers, but your code is broken anyway if readers and writers call these functions in different threads. Especially if a reader calls the read function in a loop.
TL:DR: use C++11 std::atomic<long> m_value
with return m_value++
in the increment and return m_value
in the reader. That will give you sequential consistency in a data-race-free program: execution will be work as if threads ran with some interleaving of source order. (Unless you violate the rules and have other non-atomic
shared data.) You definitely want to return a value from Increment
, if you want threads doing increments to ever know what value they produced. Doing a separate load after is totally broken for use cases like int sequence_num = shared_counter++;
where another thread's increment could be visible between count++; tmp = count;
.
If you don't need such strong ordering with respect to operations on other objects in the same thread as the reader/writer, return m_value.load(std::memory_order_acquire)
is enough for most uses, and m_value.fetch_add(1, std::memory_order_acq_rel)
. Very few programs actually need StoreLoad barriers anywhere; atomic RMWs can't actually reorder very much even with acq_rel
. (On x86, those will both compile the same as if you'd used seq_cst
.)
You can't force ordering between threads; the load either sees the value or it doesn't, depending on whether the reading thread saw the invalidate from the writer before or after it took / tried to take a value for the load. The whole point of threads is that they don't run in lock-step with each other.
Data-race UB:
A loop reading m_value
can hoist the load out of the loop since it's not atomic
(or even volatile
as a hack). This is data-race UB, compilers will break your reader. See this and Multithreading program stuck in optimized mode but runs normally in -O0
Barriers aren't the problem/solution here, just forcing re-checking of memory (or the cache-coherent view of memory that the current CPU sees; actual CPU caches like L1d and L2 are not a problem for this). That's not what barriers really do; they order this thread's access to coherent cache. C++ threads only run across cores with coherent caches.
But seriously don't roll your own atomics without a very compelling reason. When to use volatile with multi threading? pretty much never. That answer explains cache coherency and that you don't need barriers to avoid seeing stale values.
In many real-world C++ implementations, something like std::atomic_thread_fence()
will also be a "compiler barrier" that forces the compiler to reload even non-atomic vars from memory, even without volatile
, but that's an implementation detail. So it may happen to work well enough, on some compilers for some ISAs. And still isn't fully safe against the compiler inventing multiple loads; See the LWN article Who's afraid of a big bad optimizing compiler? for examples with details; primarily aimed at how the Linux kernel rolls its own atomics with volatile
, which is de-facto supported by GCC/clang.
"latest value"
Beginners often panic over this, and think that RMW operations are somehow better because of the way they're specified. Since they're a read + write tied together, and there is a modification order for every memory location separately, RMW operations necessarily have to wait for write access to a cache line, and that means serializing all writes and RMWs on a single location.
Plain loads of atomic variables are still guaranteed (by real implementations) to see values promptly. (ISO C++ only suggests that values should be seen in finite time, and promptly, but of course real implementations can do much better because they run on cache-coherent CPU hardware.)
There's no such thing as "immediate" between two threads; either a load in another thread sees a value stored, or it ran before the store became visible to other threads and didn't. With thread scheduling and so on, it's always possible that a thread will load a value but then not use it for a long time; it was fresh when it was loaded.
So this is pretty much irrelevant for correctness, and all that's left is worrying about inter-thread latency. That could in some cases be helped by barriers (to reduce contention from later memory operations, not from actively flushing your stores faster, barriers just wait for that to happen the normal way). So that's usually a very minor effect, and not a reason to use extra barriers.
See MESI Protocol & std::atomic - Does it ensure all writes are immediately visible to other threads?. And see my comments on https://github.com/dotnet/runtime/issues/67330#issuecomment-1083539281 and Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees? Often no, and if it does, not by much.
Certainly not enough to be worth slowing down the reader with lots of extra barriers just to make it look at this atomic
variable later than other atomic
variables, if you didn't need that ordering for correctness. Or slowing down the writer to make it sit there doing nothing to maybe let it complete an RFO a few cycles sooner instead of getting other useful work done.
If your use of threading is so bottlenecked on inter-core latency that it was worth it, that's probably a sign you need to rethink your design.
Without barriers or ordering, just std::atomic
with memory_order_relaxed
, you'll still normally see data on other cores within maybe 40 nanoseconds (on a modern x86 desktop/laptop), about the same as if both threads were using atomic RMWs. And it's not possible for it to be delayed for any significant amount of time, like a microsecond maybe if you create lots of contention for lots of earlier stores so they all take a long time to commit before this one can. You definitely don't have to worry about going a long time with stores not being visible. (This of course only applies to atomic
or hand-rolled atomics with volatile
. Plain non-volatile loads may only check once at the start of a loop, and then never again. That's why they're unusable to multithreading.)
m_value
can hoist the load out of the loop since it's notatomic
orvolatile
. Barriers aren't the problem/solution here, just forcing re-checking of memory (or the cache-coherent view of memory that the current CPU sees; actual CPU caches like L1d and L2 are not a problem for this.) But seriously don't roll your own atomics without a very compelling reason. (When to use volatile with multi threading? pretty much never.) – AbyssIncrement
, others would just callGetValue
. But good point thatIncrement
fails to return the value, so it's broken by design, making it impossible to use correctly for anything likeint sequence_number = val.fetch_add(1)
. – Abyssif (InterlockedDecrement(&m_value) == 0) do_job();
. but in any case thank for great answer , especially about latest value – ToitoiboidJob*
and dereferencing it, to make sure it sees up-to-date values in the pointed-to memory. But that doesn't need to involve any barriers, justfoo.load(std::memory_order_acquire)
. Hand-rolling your own atomics sucks, and separate barriers make it impossible to compile efficiently for AArch64 where the efficient way even for seq_cst is to use special load/store instructions. – AbyssHasJobCompleted
itself, that's fine usingjob->running.load(std::memory_order_relaxed)
or whatever; it's basically equivalent to anexit_now
flag. The other loads involved inGetJob()
andExecute()
are I assume unrelated to members of the originaljob
. – AbyssHasJobCompleted
as late as possible (defeating the CPUs natural tendency to do loads early), but the out-of-order execution window is only a few hundred instructions on current CPUs, and almost every time through the loop, a barrier would just be costing you throughput for your side-jobs. As long as they're useful and you won't usually run out of them, I'd use a relaxed load unless your caller is going to read other shared buffers for that job and needs sync – AbyssFoo
example class, that was an attempt to abstract the article’s usage ofunfinishedJobs
in the job system. Apologies if that has actually resulted in more confusion in the Q&A. – Griffeystd::atomic
with defaults (i.e.seq_cst
) because otherwise you can end up with a race pretty easily. Lock-free (or wait-free) is a pretty complex topic and using correct "barriers" for such algorithms requires both an expert and attention. Once you are confident and verse enough in the existing sync primitives you will know if you should use a more relaxed memory order. But before that you are more likely to do a wrong thing as the topic is very complex. – Bioecologyint32_t unfinishedJobs; // atomic
which and mentions modifying throughInterlocked*
functions on Windows if you don't usestd::atomic
, but doesn't mention how to safely read it. Probably they just got lucky and it happened to compile to asm that did a load in loops with other ops that prevented hoisting it. (And x86 asm gives acq/rel). Without any context of what you were trying to synchronize and how you were using it, the only thing left in the question was the total lack of safety of your reader. – Abyssconst int32_t unfinishedJobs = atomic::Decrement(&job->unfinishedJobs);
which requires a return value from the inc/dec functions. That's critically important vs. a separate read of the variable, even when you're using std::atomic like @Bioecology says. 100% agreed with their last comment to usestd::atomic
with seq_cst for everything. When compiling for x86, only pure-store costs any extra for being seq_cst instead of justrelease
. Any RMW costs the same as SC, and acquire or SC loads are the same thing and don't need any extra asm. – Abyss_mm_pause
does not yield the timeslice to the OS, it only tells the CPU hardware to pause this logical core for a few cycles (or ~100 on Skylake and later), yielding the shared physical core fully to the other hyperthread if there is one, for a brief moment. It doesn't give control back to the OS, and doesn't even put this logical core to sleep. Except for the lack of C++11 std::atomic and that mistake, though, the core idea of the article seems reasonable. – AbyssGetValue()
at all orHasJobCompleted(job)
but need use result ofunfinishedJobs = atomic::Decrement(&job->unfinishedJobs);
but implementation here not the best. .i be use like this - pastebin.com/By881Nhz – Toitoiboid