Is a memory barrier required to read a value that is atomically modified?
Asked Answered
G

3

1

Given the following:

class Foo
{
public:
    void Increment()
    {
        _InterlockedIncrement(&m_value); // OSIncrementAtomic
    }

    long GetValue()
    {
        return m_value;
    }

private:
    long m_value;
};

Is a memory barrier required for the read of m_value? My understanding is that _InterlockedIncrement will generate a full memory barrier, and that ensures the value is stored before any subsequent loads occur. So that sounds safe from that aspect, but, can m_value be cached at all, i.e. can GetValue() return a stale value, even when incrementing atomically?

Jeff Preshing's excellent article(s) for reference: https://preshing.com/20120515/memory-reordering-caught-in-the-act/

Additional context: I'm following a series of articles on lock-free programming, and in particular, looking at the usage of the unfinishedJobs variable and the potential implementation of HasJobCompleted here: https://blog.molecular-matters.com/2015/08/24/job-system-2-0-lock-free-work-stealing-part-1-basics/

void Wait(const Job* job)
{
  // wait until the job has completed. in the meantime, work on any other job.
  while (!HasJobCompleted(job))
  {
    Job* nextJob = GetJob();
    if (nextJob)
    {
      Execute(nextJob);
    }
  }
}

Determining whether a job has completed can be done by comparing unfinishedJobs with 0.

So, given that context, would a possible implementation of HasJobCompleted require a memory barrier?

Griffey answered 2/4, 2022 at 14:5 Comment(18)
no, you not need any barrier - between what ? you need atomic read ( in x86/x64 aligned long (it aligned) will be atomic) but not need barrier . current code is okToitoiboid
@Toitoiboid and where exactly did you see x86/x64? (Let's forget about compiler reordering/caching, for the moment).Bioecology
@Bioecology - i not say that here x86/x64 - i say that if this x86/x64 - reading m_value will be atomic. and main - here not need barrier (this is different from atomic)Toitoiboid
A loop reading m_value can hoist the load out of the loop since it's not atomic or volatile. Barriers aren't the problem/solution here, just forcing re-checking of memory (or the cache-coherent view of memory that the current CPU sees; actual CPU caches like L1d and L2 are not a problem for this.) But seriously don't roll your own atomics without a very compelling reason. (When to use volatile with multi threading? pretty much never.)Abyss
@PeterCordes yes, volatile here can be used for prevent compilator cache value. Also any api call, which is black box for compilator force it for reread value. But main - i think this example wrong by design. GetValue not need at all, but need use value returned by interlocked increment or decrement of m_valueToitoiboid
@RbMm: I was assuming that some threads would just call Increment, others would just call GetValue. But good point that Increment fails to return the value, so it's broken by design, making it impossible to use correctly for anything like int sequence_number = val.fetch_add(1).Abyss
@PeterCordes based on this comment The real world use of this sample is to check for completion of a job when m_value becomes zero i think that real code is another. and job must be begin when if (InterlockedDecrement(&m_value) == 0) do_job();. but in any case thank for great answer , especially about latest valueToitoiboid
Thanks @Toitoiboid , I've updated the question with additional context.Griffey
Pretty clearly your updated version is going to need some memory ordering somewhere, for at least acq/rel synchronization when loading a Job* and dereferencing it, to make sure it sees up-to-date values in the pointed-to memory. But that doesn't need to involve any barriers, just foo.load(std::memory_order_acquire). Hand-rolling your own atomics sucks, and separate barriers make it impossible to compile efficiently for AArch64 where the efficient way even for seq_cst is to use special load/store instructions.Abyss
As for HasJobCompleted itself, that's fine using job->running.load(std::memory_order_relaxed) or whatever; it's basically equivalent to an exit_now flag. The other loads involved in GetJob() and Execute() are I assume unrelated to members of the original job.Abyss
This 2nd part of the question seems entirely unrelated to your clunky attempt at hand-rolling a class with an increment; I don't see a use-case for that as part of this job-wait loop, and you haven't described one. Seems like it should be a separate question. (Which you don't need to ask because it'd be a duplicate of Why set the stop flag using `memory_order_seq_cst`, if you check it with `memory_order_relaxed`? and others). My answer there even covers some of the same inter-thread latency issues as my answer here.Abyss
Starting on another side-job instead instead of completing the main job is maybe something that could justify wanting to do the load in HasJobCompleted as late as possible (defeating the CPUs natural tendency to do loads early), but the out-of-order execution window is only a few hundred instructions on current CPUs, and almost every time through the loop, a barrier would just be costing you throughput for your side-jobs. As long as they're useful and you won't usually run out of them, I'd use a relaxed load unless your caller is going to read other shared buffers for that job and needs syncAbyss
@PeterCordes RE: Foo example class, that was an attempt to abstract the article’s usage of unfinishedJobs in the job system. Apologies if that has actually resulted in more confusion in the Q&A.Griffey
I think you are better off using std::atomic with defaults (i.e. seq_cst) because otherwise you can end up with a race pretty easily. Lock-free (or wait-free) is a pretty complex topic and using correct "barriers" for such algorithms requires both an expert and attention. Once you are confident and verse enough in the existing sync primitives you will know if you should use a more relaxed memory order. But before that you are more likely to do a wrong thing as the topic is very complex.Bioecology
Ok, yeah you made a mess of it by using a non-atomic member. The article you linked has int32_t unfinishedJobs; // atomic which and mentions modifying through Interlocked* functions on Windows if you don't use std::atomic, but doesn't mention how to safely read it. Probably they just got lucky and it happened to compile to asm that did a load in loops with other ops that prevented hoisting it. (And x86 asm gives acq/rel). Without any context of what you were trying to synchronize and how you were using it, the only thing left in the question was the total lack of safety of your reader.Abyss
Do note the const int32_t unfinishedJobs = atomic::Decrement(&job->unfinishedJobs); which requires a return value from the inc/dec functions. That's critically important vs. a separate read of the variable, even when you're using std::atomic like @Bioecology says. 100% agreed with their last comment to use std::atomic with seq_cst for everything. When compiling for x86, only pure-store costs any extra for being seq_cst instead of just release. Any RMW costs the same as SC, and acquire or SC loads are the same thing and don't need any extra asm.Abyss
Oh also, that linked article has another mistake: _mm_pause does not yield the timeslice to the OS, it only tells the CPU hardware to pause this logical core for a few cycles (or ~100 on Skylake and later), yielding the shared physical core fully to the other hyperthread if there is one, for a brief moment. It doesn't give control back to the OS, and doesn't even put this logical core to sleep. Except for the lack of C++11 std::atomic and that mistake, though, the core idea of the article seems reasonable.Abyss
you not need GetValue() at all or HasJobCompleted(job) but need use result of unfinishedJobs = atomic::Decrement(&job->unfinishedJobs); but implementation here not the best. .i be use like this - pastebin.com/By881NhzToitoiboid
A
8

No, you don't need barriers, but your code is broken anyway if readers and writers call these functions in different threads. Especially if a reader calls the read function in a loop.

TL:DR: use C++11 std::atomic<long> m_value with return m_value++ in the increment and return m_value in the reader. That will give you sequential consistency in a data-race-free program: execution will be work as if threads ran with some interleaving of source order. (Unless you violate the rules and have other non-atomic shared data.) You definitely want to return a value from Increment, if you want threads doing increments to ever know what value they produced. Doing a separate load after is totally broken for use cases like int sequence_num = shared_counter++; where another thread's increment could be visible between count++; tmp = count;.

If you don't need such strong ordering with respect to operations on other objects in the same thread as the reader/writer, return m_value.load(std::memory_order_acquire) is enough for most uses, and m_value.fetch_add(1, std::memory_order_acq_rel). Very few programs actually need StoreLoad barriers anywhere; atomic RMWs can't actually reorder very much even with acq_rel. (On x86, those will both compile the same as if you'd used seq_cst.)

You can't force ordering between threads; the load either sees the value or it doesn't, depending on whether the reading thread saw the invalidate from the writer before or after it took / tried to take a value for the load. The whole point of threads is that they don't run in lock-step with each other.


Data-race UB:

A loop reading m_value can hoist the load out of the loop since it's not atomic (or even volatile as a hack). This is data-race UB, compilers will break your reader. See this and Multithreading program stuck in optimized mode but runs normally in -O0

Barriers aren't the problem/solution here, just forcing re-checking of memory (or the cache-coherent view of memory that the current CPU sees; actual CPU caches like L1d and L2 are not a problem for this). That's not what barriers really do; they order this thread's access to coherent cache. C++ threads only run across cores with coherent caches.

But seriously don't roll your own atomics without a very compelling reason. When to use volatile with multi threading? pretty much never. That answer explains cache coherency and that you don't need barriers to avoid seeing stale values.

In many real-world C++ implementations, something like std::atomic_thread_fence() will also be a "compiler barrier" that forces the compiler to reload even non-atomic vars from memory, even without volatile, but that's an implementation detail. So it may happen to work well enough, on some compilers for some ISAs. And still isn't fully safe against the compiler inventing multiple loads; See the LWN article Who's afraid of a big bad optimizing compiler? for examples with details; primarily aimed at how the Linux kernel rolls its own atomics with volatile, which is de-facto supported by GCC/clang.


"latest value"

Beginners often panic over this, and think that RMW operations are somehow better because of the way they're specified. Since they're a read + write tied together, and there is a modification order for every memory location separately, RMW operations necessarily have to wait for write access to a cache line, and that means serializing all writes and RMWs on a single location.

Plain loads of atomic variables are still guaranteed (by real implementations) to see values promptly. (ISO C++ only suggests that values should be seen in finite time, and promptly, but of course real implementations can do much better because they run on cache-coherent CPU hardware.)

There's no such thing as "immediate" between two threads; either a load in another thread sees a value stored, or it ran before the store became visible to other threads and didn't. With thread scheduling and so on, it's always possible that a thread will load a value but then not use it for a long time; it was fresh when it was loaded.

So this is pretty much irrelevant for correctness, and all that's left is worrying about inter-thread latency. That could in some cases be helped by barriers (to reduce contention from later memory operations, not from actively flushing your stores faster, barriers just wait for that to happen the normal way). So that's usually a very minor effect, and not a reason to use extra barriers.

See MESI Protocol & std::atomic - Does it ensure all writes are immediately visible to other threads?. And see my comments on https://github.com/dotnet/runtime/issues/67330#issuecomment-1083539281 and Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees? Often no, and if it does, not by much.

Certainly not enough to be worth slowing down the reader with lots of extra barriers just to make it look at this atomic variable later than other atomic variables, if you didn't need that ordering for correctness. Or slowing down the writer to make it sit there doing nothing to maybe let it complete an RFO a few cycles sooner instead of getting other useful work done.

If your use of threading is so bottlenecked on inter-core latency that it was worth it, that's probably a sign you need to rethink your design.

Without barriers or ordering, just std::atomic with memory_order_relaxed, you'll still normally see data on other cores within maybe 40 nanoseconds (on a modern x86 desktop/laptop), about the same as if both threads were using atomic RMWs. And it's not possible for it to be delayed for any significant amount of time, like a microsecond maybe if you create lots of contention for lots of earlier stores so they all take a long time to commit before this one can. You definitely don't have to worry about going a long time with stores not being visible. (This of course only applies to atomic or hand-rolled atomics with volatile. Plain non-volatile loads may only check once at the start of a loop, and then never again. That's why they're unusable to multithreading.)

Abyss answered 3/4, 2022 at 0:35 Comment(3)
@ixSci: Thanks, you're right, one of those was an oversimplification. Updated to say in more detail what I meant.Abyss
@ixSci: Other answers seem incorrect in implying that this is a significant effect and a reason you might want to use a barrier. And that using an RMW to get "the latest value" is significantly different from what a normal load would see; so I should say I'm disagreeing with that. Thanks. I guess partly I wanted to disagree with the wrong claim in PSIAlt's answer that implies a barrier is necessary to see an updated value at all, not just faster, but I should reword that separately.Abyss
@ixSci: Updated again. I think the key misconception I wanted to debunk is that barriers actively flush the store buffer. (Or that they do anything to CPU cache). They don't, they just wait for earlier loads and/or stores to complete on their own before letting later things happen. So I changed my answer to say that instead of complaining about "other answers." Because what yours said isn't technically wrong, but the way it's said I think may imply the wrong thing. And now my answer corrects that misconception for anyone that picked it up.Abyss
B
4

Yes, barriers should be on both sides: reading and writing. Imagine you have some write-buffer and a loading-queue where everything might be out-of-order. So you flush the write-buffer when writing your stuff but the loading-queue, you need to deal with, is on the other thread (processor) which knows nothing about your flushing. So it is always a paired process.

Also you can think about it in a compiler hat: unless a compiler is forced to serialize access it has the liberty to reorder anything it can safely (according to its view) do.

That said, it is all about serialization and not atomicity. Which is a whole another matter. You have your writing atomically _InterlockedIncrement but reading isn't atomic return m_value. So that's a race.

Also, I see your code requires atomicity but I don't see any need for serialization. You do not protect anything with your m_value. And as to the "stale" value: usually you can't guarantee that you won't have a stale value at some point of time, even with barriers. RMW-operations require the latest values but others don't. So having a barrier will help to get the latest value faster but that's it. Having your code and forgetting about the race, compiler might safely assume that you do not modify m_value and cache it. The same might be said about CPU.


All that said: just use std::atomic when you need an unprotected variable. It will make sure the value is not cached by any entity.

Bioecology answered 2/4, 2022 at 14:46 Comment(11)
Thanks for the explanation. The real world use of this sample is to check for completion of a job when m_value becomes zero. I don’t actually need the atomic increment, but as it acts as a barrier it was useful. Thinking about it, if I’d have written it with a barrier only in Increment the answer would appear more obvious as the functions would be unbalanced. Thanks.Griffey
@MarkIngram - when m_value becomes zero. are you initially assign negative value to m_value ? and check for completion of a job - post your actual code, i think you need check or result of interlocked operation, instead check value itself. however for current code - answer is wrong - not need any barrier. in any case you atomic read some m_valueToitoiboid
@Toitoiboid there is no atomic read.Bioecology
@Bioecology - formal yes, reading of m_value not atomic. in paractic i sure on any platform (on x86/x64 how minimum) reading of aligned(!) long value will be atomic. but question here about - are need memory barrier. so i and ask - between which operations need barrier ? what must be ordered here ?Toitoiboid
@Toitoiboid the question is about C++ so the answer should also be more general since not specifics were requested. Also, in practice, if you don't somehow ensure that m_value is actually read instead of being cached, you can easily end up in the situation where the cached value is always used. I agree that the code in question doesn't need barriers but since the question is also pretty generic I pointed out that barriers should be used in pairs. What the code needs is a proper variable which no one would be able to cache. I suggested one: std::atomic. So I don't see where my answer is wrongBioecology
@Bioecology - i think that more strict question should be - which memory order need use in atomic read of m_value. you not exactly say which memory order. i also, but mean - memory_order_relaxed - when say that not need barrier, but need atomic. however this code was not actual, how i understand from first comment. and i sure that function GetValue() not need at all - need use value returned by interlocked operation - do something exactly when interlockeddecrement (not increment !) return 0Toitoiboid
@ixSci, I assume GetValue will require a StoreLoad barrier in this scenario (ensures all previous stores are visible to other processors, so all loads receive the latest value).Griffey
@MarkIngram: No. If you want the expense of sequential consistency for ordering of operations on multiple different locations, the sane way to do that is with StoreLoad barriers as part of the store, and just acquire as part of the load, so loads are cheap. See cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html for how std::atomic operations map to asm on various ISAs, e.g. seq_cst store on x86 is done with xchg, but seq_cst load is just a plain mov.Abyss
@MarkIngram: If you're hand-rolling your own atomics for some insane reason (instead of using C++ std::atomic for .fetch_add(1) and .load()), you do need m_value to be at least volatile, though. lwn.net/Articles/793253. Caching "by the CPU" doesn't break coherency, only the compiler keeping a value in registers is a problem. See When to use volatile with multi threading?.Abyss
Re: "latest value", see MESI Protocol & std::atomic - Does it ensure all writes are immediately visible to other threads?, and especially re: this answers incorrect suggestion that memory barriers make data visible faster, see my comments on github.com/dotnet/runtime/issues/67330#issuecomment-1083539281 and Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?Abyss
@PeterCordes thanks for your comprehensive answer. I've updated the original question to include additional context - I'm following a series of articles on lock-free programming.Griffey
T
0

Short answer: Yes.

Your "reading" should have "acquire" order, so your Increment() result will be visible in another thread when it do a "release" after an incremnet.

Turin answered 2/4, 2022 at 15:51 Comment(2)
no, not need any barier - between wahat operaions ?!Toitoiboid
No, those barriers would mean that if the load happens to see a value from a store, then later operations "happen after" operations in the writing thread that were before its release-store. In many real-world C++ implementations, something like std::atomic_thread_fence() will also be a "compiler barrier" that forces the compiler to reload even non-atomic vars from memory, even without volatile, but that's an implementation detail. And still isn't fully safe against the compiler inventing multiple loads; See lwn.net/Articles/793253 for example.Abyss

© 2022 - 2024 — McMap. All rights reserved.