Should Interlocked.CompareExchange also a volatile variable?
Asked Answered
P

3

3

The following example comes from the MSDN.

public class ThreadSafe
{
    // Field totalValue contains a running total that can be updated
    // by multiple threads. It must be protected from unsynchronized 
    // access.
    private float totalValue = 0.0F;

    // The Total property returns the running total.
    public float Total { get { return totalValue; }}

    // AddToTotal safely adds a value to the running total.
    public float AddToTotal(float addend)
    {
        float initialValue, computedValue;
        do
        {
            // Save the current running total in a local variable.
            initialValue = totalValue;

            // Add the new value to the running total.
            computedValue = initialValue + addend;

            // CompareExchange compares totalValue to initialValue. If
            // they are not equal, then another thread has updated the
            // running total since this loop started. CompareExchange
            // does not update totalValue. CompareExchange returns the
            // contents of totalValue, which do not equal initialValue,
            // so the loop executes again.
        }
        while (initialValue != Interlocked.CompareExchange(ref totalValue, 
        computedValue, initialValue));
        // If no other thread updated the running total, then 
        // totalValue and initialValue are equal when CompareExchange
        // compares them, and computedValue is stored in totalValue.
        // CompareExchange returns the value that was in totalValue
        // before the update, which is equal to initialValue, so the 
        // loop ends.

        // The function returns computedValue, not totalValue, because
        // totalValue could be changed by another thread between
        // the time the loop ends and the function returns.
        return computedValue;
    }
}

Should the totalValue not be declared as volatile to get the freshest value possible? I imagine that if you get a dirty value from a CPU cache then the call to Interlocked.CompareExchange should take care of getting the freshest value and cause the loop to try again. Would the volatile keyword potentially save one unnecessary loop?

I guess it isn't 100% necessary to have the volatile keyword because the method has overloads that takes datatype such as long that don't support the volatile keyword.

Pushed answered 5/3, 2021 at 7:54 Comment(0)
S
5

No, volatile wouldn't be helpful at all, and certainly not for this reason. It would just give that first read "acquire" semantics instead of effectively relaxed, but either way will compile to similar asm that runs a load instruction.

if you get a dirty value from a CPU cache

CPU caches are coherent, so anything you read from CPU cache is the current globally agreed-on value for this line. "Dirty" just means it doesn't match DRAM contents, and will have to get written-back if / when evicted. A load value can also be forwarded from the store buffer, for a value this thread stored recently that isn't yet globally visible, but that's fine, Interlocked methods are full barriers that result in waiting for the store buffer to drain as well.

If you mean stale, then no, that's impossible, cache coherency protocols like MESI prevent that. This is why Interlocked things like CAS aren't horribly slow if the cache line is already owned by this core (MESI Modified or Exclusive state). See Myths Programmers Believe about CPU Caches which talks some about Java volatiles, which I think are similar to C# volatile.

This C++11 answer also explains some about cache coherency and asm. (Note that C++11 volatile is significantly different from C#, and doesn't imply any thread-safety or ordering, but does still imply the asm must do a load or a store, not optimize into a register.)


On non-x86, running extra barrier instructions after the initial read (to give those acquire semantics) before you even try a CAS just slows things down. (On x86 including x86-64, a volatile read compiles to the same asm as a plain read, except it prevents compile-time reordering).

A volatile read can't be optimized into just using a value in a register if the current thread just wrote something via a non-interlocked = assignment. That's not helpful either; if we just stored something and remember in a register what we stored, a load that does store-forwarding from the store buffer is morally equivalent to just using the register value.

Most of the good use-cases for lock-free atomics are when contention is lowish, so usually things can succeed without hardware having to wait a long time for access / ownership of the cache line. So you want to make the uncontended case as fast as possible. Avoid volatile even if there was anything to gain from it in highly-contended cases, which I don't think there is anyway.

If you ever did any plain stores (assignments with =, not interlocked RMW), volatile would have an effect on those, too. That might mean waiting for the store buffer to drain before later memory ops in this thread can run, if C# volatile gives semantics like C++ memory_order_seq_cst. In that case, you'd be slowing down the code involving the stores a lot, if you didn't need ordering wrt. other loads/stores. If you did such a store before this CAS code, yeah you'd be waiting until the store (and all previous stores) were globally visible to try reloading it. This would mean a reload + CAS the CPU is waiting to do right after are very likely to not have to spin because the CPU will have ownership of that line, but I think you'd effectively get similar behaviour from the full barrier that's part of an Interlocked CAS.

Selfassertion answered 5/3, 2021 at 9:58 Comment(0)
I
2

You could get some insights by studying the source code of the ImmutableInterlocked.Update method:

/// <summary>
/// Mutates a value in-place with optimistic locking transaction semantics
/// via a specified transformation function.
/// The transformation is retried as many times as necessary to win the
/// optimistic locking race.
/// </summary>
public static bool Update<T>(ref T location, Func<T, T> transformer)
    where T : class
{
    Requires.NotNull(transformer, "transformer");

    bool successful;
    T oldValue = Volatile.Read(ref location);
    do
    {
        T newValue = transformer(oldValue);
        if (ReferenceEquals(oldValue, newValue))
        {
            // No change was actually required.
            return false;
        }

        T interlockedResult = Interlocked.CompareExchange(ref location,
            newValue, oldValue);
        successful = ReferenceEquals(oldValue, interlockedResult);
        oldValue = interlockedResult; // we already have a volatile read
            // that we can reuse for the next loop
    }
    while (!successful);
    return true;
}

You can see that the method starts by making a volatile read on the location argument. I think that there are two reasons for that:

  1. This method has a little twist by avoiding the Interlocked.CompareExchange operation, in case the new value happens to be the same with the already stored value.
  2. The transformer delegate has an unknown computational complexity, so invoking it on a potentially stale value could be much more costly than the cost of the initial Volatile.Read.
Ioneionesco answered 5/3, 2021 at 12:9 Comment(5)
How does Volatile.Read make it less likely that the value is stale by the time you try to CAS it? I think that premise of the question is faulty, as I pointed out in my answer. – Selfassertion
I suspect that Volatile.Read is useful here to make sure transformer is only invoked on a value that was actually visible (locally at least) at one point, not a "torn" read if it's a wide type. (e.g. in case the transform could do something bad on an out-of-range value). Your reason (1) about skipping the CAS certainly justifies that; if a bogus values transforms to itself, you'd never reach the CAS. Maybe there's also some useful effect I'm not thinking of, but I don't think your reason (2) makes sense. – Selfassertion
@PeterCordes the (2) makes sense based on my shallow and naive understanding of how memory works. It seems that you have a much more in-depth knowledge regarding this topic. πŸ˜ƒ – Ioneionesco
CPU caches are coherent. A normal read could get a value from a register, not memory at all, so could potentially be somewhat out of date I guess. But only if this inlined into code that earlier stored to this variable with a non-Volatile write, otherwise the compiler wouldn't think it still knew the value and would emit load instructions. (The same as Volatile.Read but (on non x86) without a barrier afterwards.) – Selfassertion
And only if that code was slow enough for there to be some meaningful chance for the store to have left the store buffer and then been overwritten, but little enough code that the value could still be in a register. (e.g. a slow function call, keeping this value in a call-preserved register). But if the value is shared, you're unlikely to be doing a non-Volatile write to it, so this whole situation is unlikely. A normal read (that compiles to an asm load) or volatile read will see the same value, but normal can avoid a barrier. – Selfassertion
L
1

It does not matter since Interlocked.CompareExchange inserts memory barriers.

initialValue = totalValue;

At this point the totalValue could be anything. Stale value from cache, just replaced, who knowns. While volatile would prevent reading a cached value, the value might become stale just after it was read, so volatile does not solve anything.

Interlocked.CompareExchange(ref totalValue, computedValue, initialValue)

At this point we have memory barriers that ensures totalValue is up to date. If it is equal to initialValue, then we also know that the initialValue was not stale when we started the computation. If it is not equal we try again, and since we have issued a memory barrier we do not risk getting the same stale value the next iteration.

Edit: I find it very unlikely that there would be any performance difference. If there is no contention there is little reason for the value to be stale. If there is high contention the time will be dominated by needing to loop.

Librarianship answered 5/3, 2021 at 9:57 Comment(4)
Stale value from cache - There's no such thing. See my answer (which I was almost finished when you posted yours :P), and software.rajivprab.com/2018/04/29/…. Also, I think you're missing the point. This question is about performance, not correctness. – Selfassertion
@peter cordes as far as I know, there is nothing in the .Net memory model that guarantees that caches are coherent. The linked article talks about x86/x64 processors, but that is irrelevant. – Librarianship
The OP asked about performance, so it's safe to assume they're using a real-world commercial CPU that a .Net implementation exists for. Therefore, they have coherent data caches. (That's true for all non-x86 systems as well, at least ones where real OSes run multiple threads across the different cores in question. There are a few hybrid ARM boards with microcontroller + DSP that share memory but not coherent caches, but an OS wouldn't run threads of the same process on microcontroller + DSP cores, for precisely that reason among others.) – Selfassertion
The key correct point your answer is making is that there's time for the value to become stale between the read and the CAS. Especially if the read only gets the line in "Shared" state (so the CAS can't finish until it waits for MESI exclusive ownership, i.e. an RFO Read For Ownership), there's lots of time for a store from another core to become visible. And yeah, an acquire barrier after the load wouldn't make that any better; no reason to expect it to decrease the chance of a CAS retry. – Selfassertion

© 2022 - 2024 β€” McMap. All rights reserved.