Does Interlocked.CompareExchange use a memory barrier?

Asked 17/10, 2009 at 8:15 Answered 27/7, 2017 at 9:37

Solved c#multithreading optimization volatile memory-model

I'm reading Joe Duffy's post about Volatile reads and writes, and timeliness, and i'm trying to understand something about the last code sample in the post:

while (Interlocked.CompareExchange(ref m_state, 1, 0) != 0) ;
m_state = 0;
while (Interlocked.CompareExchange(ref m_state, 1, 0) != 0) ;
m_state = 0;
…

When the second CMPXCHG operation is executed, does it use a memory barrier to ensure that the value of m_state is indeed the latest value written to it? Or will it just use some value that is already stored in the processor's cache? (assuming m_state isn't declared as volatile).
If I understand correctly, if CMPXCHG won't use a memory barrier, then the whole lock acquisition procedure won't be fair since it's highly likely that the thread that was the first to acquire the lock, will be the one that will acquire all of following locks. Did I understand correctly, or am I missing out on something here?

Edit: The main question is actually whether calling to CompareExchange will cause a memory barrier before attempting to read m_state's value. So whether assigning 0 will be visible to all of the threads when they try to call CompareExchange again.

Ivon answered 17/10, 2009 at 8:15 Comment(0)

Any x86 instruction that has lock prefix has full memory barrier. As shown Abel's answer, Interlocked* APIs and CompareExchanges use lock-prefixed instruction such as lock cmpxchg. So, it implies memory fence.

Yes, Interlocked.CompareExchange uses a memory barrier.

Why? Because x86 processors did so. From Intel's Volume 3A: System Programming Guide Part 1, Section 7.1.2.2:

For the P6 family processors, locked operations serialize all outstanding load and store operations (that is, wait for them to complete). This rule is also true for the Pentium 4 and Intel Xeon processors, with one exception. Load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized.

volatile has nothing to do with this discussion. This is about atomic operations; to support atomic operations in CPU, x86 guarantees all previous loads and stores to be completed.

Coburn answered 11/11, 2009 at 16:57 Comment(6)

Worth to mention that it provide FULL FENCE and not half fence. – Romp 8/9, 2015 at 6:53

Is it true for Interlocked.CompareExchange on ARM / AArch64 as well, or is this only a C# implementation detail for x86 that's not part of the language standard guarantees? – Holt 17/6, 2020 at 16:5

@PeterCordes It looks like ARM is different see devblogs.microsoft.com/oldnewthing/20130913-00/?p=3243 - two specific atomic instructions are used for a load-link/conditional strategy. – Default 26/9, 2021 at 4:22

@KindContributor: Right, before ARMv8.1, there was no single-instruction atomic compare-exchange. But ARM / AArch64 do also have a full memory barrier instruction, so it's possible that dmb ish was also required. MS's docs for the C++ _InterlockedIncrement intrinsic indicate that there are _acq and _rel versions, presumably implying that the plain version is seq_cst. (since there isn't a _relaxed or _sc). – Holt 26/9, 2021 at 5:6

A seq_cst RMW operation on ARM (with ldaxr/stlxr) may not be exactly equivalent to a full barrier, but I think it will still stop operations on opposites sides from reordering with each other, even ifs own load/store can appear to split up. It's much closer than a fully "relaxed" CAS like ARM can do with just ldxr/stxr (exclusive but without the acquire and release properties, like C++ memory_order_relaxed). – Holt 26/9, 2021 at 5:8

I tested in C++ godbolt.org/z/c95shc9Yc and MSVC uses both ldaxr/stlxr and dmb ish (data memory barrier: Inner Shareable) for C++ _InterlockedIncrement. So at least that implementation makes it a truly full barrier, going beyond a seq-cst operation, and I'd guess C# might too. Leaving only the question of whether any written spec requires that. – Holt 26/9, 2021 at 5:12

ref doesn't respect the usual volatile rules, especially in things like:

volatile bool myField;
...
RunMethod(ref myField);
...
void RunMethod(ref bool isDone) {
    while(!isDone) {} // silly example
}

Here, RunMethod is not guaranteed to spot external changes to isDone even though the underlying field (myField) is volatile; RunMethod doesn't know about it, so doesn't have the right code.

However! This should be a non-issue:

if you are using Interlocked, then use Interlocked for all access to the field
if you are using lock, then use lock for all access to the field

Follow those rules and it should work OK.

Re the edit; yes, that behaviour is a critical part of Interlocked. To be honest, I don't know how it is implemented (memory barrier, etc - note they are "InternalCall" methods, so I can't check ;-p) - but yes: updates from one thread will be immediately visible to all others as long as they use the Interlocked methods (hence my point above).

Ammonal answered 17/10, 2009 at 8:23 Comment(2)

I'm not asking about volatiles, but only if a Interlocked.Exchange is necessary when releasing the lock (or, Thread.VolatileWrite will be more appropriate). and does the only problem that could arise from this code is a habit of "unfairness" (as Joe mentions at the beginning of this post) – Ivon 17/10, 2009 at 8:28

@Marc: the source of InternalCall methods can be viewed (for the most part) through the Shared Source CLI SSCLI, aka Rotor. The Interlocked.CompareExchange is explained in this interesting read: moserware.com/2008/09/how-do-locks-lock.html – Loretaloretta 10/11, 2009 at 0:8

There seems to be some comparison with the Win32 API functions by the same name, but this thread is all about the C# Interlocked class. From its very description, it is guaranteed that its operations are atomic. I'm not sure how that translates to "full memory barriers" as mentioned in other answers here, but judge for yourself.

On uniprocessor systems, nothing special happens, there's just a single instruction:

FASTCALL_FUNC CompareExchangeUP,12
        _ASSERT_ALIGNED_4_X86 ecx
        mov     eax, [esp+4]    ; Comparand
        cmpxchg [ecx], edx
        retn    4               ; result in EAX
FASTCALL_ENDFUNC CompareExchangeUP

But on multiprocessor systems, a hardware lock is used to prevent other cores to access the data at the same time:

FASTCALL_FUNC CompareExchangeMP,12
        _ASSERT_ALIGNED_4_X86 ecx
        mov     eax, [esp+4]    ; Comparand
  lock  cmpxchg [ecx], edx
        retn    4               ; result in EAX
FASTCALL_ENDFUNC CompareExchangeMP

An interesting read with here and there some wrong conclusions, but all-in-all excellent on the subject is this blog post on CompareExchange.

Update for ARM

As often, the answer is, "it depends". It appears that prior to 2.1, the ARM had a half-barrier. For the 2.1 release, this behavior was changed to a full barrier for the Interlocked operations.

The current code can be found here and actual implementation of CompareExchange here. Discussions on the generated ARM assembly, as well as examples on generated code can be seen in the aforementioned PR.

Loretaloretta answered 10/11, 2009 at 0:18 Comment(4)

Yes, it has to be on x86, but is it also a full barrier on ARM or AArch64 where the hardware can do weakly-ordered atomic RMW? – Holt 17/6, 2020 at 16:8

@PeterCordes, this was answered by me in 2009. The ARM versions of .NET didn't exist back then, it was all x86/x64 (and maybe PowerPC). But since .NET is now all open source, it's trivial to check for both Mono and RyuJIT. – Loretaloretta 18/6, 2020 at 23:17

I found this answer while I was trying to check; this is one of the things that came up on Google. I was looking for documented guarantees so checking the source code wouldn't be great. (And I'd hardly say "trivial". Straightforward probably, but probably time consuming.) This answer was fine for 2009, agreed; my point was that it's not as useful as it could be to current readers. It turns out that another answer on this question cites a standard for evidence that Interlocked ops are at least Acquire / Release. – Holt 18/6, 2020 at 23:23

@PeterCordes, your comment peeked my interest. I've checked the actual implementation and Github discussions and it appears that it depends on the version. I've updated my answer to include this, feel free to edit my answer further if you find more info on the subject. – Loretaloretta 19/6, 2020 at 12:54

MSDN says about the Win32 API functions: "Most of the interlocked functions provide full memory barriers on all Windows platforms"

(the exceptions are Interlocked functions with explicit Acquire / Release semantics)

From that I would conclude that the C# runtime's Interlocked makes the same guarantees, as they are documented withotherwise identical behavior (and they resolve to intrinsic CPU statements on the platforms i know). Unfortunately, with MSDN's tendency to put up samples instead of documentation, it isn't spelled out explicitly.

Fermat answered 17/10, 2009 at 16:33 Comment(0)

According to ECMA-335 (section I.12.6.5):

5. Explicit atomic operations. The class library provides a variety of atomic operations in the System.Threading.Interlocked class. These operations (e.g., Increment, Decrement, Exchange, and CompareExchange) perform implicit acquire/release operations.

So, these operations follow principle of least astonishment.

Smoothbore answered 27/7, 2017 at 9:37 Comment(0)

The interlocked functions are guaranteed to stall the bus and the cpu while it resolves the operands. The immediate consequence is that no thread switch, on your cpu or another one, will interrupt the interlocked function in the middle of its execution.

Since you're passing a reference to the c# function, the underlying assembler code will work with the address of the actual integer, so the variable access won't be optimized away. It will work exactly as expected.

edit: Here's a link that explains the behaviour of the asm instruction better: http://faydoc.tripod.com/cpu/cmpxchg.htm
As you can see, the bus is stalled by forcing a write cycle, so any other "threads" (read: other cpu cores) that would try to use the bus at the same time would be put in a waiting queue.

Marchesa answered 17/10, 2009 at 9:15 Comment(2)

Actually, the reverse (partially) is true. Interlocked does an atomic operation and uses the cmpxchg assembly instruction. It does not require putting the other threads in a wait state, hence it is very performant. See section "Inside InternalCall" on this page: moserware.com/2008/09/how-do-locks-lock.html – Loretaloretta 10/11, 2009 at 0:6

There is no shared bus in modern CPUs; lock cmpxchg on an aligned value can just get that CPU core to delay responding to MESI invalidates / share requests, i.e. a cache lock not a bus lock. Anyway, this only tells us about x86, not C# in general for other ISAs. – Holt 17/6, 2020 at 16:10

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Update for ARM

Recommended topics

Hot tags