If you don't care about the old value, and don't need a full memory barrier (including an expensive StoreLoad, i.e. draining the store buffer before later loads), always use Volatile.Write
.
Volatile.Write
- atomic release store
Volatile.Write
is a store with "release" semantics, which AArch64 can do cheaply, and which x86 can do for free (well, same cost as a non-atomic store, except of course for contention with other cores also trying to write the line). It's basically equivalent to C++ std::atomic<T>
store(value, memory_order_release)
.
For example, in the case of a double
, Volatile.Write
for x86 (including 32-bit and x86-64) could compile to an SSE2 8-byte store directly from an XMM register, like movsd [mem], xmm0
, because x86 stores already have as much ordering as MS's documentation specifies for Volatile.Write
. And assuming the double
is naturally-aligned (which any C# runtime would do, right?) it's also guaranteed to be atomic. (On all x86-64 CPUs, and 32-bit since P5 Pentium.)
The older Thread.VolatileWrite
method in practice uses a full barrier, instead of just being a release operation that can reorder in one direction. That makes it no cheaper than Interlocked.Exchange, or not much on non-x86. But Volatile.Write
/Read
don't have that problem of an overly strong implementation that some software probably relies on. They don't have to drain the store buffer, just make sure all earlier stores (and loads) are visible by the time this one is.
Interlocked.Exchange
- atomic RMW plus full barrier (at least acq/rel)
This is a wrapper for the x86 xchg
instruction, which acts as if it had a lock
prefix even if the machine code omits that. That means an atomic RMW, and a "full" barrier as part of it (like x86 mfence
).
In general, I think the Interlocked class methods originated as wrappers for x86 instructions with the lock
prefix; on x86 it's impossible to do an atomic RMW that isn't a full barrier. There are MS C++ functions with those names, too, so this history predates C#.
The current documentation for Interlocked methods (other than MemoryBarrier) on MS's site doesn't even bother to mention that these methods are a full barrier, even on non-x86 ISAs where atomic RMW operations don't require that.
I'm not sure if the full barrier is an implementation detail rather than part of the language spec, but it's certainly the case currently. That makes Intelocked.Exchange
a poor choice for efficiency if you don't need that.
This answer quotes the ECMA-335 spec as saying that Interlocked operations perform implicit acquire/release operations. If that's like C++ acq_rel
, that's fairly strong ordering since it's an atomic RMW with the load and store somewhat tied together, and each one prevents reordering in one direction. (But see For purposes of ordering, is atomic read-modify-write one operation or two? - it's possible to observe a seq_cst
RMW reordering with a later relaxed
operation on AArch64, within the limits allowed by C++ semantics. It's still an atomic RMW, though.)
@Theodor Zoulias found multiple sources online saying that C# Interlocked methods imply a full fence/barrier. For example, Joseph Albahari's online book: "The following implicitly generate full fences: [...] All methods on the Interlocked
class". And on Stack Overflow, Memory barrier generators includes all Interlocked
class methods in its list. Both of these may just be cataloguing actual current behaviour, rather than what's mandated by the language spec.
I'd assume there's plenty of code that now depends on it, and would break if Interlocked methods changed from being like C++ std::memory_order_seq_cst
to relaxed
like the MS docs imply by saying nothing about memory ordering wrt. to the surrounding code. (Unless that's covered somewhere else in the docs.)
I don't use C# myself so I can't easily cook up an example on SharpLab with JITted asm to check, but MSVC compiles its _InterlockedIncrement
intrinsic to include a dmb ish
for AArch64. (Comment thread.) So it seems MS compilers go beyond even the acquire/release guaranteed by the ECMA language spec and add a full barrier, if they do the same thing for C# code.
BTW, some people only use the term "atomic" at all to describe RMW operations, not atomic loads or atomic stores. MS's documentation says the Interlocked
class "Provides atomic operations for variables that are shared by multiple threads." but the class doesn't provide pure stores or pure loads, which is weird.
(Except for Read([U]Int64)
, presumably intended to expose 32-bit x86 lock cmpxchg8b
with desired=expected so you either replace a value with itself or load the old value. Either way it dirties the cache line (so contends with reads by other threads just like any other Interlocked RMW operation) and is a full barrier, so you wouldn't normally read a 64-bit integer this way in 32-bit asm. Modern 32-bit code can just use SSE2 movq xmm0, [mem]
/ movd eax, xmm0
/ pextrd edx, xmm0, 1
or similar, like G++ and MSVC do for std::atomic<uint64_t>
; this is much better and can scale to multiple threads reading the same value in parallel without contending with each other.)
(ISO C++ gets this right, where std::atomic<T>
has load and store methods, as well as exchange, fetch_add, etc. But ISO C++ defines literally nothing about what happens with unsynchronized read+write or write+write of a plain non-atomic object. A memory-safe language like C# has to define more.)
Inter-thread latency
Is it possible that the Volatile.Write has some hidden disadvantage, like updating the memory "less instantaneously" (if this makes any sense) than the Interlocked.Exchange?
I wouldn't expect any difference. Extra memory ordering just makes later stuff in the current thread wait until after a store commits to L1d cache. It doesn't make that happen any sooner, since CPUs already do that as fast as they can. (To make room in the store buffer for later stores.) See Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees? for more.
Certainly not on x86; IDK if things could be any different on weakly-ordered ISAs where a relaxed atomic RMW could load+store without waiting for the store buffer to drain, and might "jump the queue". But Interlocked.Exchange doesn't do a relaxed RMW, it's more like C++ memory_order_seq_cst
.
Examples in the question:
In the first example, with .Set()
and .WaitOne()
on a separate variable, that already provides sufficient synchronization that a plain non-atomic assignment to a double
is guaranteed to be fully visible to that reader. Volatile.Write
and Interlocked.Exchange
would both be entirely pointless.
For releasing a lock, yes you just want a pure store, especially on x86 where that doesn't take any barrier instructions. If you want to detect double-unlocking (unlocking an already-unlocked lock), load the spinlock variable first, before storing. (That can possibly miss double-unlocks, unlike an atomic exchange, but should be sufficient to find buggy usages unless they always only happen with tight timing between both unlockers.)
are
with Set / WaitOne, so even a plain assignment would Just Work. But normally I'd expect Volatile.Write to be at least as cheap asInterlocked.Exchange
(atomic RMW), and of course the store part of both is atomic, I think that's the point of Volatile.Write. It might also have ordering semantics, hopefully just release not seq_cst so it can be cheaper on ISAs like x86. – AphisVolatile.Write
might be cheaper than theInterlocked.Exchange
, in terms of CPU utilization. Is it possible that theVolatile.Write
has some hidden disadvantage, like updating the memory "less instantaneously" (if this makes any sense) than theInterlocked.Exchange
? – Ehrmemory_order_relaxed
exchange without ordering wrt. surrounding code (I'd be surprised; I thought Interlocked implied a barrier), it could be different – AphisInterlockedIncrement
compiles with a full memory barrier for ARM64, when it wouldn't need any just for atomicity of the operation itself. (Found a comment thread where I'd tested this myself. I highly suspect the answer here is wrong unless C#Interlocked.
stuff has weaker ordering guarantees than in C++. I suspect that Interlocked.Exchange actually guarantees a barrier, too. (In x86 asm, atomic RMW instructions are also barriers.) – AphisInterlocked
and fences, in this answer theInterlocked
class methods are listed as mechanisms "that are generally agreed upon to cause implicit barriers". The same is stated in Joseph Albahari's online book "The following implicitly generate full fences: [...] All methods on theInterlocked
class". I am sure that I have seen it somewhere in Microsoft's documentation as well. – EhrVolatile.Write
is slightly cheaper, but may take a dozen of nanoseconds before the written value is visible to other threads." This would allow me to make an informed decision about which API to use (the decision would be: flip a coin, it doesn't matter). – EhrVolatile.Write
; if the cheapest way for a compiler to implement its ordering semantics is x86xchg
, then so be it. (If it implies seq_cst ordering / a full barrier, not just release semantics). But if it's justrelease
, then it's clearly better than Interlocked.Exchange on x86 and AArch64, no tradeoff. And due to the lack of a read, probably on other ISAs. – AphisInterlocked.Exchange
is only useful when you want to get the previous value of the variable, via its return value, and in all other cases theVolatile.Write
is preferable? If you consider this to be correct, you may post it as an answer and I'll award it the bounty. :-) – Ehr