Memory barrier vs Interlocked impact on memory caches coherency timing

Asked 13/7, 2014 at 20:48 Answered 10/9, 2015 at 7:47

c#multithreading memory-barriers interlocked

Simplified question:

Is there a difference in timing of memory caches coherency (or "flushing") caused by Interlocked operations compared to Memory barriers? Let's consider in C# - any Interlocked operations vs Thread.MemoryBarrier(). I believe there is a difference.

Background:

I read quite few information about memory barriers - all the impact on prevention of specific types of memory interaction instructions reordering, but I couldn't find consistent info on whether they should cause immediate flushing of read/write queues.

I actually found few sources mentioning that there is NO guarantee on immediacy of the operation (only the prevention of specific reordering is guaranteed). E.g.

Wikipedia: "However, to be clear, it does not mean any operations WILL have completed by the time the barrier completes; only the ORDERING of the completion of operations (when they do complete) is guaranteed"

Freebsd.org (barriers are HW specific, so I guess a specific OS doesn't matter): "memory barriers simply determine relative order of memory operations; they do not make any guarantee about timing of memory operations"

On the other hand Interlocked operations - ~~from their definition - causes immediate flushing of all memory buffers to guarantee the most recent value of variable was updated~~ causes memory subsystem to lock the entire cache line with the value, to prevent access (including reads) from any other CPU/core, until the operation is done.

Am I correct or am I mistaken?

Disclaimer:

This is an evolution of my original question here Variable freshness guarantee in .NET (volatile vs. volatile read)

EDIT1: Fixed my statement about Interlocked operations - inline the text.

EDIT2: Completely remove demonstration code + it's discussion (as some complained about too much information)

Reilly answered 13/7, 2014 at 20:48 Comment(10)

"On the other hand Interlocked operations - from their definition - causes immediate flushing of all memory buffers to guarantee the most recent value of variable was updated" - which definition? As far as I know, the only guarantee is that the operation will be atomic. – Cherey 13/7, 2014 at 20:55

@Cherey That's a fair point! My statement about Interlocked was not correct - I edited my question and attempted to fix it. Basically Interlocked operations require exclusive access to the entire cache line (effectively preventing any possible stale reads), however - to my knowledge - this is not true about (any type of) memory barriers or volatile variables. – Reilly 13/7, 2014 at 21:30

Way too much information. If you want a good answer use this: stackoverflow.com/questions/how-to-ask – Leaves 13/7, 2014 at 21:30

it's a fair point in so far as that the documentation doesn't state anything (really a big omission for something like this). The win32 equivalents do - unnecessarily - create full memory barriers instead of the more reasonable acquire/release semantics.. – Disrate 14/7, 2014 at 17:11

@Leaves I completely removed the code sample + discussion. Please do let me know if this feels sufficient to revert the downvote (if it was yours). Please keep in mind that this topic is really complicated - so it needs some references. Thanks – Reilly 15/7, 2014 at 10:33

@Downvoter Please do let me know if you feel there is further improvement needed to the question. This is quite complicated topic - I did quite lot of research on this (and based the question on that research) and couldn't find any answer and I guess that many people would benefit from the knowledge of expected behavior. Therefore I want my question to draw some attention (which down voted question obviously does not) – Reilly 15/7, 2014 at 10:40

I belive that barriers are way more faster than contended RMW operations because no cache coherency traffic is needed to execute barrier. (I can't proove this) – Accentor 22/7, 2014 at 10:36

@Lazin I cannot imagine how barriers would be able to satisfy ordering requirements without any cache coherency. It's true that they might not force this immediately (compared to interlocked operations) - but that would only mean that interlocked would have guarantee of faster state change propagation (as stated in question) – Reilly 23/7, 2014 at 11:34

mechanical-sympathy.blogspot.ru/2011/07/… "A store barrier, “sfence” instruction on x86, forces all store instructions prior to the barrier to happen before the barrier and have the store buffers flushed to cache for the CPU on which it is issued." This article states that fences is local to the core. – Accentor 28/7, 2014 at 15:28

Memory fence doesn't create any cache coherency traffic, but stores and loads does. – Accentor 28/7, 2014 at 15:29

To understand C# interlocked operations, you need to understand Win32 interlocked operations.

The "pure" interlocked operations themselves only affect the freshness of the data directly referenced by the operation.

But in Win32, interlocked operations used to imply full memory barrier. I believe this is mostly to avoid breaking old programs on newer hardware. So InterlockedAdd does two things: interlocked add (very cheap, does not affect caches) and full memory barrier (rather heavy op).

Later, Microsoft realized this is expensive, and added versions of each operation that does no or partial memory barrier.

So there are now (in Win32 world) four versions of almost everything: e.g. InterlockedAdd (full fence), InterlockedAddAcquire (read fence), InterlockedAddRelease (write fence), pure InterlockedAddNoFence (no fence).

In C# world, there is only one version, and it matches the "classic" InterlockedAdd - that also does the full memory fence.

Erasme answered 10/9, 2015 at 7:47 Comment(6)

Thank you - that's a very valuable info. Do you have any reference (online preferable, but book is also ok) to back this up? I'd like to read a bit more about this. – Reilly 11/9, 2015 at 8:57

The best reference for Win32 is of course MSDN: msdn.microsoft.com/en-us/library/windows/desktop/… – Erasme 13/9, 2015 at 4:11

Thanks! Looks like the 'acquire-' and 'release-' only versions are supported only on Itanium only processors - which is why this is not surfaced at all to .NET. The basic Interlocked operations (those surfaced to .NET) are based on interlocked processor instructions - and those currently always perform full barrier (by taking lock on memory bus + invalidating the cache lines)... – Reilly 14/9, 2015 at 7:47

... Technically there might be some solution in future which flushes only the variable, but all high level synchronization constructs (Mutex, Semaphore, Lock etc.) would be broken by that - as all of those are based on interlocked operations - as mentioned in my answer above #24727404 – Reilly 14/9, 2015 at 7:53

The API is available for all processors, the intrinsic are CPU and compiler-version dependent. My link was to old VS; to see which intrinsics are supported, switch to current VS version. E.g. it seems a lot of release/aquire/no-fense intrinsics and supported by ARM: msdn.microsoft.com/en-us/library/ttk2z1ws(v=vs.140).aspx On Intel the no-fense version might be equal to full version. – Erasme 15/9, 2015 at 7:56

Thanks Michael! All the info you provided here is very helpful! Can you elaborate a bit more on timing of interlocked vs memory barriers? If you update your answer so that it actually answers my question above, than I'll mark your answer. – Reilly 16/9, 2015 at 10:3

Short answer: CAS (Interlocked) operations have been (and most likely will) be the quickest caches flusher.

Background: - CAS operations are supported in HW by single uninteruptable instruction. Compared to thread calling memory barrier which can be swapped right after placing the barrier but just before performing any reads/writes (so consistency guaranteed for the barrier is still met). - CAS operations are foundations for majority (if not all) high level synchronization construct (mutexes, sempahores, locks - look on their implementation and you will find CAS operations). They wouldn't likely be used if they wouldn't guarantee immediate cross-thread state consistency or if there would be other, faster mechanism(s)

Reilly answered 18/11, 2014 at 19:54 Comment(0)

At least on Intel devices, a bunch of machinecode operations can be prefixed with a LOCK prefix, which ensures that the following operation is treated as atomic, even if the underlying datatype won't fit on the databus in one go, for example, LOCK REPNE SCASB will scan a string of bytes for a terminating zero, and won't be interrupted by other threads. As far as I am aware, the Memory Barrier construct is basically a CAS based spinlock that causes a thread to wait for some Condition to be met, such as no other threads having any work to do. This is clearly a higher-level construct, but make no mistake there's a condition check in there, and it's likely to be atomic, and also likely to be CAS-protected, you're still going to pay the cache line price when you reach a memory barrier.

Jillene answered 13/4, 2015 at 13:53 Comment(0)

Recommended topics

Hot tags