InterlockedExchange and memory visibility

Asked 7/10, 2011 at 9:41 Answered 19/10, 2011 at 10:24

I have read the article Synchronization and Multiprocessor Issues and I have a question about InterlockedCompareExchange and InterlockedExchange. The question is actually about the last example in the article. They have two variables iValue and fValueHasBeenComputed and in CacheComputedValue() they modify each of them using InterlockedExchange:

InterlockedExchange ((LONG*)&iValue, (LONG)ComputeValue());  // don't understand
InterlockedExchange ((LONG*)&fValueHasBeenComputed, TRUE); // understand

I understand that I can use InterlockedExchange for modifing iValue but is it enought just to do

iValue = ComputeValue();

So is it actually necessary to use InterlockedExchange to set iValue? Or other threads will see iValue correctly even if iValue = ComputeValue();. I mean the other threads will see iValue correctly because there is InterlockedExchange after it.

There is also the paper A Principle-Based Sequential Memory Model for Microsoft Native Code Platforms. There is the 3.1.1 example with more or less the same code. One of the recomendation Make y interlocked. Notice - not both y and x.

Update
Just to clarify the question. The issue is that I see a contradiction. The example from "Synchronization and Multiprocessor Issues" uses two InterlockedExchange. On the contrary, in the example 3.1.1 "Basic Reodering" (which I think is quite similar to the first example) Herb Sutter gives this recomendation

"Make y interlocked: If y is interlocked, then there is no race on y because it is atomically updatable,and there is no race on x because a -> b -> d."

. In this draft Herb do not use two interlocked variable (If I am right he means use InterlockedExchange only for y ).

Draff answered 7/10, 2011 at 9:41 Comment(14)

I'd say you're right, only accesses to fValueHasBeenComputed need to be interlocked. Why did they make it like this, I don't know. – Grammar 7/10, 2011 at 9:58

Yeah, it's necessary. InterlockedExchange promises a full memory barrier so you can be sure that both variables are visible. But it does not guarantee in which order the pending buffer writes become visible. You are still screwed if the fValueHasBeenComputed update is flushed first. – Supposititious 7/10, 2011 at 10:6

@HansPassant: Wrong. Quoting the linked document: A full memory barrier ensures that memory read and write operations that appear before the memory barrier instruction are committed to memory before any memory read and write operations that appear after the memory barrier instruction. In fact, the barriers are ther to ensure ordering, not visibility. – Grammar 7/10, 2011 at 10:41

@Grammar you are not protected in case when other instructions occur in another thread between these two – Geum 7/10, 2011 at 11:0

@Oleg: OMG, protected against what? What do "instructions in another thread" have to do with this? – Grammar 7/10, 2011 at 11:1

@Grammar Ever heard about race conditions? – Geum 7/10, 2011 at 11:16

@Oleg: writes to well-aligned memory locations are atomic on x86. Even without interlocked exchange, iValue would be updated atomically (unless your code does something horrible to break alignment). So instructions in another thread have nothing to do with it. Morover, even when both variables are updated with interlocked operations, there's still an interval in between them where instructions in another thread will see the nerw value of iValue and the old value of fValueHasBeenComputed. So the race condition is still there with interlocked. – Ontine 7/10, 2011 at 11:30

@Oleg: Yes. The code (in either modification) does nothing against races between two calls of CacheComputedValue(). However, races between Cache... and Fetch... are equally impossible in both. Arguing for InterlockedExchange (especially if you don't use its return value) on grounds of race condition is lame. – Grammar 7/10, 2011 at 11:36

@jalf Ever heard about multicore systems? – Geum 7/10, 2011 at 13:7

@Grammar I believe jalf explained this once again. Did you get it now? Order of operations is guaranteed, you are not guaranteed that operations are executed immediately one after another. – Geum 7/10, 2011 at 13:11

@Oleg: yes, I have. Rather than asking if we have heard of a seemingly endless stream of unrelated words, perhaps you could explain what it is about them that you consider relevant. – Ontine 7/10, 2011 at 13:15

@Oleg: you are never guaranteed that two operations are executed immediately after another (depends on what you mean by immediately, obviously). Interlocked... doesn't solve this either. – Grammar 7/10, 2011 at 13:17

No matter how many CPUs or threads you have, writing a LONG on x86 (using Win32's definition of LONG) will (assuming your data is well aligned) be atomic. And no matter how many interlocked operations you use, there will be a gap between iValue and fValueHasBeenComputed getting updated, so there will be a potential race condition there if other code assumes that the two variables will be updated as a single atomic operation. – Ontine 7/10, 2011 at 13:27

@jpalecek. Ok, just imagine that another piece of code checks fValueHasBeenComputed and gets false even if iValue was updated. This is obviously not solved by InterlockedExchange, but in my initial comment I only did want to say that Hans Passant was correct about this. – Geum 7/10, 2011 at 13:51

They did that to prevent partial reads/writes if the address of iValue is not aligned to an address that guarantees atomic access. this problem would arise when two or more physical thread try to write the value concurrently, or one reads and one tries to write at the same time.

As a secondary point, it should be noted that stores are not always globally visible, they are only going to be visible when serialized, either by a fence or by a bus lock.

Deryl answered 7/10, 2011 at 11:10 Comment(6)

How is this particular example iValue can be read partially modifed even if iValue is not aligned to an address that guarantees atomic access? I mean that FetchComputedValue() reads iValue only after fValueHasBeenComputed has been set to TRUE. And at that moment iValue must be fully modifed even if its modification was not atomic. – Draff 14/10, 2011 at 8:33

@skwllsp: although that may be partially true for x86, what happens if the thread is interrupted between the check and the read? other threads are still free to modify the value, so when the interrupted thread receives control again, it can still read a partial value. the same applies to instructions, just cause read2 directly follows read1 doesn't mean there cant be a write from a separate HW thread in-between. such is the perils of multi-core/threaded programming, you can never assume things, this is why early lockless algo's fell prey to the ABA problem – Deryl 14/10, 2011 at 9:14

In the Microdoft example there are no other threads. So no other thread can change the two variables. – Draff 14/10, 2011 at 10:3

I have found a discussion about the same topic here: social.msdn.microsoft.com/Forums/pl-PL/parallelcppnative/thread/…. The summary: "On Windows platforms, InterlockedXxx functions have full fence semantics". So it means that only one InterlockedExchange is necessary (the second one). It guarantees that all modifications before InterlockedExchange correctly seen by other threads. – Draff 14/10, 2011 at 10:8

@skwllsp: the MS example is meant to apply to systems with two or more threads of execution. yes, the barrier from an interlocked stops reordering and serializes all writes (as I mentioned in my answer, they use interlocked, aka bus locks to globalize writes), but it still does not prevent an interruption between the two writes, and you need both interlocked to make the writes atomic & visible without MFENCE, as they plainly state in the article, under 'Fixing a Race Condition'. That other article also plain states that not every system fences interlocks implicitly. – Deryl 14/10, 2011 at 10:36

I think I don't understand your point about Interlocked without MFENCE. Why are you mentioning "without MFENCE"? In my situation I use Windows on x86/x64 and Microsoft here msdn.microsoft.com/en-us/library/windows/desktop/… says this function generates a full memory barrier (or fence) to ensure that memory operations are completed in order.". – Draff 14/10, 2011 at 19:49

You simply get an atomic operation with InterlockedExchange. Why you need it? Cause InterlockedExchange does 2 things.

Replaces a value of variable
Returns an old value

If you do the same things in 2 operations (Thus first check value then replace) you can get screwed if other instructions (on another thread) occur between these 2.

And you also prevent data races on this value. here you get a good explanation why read/write on a LONG is not atomic

Geum answered 7/10, 2011 at 11:10 Comment(10)

All the InterlockedExchange examples in the OP's code do is #1. The returned value is discarded – Ontine 7/10, 2011 at 11:38

@jalf Did you take a look at the link? – Geum 7/10, 2011 at 13:8

you've got a strange way of discussing. Yes, I have taken a look at the link and yes, I have heard of race conditions and yes, I have heard of multicore machines. SO WHAT? Neither of those are relevant. On x86 machines, memory writes of aligned variables (at least if they're no wider than 32 bits) are atomic. And LONG in the Win32 API is 32 bits wide. Therefore, writing an object of type LONG to a well-aligned address is going to be atomic. InterlockedExchange also gives us access to the old value, but the OP doesn't use that. – Ontine 7/10, 2011 at 13:20

@jalf Hmm, maybe I was not polite. Sorry, just a hard day. Memory writes are atomic in the way that you do not get a half-changed variable in any case. On the other hand you are not guaranteed that your updated variable is immediately synchronized back from CPU cache, so if you have a multicore system you can reach a situation when your value was updated by one core but value stored in another core's cache (and in memory) is different. So you have to use a memory barrier (fence) to verify the value was flushed before next access and this is basically what atomic operation implementation does. – Geum 7/10, 2011 at 14:27

Oleg is right. Relying on proper alignment, the size of a defined type, and the behavior of the underlying hardware is absolutely terrible code that will someday be somebody's debugging headache. +1 – Downward 7/10, 2011 at 17:39

@Carey: It is impossible to write code without depending on those things. It's pretty fundamental of multithreaded code that you are going to rely on the properties of the underlying hardware. Unless you propose to put a memory barrier between every line of code, that is. – Ontine 7/10, 2011 at 19:46

@Oleg: most common CPU architectures (x86 included) has cache coherence between cores, so you will see the value updated on all cores or not at all. And looking at the code, the first value is not intended to be used until the second one has also been updated. A single memory barrier will flush both writes, and there is no need for the first one to be flushed before the second has been updated – Ontine 7/10, 2011 at 19:46

@jalf: I thoroughly disagree. Unless you're writing device drivers you shouldn't be assuming much of anything about hardware, and you certainly shouldn't be relying on a whole set of assumptions that may or may not be true with another platform, compiler, data item size, or OS. In a multithreaded app, operations on shared objects that need to be atomic require mutex protection from either the OS or compiler. Relying on quirks of the hardware is very poor practice. – Downward 7/10, 2011 at 21:33

@jalf, I think the issue is what happens if your code is recompiled in ten years on a different architecture. If you've used InterlockedExchange, it will still work. If not, all bets are off. – Expunge 7/10, 2011 at 23:38

@Harry, what happens if your code is recompiled in ten years on a different architecture is not an issue. The issue is that I see a contradiction. The example from "Synchronization and Multiprocessor Issues" uses two InterlockedExchange. On the contrary, in the example 3.1.1 "Basic Reodering" (which I think is quite similar to the first example) Herb Sutter gives this recomendation "Make y interlocked: If y is interlocked, then there is no race on y because it is atomically updatable,and there is no race on x because a -> b -> d.". In this draft Herb do not use two interlocked! – Draff 8/10, 2011 at 7:37

There are two plausible resolutions to the contradiction you've observed.

One is that the second document is simply wrong in that particular respect. It is, after all, a draft. I note that the example you refer to specifically states that the programmer cannot rely on the writes to be atomic, which means that both writes must indeed be interlocked.

The other is that the additional interlock might not actually be required in that particular example, because it is a very special case: only a single bit of the variable is being changed. However, the specification being developed doesn't appear to mention this as a premise, so I doubt that this is intentional.

Expunge answered 8/10, 2011 at 22:44 Comment(0)

I think this discussion has the answer to the question: Implicit Memory Barriers.

Question: does calling InterlockedExchange (implicit full fence) on T1 and T2, gurentess that T2 will "See" the write done by T1 before the fence? (A, B and C variables), even though those variables are not plance on the same cache-line as Foo and Bar ?

Answer: Yes -- the full fence generated by the InterlockedExchange will guarantee that the writes to A, B, and C are not reordered past the fence implicit in the InterlockedExchange call. This is the point of memory barrier semantics. They do not need to be on the same cache line.

Memory Barriers: a Hardware View for Software Hackers and Lockless Programming Considerations for Xbox 360 and Microsoft Windows are also insteresting.

Draff answered 19/10, 2011 at 10:24 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags