Memory Model: preventing store-release and load-acquire reordering
Asked Answered
F

4

7

It is known that, unlike Java's volatiles, .NET's ones allow reordering of volatile writes with the following volatile reads from another location. When it is a problem MemoryBarier is recommended to be placed between them, or Interlocked.Exchange can be used instead of volatile write.

It works but MemoryBarier could be a performance killer when used in highly optimized lock-free code.

I thought about it a bit and came with an idea. I want somebody to tell me if I took the right way.

So, the idea is the following:

We want to prevent reordering between these two accesses:

 volatile1 write

 volatile2 read

From .NET MM we know that :

 1) writes to a variable cannot be reordered with  a  following read from 
    the same variable
 2) no volatile accesses can be eliminated
 3) no memory accesses can be reordered with a previous volatile read 

To prevent unwanted reordering between write and read we introduce a dummy volatile read from the variable we've just written to:

 A) volatile1 write
 B) volatile1 read [to a visible (accessible | potentially shared) location]
 C) volatile2 read

In such case B cannot be reordered with A as they both access the same variable, C cannot be reordered with B because two volatile reads cannot be reordered with each other, and transitively C cannot be reordered with A.

And the question:

Am I right? Can that dummy volatile read be used as a lightweight memory barrier for such scenario?

Fates answered 15/5, 2013 at 18:34 Comment(3)
Seems odd that you would be using volatile at all in "highly optimized lock-free code." Your volatile read is costing, what, a hundred or more cycles? So a volatile read costs about half as much as an uncontended lock. Possibly even more than that. My suggestion would be to re-think your design so as to avoid volatile. bluebytesoftware.com/blog/2010/12/04/SayonaraVolatile.aspxSquander
@JimMischel that's on ARM (your comment is valid there). On x86 it only hurts in so far that it prevents the compiler/JIT from reordering and eliminating accesses. It does not cause different instructions to be emitted.Rhombohedral
@Jim Mischel: My question is about load-acquire and store-release barriers. Volatile is just a C# way to enforce them. Btw, in 4.5 instead of volatile keyword we can use Volatile class's Read and Write methods.Fates
F
2

I forgot to post the soon found answer back to SO. Better late than never..

Turns out it is impossible thanks to how processors (at least x86-x64 kind of them) optimize memory accesses. I found the answer when was reading Intel manuals on its procs. Example 8-5:" Intra-Processor Forwarding is Allowed" was looking suspicious. Googling for "store buffer forwarding" lead to Joe Duffy's blog posts (first and second - read them pls).

To optimize writes processor uses store buffers (per processor queues of write ops). Buffering writes locally allows it to do next optimization: satisfying reads from the previously buffered writes to the same memory location and which haven't left the processor yet. The technique is called store-buffer forwarding (or store-to-load forwarding).

The end result in our case is that as reading at B is satisfied from a local storage (store buffer) it is not considered a volatile read and can be reordered with further volatile reads from another memory location (C).

It seems like a violation of the rule "Volatile reads don't reorder with each other". Yes, it is a violation, but very rare and exotic one. Why did it happen? Probably because Intel's released its first formal document on memory model of its processors years after .NET (and its JIT compiler) saw the sunlight.

So the answer is: no, the dummy reading (B) doesn't prevent reordering between A and C and cannot be used as a lightweight memory barrier.

Fates answered 24/9, 2015 at 12:7 Comment(1)
The data race Intel "allowed" was already a data race. If those queued store instructions happened to be issued a few cycles later, it would have been the same. That is why it is allowed and can still be called a strongly ordered memory model.Zermatt
A
3

Here I will use an arrow notation to conceptualize the memory barriers. I use an up arrow ↑ and a down arrow ↓ to represent volatile writes and reads respectively. Think of the arrow head as pushing away any other reads or writes. So no other memory access can move past the arrow head, but they can move past the tail.

Consider your first example. This is how it would be conceptualized.

↑          
volatile1 write  // A
volatile2 read   // B
↓

So clearly we can see that the read and the write are allowed to switch positions. You are correct.

Now consider your second example. You claimed that introducing a dummy read would prevent the write of A and the read of B from getting swapped.

↑          
volatile1 write  // A
volatile1 read   // A
↓
volatile2 read   // B
↓

We can see that B is prevented from floating up by the dummy read of A. We can also see that the read of A cannot float down because, by inference, that would be the same as B moving up before A. But, notice that we have no ↑ arrow that would prevent the write to A from floating down (remember it can still move past the tail of an arrow). So no, at least theoretically, injecting a dummy read of A will not prevent the original write of A and the read of B from getting swapped because the write to A is still allowed to move downward.

I had to really think about this scenario. One thing I pondered for a quite some time is whether the read and write to A are locked together in tandem. If so then that would prevent the write to A from moving down because it would have to take the read with it which we already said was prevented. So if you go with that school of thought then your solution might just work. But, I read the specification again and I see nothing special mentioned about volatile accesses to the same variable. Obviously, the thread has to execute in a manner that is logically consistent with the original program sequence (that is mentioned in the specification). But, I can visualize ways the compiler or hardware could optimize (or otherwise reorder) that tandem access of A and still get the same result. So, I simply have to side with caution here and assume that the write to A can move down. Remember, a volatile read does not mean "fresh read from main memory". The write to A could be cached in a register and then the read comes from that register delaying the actual write to a later time. Volatile semantics do not prevent that scenario as far as I know.

The correct solution would be to put a call to Thread.MemoryBarrier in between the accesses. You can see how this is conceptualized with the arrow notation.

↑          
volatile1 write       // A
↑
Thread.MemoryBarrier
↓
volatile2 read        // B
↓

Now you can see that the read is not allowed to float up and the write is not allowed to float down preventing the swap.


You can see some of my other memory barrier answers using this arrow notation here, here, and here just to name a few.

Apostrophe answered 13/7, 2013 at 14:18 Comment(3)
Brian, good explanation. I imagine teeth when I think of barriers :) But I don't completely agree with you. There are explicit barriers that compilers, runtimes and CPUs give to us (because they are aren't clever enough) and there implicit barriers enforced by them when they have information. Though these barriers work only within a single thread they are still barriers. For example, write and reads to the same variable can't be reordered to preserve correctness. From this point of view it should look like this: ↑ volatile1 write ↨ volatile1 read ↓ volatile2 read ↓Fates
@OmariO: You make a good point about the reordering of reads/write with respect to the same variable. I need to consider if it is worth mentioning in future answer. I'm afraid the salient points might get overlooked trying to hammer home a point that should be plainly obvious to begin with. Still, good point nonetheless.Apostrophe
Brian I unaccepted your answer because found the right one. Your answer is still useful for readers.Fates
F
2

I forgot to post the soon found answer back to SO. Better late than never..

Turns out it is impossible thanks to how processors (at least x86-x64 kind of them) optimize memory accesses. I found the answer when was reading Intel manuals on its procs. Example 8-5:" Intra-Processor Forwarding is Allowed" was looking suspicious. Googling for "store buffer forwarding" lead to Joe Duffy's blog posts (first and second - read them pls).

To optimize writes processor uses store buffers (per processor queues of write ops). Buffering writes locally allows it to do next optimization: satisfying reads from the previously buffered writes to the same memory location and which haven't left the processor yet. The technique is called store-buffer forwarding (or store-to-load forwarding).

The end result in our case is that as reading at B is satisfied from a local storage (store buffer) it is not considered a volatile read and can be reordered with further volatile reads from another memory location (C).

It seems like a violation of the rule "Volatile reads don't reorder with each other". Yes, it is a violation, but very rare and exotic one. Why did it happen? Probably because Intel's released its first formal document on memory model of its processors years after .NET (and its JIT compiler) saw the sunlight.

So the answer is: no, the dummy reading (B) doesn't prevent reordering between A and C and cannot be used as a lightweight memory barrier.

Fates answered 24/9, 2015 at 12:7 Comment(1)
The data race Intel "allowed" was already a data race. If those queued store instructions happened to be issued a few cycles later, it would have been the same. That is why it is allowed and can still be called a strongly ordered memory model.Zermatt
C
0

EDIT The conclusions I drew from the C# specs are wrong, see below. END EDIT

I surely am not someone 'authorized', but I think you haven't understood the memory model correctly.

Quoting the C# specification, section §10.10 Execution order, third bullet point on page 105:

The ordering of side effects is preserved with respect to volatile reads and writes.

Volatile reads and writes are defined as "side-effects" and this paragraph states that the ordering of side-effects is preserved.

So it is my understanding that your whole question is based on an incorrect assumption: volatile reads and writes cannot be reordered.

I think you got confused with that fact that with respect to non-volatile memory operations, volatile reads and writes are only half-fences.

EDIT this article: The C# Memory Model in Theory and Practice, Part 2 states exactly the opposite and supports your assertion that volatile reads can move up past an unrelated volatile write. The suggested solution is to introduce a MemoryBarrier where it matters.

Comment by Daniel below also says that the CLI spec is more specific about this than the C# spec and allows this reordering.

Now I find that the C# spec I quoted above is confusing! But given that on x86 the same instructions are used for a volatile memory access and a regular memory access, it makes perfect sense that they are subject to the same half-fence reordering issues. END EDIT

Conductivity answered 1/6, 2013 at 19:6 Comment(1)
Wrong. In C#, a volatile write followed by a volatile read can be reordered. The C# spec isn't very clear on this, but the CLI spec (ECMA-335) has more details: volatile only gives you acquire/release semantics, not the full sequential consistency offered by Java's volatile.Luciolucita
D
0

Let me disagree with the accepted answer from Brian Gideon.

OmariO your solution to the problem (dummy read) looks perfectly correct to me. As you mentioned correctly, writes to a variable cannot be reordered with a following read from the same variable. If that reordering was possible then it would make the code incorrect in a single-threaded case (the read operation could return not the same value that has been written by the previous write operation). That is it would violate the fundamental rule of any memory model: single-threaded execution of a program must not be logically changed.

Also guys, Brian and OmariO, please don't mix up memory operations with acquire/release semantics and acquire/release memory fences. E.g. a read-acquire operation is not the same as an acquire fence. They have similar semantics but the distinction between them is very important. The best explanation of that terms that I know is in the great blog of Jeff Preshing:
Acquire and Release Semantics
Acquire and Release Fences

Deckhouse answered 28/11, 2013 at 18:25 Comment(1)
Thanks for useful links.Fates

© 2022 - 2024 — McMap. All rights reserved.