Reasoning about IORef operation reordering in concurrent programs

Asked 2/2, 2014 at 3:40 Answered 2/2, 2014 at 12:14

The docs say:

In a concurrent program, IORef operations may appear out-of-order to another thread, depending on the memory model of the underlying processor architecture...The implementation is required to ensure that reordering of memory operations cannot cause type-correct code to go wrong. In particular, when inspecting the value read from an IORef, the memory writes that created that value must have occurred from the point of view of the current thread.

Which I'm not even entirely sure how to parse. Edward Yang says

In other words, “We give no guarantees about reordering, except that you will not have any type-safety violations.” ... the last sentence remarks that an IORef is not allowed to point to uninitialized memory

So... it won't break the whole haskell; not very helpful. The discussion from which the memory model example arose also left me with questions (even Simon Marlow seemed a bit surprised).

Things that seem clear to me from the documentation

within a thread an atomicModifyIORef "is never observed to take place ahead of any earlier IORef operations, or after any later IORef operations" i.e. we get a partial ordering of: stuff above the atomic mod -> atomic mod -> stuff after. Although, the wording "is never observed" here is suggestive of spooky behavior that I haven't anticipated.
A readIORef x might be moved before writeIORef y, at least when there are no data dependencies
Logically I don't see how something like readIORef x >>= writeIORef y could be reordered

What isn't clear to me

Will newIORef False >>= \v-> writeIORef v True >> readIORef v always return True?
In the maybePrint case (from the IORef docs) would a readIORef myRef (along with maybe a seq or something) before readIORef yourRef have forced a barrier to reordering?

What's the straightforward mental model I should have? Is it something like:

within and from the point of view of an individual thread, the ordering of IORef operations will appear sane and sequential; but the compiler may actually reorder operations in such a way that break certain assumptions in a concurrent system; however when a thread does atomicModifyIORef, no threads will observe operations on that IORef that appeared above the atomicModifyIORef to happen after, and vice versa.

...? If not, what's the corrected version of the above?

If your response is "don't use IORef in concurrent code; use TVar" please convince me with specific facts and concrete examples of the kind of things you can't reason about with IORef.

Terribly answered 2/2, 2014 at 3:40 Comment(1)

I think all reasonable memory models will guarantee the operations executed in the same thread to observe stuff from the same thread in program order. So readIORef v will always return True, if there are no other threads modifying the same IORef concurrently. – Moiety 2/2, 2014 at 12:49

I don't know Haskell concurrency, but I know something about memory models.

Processors can reorder instructions the way they like: loads may go ahead of loads, loads may go ahead of stores, loads of dependent stuff may go ahead of loads of stuff they depend on (a[i] may load the value from array first, then the reference to array a!), stores may be reordered with each other. You simply cannot put a finger on it and say "these two things definitely appear in a particular order, because there is no way they can be reordered". But in order for concurrent algorithms to operate, they need to observe the state of other threads. This is where it is important for thread state to proceed in a particular order. This is achieved by placing barriers between instructions, which guarantee the order of instructions to appear the same to all processors.

Typically (one of the simplest models), you want two types of ordered instructions: ordered load that does not go ahead of any other ordered loads or stores, and ordered store that does not go ahead of any instructions at all, and a guarantee that all ordered instructions appear in the same order to all processors. This way you can reason about IRIW kind of problem:

Thread 1: x=1

Thread 2: y=1

Thread 3: r1=x;
          r2=y;

Thread 4: r4=y;
          r3=x;

If all of these operations are ordered loads and ordered stores, then you can conclude the outcome (1,0,0,1)=(r1,r2,r3,r4) is not possible. Indeed, ordered stores in Threads 1 and 2 should appear in some order to all threads, and r1=1,r2=0 is witness that y=1 is executed after x=1. In its turn, this means that Thread 4 can never observe r4=1 without observing r3=1 (which is executed after r4=1) (if the ordered stores happen to be executed that way, observing y==1 implies x==1).

Also, if the loads and stores were not ordered, the processors would usually be allowed to observe the assignments to appear even in different orders: one might see x=1 appear before y=1, the other might see y=1 appear before x=1, so any combination of values r1,r2,r3,r4 is permitted.

This is sufficiently implemented like so:

ordered load:

load x
load-load  -- barriers stopping other loads to go ahead of preceding loads
load-store -- no one is allowed to go ahead of ordered load

ordered store:

load-store
store-store -- ordered store must appear after all stores
            -- preceding it in program order - serialize all stores
            -- (flush write buffers)
store x,v
store-load -- ordered loads must not go ahead of ordered store
           -- preceding them in program order

Of these two, I can see IORef implements a ordered store (atomicWriteIORef), but I don't see a ordered load (atomicReadIORef), without which you cannot reason about IRIW problem above. This is not a problem, if your target platform is x86, because all loads will be executed in program order on that platform, and stores never go ahead of loads (in effect, all loads are ordered loads).

A atomic update (atomicModifyIORef) seems to me a implementation of a so-called CAS loop (compare-and-set loop, which does not stop until a value is atomically set to b, if its value is a). You can see the atomic modify operation as a fusion of a ordered load and ordered store, with all those barriers there, and executed atomically - no processor is allowed to insert a modification instruction between load and store of a CAS.

Furthermore, writeIORef is cheaper than atomicWriteIORef, so you want to use writeIORef as much as your inter-thread communication protocol permits. Whereas writeIORef x vx >> writeIORef y vy >> atomicWriteIORef z vz >> readIORef t does not guarantee the order in which writeIORefs appear to other threads with respect to each other, there is a guarantee that they both will appear before atomicWriteIORef - so, seeing z==vz, you can conclude at this moment x==vx and y==vy, and you can conclude IORef t was loaded after stores to x, y, z can be observed by other threads. This latter point requires readIORef to be a ordered load, which is not provided as far as I can tell, but it will work like a ordered load on x86.

Typically you don't use concrete values of x, y, z, when reasoning about the algorithm. Instead, some algorithm-dependent invariants about the assigned values must hold, and can be proven - for example, like in IRIW case you can guarantee that Thread 4 will never see (0,1)=(r3,r4), if Thread 3 sees (1,0)=(r1,r2), and Thread 3 can take advantage of this: this means something is mutually excluded without acquiring any mutex or lock.

An example (not in Haskell) that will not work if loads are not ordered, or ordered stores do not flush write buffers (the requirement to make written values visible before the ordered load executes).

Suppose, z will show either x until y is computed, or y, if x has been computed, too. Don't ask why, it is not very easy to see outside the context - it is a kind of a queue - just enjoy what sort of reasoning is possible.

Thread 1: x=1;
          if (z==0) compareAndSet(z, 0, y == 0? x: y);

Thread 2: y=2;
          if (x != 0) while ((tmp=z) != y && !compareAndSet(z, tmp, y));

So, two threads set x and y, then set z to x or y, depending on whether y or x were computed, too. Assuming initially all are 0. Translating into loads and stores:

Thread 1: store x,1
          load z
          if ==0 then
            load y
            if == 0 then load x -- if loaded y is still 0, load x into tmp
            else load y -- otherwise, load y into tmp
            CAS z, 0, tmp -- CAS whatever was loaded in the previous if-statement
                          -- the CAS may fail, but see explanation

Thread 2: store y,2
          load x
          if !=0 then
          loop: load z -- into tmp
                load y
                if !=tmp then -- compare loaded y to tmp
                  CAS z, tmp, y  -- attempt to CAS z: if it is still tmp, set to y
                  if ! then goto loop -- if CAS did not succeed, go to loop

If Thread 1 load z is not a ordered load, then it will be allowed to go ahead of a ordered store (store x). It means wherever z is loaded to (a register, cache line, stack,...), the value is such that existed before the value of x can be visible. Looking at that value is useless - you cannot then judge where Thread 2 is up to. For the same reason you've got to have a guarantee that the write buffers were flushed before load z executed - otherwise it will still appear as a load of a value that existed before Thread 2 could see the value of x. This is important as will become clear below.

If Thread 2 load x or load z are not ordered loads, they may go ahead of store y, and will observe the values that were written before y is visible to other threads.

However, see that if the loads and stores are ordered, then the threads can negotiate who is to set the value of z without contending z. For example, if Thread 2 observes x==0, there is guarantee that Thread 1 will definitely execute x=1 later, and will see z==0 after that - so Thread 2 can leave without attempting to set z.

If Thread 1 observes z==0, then it should try to set z to x or y. So, first it will check if y has been set already. If it wasn't set, it will be set in the future, so try to set to x - CAS may fail, but only if Thread 2 concurrently set z to y, so no need to retry. Similarly there is no need to retry if Thread 1 observed y has been set: if CAS fails, then it has been set by Thread 2 to y. Thus we can see Thread 1 sets z to x or y in accordance with the requirement, and does not contend z too much.

On the other hand, Thread 2 can check if x has been computed already. If not, then it will be Thread 1's job to set z. If Thread 1 has computed x, then need to set z to y. Here we do need a CAS loop, because a single CAS may fail, if Thread 1 is attempting to set z to x or y concurrently.

The important takeaway here is that if "unrelated" loads and stores are not serialized (including flushing write buffers), no such reasoning is possible. However, once loads and stores are ordered, both threads can figure out the path each of them _will_take_in_the_future, and that way eliminate contention in half the cases. Most of the time x and y will be computed at significantly different times, so if y is computed before x, it is likely Thread 2 will not touch z at all. (Typically, "touching z" also possibly means "wake up a thread waiting on a cond_var z", so it is not only a matter of loading something from memory)

Moiety answered 2/2, 2014 at 12:14 Comment(5)

Thanks, a lot. Ah, so an atomicModifyIORef x acts as a barrier for all IORef operations, not just operations on IORef x? And can you help me understand your point about readIORef t in the second to last paragraph: That operation has to be last because of the atomicWriteIORef correct? If so what distinction are you making about readIORef needing to be an ordered load? – Terribly 2/2, 2014 at 16:20

@Terribly Correct. readIORef t after atomiWriteIORef may or may not be a ordered read. A memory model distinguishes between ordered and unordered reads, so a unordered read can go ahead of ordered stores (it would be possible to reorder readIORef t and atomicWriteIORef z vz at compile time). However, since the other commenters remark that Haskell does not have a memory model, the compiler will not attempt to reorder them, and the processor will not do it, because atomicWriteIORef will have the barriers forbidding that. I'll add more. – Moiety 2/2, 2014 at 18:15

@SassaNF: there's one problem with the "in effect, all loads are ordered loads" line of reasoning. GHC itself will reorder loads, although not across an atomic*IORef. Which is a bit of a sticky wicket, because to simulate an ordered load it's necessary to use an atomicModifyIORef, which is essentially a full barrier. – Sterling 2/2, 2014 at 18:39

@JohnL thanks. I didn't know that, I only assumed they must be ordered loads, because requiring full barriers to enforce ordered loads is very very expensive. It is a very common operation and extremely cheap in other languages, pity it cannot be ordered cheaply in Haskell. – Moiety 2/2, 2014 at 20:3

@SassaNF: the full story is a bit complicated, I'd highly recommend Edward Yang's post (linked from the question) for it. But it's a common viewpoint that these sorts of operations shouldn't be exposed to the Haskell programmer at all, as it's simpler to reason about TVars and MVars. I don't think there's any technical reason it can't be done, just that nobody has put forth the requisite effort to allow for it. – Sterling 2/2, 2014 at 20:53

within a thread an atomicModifyIORef "is never observed to take place ahead of any earlier IORef operations, or after any later IORef operations" i.e. we get a partial ordering of: stuff above the atomic mod -> atomic mod -> stuff after. Although, the wording "is never observed" here is suggestive of spooky behavior that I haven't anticipated.

"is never observed" is standard language when discussing memory reordering issues. For example, a CPU may issue a speculative read of a memory location earlier than necessary, so long as the value doesn't change between when the read is executed (early) and when the read should have been executed (in program order). That's entirely up to the CPU and cache though, it's never exposed to the programmer (hence language like "is never observed").

A readIORef x might be moved before writeIORef y, at least when there are no data dependencies

True

Logically I don't see how something like readIORef x >>= writeIORef y could be reordered

Correct, as that sequence has a data dependency. The value to be written depends upon the value returned from the first read.

For the other questions: newIORef False >>= \v-> writeIORef v True >> readIORef v will always return True (there's no opportunity for other threads to access the ref here).

In the myprint example, there's very little you can do to ensure this works reliably in the face of new optimizations added to future GHCs and across various CPU architectures. If you write:

writeIORef myRef True
x <- readIORef myRef
yourVal <- x `seq` readIORef yourRef

Even though GHC 7.6.3 produces correct cmm (and presumably asm, although I didn't check), there's nothing to stop a CPU with a relaxed memory model from moving the readIORef yourRef to before all of the myref/seq stuff. The only 100% reliable way to prevent it is with a memory fence, which GHC doesn't provide. (Edward's blog post does go through some of the other things you can do now, as well as why you may not want to rely on them).

I think your mental model is correct, however it's important to know that the possible apparent reorderings introduced by concurrent ops can be really unintuitive.

Edit: at the cmm level, the code snippet above looks like this (simplified, pseudocode):

[StackPtr+offset] := True
x := [StackPtr+offset]
if (notEvaluated x) (evaluate x)
yourVal := [StackPtr+offset2]

So there are a couple things that can happen. GHC as it currently stands is unlikely to move the last line any earlier, but I think it could if doing so seemed more optimal. I'm more concerned that, if you compile via LLVM, the LLVM optimizer might replace the second line with the value that was just written, and then the third line might be constant-folded out of existence, which would make it more likely that the read could be moved earlier. And regardless of what GHC does, most CPU memory models allow the CPU itself to move the read earlier absent a memory barrier.

Sterling answered 2/2, 2014 at 7:21 Comment(2)

Thanks. So with the myPrint example with seq, you're saying that of course the compiler might now move x <- readIORef myRef as well as x seq` readIORef yourRef` above the writeIORef myRef? – Terribly 2/2, 2014 at 16:6

@jberryman: I've edited to address this, but in a nutshell yes, the read could theoretically be moved before the seq. – Sterling 2/2, 2014 at 18:32

http://en.wikipedia.org/wiki/Memory_ordering for non atomic concurrent reads and writes. (basically when you dont use atomics, just look at the memory ordering model for your target CPU)

Currently ghc can be regarded as not reordering your reads and writes for non atomic (and imperative) loads and stores. However, GHC Haskell currently doesn't specify any sort of concurrent memory model, so those non atomic operations will have the ordering semantics of the underlying CPU model, as I link to above.

In other words, Currently GHC has no formal concurrency memory model, and because any optimization algorithms tend to be wrt some model of equivalence, theres no reordering currently in play there.

that is: the only semantic model you can have right now is "the way its implemented"

shoot me an email! I'm working on some patching up atomics for 7.10, lets try to cook up some semantics!

Edit: some folks who understand this problem better than me chimed in on ghc-users thread here http://www.haskell.org/pipermail/glasgow-haskell-users/2013-December/024473.html . Assume that i'm wrong in both this comment and anything i said in the ghc-users thread :)

Manado answered 2/2, 2014 at 7:5 Comment(2)

Are you sure GHC doesn't reorder reads? Edward seemed pretty clear that's a large part of what CmmSink does. – Sterling 2/2, 2014 at 7:23

@JohnL, you may be absolutely correct. Edward Yang did talk about this a wee bit on the mailing and I'm quite likely somewhat wrong. – Manado 2/2, 2014 at 15:16

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags