Handling out of order execution

Asked 28/1, 2010 at 23:31 Answered 5/2, 2010 at 0:23

I recently stumbled upon this Wikipedia article. From my experience with multi-threading I am aware of the multitude of issues caused by the program being able to switch threads between threads at any time. However, I never knew that compiler and hardware optimisations could reorder operations in a way that is guaranteed to work for a single thread, but not necessarily for multi-threading. Can anyone explain how to correctly deal with the possibility of reordered operations in a multi-threaded environment?

UPDATE: I originally had accidentally linked to the Out-of-Order Execution article instead of the Memory barrier article, which has a better explanation of the problem.

Hemorrhage answered 28/1, 2010 at 23:31 Comment(3)

this is another eexample of trouble with concurrent code: drdobbs.com/go-parallel/article/… and this is an interesting article (with examples on a 24core cpu) of reducing contention and locks on a multithreaded program: drdobbs.com/hpc-high-performance-computing/212201163 – Crap 2/2, 2010 at 23:8

You explicitly talk about multi-threading, but this is an issue for any kind of concurrent code, e.g., code with interrupt handlers and code that talks to memory-mapped hardware peripherals. – Miles 4/2, 2010 at 2:54

The short answer is that you probably don't need to worry. The only time that reordering is an issue is when two threads are accessing shared resources, and in those cases, you are already using a lock mechanism which includes a memory barrier. Perhaps there are compilers out there that don't properly support multiprocessing and therefore make stupid optimizations, but the only answer to that is to avoid them, or at least disable the bad tweaks. That's all. – Rambouillet 4/2, 2010 at 19:17

I will address your question as one about multithreading in a high-level language, rather than discussing CPU pipeline optimization.

Can anyone explain how to correctly deal with the possibility of reordered operations in a multi-threaded environment?

Most, if not all, modern high-level multithreaded languages provide constructs for managing this potential for the compiler to reorder the logical execution of instructions. In C#, these include field-level constructs (volatile modifier), block-level constructs (lock keyword), and imperative constructs (Thead.MemoryBarrier).

Applying volatile to a field causes all access to that field in the CPU/memory to be executed in the same relative order in which it occurs in the instruction sequence (source code).

Using lock around a block of code causes the enclosed instruction sequence to be executed in the same relative order in which it occurs in the parent block of code.

The Thread.MemoryBarrier method indicates to the compiler that the CPU must not reorder memory access around this point in the instruction sequence. This enables a more advanced technique for specialized requirements.

The techniques above are described in order of increasing complexity and performance. As with all concurrency programming, determining when and where to apply these techniques is the challenge. When synchronizing access to a single field, the volatile keyword will work, but it could prove to be overkill. Sometimes you only need to synchronize writes (in which case a ReaderWriterLockSlim would accomplish the same thing with much better performance). Sometimes you need to manipulate the field multiple times in quick succession, or you must check a field and conditionally manipulate it. In these cases, the lock keyword is a better idea. Sometimes you have multiple threads manipulating shared state in a very loosely-synchronized model to improve performance (not typically recommended). In that case, carefully placed memory barriers can prevent stale and inconsistent data from being used in threads.

Frig answered 4/2, 2010 at 19:56 Comment(0)

Let me ask a question: Given a program code (say it is a single-threaded application), what is the correct execution? Intuitively, executing by CPU in-order as code specifies would be correct. This illusion of sequential execution is what programmers have.

However, modern CPU doesn't obey such restriction. Unless dependences are violated (data dependence, control dependence, and memory dependence), CPUs are executing instructions in out-of-order fashion. However, it is completely hidden to programmers. Programmers can never see what is going on inside of the CPU.

Compilers also exploit such fact. If the program's semantics (i.e., the inherent dependencies in your code) can be preserved, compilers would reorder any possible instruction to achieve better performance. One notable optimization is code hoisting: compilers may hoist load instruction to minimize memory latency. But, don't worry, compilers guarantee its correctness; In any case, compilers will NOT crash your program due to such instructing reordering since compilers must preserve dependencies at least. (But, compilers might have bugs :-)

If you're only considering single-thread application, you do not need to worry about such out-of-order execution either by compilers and CPUs, for a single-thread case.

(To learn more, I recommend you to take a look at the concept of ILP(instruction-level parallelism). Single thread performance is mostly dependent on how much ILP you can extract from a single thread. So, both CPUs and compilers do whatever they can do for better performance.)

However, when you consider multithreaded execution, then it has a potential problem called memory consistency problem. Intuitively programmers have a concept of sequential consistency. However, modern multi-core architectures are doing dirty and aggressive optimizations (e.g., caches and buffers). It is hard to implement sequential consistency with low-overheads in modern computer architecture. So, there could be very confusing situation due to out-of-order executions of memory loads and stores. You may observe some loads and stores had been executed in out-of-order. Read some articles related to relaxed memory models such as Intel x86's memory model (Read Chapter 8, Memory Ordering, of Volume 3A of Intel 64 and IA-32 Architectures Software Developer’s Manual). Memory barriers are needed in this situation where you have to enforce orders of memory instructions for the correctness.

THE ANSWER TO THE QUESTION: It's not easy to answer for this question in short. There is no good tools that detects such out-of-order and problematic behaviors due to memory consistency model (though there are research papers). So, in short, it is even hard for you to find such bugs exist in your code. However, I strongly suggest you to read articles on double-checked locking and its detailed paper. In a double-checked locking, due to relaxed memory consistency and compilers' reordering (note that compilers do not aware multi-threaded behaviors unless you explicitly specify with memory barriers), it may lead misbehavior.

In sum:

If you're only working on a single-threaded program, then you don't need to worry about out-of-order behaviors.
On multi-core, you may need to consider memory consistency problems. But, it's actually rare when you really need to worry about memory consistency issue. Mostly, data races, deadlocks, and atomicity violations kill your multi-threaded program.

Cyclone answered 28/1, 2010 at 23:32 Comment(5)

You don't ever actually answer the question. You're just elaborating what the OP already stated in his question. – Farthermost 4/2, 2010 at 14:49

@RBarryYoung, I elaborate some principles about that and also answered that in a single thread, you don't need to worry, but in a multithread, you may worry about it due to relaxed memory consistency. – Cyclone 4/2, 2010 at 17:0

I agree that it doesn't really answer the question, but it contains useful information so +1 – Hemorrhage 4/2, 2010 at 22:27

@Casebash, thanks. So, I added some answers :D Honestly speaking, I can't give a simple answer such a big question. I gave this long posting (not answer!) to give some background which is critical to understand the problem. I don't think there is a simple answer for your question. It really depends on situation. – Cyclone 4/2, 2010 at 23:43

Your wrong, it is actually quite easy to know how code is being executed out of order. Look at the hardware specs. And Its even easier to tell when Code is Generating re-ordered instructions. Look at the asm listing. Its much easier to say it is just too hard, than it actually is to know what your talking about. – Baguio 23/3, 2013 at 23:14

It's not the compiler, it's the CPU. (Well both actually, but the CPU is the harder to control.) Regardless of how your code gets compiled, the CPU will look-ahead in the instruction stream and execute things out of order. Typically, for example, start a read early, since memory is slower than the CPU. (ie start it early, hope the read is done before you actually need it)

Both the CPU and the compiler optimize based on the same rule: reorder anything as long as it doesn't affect the results of the program * assuming a single-threaded single-processor environment *.

So there's the problem - it optimizes for single-threadedness, when it isn't. Why? Because otherwise everything would be 100x slower. really. And most of your code is singlethreaded (ie single-threaded-interaction) - only small parts need to interact in a multi-threaded way.

The best/easiest/safest way to control this is with locks - mutexes, semaphores, events, etc.

Only if you really, really, need to optimize (based on careful measurement), then you can look into memory barriers and atomic operations - these are the underlying instructions that are used to build mutexes etc, and when used correctly limit out-of-order execution.

But before doing that kind of optimization, check that the algorithms and code-flow are correct and whether you could further minimize multi-threaded interactions.

Haul answered 29/1, 2010 at 0:11 Comment(5)

I'm not worrying too much about performance ATM, just how to write code that correctly deals with these optimisations. I kind of just stumbled onto the memory barrier article. – Hemorrhage 29/1, 2010 at 0:22

Then just use locks. Any variable that is shared between threads must be written AND read within a lock. – Haul 29/1, 2010 at 1:53

I suppose a lock would prevent the compiler rearranging the operations – Hemorrhage 29/1, 2010 at 3:23

Yes locks prevent compiler reordering as well. Typically not with much magic but just because any call to an opaque function (such as lock/unlock) forces the compiler to reload globals and avoid other optimizations because it doesn't know what that opaque function might change. – Haul 29/1, 2010 at 17:23

Unless you run -O3 where gcc will break everything not marked with volatile ;) – Extern 4/2, 2010 at 19:20

Let's be clear - out of order execution refers to the processor execution pipeline not to the compiler per se as your link clearly demonstrates.
Out of order execution is a strategy employed by most modern CPU pipelines that allows them to re-order instructions on the fly to typically minimise read/write stalls which is the most common bottleneck on modern hardware due to the disparity between CPU execution speeds and memory latency ( i.e how fast my processor can fetch and process compared to how fast I can update the result back to RAM ).
So this is primarily a hardware feature not a compiler feature.
You can override this feature if you know what you're doing typically by use of memory barriers. Power PC has a wonderfully named instruction called eieio ( enforce in order execution of i/o) that forces the CPU to flush all pending reads and writes to memory - this is particularly important with concurrent programming ( whether that be multi-threaded or multi-processor ) as it ensures that all CPUs or threads have synchronised the value of all memory locations.
If you want to read about this in depth then this PDF is an excellent ( though detailed ) introduction.
HTH

Mortician answered 29/1, 2010 at 0:5 Comment(1)

Thanks, the PDF does look pretty interesting on the hardware level, but it doesn't seem to answer the question on how to deal with the problem. – Hemorrhage 29/1, 2010 at 0:13

The compiler does not generate out of execution errors, it optimizes and reorders however it likes as long as what it produces yields the result your source code says it should.

But in dealing with multihreading, this can indeed blow up, though that generally has little to do with how the compiler has reordered your code (though it could make it worse in otherwise optimistic cases.

Dealing with threads operating on the same data, you need to be very, very careful and make sure your data is properly guarded with the appropriate guarding (semaphores/mutexes/atomic operations and similar)

Engrail answered 28/1, 2010 at 23:39 Comment(4)

Could you elaborate on the circumstances when you need guards to prevent this problem? – Hemorrhage 29/1, 2010 at 0:16

Basically any time you have more than 1 thread accessing the same data. – Engrail 31/1, 2010 at 18:36

There is no such thing as a signle threaded application. If your using hardware, then your using harware interrupts. And your using mulithreading whether you like it or not. – Baguio 23/3, 2013 at 23:5

@Baguio No, an application might not, and usually don't use interrupts. The fact that a single threaded application runs on a multithreaded operating system is orthogonal to this. – Engrail 23/3, 2013 at 23:11

The factor of the matter is that if you're only just starting to deal with multithreaded code (to the point that you're explicitly talking about thread scheduling as if it's somewhat scary [not to say it isn't, but for different reasons]), this is happening at a much, much lower level than you need to worry about. As others have said, compilers will not do this if it cannot guarantee correctness, and while it's good to know that technologies like this exist, unless you're writing your own compiler or doing really bare metal stuff, it shouldn't present an issue.

Isolationist answered 28/1, 2010 at 23:42 Comment(0)

The compiler and the cpu both implement algorithms which ensure that sequential semantics are preserved for a given execution stream. For them to not implement said algorithms qualifies as a bug. It is safe to assume that instruction reordering will not affect your program semantics.

As noted elsewhere, memory is the only place where non-sequential semantics may arise; synchronization to sequentialisms can be obtained there via various well-known mechanisms(at the assembly level, there are atomic memory access instructions; higher level functions such as mutexes, barriers, spinlocks, etc. are implemented with atomic assembly instructions).

In answer to your title: You don't handle OOO execution.

Similitude answered 4/2, 2010 at 19:28 Comment(0)

However, I never knew that compiler and hardware optimisations could reorder operations in a way that is guaranteed to work for a single thread, but not necessarily for multi-threading.

As neither C nor C++ have had a strongly defined memory model, compilers could reorder optimisations which might cause issues for multi-threading. But as for compilers which are designed for use in multi-threaded environments, they don't.

Multi-threaded code either writes to memory, and uses a fence to ensure visibility of the writes between threads, or it uses atomic operations.

Since the values used in the atomic operation case are observable in a single thread, the reordering does not effect it - they have to have been calculated correctly prior to the atomic operation.

Compliers intended for multi-threaded applications do not reorder across memory fences.

So the reordering either does not effect the behaviour, or is suppressed as a special case.

If you are already writing correct multi-threaded code, the compiler reordering doesn't matter. It's only an issue if the compiler isn't aware of memory fences, it which case you probably shouldn't be using it to write multi-threaded code in the first place.

Bunin answered 1/2, 2010 at 0:45 Comment(4)

In many cases, the same compilers are used in single-threaded and multi-threaded programs and, if you are not careful, they may re-order operations in a way that is noticeable in multithreaded programs. In C, the volatile attribute is used to denote a memory location that is visible to other threads. C compilers will not reorder volatile memory reads and writes. – Miles 2/2, 2010 at 12:55

@Jamey Hicks you seem to have missed the point. A given compiler is either targeted at a multithreaded environment or it is not. Whether you happen to write multithreaded programs with said compiler doesn't matter. Please read the C specification for the semantics of volatile - it does not mention threads. 'If you are not careful' implies you are not using the memory barriers and atomic operations correctly. The point I was making is that if you are careful to avoid the large reorderings you get between threads, then it already avoids the weaker reorderings the compiler can create. – Bunin 2/2, 2010 at 19:50

I think the whole point of this question is whether and how one writes thread-safe code. If the first sentence were rewritten to say that a compiler will not reorder memory operations in thread-safe code in a way that changes the behavior of the program, I would be in complete agreement. As written, it sounds to me like you are implying it won't do so even if the code is not thread-safe. – Miles 4/2, 2010 at 2:53

The OP says he knows about the issues writing multi-threaded code, but is worried about the compiler/CPU reordering the instructions within each thread. The compilers which target multithreaded environments will not introduce any reordering of the operations within a thread which will break code which already handles possible reorderings of operations between thread using fences, and the memory fences also enforce the order at the CPU level. So if he knows how to overcome the issues of multithreaded code, and has checked that the compiler docs, there is nothing more that needs doing. – Bunin 4/2, 2010 at 12:54

So essentially you're asking about the memory consistency model. Some languages/environments, such as Java and .NET, define a memory model, and it's the responsibility of the programmer to not do things which are not allowed, or results in undefined behavior. If you're unsure about the atomicity behavior of "normal" operations, it's better to be safe than sorry and just use the mutex primitives.

For C and C++ the situation is not as nice, as those language standards don't define a memory model. And no, contrary to the unfortunately popular opinion, volatile doesn't guarantee anything wrt atomicity. In this case, you have to rely on the platform threads library (which among other things executes required memory barriers) or compiler/hw-specific atomic intrinsics, and hope that the compiler doesn't do any optimizations that break program semantics. As long as you avoid conditional locking within a function (or translation unit if using IPA) you ought to be relatively safe.

Luckily, C++0x and the next C standard are rectifying this issue by defining a memory model. I asked a question related to this and, as it turned out, conditional locking here; the question contains links to some documents that go into the issue in some detail. I recommend you to read those documents.

Chacma answered 5/2, 2010 at 0:23 Comment(0)

How do you prevent the possibility out of execution functions occurring and blowing up in your face?

You don't - the compiler can only change the order of execution when doing so doesn't alter the end result.

Ibnsina answered 28/1, 2010 at 23:34 Comment(4)

That's not strictly true; on Windows and at least two other major gaming platforms the C/C++ compilers provide memory ordering intrinsics that prevent compiler reordering. (As well as CPU reordering.) – Marcie 28/1, 2010 at 23:40

They are only necessary when the order of instructions matters as well as the end result (for example, in locking algorithms). – Ibnsina 29/1, 2010 at 0:3

The compiler may make any optimisation that would work for a single thread, but then when multithreading is enabled, these optimisations fail. en.wikipedia.org/wiki/Memory_barrier – Hemorrhage 29/1, 2010 at 0:3

This is true given the assumptions the compiler is making. In most cases, the compiler assumes single-threaded operation except for things that you explicitly mark as being visible to concurrent operations, e.g., volatile in C/C++. – Miles 2/2, 2010 at 13:0

Most compilers nowadays have explicit ordering intrinsics. C++0x has memory ordering intrinsics as well.

Marcie answered 28/1, 2010 at 23:38 Comment(1)

Explicit ordering intrinsics? – Hemorrhage 29/1, 2010 at 0:7

-1

You should not be running things that need to happen in order in different threads. Threads are for processing things in parallel, so if the order is important, it needs to be done serially.

Unrepair answered 28/1, 2010 at 23:35 Comment(3)

It's very rare for processes to be entirely parallel. Usually there needs to be some communication, if only to gather the results of computation. – Bunin 1/2, 2010 at 0:31

It's OK to have an ordering dependence in multiple threads, as long as you synchronize to enforce the ordering. Producer-consumer parallelism is a common pattern but requires synchronization between the producer and consumer. – Miles 2/2, 2010 at 12:58

Please don't try answer questions you know nothing about. This is a very long page of idiotic, and moronic posts. Most of which clearly show the ignorance of the poster. – Baguio 23/3, 2013 at 23:10

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags