Why do weak memory models exist and how is their instruction order selected?

Asked 15/11, 2019 at 3:40 Answered 6/4, 2021 at 6:30

multithreading memory parallel-processing memory-model

CPUs such as ARM have the weak memory model. Assume we have two threads T1 and T2.

| T1      | T2      |
|---------|---------|
| Instr A | Instr C |
| Instr B | Instr D |

In a weak order any instruction can run at any time which means execution order of "D -> A -> B -> C" is possible.

I have the following questions:

Why is this beneficial?
How is the selection (optimization) done? is the CPU randomly picking them or are there algorithms behind it? Is the CPU doing the picking or there is another chip which is doing the work (memory chip or something)?

Bootle answered 15/11, 2019 at 3:40 Comment(0)

In addition to the above answers:

If there are no fences, the only ordering that needs to be preserved is the data dependency order. So on a single CPU a load of X should see the most recent store to X before it. But if instructions do not have any data dependency, they can be executed in any order.

Modern CPU's use out of order execution the maximize the amount of parallelism in the instruction stream. This way independent instructions can run in parallel and it prevents the CPU from stalling for memory access.

CPUs make use of other techniques like store buffers, load buffers, write coalescing etc. Which all can lead to loads and stores being executed out of order. This is fine, because it isn't visible to the core that executes these loads and stores. The problem is when the core is sharing memory with other cores; then these reorderings can become visible.

For Sequential Consistency (SC) no reordering is allowed; so all 4 fences need to be preserved:

[LoadLoad]
[LoadStore]
[StoreLoad]
[StoreStore]

On the X86, the store buffers can cause older stores to be reordered with newer loads to a different address; so the [StoreLoad] is dropped and SC only preserved [LoadLoad][LoadStore][StoreStore]. This memory model is called TSO (Total Store Order).

TSO can be relaxed by allowing writes from the same core to be reordered (e.g. write coalescing or store buffers that don't retire in order). This results in PSO (partial store order).

The problem with SC/TSO/PSO is that certain reordering aren't allowed and this can lead to reduced performance; imagine there are 2 independent loads on the same CPU, then these loads can't be reordered because of the [LoadLoad]. In practice this can be resolved by executing instructions speculatively and if an out of order load is detected, then flush the pipeline and start again. This makes CPU's more complex and less performant.

Models like SC, TSO, PSO are strong consistency models because ever load and every store has certain ordering semantics. But in a weakly ordered consistency model, there is a separation between a plain load/store (no ordering semantics) and synchronization actions e.g. an acquire load and release store that do provide ordering semantics. The weak memory model with acquire loads and release stores is called release-consistency.

The big advantage of these weak models is that they allow for a much higher degree of parallelism and simpler CPU design. It shifts the burden to the software.

In practice you normally program using a programming language/API that provides a certain memory model and it needs to make sure the compiler isn't violating the model and sufficient ordering is added to the hardware e.g. in the form of fences. If you have a look at Java or C11, and you are using it correctly, then the same code will run fine on a CPU with a strong memory model like an X86 and a CPU with a weak memory model like ARM.

References:

Whet answered 6/4, 2021 at 6:30 Comment(0)

Why do weak memory models exit?

For performance reasons. Weak memory models allow compiler and hardware optimization that improve system performance. The cost of enforcing a strong memory model (sequential-consistency model) in compilation and hardware implementation is severe performance degradation.

What are the allowed instruction reorderings (how is the selection done)?

It is specific to each memory model. There are several weak memory models, and the instruction reordering rules are part of their specifications.

Instruction reordering is ubiquitously used in compiler and hardware optimizations to achieve higher performance. The basic premise for these optimizations is that the instructions can be reordered as long as the functional correctness of the program is preserved.

In a sequential (single-threaded) program, functional correctness can be guaranteed by simply ensuring that "two operations are executed in program order if they are accessing the same memory location and one of them is a write or if there is a data or control dependence between them."

For multithreaded programs, functional correctness also depends on the relative order of loads and stores to different memory locations in the same thread. It is the memory model specification that specifies the conditions under which two memory instructions can be reordered without affecting the functional correctness.

Danyelledanyette answered 1/2, 2020 at 16:20 Comment(1)

Refer to (preshing.com/20120930/weak-vs-strong-memory-models) for a simpler explanation of weak vs. strong memory models. – Danyelledanyette 10/3, 2020 at 5:25

There is no global arbiter that would do any such thing. If there was, it would be as efficient to always do things in order.

The only data available immediately is local. Each execution takes decision based on rapidly available information.

There is no pressure to execute anything in reverse order rather than in written order. Reserve is not a priori better. But data for B might be available before data for A and then B might be executed first as waiting for A to complete would let computing resources unused.

So it's all a matter of having all data available when needed, and the delays of communication between processors. You could view that as a team effort to work cooperatively with people that can only exchange by very slow means of communication: they would get as much work done based on their locally available information. No central power would ever have an accurate picture of the state of latest done work.

Cacka answered 19/11, 2019 at 15:32 Comment(0)

Recommended topics

Hot tags