In addition to the above answers:
If there are no fences, the only ordering that needs to be preserved is the data dependency order. So on a single CPU a load of X should see the most recent store to X before it. But if instructions do not have any data dependency, they can be executed in any order.
Modern CPU's use out of order execution the maximize the amount of parallelism in the instruction stream. This way independent instructions can run in parallel and it prevents the CPU from stalling for memory access.
CPUs make use of other techniques like store buffers, load buffers, write coalescing etc. Which all can lead to loads and stores being executed out of order. This is fine, because it isn't visible to the core that executes these loads and stores. The problem is when the core is sharing memory with other cores; then these reorderings can become visible.
For Sequential Consistency (SC) no reordering is allowed; so all 4 fences need to be preserved:
- [LoadLoad]
- [LoadStore]
- [StoreLoad]
- [StoreStore]
On the X86, the store buffers can cause older stores to be reordered with newer loads to a different address; so the [StoreLoad] is dropped and SC only preserved [LoadLoad][LoadStore][StoreStore]. This memory model is called TSO (Total Store Order).
TSO can be relaxed by allowing writes from the same core to be reordered (e.g. write coalescing or store buffers that don't retire in order). This results in PSO (partial store order).
The problem with SC/TSO/PSO is that certain reordering aren't allowed and this can lead to reduced performance; imagine there are 2 independent loads on the same CPU, then these loads can't be reordered because of the [LoadLoad]. In practice this can be resolved by executing instructions speculatively and if an out of order load is detected, then flush the pipeline and start again. This makes CPU's more complex and less performant.
Models like SC, TSO, PSO are strong consistency models because ever load and every store has certain ordering semantics. But in a weakly ordered consistency model, there is a separation between a plain load/store (no ordering semantics) and synchronization actions e.g. an acquire load and release store that do provide ordering semantics. The weak memory model with acquire loads and release stores is called release-consistency.
The big advantage of these weak models is that they allow for a much higher degree of parallelism and simpler CPU design. It shifts the burden to the software.
In practice you normally program using a programming language/API that provides a certain memory model and it needs to make sure the compiler isn't violating the model and sufficient ordering is added to the hardware e.g. in the form of fences. If you have a look at Java or C11, and you are using it correctly, then the same code will run fine on a CPU with a strong memory model like an X86 and a CPU with a weak memory model like ARM.
References: