There are lots of ways for a DeathStation 9000 C++ implementation to break your program, e.g. by compiling foo = 5
into a store of 4
and then a memory increment, so a value is visible that never existed in the abstract machine. But that doesn't seem plausible on any real compiler anyone would want to use, except maybe if compiling shared = tmp ? 1234 : 5678;
it could unconditionally store a 1234
then conditionally store a 5678
instead of doing an ALU select. Probably still unlikely.
One major real-world effect that can't go without mentioning is hoisting a non-atomic load (or sinking a store) out of a loop. MCU programming - C++ O2 optimization breaks while loop explains the details. But that would just stop you from ever seeing the store, not give you a value other than 0 or 5.
Other practical effects include int tmp = shared;
compiling later uses of tmp
into re-reads of shared
, so effectively your local variable can have multiple inconsistent values. See the "Invented loads" section in Who's afraid of a big bad optimizing compiler? on LWN. (The context of that article is Linux kernel programming, where they use GCC/Clang's semantics for volatile
(via WRITE_ONCE
or READ_ONCE
macros) as basically equivalent to std::atomic<>
with memory_order_relaxed
, for types of register-width or narrower.)
Definitely read that whole article; it's written from exactly the perspective you're looking for, describing real-world badness that can plausibly happen on real CPUs with real compilers. It also has a section about invented stores with a more plausible example than my Deathstation 9000 first paragraph, involving multiple small members of a struct.
But you asked about practical cases that could cause tearing or bad effects other than lack of visibility across one single change, with CPUs and compilers that are used in practice. Tearing between 0
and 5
is implausible1, but let's talk about 0
and 0x12345678
.
Which types on a 64-bit computer are naturally atomic in gnu C and gnu C++? -- meaning they have atomic reads, and atomic writes - none are guaranteed; Nate's answer shows possible tearing for assigning certain constants to uint64_t
on AArch64 before ARMv8.4, and that you can coax a compiler into doing an unaligned 16-bit load from the middle of a 32-bit unsigned
integer just by reading the whole variable and doing some innocent-looking computations on the temporary result:
unsigned x;
unsigned foo(void) {
return (x >> 8) & 0xffff;
}
For x86-64, compilers use movzx eax, WORD PTR x[rip+1]
, which is safe: both AMD and Intel separately guarantee that 1, 2, or 4-byte unaligned loads / stores on cacheable memory contained within an aligned 8-byte chunk are guaranteed atomic. (See Why is integer assignment on a naturally aligned variable atomic on x86?). And both mainstream x86-64 ABIs have alignof(unsigned) == 4
.
But that's not guaranteed on all other ISAs, such as ARMv7-M or ARMv8-A. Unaligned loads are supported, and clang will use them (Godbolt), but the architecture only guarantees atomicity for aligned loads, e.g. "All halfword accesses to halfword-aligned locations.". Instead of putting significant shift hardware into their load execution units like x86 CPUs are required to do, ARM chips are allowed if they want to do separate byte loads and combine the bytes. From the ARMv8-A architecture reference manual, All other memory accesses are regarded as streams of accesses to bytes, and no atomicity between accesses to
different bytes is ensured by the architecture.
According to a blog post, fallback to separate byte loads is in practice what happened on real ARMv6 hardware. I'm not sure if later low-power ARMs still do that. Maybe not, since compiler tuning heuristics are willing to use unaligned loads for them. (Even -march=armv6t2 -mtune=cortex-a53
gets clang to avoid unaligned loads, Godbolt, but that might be due to ARMv6 allowing a config choice so unaligned loads use the offset-within-word bits as a byte rotate count like ARMv5(!) or undefined for halfword, not as a normal byte-offset like ARMv7.)
But we can get a compiler to emit code that's not guaranteed on paper to be atomic on the multi-core -mcpu
we asked it to compile for, such as Cortex-M55 which is ARMv8.1-M and has dual-core versions available, or Cortex-A53 which is ARMv8-A and is often found in multi-core CPUs. (Godbolt)
# ARM clang 11 -O2 -Wall -mcpu=cortex-a53 or -mcpu=cortex-m55
foo():
movw r0, :lower16:x
movt r0, :upper16:x
ldrh.w r0, [r0, #1] @@@ unaligned load from x+1
bx lr
Similar with AArch64 Clang. I don't have the hardware to test if tearing is actually visible in practice. I suspect it might not be on Cortex-A53, otherwise it might have been better for Clang to compile like GCC to a word load and ubfx
unsigned bitfield-extract. If tearing is possible, that means it took extra cycles to do multiple accesses to L1d cache even within an aligned word.
The architecture manual says tearing is possible for any misaligned halfword load/store, but on some cores it probably only happens when crossing a cache-line boundary for cacheable load/store, or maybe a 4 or 8-byte boundary depending on the CPU internals.
Related Q&As:
Footnote 1: 0
and 5
differ only in the low byte.
Tearing between 0
and 5
in a uint32_t
is pretty much impossible even on an 8-bit machine: the only HW guarantee required is that byte stores are atomic.
0u
and 5u
differ only in the low 3 bits; their top 3 octets are the same.
(This would also be the case for signed 0
or 5
in any of the three signed-integer representations allowed by the standard, two's complement, one's complement, and sign/magnitude.)
uint32_t
is required to be exactly 32 bits, zero padding, so low padding can't split the low 3 value bits across a byte boundary. Any endianness is possible, but within each unsigned char
chunk (of at least 8 bits, the minimum CHAR_BIT) the bits have to be in base-2 place-value order for any unsigned type. (It is well-defined to use memcpy
or unsigned char*
to examine the object-representation of other types.)
So tearing at byte boundaries couldn't create values other than 0u
or 5u
. Obviously with values like 0
and 0x12345678u
such tearing would be visible.
All CPU architectures I'm aware of have atomic byte load / byte store, if they have byte accesses at all (not DEC Alpha). See Can modern x86 hardware not store a single byte to memory? (which covers more than just x86). Commit to L1d cache might involve an extra cycle on non-x86 CPUs to update a larger ECC granule, but it's like an atomic RMW.
volatile uint32_t foo;
– Interactvolatile
helps compiler optimizations, but doesn't do anything about the read and write pipelines on modern processors. (Arguably, it should, but in reality, it doesn't.) – Somewayvolatile
doesn't help compiler optimization, it hinders it. It directs the compiler that reads and writes cannot be folded / optimized away. The primary purpose is for accessing (memory-mapped) hardware registers. – Breaux