The cardinal rules of speculative out-of-order (OoO) execution are:
- Preserve the illusion of instructions running sequentially, in program order
- Make sure speculation is contained to things that can be rolled back if mis-speculation is detected, and that can't be observed by other cores to be holding a wrong value. Physical registers, the back-end itself that tracks instruction order yes, but not cache. Cache is coherent with other cores so stores must not commit to cache until after they're non-speculative.
OoO exec is normally implemented by treating everything as speculative until retirement. Every load or store could fault, every FP instruction could raise an FP exception. Branches are special (compared to exceptions) only in that branch mispredicts are not rare, so a special mechanism to handle early detection and roll-back for branch misses is helpful.
Yes, cacheable loads can be executed speculatively and OoO because they have no side effects.
Store instructions can also be executed speculatively thanks to the store buffer. The actual execution of a store just writes the address and data into the store buffer. (related: Size of store buffers on Intel hardware? What exactly is a store buffer? gets more techincal than this, with more x86 focus. This answer is I think applicable to most ISAs.)
Commit to L1d cache happens some time after the store instruction retires from the re-order buffer (ROB), i.e. when the store is known to be non-speculative, the associated store-buffer entry "graduates" and becomes eligible to commit to cache and become globally visible. A store buffer decouples execution from anything other cores can see, and also insulates this core from cache-miss stores so it's a very useful feature even on in-order CPUs.
Before a store-buffer entry "graduates", it can just be discarded along with the ROB entry that points to it, when rolling back on mis-speculation.
(This is why even strongly-ordered hardware memory models still allow StoreLoad reordering https://preshing.com/20120930/weak-vs-strong-memory-models/ - it's nearly essential for good performance not to make later loads wait for earlier stores to actually commit.)
The store buffer is effectively a circular buffer: entries allocated by the front-end (during alloc/rename pipeline stage(s)) and released upon commit of the store to L1d cache. (Which is kept coherent with other cores via MESI).
Strongly-ordered memory models like x86 can be implemented by doing commit from the store buffer to L1d in order. Entries were allocated in program order, so the store buffer can basically be a circular buffer in hardware. Weakly-ordered ISAs can look at younger entries if the head of the store buffer is for a cache line that isn't ready yet.
Some ISAs (especially weakly ordered) also do merging of store buffer entries to create a single 8-byte commit to L1d out of a pair of 32-bit stores, for example.
Reading cacheable memory regions is assumed to have no side effects and can be done speculatively by OoO exec, hardware prefetch, or whatever. Mis-speculation can "pollute" caches and waste some bandwidth by touching cache lines that the true path of execution wouldn't (and maybe even triggering speculative page-walks for TLB misses), but that's the only downside1.
MMIO regions (where reads do have side-effects, e.g. making a network card or SATA controller do something) need to be marked as uncacheable so the CPU knows that speculative reads from that physical address are not allowed. If you get this wrong, your system will be unstable - my answer there covers a lot of the same details you're asking about for speculative loads.
High performance CPUs have a load buffer with multiple entries to track in-flight loads, including ones that miss in L1d cache. (Allowing hit-under-miss and miss-under-miss even on in-order CPUs, stalling only if/when an instruction tries to read load-result register that isn't ready yet).
In an OoO exec CPU, it also allows OoO exec when one load address is ready before another. When data eventually arrives, instructions waiting for inputs from the load result become ready to run (if their other input was also ready). So the load buffer entries have to be wired up to the scheduler (called the reservation station in some CPUs).
See also About the RIDL vulnerabilities and the "replaying" of loads for more about how Intel CPUs specifically handle uops that are waiting by aggressively trying to start them on the cycle when data might be arriving from L2 for an L2 hit.
Footnote 1: This downside, combined with a timing side-channel for detecting / reading micro-architectural state (cache line hot or cold) into architectural state (register value) is what enables Spectre. (https://en.wikipedia.org/wiki/Spectre_(security_vulnerability)#Mechanism)
Understanding Meltdown as well is very useful for understanding the details of how Intel CPUs choose to handle fault-suppression for speculative loads that turn out to be on the wrong path. http://blog.stuffedcow.net/2018/05/meltdown-microarchitecture/
And, for sure, read/write operations are supported
Yes, by decoding them to separate logically separate load / ALU / store operations, if you're talking about modern x86 that decodes to instructions uops. The load works like a normal load, the store puts the ALU result in the store buffer. All 3 of the operation can be scheduled normally by the out-of-order back end, just like if you'd written separate instructions.
If you mean atomic RMW, then that can't really be speculative. Cache is globally visible (share requests can come at any time) and there's no way to roll it back (well, except whatever Intel does for transactional memory...). You must not ever put a wrong value in cache. See Can num++ be atomic for 'int num'? for more about how atomic RMWs are handled, especially on modern x86, by delaying response to share / invalidate requests for that line between the load and the store-commit.
However, that doesn't mean that lock add [rdi], eax
serializes the whole pipeline: Are loads and stores the only instructions that gets reordered? shows that speculative OoO exec of other independent instructions can happen around an atomic RMW. (vs. what happens with an exec barrier like lfence
that drains the ROB).
Many RISC ISAs only provide atomic RMW via load-linked / store-conditional instructions, not a single atomic RMW instruction.
[read/write ops ...], to some extent at least, due to the fact that the registers themselves, on some CPUs, are physically located on the CPU cache as I understand.
Huh? False premise, and that logic doesn't make sense. Cache has to be correct at all times because another core could ask you to share it at any moment. Unlike registers which are private to this core.
Register files are built out of SRAM like cache, but are separate. There are a few microcontrollers with SRAM memory (not cache) on board, and the registers are memory-mapped using the early bytes of that space. (e.g. AVR). But none of that seems at all relevant to out-of-order execution; cache lines that are caching memory are definitely not the same ones that are being used for something completely different, like holding register values.
It's also not really plausible that a high-performance CPU that's spending the transistor budget to do speculative execution at all would combine cache with register file; then they'd compete for read/write ports. One large cache with the sum total read and write ports is much more expensive (area and power) than a tiny fast register file (many read/write ports) and a small (like 32kiB) L1d cache with a couple read ports and 1 write port. For the same reason we use split L1 caches, and have multi-level caches instead of just one big private cache per core in modern CPUs. Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?
Related reading / background:
- https://en.wikipedia.org/wiki/Memory_disambiguation - how the CPU handles forwarding from the store buffer to a load, or not if the store was actually younger (later in program order) than this load.
- https://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/ - Store-to-Load Forwarding and Memory Disambiguation in x86 Processors. Very detailed test results and technical discussion of store-forwarding, including from narrow loads that overlap with different parts of a store, and near cache-line boundaries. (https://agner.org/optimize/ has some simpler-to-understand but less detailed info about when store-forwarding is slow vs. fast in his microarch PDF.)
- https://github.com/travisdowns/uarch-bench/wiki/Memory-Disambiguation-on-Skylake - modern CPUs dynamically predict memory dependencies for loads when there are earlier stores with unknown address in flight. (i.e. store-address uop not executed yet.) This can result in having to roll back if the prediction is wrong.
- Globally Invisible load instructions - store forwarding from loads that partially overlap a recent store and partially don't gives us a corner case that sheds some light on how CPUs work, and how it does/doesn't make sense to think about memory (ordering) models. Note that C++ std::atomic can't create code that does this, although C++20 std::atomic_ref could let you do an aligned 4-byte atomic store that overlaps an aligned 8-byte atomic load.