Can a speculatively executed CPU branch contain opcodes that access RAM?
Asked Answered
C

1

11

As I understand, when a CPU speculatively executes a piece of code, it "backs up" the register state before switching to the speculative branch, so that if the prediction turns out wrong (rendering the branch useless) -- the register state would be safely restored, without damaging the "state".

So, my question is: can a speculatively executed CPU branch contain opcodes that access RAM?

I mean, accessing the RAM isn't an "atomic" operation - one simple opcode reading from memory can cause actual RAM access, if the data isn't currently located in the CPU cache, which might turn out as an extremely time consuming operation, from the CPU perspective.

And if such access is indeed allowed in a speculative branch, is it only for read operations? Because, I can only assume that reverting a write operation, depending on it's size, might turn out extremely slow and tricky if a branch is discarded and a "rollback" is performed. And, for sure, read/write operations are supported, to some extent at least, due to the fact that the registers themselves, on some CPUs, are physically located on the CPU cache as I understand.

So, maybe a more precise formulation would be: what are the limitations of a speculatively executed piece of code?

Covering answered 30/9, 2020 at 15:57 Comment(0)
I
24

The cardinal rules of speculative out-of-order (OoO) execution are:

  1. Preserve the illusion of instructions running sequentially, in program order
  2. Make sure speculation is contained to things that can be rolled back if mis-speculation is detected, and that can't be observed by other cores to be holding a wrong value. Physical registers, the back-end itself that tracks instruction order yes, but not cache. Cache is coherent with other cores so stores must not commit to cache until after they're non-speculative.

OoO exec is normally implemented by treating everything as speculative until retirement. Every load or store could fault, every FP instruction could raise an FP exception. Branches are special (compared to exceptions) only in that branch mispredicts are not rare, so a special mechanism to handle early detection and roll-back for branch misses is helpful.


Yes, cacheable loads can be executed speculatively and OoO because they have no side effects.

Store instructions can also be executed speculatively thanks to the store buffer. The actual execution of a store just writes the address and data into the store buffer. (related: Size of store buffers on Intel hardware? What exactly is a store buffer? gets more techincal than this, with more x86 focus. This answer is I think applicable to most ISAs.)

Commit to L1d cache happens some time after the store instruction retires from the re-order buffer (ROB), i.e. when the store is known to be non-speculative, the associated store-buffer entry "graduates" and becomes eligible to commit to cache and become globally visible. A store buffer decouples execution from anything other cores can see, and also insulates this core from cache-miss stores so it's a very useful feature even on in-order CPUs.

Before a store-buffer entry "graduates", it can just be discarded along with the ROB entry that points to it, when rolling back on mis-speculation.

(This is why even strongly-ordered hardware memory models still allow StoreLoad reordering https://preshing.com/20120930/weak-vs-strong-memory-models/ - it's nearly essential for good performance not to make later loads wait for earlier stores to actually commit.)

The store buffer is effectively a circular buffer: entries allocated by the front-end (during alloc/rename pipeline stage(s)) and released upon commit of the store to L1d cache. (Which is kept coherent with other cores via MESI).

Strongly-ordered memory models like x86 can be implemented by doing commit from the store buffer to L1d in order. Entries were allocated in program order, so the store buffer can basically be a circular buffer in hardware. Weakly-ordered ISAs can look at younger entries if the head of the store buffer is for a cache line that isn't ready yet.

Some ISAs (especially weakly ordered) also do merging of store buffer entries to create a single 8-byte commit to L1d out of a pair of 32-bit stores, for example.


Reading cacheable memory regions is assumed to have no side effects and can be done speculatively by OoO exec, hardware prefetch, or whatever. Mis-speculation can "pollute" caches and waste some bandwidth by touching cache lines that the true path of execution wouldn't (and maybe even triggering speculative page-walks for TLB misses), but that's the only downside1.

MMIO regions (where reads do have side-effects, e.g. making a network card or SATA controller do something) need to be marked as uncacheable so the CPU knows that speculative reads from that physical address are not allowed. If you get this wrong, your system will be unstable - my answer there covers a lot of the same details you're asking about for speculative loads.

High performance CPUs have a load buffer with multiple entries to track in-flight loads, including ones that miss in L1d cache. (Allowing hit-under-miss and miss-under-miss even on in-order CPUs, stalling only if/when an instruction tries to read load-result register that isn't ready yet).

In an OoO exec CPU, it also allows OoO exec when one load address is ready before another. When data eventually arrives, instructions waiting for inputs from the load result become ready to run (if their other input was also ready). So the load buffer entries have to be wired up to the scheduler (called the reservation station in some CPUs).

See also About the RIDL vulnerabilities and the "replaying" of loads for more about how Intel CPUs specifically handle uops that are waiting by aggressively trying to start them on the cycle when data might be arriving from L2 for an L2 hit.


Footnote 1: This downside, combined with a timing side-channel for detecting / reading micro-architectural state (cache line hot or cold) into architectural state (register value) is what enables Spectre. (https://en.wikipedia.org/wiki/Spectre_(security_vulnerability)#Mechanism)

Understanding Meltdown as well is very useful for understanding the details of how Intel CPUs choose to handle fault-suppression for speculative loads that turn out to be on the wrong path. http://blog.stuffedcow.net/2018/05/meltdown-microarchitecture/


And, for sure, read/write operations are supported

Yes, by decoding them to separate logically separate load / ALU / store operations, if you're talking about modern x86 that decodes to instructions uops. The load works like a normal load, the store puts the ALU result in the store buffer. All 3 of the operation can be scheduled normally by the out-of-order back end, just like if you'd written separate instructions.

If you mean atomic RMW, then that can't really be speculative. Cache is globally visible (share requests can come at any time) and there's no way to roll it back (well, except whatever Intel does for transactional memory...). You must not ever put a wrong value in cache. See Can num++ be atomic for 'int num'? for more about how atomic RMWs are handled, especially on modern x86, by delaying response to share / invalidate requests for that line between the load and the store-commit.

However, that doesn't mean that lock add [rdi], eax serializes the whole pipeline: Are loads and stores the only instructions that gets reordered? shows that speculative OoO exec of other independent instructions can happen around an atomic RMW. (vs. what happens with an exec barrier like lfence that drains the ROB).

Many RISC ISAs only provide atomic RMW via load-linked / store-conditional instructions, not a single atomic RMW instruction.

[read/write ops ...], to some extent at least, due to the fact that the registers themselves, on some CPUs, are physically located on the CPU cache as I understand.

Huh? False premise, and that logic doesn't make sense. Cache has to be correct at all times because another core could ask you to share it at any moment. Unlike registers which are private to this core.

Register files are built out of SRAM like cache, but are separate. There are a few microcontrollers with SRAM memory (not cache) on board, and the registers are memory-mapped using the early bytes of that space. (e.g. AVR). But none of that seems at all relevant to out-of-order execution; cache lines that are caching memory are definitely not the same ones that are being used for something completely different, like holding register values.

It's also not really plausible that a high-performance CPU that's spending the transistor budget to do speculative execution at all would combine cache with register file; then they'd compete for read/write ports. One large cache with the sum total read and write ports is much more expensive (area and power) than a tiny fast register file (many read/write ports) and a small (like 32kiB) L1d cache with a couple read ports and 1 write port. For the same reason we use split L1 caches, and have multi-level caches instead of just one big private cache per core in modern CPUs. Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?


Related reading / background:


  • https://en.wikipedia.org/wiki/Memory_disambiguation - how the CPU handles forwarding from the store buffer to a load, or not if the store was actually younger (later in program order) than this load.
  • https://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/ - Store-to-Load Forwarding and Memory Disambiguation in x86 Processors. Very detailed test results and technical discussion of store-forwarding, including from narrow loads that overlap with different parts of a store, and near cache-line boundaries. (https://agner.org/optimize/ has some simpler-to-understand but less detailed info about when store-forwarding is slow vs. fast in his microarch PDF.)
  • https://github.com/travisdowns/uarch-bench/wiki/Memory-Disambiguation-on-Skylake - modern CPUs dynamically predict memory dependencies for loads when there are earlier stores with unknown address in flight. (i.e. store-address uop not executed yet.) This can result in having to roll back if the prediction is wrong.
  • Globally Invisible load instructions - store forwarding from loads that partially overlap a recent store and partially don't gives us a corner case that sheds some light on how CPUs work, and how it does/doesn't make sense to think about memory (ordering) models. Note that C++ std::atomic can't create code that does this, although C++20 std::atomic_ref could let you do an aligned 4-byte atomic store that overlaps an aligned 8-byte atomic load.
Ierna answered 1/10, 2020 at 2:46 Comment(5)
Thank you for the highly informative and detailed answer.Covering
Wow, what a nice answer!Thomey
@MargaretBloom: Thanks. I'd written some answers previously where I intended to explain what a store buffer was and what it was for, but they ended up getting bogged down in specific details and got super technical really quickly. I think this time I managed to write a more beginner-friendly actual intro to the relevant concepts.Ierna
Typical nice answer. Cache can contain speculative state; hardware transactional memory can be implemented by allowing speculative writes to cache and not making such visible to other agents. However, complicating an already complex concept may not be wise. Even more off-the-wall, MMIO accesses could be cached, in theory, though the complexity of guaranteeing correct behavior would limit the total payoff for such (many I/O reads have no side effects and even some writes would be safe, similar to some speculative stack/TLS writes). Cached MMIO is even more "unnecessary complication".Analyzer
Indeed, Intel's TSX does use L1d cache to track the write set of a transaction. At least in the implementation in Haswell: realworldtech.com/haswell-tm - David Kanter wrote that before details were confirmed, but his guesswork was coorrect, IIRC. It's come and gone many times with microcode updates disabling all or part of it (for correctness in the first two attempts, then later for MDS vulnerabilities, specifically TAA). I'm not sure if the latest CPUs even support it at all. (And if so, maybe only Xeons since presumably E-cores in hybrid CPUs never did.)Ierna

© 2022 - 2024 — McMap. All rights reserved.