Will Speculative Execution Follow Into an Expensive Operation?
Asked Answered
Z

1

11

If I understand branching correctly (x86), the processor will sometimes speculatively take a code path and perform the instructions and 'cancel' the results of the wrong path. What if the operation in the wrong codepath is very expensive, like a memory read that causes a cache miss or some expensive math operation? Will the processor try to perform something expensive ahead of time? How would a processor typically handle this?

if (likely) {
    // do something lightweight (addition, subtraction, etc.)
} else {
    // do something expensive (cache-miss, division, sin/cos/tan etc.)
}
Zellers answered 8/3, 2016 at 0:46 Comment(6)
At least as far as I know, it executes a stream of instructions the same way whether it's executing it speculatively or not. In fact, I don't think the fact that code is being executed speculatively is transmitted to the execution units at all. It's afterward (retirement unit) that it decides what to retire and what to just throw away.Mahican
@JerryCoffin I don't understand what that means for an instruction to be executed but not translated to the execution units, can you rephrase? Do you mean the speculative instructions don't take up any CPU cycles?Zellers
Note that branch prediction and speculative execution are two separate techniques. Title mentions branch prediction, and question body then talks about speculative execution. And the instructions that are executed speculatively do of course consume resources of CPU execution units.Androcles
@Androcles Oh sorry, what is the etiquette here, do I edit the title?Zellers
@user1043761: Yeah, edit the title and/or question body to make it a better post, as long as you don't change it into a different question that invalidates the answer. In this case probably just changing the title is good.Link
@PeterCordes - Anders Fogh rained on everyone's parade: Negative Result: Reading Kernel Memory From User Mode. As I understand it, speculative execution and the blog lead to Meltdown and Spectre bugs.Coh
L
7

tl:dr: the impact isn't as bad as you think, because the CPU no longer has to wait for slow things, even if it doesn't cancel them. Almost everything is heavily pipelined, so many operations can be in flight at once. The mis-speculated operations don't prevent new ones from starting.


Current x86 designs do not speculate on both sides of a branch at once. They only speculate down the predicted path.

I'm not aware of any specific microarchitecture that does speculate along both ways of a branch in any circumstances, but that doesn't mean there aren't any. I've mostly only read up on microarchitectures (see the tag wiki for links to Agner Fog's microarch gude). I'm sure it's been suggested in academic papers, and maybe even implemented in a real design somewhere.


I'm not sure exactly what happens in current Intel and AMD designs when a branch mispredict is detected while a cache-miss load or store is already executing pending, or a divide is occupying the divide unit. Certainly out-of-order execution doesn't have to wait for the result, because no future uops depend on it.

On uarches other than P4, bogus uops in the ROB/scheduler are discarded when a mispredict is detected. From Agner Fog's microarch doc, talking about P4 vs. other uarches:

the misprediction penalty is unusually high for two reasons ... [long pipeline and] ... bogus μops in a mispredicted branch are not discarded before they retire. A misprediction typically involves 45 μops. If these μops are divisions or other time-consuming operations then the misprediction can be extremely costly. Other microprocessors can discard μops as soon as the misprediction is detected so that they don't use execution resources unnecessarily.

uops that are currently occupying execution units are another story:

Almost all execution units except the divider are fully pipelined, so another multiply, shuffle, or whatever can start without cancelling an in-flight FP FMA. (Haswell: 5 cycle latency, two execution units each capable of one per clock throughput, for a total sustained throughput of one per 0.5c. This means max throughput requires keeping 10 FMAs in flight at once, typically with 10 vector accumulators). Divide is interesting, though. Integer divide is many uops, so a branch mispredict will at least stop issuing them. FP div is only a single uop instruction, but not fully pipelined, esp. in older CPUs. It would be useful to cancel an FP div that was tieing up the divide unit, but IDK if that's possible. If adding the ability to cancel would have slowed down the normal case, or cost more power, then it would probably be left out. It's a rare special case which probably wasn't worth spending transistors on.

x87 fsin or something is a good example of a really expensive instruction. I didn't notice that until I went back to re-read the question. It's microcoded, so even though it has a latency of 47-106 cycles (Intel Haswell), it's also 71-100 uops. A branch mispredict would stop the frontend from issuing the remaining uops, and cancel all the ones that are queued, like I said for integer division. Note that real libm implementations typically don't use fsin and so on because they're slower and less accurate than what can be achieved in software (even without SSE), IIRC.


For a cache-miss, it might be cancelled, potentially saving bandwidth in L3 cache (and maybe main memory). Even if not, the instruction no longer has to retire, so the ROB won't fill up waiting for it to finish. That's normally why cache misses hurt OOO execution so much, but here it's at worst just tieing up a load or store buffer. Modern CPUs can have many outstanding cache misses in flight at once. Often code doesn't make this possible because future operations depend on the result of a load that missed in cache (e.g. pointer chasing in a linked list or tree), so multiple memory operations can't be pipelined. Even if a branch mispredict doesn't cancel much of an in-flight memory op, it avoids most of the worst effects.


I have heard of putting a ud2 (illegal instruction) at the end of a block of code to stop instruction prefetch from triggering a TLB miss when the block is at the end of a page. I'm not sure when this technique is necessary. Maybe if there's a conditional branch that's always actually taken? That doesn't make sense, you'd just use an unconditional branch. There must be something I'm not remembering about when you'd do that.

Link answered 8/3, 2016 at 2:43 Comment(4)
Isn't branch prediction something that is normally done in the compilation step (more specifically, optimization)? Or is there something inherent about x86 architecture that can accurately perform branch prediction?Unutterable
@Qix No, branch prediction is entirely a hardware thing.Cirrus
@Qix: You can hint the compiler about which way a branch will typically go, which affects its code-layout decisions (so the fast-path is mostly not-taken branches, which are slightly better even when predicted correctly. esp. for code density in I-cache: jumping around in a function often requires more cache lines of code to be pulled in). P4 had branch predictor hints, but all other uarches ignore them. See https://mcmap.net/q/14998/-is-it-possible-to-tell-the-branch-predictor-how-likely-it-is-to-follow-the-branchLink
Great answer. memory accesses may also have a "heavy" side effect if you get a page fault (for example when mispredicting a loop exit and running out of bounds of some page), but since these operations are in the shadow of a clearing branch, they will never reach the commit point and trigger the fault. By the way, there's an entire discussion about the fact that the wrong path is consuming energy, some research is done on avoiding speculation altogether on hard-to-predict branches (you need to predict that attribute though :).Carsick

© 2022 - 2024 — McMap. All rights reserved.