How does branch prediction interact with the instruction pointer
Asked Answered
K

1

4

It's my understanding that at the beginning of a processor's pipeline, the instruction pointer (which points to the address of the next instruction to execute) is updated by the branch predictor after fetching, so that this new address can then be fetched on the next cycle.

However, if the instruction pointer is modified early on in the pipeline, wouldn't this affect instructions currently in the execute phase that might rely on the old instruction pointer value? For instance, when doing a call the current EIP needs to be pushed into the stack, but wouldn't this be affected when the instruction pointer is updated during branch prediction?

Kwh answered 21/8, 2018 at 6:4 Comment(2)
in many pipelined architectures the program counter is bogus, the one the software can see has the right value. there are some to many other instruction pointer addresses used by the logic that do the real heavy lifting, one or more branch prediction computations, the actual pointer that goes to fetch memory, etc. Arm is a simple one the program counter being two instructions ahead has not been that way for a long while, the pipes are deeper with prediction. yet we still have an r15 that gives the as designed in the instruction set result.Powerless
a usable (pseudo) register like EIP would have the correct value for the instruction set being used, independent of any latched or combinational addresses used for actual fetching.Powerless
H
11

You seem to be assuming that there's only one physical EIP register that's used by the whole CPU core.

That doesn't work because every instruction that could take an exception needs to know its own address. Or when an external interrupt arrives, the CPU could decide to service the interrupt after any instruction, making that one the architectural EIP. In long mode (x86-64), there are also RIP-relative addressing modes, so call isn't the only instruction that needs the current program-counter as data.

A simple pipelined CPU might have an EIP for each pipeline stage.

A modern superscalar out-of-order x86 associates an EIP (or RIP) with each in-flight instruction (or maybe each uop; but multi-uop instructions have all their uops associated with each other so an instruction can't partially retire.)

Unlike other parts of the architectural state (e.g. EFLAGS, EAX, etc.) the value is statically known after decode. Actually even earlier than immediate values; instruction boundaries are detected in a pre-decode stage (or marked in L1i cache) so that multiple instructions can be fed to multiple decoders in parallel.

The early fetch/decode stage might just track addresses of 16-byte or 32-byte fetch blocks, but after decode I assume there's an address field in the internal uop representation. It might just be a small offset from the previous (to save space) for non-branch instructions, so if it's ever needed it can be calculated, but we're deep into implementation details here. Out-of-order execution maintains the illusion of instructions running in program-order, and they do issue and retire in-order (enter/leave the out-of-order execution part of the core).

Related: x86 registers: MBR/MDR and instruction registers makes a similar wrong assumption based on looking at toy CPUs. There is no "current instruction" register holding the machine code bytes either. See more links in my answer there for more about OoO / pipelined CPUs.


Branch prediction has to work before a block is even decoded. i.e. given that we just fetched a block at address abc, we need to predict what block to fetch next. i.e. prediction has to predict the existence of jumps in a 16-byte block of instructions that will be decoded in parallel.

Related: Why did Intel change the static branch prediction mechanism over these years?

Hite answered 21/8, 2018 at 6:26 Comment(8)
Isn't "A modern superscalar out-of-order x86 has an EIP (or RIP) associated with each instruction" a bit misleading? I believe EIP is just like any other input register when a uop is executed - i.e. the micro-architectural value of EIP is used.Bifocal
@MargaretBloom: Not really; it wouldn't be stored in the register file because it's statically known for each instruction at decode time, and can't be an output. Control dependencies are handled differently from data dependencies. I did reword that sentence, though, since it didn't sound exactly like what I meant to say.Hite
Yes of course! A uArch EIP won't do, I forgot about OoO execution. Each uOP must somehow have its instruction EIP. That's a lot of space though, it's probably an offset as you said (maybe even from the architectural EIP, I don't know).Bifocal
Most instructions never really need to know their own IP, at least in efficient way, so it is entirely possible the address is not stored with each instruction, but only calculated retroactively in some not necessarily fast way if an interrupt or exception occurs or whatever. Instructions that use the IP directly though like call or rip-relative stuff would still get it populated near decode time though. As you point out, this is far into implementation details and I'm just guessing.Flor
BTW, I think that the fetch/decode unit pretty much has to calculate "exact" addresses for predictions, and not use a larger granularity like 32-bytes because (a) right from decode you already need to know the correct offset into the chunk so you can decode the instructions properly and also because you need to form the correct contiguous instruction stream and (b) the predictor itself needs the actual address in order to make its next prediction since there might be several branches within a chunk. So (a) means that the window that a coarse prediction is small, and (b) means it may be zero.Flor
It is possible that predictors don't actually behave as I suggest in (b) - i.e., that their "tightest" loop just predicts at the level of chunks and later they refine the target to feed to the decoders. This means that they would get fooled by multiple branches in the same chunk with different targets and patterns and that could be testable from software.Flor
Yeah I hate the MBR / MDR rubbish from A level text books. Anyway, there's obviously an instruction pointer that controls the IFETCH block fetch and which is incremented after the branch prediction consultation in one atomic action; either it increases to the next 16 byte boundary or the BPU sets it to a branch destination address which may be in the middle of a 16 byte boundary and I think the L1i ignores the lower 4 bits as it always fetches on 16 byte granularity but the lower 4 bits are later indicated to the predecoder in order to shave off those instructions for a new IFETCH block..Cavicorn
.. and if there is no delay at the decoders (i.e. No 2 decode groups (2 complex instructions) which will produce a 1 cycle delay) then it will have to be indicated to the decoders as the IFETCH block will be started at a 16 byte boundaryCavicorn

© 2022 - 2024 — McMap. All rights reserved.