x86 Program Counter abstracted from microarchitecture?

Asked 23/7, 2019 at 19:48 Answered 24/7, 2019 at 18:16

Solved x86 cpu-architecture riscv instruction-set program-counter

I'm reading the book The RISC-V Reader: An Open Architecture Atlas. The authors, to explain the isolation of an ISA (Instruction Set Architecture) from a particular implementation (i.e., microarchitecture) wrote:

The temptation for an architect is to include instructions in an ISA that helps performance or cost of one implementation at a particular time, but burden different or future implementations.

As far as I understand, it states that when designing an ISA, the ISA should ideally refrain from exposing the details of a particular microarchitecture that implements it.

Keeping the quote above in mind: When it comes to the program counter, on the RISC-V ISA, the program counter (pc) points to the instruction being currently executed. On the other hand, on the x86 ISA, the program counter (eip) does not contain the address of the instruction being currently executed, but the address of the one following the current instruction.

Is the x86 Program Counter abstracted away from the microarchitecture?

Uproot answered 23/7, 2019 at 19:48 Comment(1)

working on an answer, but no, x86 instruction decoding already needs to know the start and end address of an instruction to decode + execute it anyway. It's not like ARM where PC = 2 instructions ahead; that is exposing the pipelined fetch/decode. It's not really exposing anything for call to push a return address. Until x86-64 RIP-relative addressing, that was basically the only way to read EIP. – Donica 23/7, 2019 at 20:6

I'm going to answer this in terms of MIPS instead of x86, because (1) MIPS and x86 have a similarity in this area, and because (2) RISC V was developed by Patterson, et al, after decades of experience with MIPS. I feel these statement from their books are best understood in this comparison because x86 and MIPS both encode branch offsets relative to the end of the instruction (pc+4 in MIPS).

In both MIPS and x86, PC-relative addressing modes were only found in branches in early ISA versions. Later revisions added PC-relative address calculation (e.g. MIPS auipc or x86-64's RIP-relative addressing mode for LEA or load/store). These are all consistent with each other: the offset is encoded relative to (one past) the end of the instruction (i.e. the next instruction start) — whereas, as you're noting, in RISC V, the encoded branch offset (and auipc, etc..) is relative to the start of the instruction instead.

The value of this is that it removes an adder from certain datapaths, and sometimes one of these datapaths can be on the critical path, so for some implementations this minor shortening of the datapath means a higher clock rate.

(RISC V, of course, still has to produce instruction + 4 for pc-next and the return address of call instructions, but that is much less on the critical path. Note that in the diagrams below neither shows the capture of pc+4 as a return address.)

Let's compare hardware block diagrams:

MIPS datapath (simplified)

RISC V datapath (simplified)

You can see on the RISC V datapath diagram the line tagged #5 (in red, just above the control oval), bypasses the adder (#4, which adds 4 to the pc for pc-next).

Attribution for diagrams

Why did x86 / MIPS make that different choice back in their initial versions?

Of course, I can't say for sure. What it looks like to me is that there was a choice to be made and it simply didn't matter for the earliest implementations, so they probably were not even aware of the potential issue. Almost every instruction needs to compute instruction-next anyway, so this probably seemed like the logical choice.

At best, they might have saved a few wires, as pc-next is indeed required by other instructions (e.g. call) and pc+0 is not necessarily otherwise needed.

An examination of prior processors might show this was just the way things were done back then, so this might have been more of a carry over of existing methods rather than a design choice.

8086 is not pipelined (other than the instruction prefetch buffer) and variable-length decoding has already found the end of an instruction before it starts to execute.

With years of hindsight, this datapath issue is now addressed in RISC V.

I doubt they made the same level of conscious decision about this, as was done for example, for branch delay slots (MIPS).

As per discussion in comments, 8086 may not have had any exceptions that push the instruction start address. Unlike on later x86 models, divide exceptions pushed the address of the instruction after div/idiv. And in 8086, interrupt-resume after cs rep movsb (or other string instruction) pushed the address of the last prefix, not the whole instruction including multiple prefixes. This "bug" is documented in Intel's 8086 manual (scanned PDF). So it's quite possible 8086 really didn't record the instruction start address or length, only the address where decoding finished before starting execution. This was fixed by at least 286, maybe 186, but applies to all 8086 / 8088 CPUs.

MIPS had virtual memory from the start, so it did need to be able to record the address of a faulting instruction so it could be rerun after exception-return. Plus software TLB-miss handling also required re-rerunning a faulting instruction. But exceptions are slow and flush the pipeline anyway, and aren't detected until well after fetch, so presumably some calculation would be needed regardless.

Archivist answered 23/7, 2019 at 21:33 Comment(14)

even first-gen x86 (8086) pipelined instruction prefetch separate from the rest of the non-pipelined decode/exec CPU internals. But it could be multiple instructions ahead; and doesn't know about instruction boundaries, so it isn't necessarily still holding the next-instruction fetch address when a call needs to read it. But decode did already have to work out how long an instruction was as part of decoding. (Or more likely, just record its start and end address). If 8086 had any exceptions that push the address of the faulting instruction (like 386 #PF), both were potentially needed. – Donica 23/7, 2019 at 22:15

I'm pretty sure some possible exceptions even on 8086 need the instruction-start address, like probably divide exception #DE for divide by zero or other overflow of the quotient. #SS or other segment limit exceptions probably weren't possible with only 16-bit addressing modes and segment limits implicitly fixed at 64kiB. And call + exceptions from int 0x21 or whatever need the end address. 8086 didn't have a #UD (undefined instruction) exception, so I guess using a lock prefix with instructions where it didn't apply wouldn't do anything, unlike now where that's documented to #UD. – Donica 23/7, 2019 at 22:24

I don't see why that adder would affect performance in any way. It's not like the address of the next instruction is needed before the instruction gets fetched. So the adder works in parallel with instruction fetch. Is there any study on this? This answer looks wrong. – Extrauterine 24/7, 2019 at 13:12

@HadiBrais, this won't be on the critical path in many implementations (such as those described by the above simplified single cycle, non-pipelined diagrams), but this simplification could help some timings in alternative implementations. – Archivist 24/7, 2019 at 14:39

Well then your answer should discuss at least one of these alternative implementations to back up your claim. I can't think of one case where the way PC is defined in RISC-V has any advantage (in terms of performance, energy, or area) over the way it is defined in x86. It's really just an architectural characteristic of the ISA and may influence the design of the ISA I guess (but not the implementation in any significant way). – Extrauterine 24/7, 2019 at 14:53

@Peter Cordes: Divide exceptions on the 8086/8088 did not point to the faulting instruction. css.csail.mit.edu/6.858/2014/readings/i386/s14_07.htm "On the 8086/8088, the CS:IP value points to the next instruction." – Picro 24/7, 2019 at 17:47

@ecm: Oh interesting! Are there any exceptions that push the faulting instruction's address on 8086? If not, maybe 8086 never needs the instruction start address and simply uses the address that decoding stopped at. That would make sense for its tiny transistor budget; we also know 8086 doesn't have any instruction-length limits and will keep decoding a whole 64kiB CS segment of prefixes. I guess I need to fix my edit to this answer. – Donica 24/7, 2019 at 17:57

@Peter Cordes: I think the start of the instruction (or rather, start of first prefix) is used when a repeated string op is interrupted. (Which has the famous bug on original generations of dropping all but the last prefix. That is, if "rep cs movsw" is interrupted, the processor will restart with "cs movsw" having lost the rep prefix. But that was considered a bug and fixed in later generations of the processor.) – Picro 24/7, 2019 at 18:4

@ecm: Oh yes, 8086 still had interruptible string instructions. I wonder if Intel tried to be too cute there in the first design, and maybe instead of recording the real instruction start, just calculated the exception address from the end address assuming only a rep prefix. i.e. designed the HW/microcode to take an interrupt with a hardcoded IP-=2 which works for rep + opcode but not cs + rep + opcode or anything else. Presumably the later stepping had to spend more instructions. That first stepping also had a bug where mov ss, src didn't properly delay interrupts until after next – Donica 24/7, 2019 at 18:17

@Peter Cordes: Writing of which, do you know whether the lack of interrupt lockout for mov/pop to ss also causes a trace interrupt before the subsequent mov to sp, with TF=1 ? That'd be very bad news for debuggers. – Picro 24/7, 2019 at 18:26

@ecm: no clue, you'd have to ask MichaelPetch. I have no practical experience with original 8086 and both those interrupt bugs you and I mentioned are things I learned from his comments. IIRC it was only in some circumstances that an interrupt could happen, not always, or Intel probably would have caught it in validation. – Donica 24/7, 2019 at 18:29

@Peter Cordes: Link for reference #47278202 – Picro 24/7, 2019 at 19:17

@ecm: Oh yes, right it's only early 8088. I'd guess it doesn't apply to synchronous interrupts like TF, only external. But clearly Intel didn't have nearly the amount of validation testing they do for modern x86 where they can simulate everything, as well as having the budget to fuzz real hardware with randomized instruction streams. – Donica 24/7, 2019 at 19:57

@ecm: Unlike that early-8088 mov ss bug, the last-prefix-only interrupt resume applies to all 8086 CPUs. I was thinking later steppings of 8086, but no you were right, not until later generations like 286 (or maybe 186), according to this forum thread. I re-edited that part of this answer. – Donica 24/7, 2019 at 20:23

As far as I understand, it states that when designing an ISA, the ISA should ideally refrain from exposing the details of a particular microarchitecture that implements it.

If your metric for an ideal ISA is simplicity, then I might agree with you. But in some cases, it can be beneficial to expose some charactersitics of the microarchitecture through the ISA to improve performance, and there are ways to make the burden of doing that negligible. Consider, for example, the software prefetch instructions in x86. The behavior of these instructions are architecturally defined to be microarchitecturally-dependent. Intel can even design a microarchitecture in the future where these instructions behave as no-ops, without violating the x86 spec. The only burden there is defining the functionality of these instructions¹. However, if a prefetch instruction was architecturally defined to prefetch a 64-byte aligned data into the L3 cache and there is no CPUID bit to allow optional support for this instruction, then this may indeed make supporting such an instruction a substantial burden in the future.

Is the x86 Program Counter abstracted away from the microarchitecture?

Before it gets edited by @InstructionPointer, your referred to the "first implementation" of x86 in this question, which is the 8086. This is a simple processor with two pipe stages: fetch and execute. One of the architectural registers is IP, which is defined to contain the 16-bit offset (from the code segment base) of the next instruction. So the architectural value of IP at every instruction is equal to the offset plus the size of the instruction. How is this implemented in the 8086? There is actually no physical register that stores the IP value. There is a single physical instruction pointer register, but it points to the next 16 bits to be fetched into the instruction queue, which can hold up to 6 bytes (see: https://patents.google.com/patent/US4449184A/en). If the current instruction that is being executed is a control transfer instruction, the target address is calculated on-the-fly based on the relative offset from the instruction, the current value in the physical IP, and the number of valid bytes in the instruction queue. For example, if the relative offset is 15, the physical IP is 100, and the instruction queue contains 4 valid bytes, then the target offset is: 100 - 4 + 15 = 111. The physical address can then be calculated by adding the 20-bit code segment address. Clearly, the architectural IP does not expose any of these microarchitectural details. In modern Intel processors, there can be many instructions in-flight and so each instruction needs to carry with it enough information to reconstruct its address or the address of the following instruction.

What if the x86 architectural IP was defined to point to the current instruction instead of the next instruction? How would this impact the design of the 8086? Well, the relative offset from the control transfer instruction becomes relative to the offset of the current instruction, not the next one. In the previous example, we have to subtract the length of the current instruction from 111 to get the target offset. So there may be a need for an additional hardware to track the size of the current instruction and include it in the calculation. But in such an ISA, we can define all control transfer instructions to have a uniform length² (other instructions can still be of variable-length), which eliminates most of that overhead. I can't think of a realistic example where defining the program counter one way is significantly better than the other. However, it may influence the design of the ISA.

Footnotes:

(1) The decoders may still have to be able to recognize that the prefetch instructions are valid and emit the corresponding uops. However, this burden is not a consequence of defining microarchitecturally-dependent instructions, but rather of defining new instructions, irrespective of the functionality of these instuctions.

(2) Alternatively, the length of the current instruction can be stored in a tiny register. IIRC, the maximum instruction length in the 8086 is 6 bytes, so it takes at most 3 bits to store the length of any instruction. This overhead is very small even for the 8086 days.

Extrauterine answered 24/7, 2019 at 18:16 Comment(7)

8086 decodes prefixes separately (1 cycle at a time) and has no limit on total instruction length. e.g. a 64kiB CS segment full of rep prefixes will IIRC loop forever, whether or not there's an opcode in there or just prefixes. But yes, something like 6 bytes is I think the upper limit not counting any prefixes. Opcode + modrm + disp16 + imm16. Fun fact: 8088 only had a 4-byte prefetch buffer, down from 6 in 8086, but apparently no circuit differences outside the bus interface. So that prefetch buffer wasn't also a decode buffer, really just prefetch. – Donica 24/7, 2019 at 18:24

@PeterCordes Aha, what about the sizes of the control transfer instructions (call and jmp)? Is there any limit on their lengths? The fetch unit really only needs to maintain the length of control transfer instructions. The length of any other instruction can be considered as zero as far the as the fetch unit is concerned. – Extrauterine 24/7, 2019 at 18:28

felixcloutier.com/x86/call call far ptr16:16 is 5 bytes: opcode + new_IP + new_CS is 5 bytes. It has to push a CS:IP return address even though the branch target itself is absolute, not relative. With repeated segment-override prefixes, a call [mem] can be arbitrary length. Or I guess with useless prefixes on a call rel16 it could also be any length. That's probably a good reason for x86 calculating from the end, not the start! – Donica 24/7, 2019 at 18:38

All of the reasoning in your answer is of course very different for a fixed-instruction-width ISA like RISC-V where you can calculate the start of an instruction given the end address, or calculate as far ahead as you want (assuming no branches) with an adder that runs in parallel. 8086 was clearly not designed with a superscalar implementation in mind (and later complexity added to the variable length encoding led to the current disaster). Probably even a pipelined CISC implementation wasn't on the radar for 8086; that didn't happen until 486 and 586. – Donica 24/7, 2019 at 18:44

@PeterCordes Despite of the arbitrary length, the length is still limited by the maximum size of the code segment. So a 16 bit length register would be required in the case of 8086, if the program counter was defined to point to the current instruction. Or the ISA could have been originally defined to impose a hard limit on the maximum length (e.g., 7 or 15 bytes). – Extrauterine 24/7, 2019 at 18:53

Indeed. Taking and holding a 16-bit snapshot (before decoding starts) of the instruction-start address would probably be more sane than accumulating a length. Hmm, I wonder how 8086 handled async interrupts while churning through redundant lock, rep, and segment prefixes. I wonder if the mechanism is related to the cs/es/ss rep movs bug (which @Picro brought up) in some 8086 CPUs where the interrupt-return address only points at the last prefix, changing the meaning of the instruction on resume. Only string instructions are normally interruptible, AFAIK; maybe prefix-decoding isn't. – Donica 24/7, 2019 at 19:6

Hmm, I wonder how the 8086 fetch stage tracks IP wrap-around. I was wondering if it would just use a 20-bit linear address, but then it would need to re-steer if IP wrapped around to the start of CS. – Donica 24/7, 2019 at 19:8

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags