Why does the ARM PC register point to the instruction after the next one to be executed?
Asked Answered
C

2

52

According to the ARM IC.

In ARM state, the value of the PC is the address of the current instruction plus 8 bytes.

In Thumb state:

  • For B, BL, CBNZ, and CBZ instructions, the value of the PC is the address of the current instruction plus 4 bytes.
  • For all other instructions that use labels, the value of the PC is the address of the current instruction plus 4 bytes, with bit[1] of the result cleared to 0 to make it word-aligned.

Simply saying, the value of the PC register points to the instruction after the next instruction. This is the thing I don't get. Usually (particularly on the x86) program counter register is used to point to the address of the next instruction to be executed.

So, what are the premises underlying that? Conditional execution, maybe?

Cid answered 6/6, 2014 at 22:28 Comment(5)
I'm sure someone more at home with the architecture can give a more detailed explanation, but in short; R15 contains the address of the next instruction to be fetched, due to prefetching it's (for arm state) 8 or in some cases 12 bytes ahead of the currently executing instruction.Dumbarton
@JoachimIsaksson In which cases should the value of R15 be the address of the current instruction plus 12 bytes?Cid
@Notlikethat You can read RIP directly on x86-64: lea rax, [rip]. On x86-32, the most direct way is probably with a call instruction, which pushes EIP as the return address. It's nowhere near as exposed as it is on ARM, though, where it can be a src or dst for pretty much any instruction or addressing mode, IIRC.Plastometer
@Peter OK, I concede ;) I suppose I take "register" here to mean "something which can be an operand to an instruction", and my x86 knowledge kinda fades out beyond the 32-bit SSE2 era...Musicale
A related thread: #59405344Notecase
M
85

It's a nasty bit of legacy abstraction leakage.

The original ARM design had a 3-stage pipeline (fetch-decode-execute). To simplify the design they chose to have the PC read as the value currently on the instruction fetch address lines, rather than that of the currently executing instruction from 2 cycles ago. Since most PC-relative addresses are calculated at link time, it's easier to have the assembler/linker compensate for that 2-instruction offset than to design all the logic to 'correct' the PC register.

Of course, that's all firmly on the "things that made sense 30 years ago" pile. Now imagine what it takes to keep a meaningful value in that register on today's 15+ stage, multiple-issue, out-of-order pipelines, and you might appreciate why it's hard to find a CPU designer these days who thinks exposing the PC as a register is a good idea.

Still, on the upside, at least it's not quite as horrible as delay slots. Instead, contrary to what you suppose, having every instruction execute conditionally was really just another optimisation around that prefetch offset. Rather than always having to take pipeline flush delays when branching around conditional code (or still executing whatever's left in the pipe like a crazy person), you can avoid very short branches entirely; the pipeline stays busy, and the decoded instructions can just execute as NOPs when the flags don't match*. Again, these days we have effective branch predictors and it ends up being more of a hindrance than a help, but for 1985 it was cool.

* "...the instruction set with the most NOPs on the planet."

Musicale answered 7/6, 2014 at 0:10 Comment(22)
Love your answer! May I ask you what do you think about the use of the least significant bit of the PC register to determine CPU state? Isn't that weird?Cid
@Cid It's not the lsb of the PC - that would cause an alignment fault - it's only the lsb of the target address of a bx, blx or bxj instruction that controls an instruction set switch. The current state is indicated in bit 5 of the CPSR.Musicale
Oops! I know about alignment exceptions, but I was thinking that the lsb is just ignored when the PC register is actually fetched. It's clear now, thank you! :)Cid
I have wondered how many times ARM designers have cursed having to keep CPUs compatible with that old behavior. Also, as bad as delays slots are, the worst thing about them is how poorly documented they are, especially with how assemblers deal with them (assemblers often try to hide their existence from you, which seems to be the worst/most confusing thing to do in my opinion).Costate
@MichaelBurr: Whether using the LSB of the PC would cause an alignment fault would depend upon whether fetches are defined as using address [PC] or [PC & ~1]. What I find weird is the requirement to use BX or BLX rather than being able to use, e.g. ldr r15,[r0] to jump to an pointer stored in memory identified by r0. Any idea why that was required on e.g. ARM7-TDMI?Airdry
@Airdry On the early cores, Thumb state involved switching in what was more or less a separate bolted-on pre-decode stage which converted the Thumb encoding into the equivalent ARM encoding and fed it into the start of the regular ARM pipeline. Note "Thumb instruction controller", and that ARM7DMI (no T) was also a thing. The ARMv7 architecture (once Thumb-2 was commonplace) had a big cleanup and did redefine most writes to the PC to be interworking.Musicale
@Notlikethat: The documentation I used when writing code for the ARM7 said the BX was required, and the compiler I was using generated it.Airdry
@Airdry Well, on ARM7 it is required (presumably because there's only so much pipeline state it's practical to wire out to control an optional external block). Don't confuse the ARMv7 architecture version (implemented by Cortex-A8 and later cores) with the ARM7 core (implementing the ARMv4 architecture).Musicale
@Notlikethat: Awesome nomenclature (grrr). One thing I've sometimes pondered is whether it might have been useful to, instead of having discrete thumb/arm modes, instead repurpose some ranges of opcode space for "instruction pairs". There would be a lot fewer than 65536 possible opcodes for the paired instructions, but with such an approach they could be more freely intermixed with ordinary ARM code so one wouldn't need as many different opcodes.Airdry
@Airdry ...funnily enough, the ARM architects thought more or less exactly that a decade ago. What you've done there is invent the way that 32-bit Thumb-2 encodings work :PMusicale
@Notlikethat: The 32-bit thumb encodings don't have a rigid pairing of instructions, and thus require an extra PC bit and the ability to form instructions using half of each of two words. I was pondering a simplified form which would steal half the opcode space from the branch instructions (a lot of programs will never need branches anywhere near that big) and always execute both instructions of a pair together, thus avoiding the need to add an extra bit to the program counter.Airdry
"it's hard to find a CPU designer these days who thinks exposing the PC as a register is a good idea." Hmm why does RISC-V apparently expose a PC then?Jello
Soundcloud link is broken. *sigh* linkrot.Avruch
@supercat, years later now, your wish came true, risc-v does exactly the encoding indicates the size so you can have various sized instructions back to back that execute right through, no need to change modes like arm or mips to get between the 32 and 16 bit instructions.Gramercy
When I started I had to wait for gcc to switch from mov pc,lr and add thumb support, but then they did and then the linker got better and started adding trampolines for you, etc. And as mentioned above armv4 to armv5 to armv6 to armv7 more instructions each step supported mode switching if the pc was the destination.Gramercy
whether or not it is a few stages or not the pc reflects the current instruction, next, etc these days it is fake there are multiple pc's if you will for fetching, doing math, destinations for branching, etc. so you can fake it to be the current instruction, next instruction (extremely common) or this way two ahead well except for thumb2.Gramercy
@old_timer: The Cortex-M3 adds two-word encodings for most of the instructions that were omitted from Thumb, but requires a fair bit of hardware complexity to allow instructions to span word boundaries. I was envisioning a design that would retain the hardware simplicity of having 32-bit instructions be word-aligned, but include some "do two operations" instructions to improve code density. Incidentally, another feature I would have liked to have seen would have been to have a source format which would indicate that some operand bits should come from the next instruction word (which...Airdry
I wouldnt agree that it was a case of omitted from thumb, more of a later down the road they thought of extending thumb and adding it to the full sized cores. with the feature that they dont have to be aligned. we can each interpret it however we like thoughGramercy
...should already have been fetched by the time the current instruction has reached the "execute" stage of the pipeline). If one needs to e.g. add a 32-bit constant to a value or use one as a mask, being able to do so with two words and two cycles would be better than having to use a two-cycle "load" instruction to fetch the constant and use up a register for it.Airdry
armv6-m added the first thumb2 instructions (like 20 or so) then the armv7-m (cortex-m3) added like 100 more or maybe it was 150. I hand counted them once. the m3 hit the streets before the m0...Gramercy
@old_timer: When the Thumb instruction set was invented, the design intention was that it be available as a mode on machines that would also have the ARM instruction set. Speed-senstiive parts of a program which would benefit from being able to have instructions which could combine two arbitrary source registers, one with a shift applied, would be compiled to use ARM, while parts that were less speed sensitive or would receive less benefit from all those features would use Thumb mode. Most of the instructions that are present on Cortex-M3 but not Cortex-M0 were present in the 32-bit ARM.Airdry
@old_timer: The hardware complexity of using 32-bit instruction words but allowing some of them to encode two operations would likely have been comparable to that of processing Cortex-M0 instructions. The Cortex-M3 approach may be nicer to program for than what I envisioned, but what I envisioned would be nicer than the Cortex-M0, with a cost that I would expect would have been much closer to the latter than the former.Airdry
P
2

that's true...

one example is below: C program:

int f,g,y;//global variables
int sum(int a, int b){
     return (a+b);
}
int main(void){
    f = 2;
    g = 3;
    y = sum(f, g);
    return y;
}

compile to assembly:

    00008390 <sum>:
int sum(int a, int b) {
return (a + b);
}
    8390: e0800001 add r0, r0, r1
    8394: e12fff1e bx lr
    00008398 <main>:
int f, g, y; // global variables
int sum(int a, int b);
int main(void) {
    8398: e92d4008 push {r3, lr}
f = 2;
    839c: e3a00002 mov r0, #2
    83a0: e59f301c ldr r3, [pc, #28] ; 83c4 <main+0x2c> 
    83a4: e5830000 str r0, [r3]
g = 3;
    83a8: e3a01003 mov r1, #3
    83ac: e59f3014 ldr r3, [pc, #20] ; 83c8 <main+0x30>
    83b0: e5831000 str r1, [r3]
y = sum(f,g);
    83b4: ebfffff5 bl 8390 <sum>
    83b8: e59f300c ldr r3, [pc, #12] ; 83cc <main+0x34>
    83bc: e5830000 str r0, [r3]
return y;
}
83c0: e8bd8008 pop {r3, pc}
83c4: 00010570 .word 0x00010570
83c8: 00010574 .word 0x00010574
83cc: 00010578 .word 0x00010578

see the above LDR's PC value--here is used to load variable f,g,y's address to r3.

    83a0: e59f301c ldr r3, [pc, #28];83c4 main+0x2c
    PC=0x83c4-28=0x83a8-0x1C = 0x83a8

PC's value is just the current executing instruction's next's next instruction. as ARM uses 32bits instruction, but it's using byte address, so + 8 means 8bytes, two instructions' length.

so attached ARM archi's 5 stage pipe linefetch, decode, execute, memory, writeback

ARM's 5 stage pipeline

the PC register is added by 4 each clock, so when instruction bubbled to execute--the current instruction, PC register's already 2 clock passed! now it's + 8. that actually means: PC points the "fetch" instruction, current instruction means "execute" instruction, so PC means the next next to be executed.

BTW: the pic is from Harris's book of Digital Design and Computer Architecture ARM Edition

Proselytism answered 10/4, 2018 at 10:41 Comment(1)
the OP asked why it's like that, not if it is true or notMandatory

© 2022 - 2024 — McMap. All rights reserved.