Any reason to use BX R over MOV pc, R except thumb interwork pre ARMv7?
Asked Answered
L

1

11

Linux defines an assembler macro to use BX on CPUs that support it, which makes me suspect there is some performance reason.

This answer and the Cortex-A7 MPCore Technical Reference Manual also states that it helps with branch prediction.

However my benchmarking efforts have not been able to find a performance difference with ARM1176, Cortex-A17, Cortex-A72 and Neoverse-N1 cpus.

Is there therefore any reason to prefer BX over MOV pc, on cpus with a MMU and that implement the 32-bit ARM instruction set, other than interworking with Thumb code?

Edited to add benchmark code, all aligned to 64 bytes:

Perform useless calculations on lr and return using BX:

div_bx
        mov  r9, #2
        mul  lr, r9, lr
        udiv lr, lr, r9
        mul  lr, r9, lr
        udiv lr, lr, r9
        bx   lr

Perform useless calculations on another register and return using BX:

div_bx2
        mov  r9, #2
        mul  r3, r9, lr
        udiv r3, r3, r9
        mul  r3, r9, r3
        udiv r3, r3, r9
        bx   lr

Perform useless calculations on lr and return using MOV:

div_mov
        mov  r9, #2
        mul  lr, r9, lr
        udiv lr, lr, r9
        mul  lr, r9, lr
        udiv lr, lr, r9
        mov  pc, lr

Call using classic function pointer sequence:

movmov
        push {lr}
loop    mov  lr, pc
        mov  pc, r1
        mov  lr, pc
        mov  pc, r1
        mov  lr, pc
        mov  pc, r1
        mov  lr, pc
        mov  pc, r1
        subs r0, r0, #1
        bne  loop
        pop  {pc}

Call using BLX:

blx
        push {lr}
loop    nop
        blx  r1
        nop
        blx  r1
        nop
        blx  r1
        nop
        blx  r1
        subs r0, r0, #1
        bne  loop
        pop  {pc}

Removing the nops makes is slower.

Results as seconds per 100000000 loops:

Neoverse-N1 r3p1 (AWS c6g.medium)
           mov+mov   blx 
div_bx        5.73  1.70 
div_mov       5.89  1.71 
div_bx2       2.81  1.69 

Cortex-A72 r0p3 (AWS a1.medium)
           mov+mov   blx 
div_bx        5.32  1.63 
div_mov       5.39  1.58 
div_bx2       2.79  1.63 

Cortex-A17 r0p1 (ASUS C100P)
           mov+mov   blx 
div_bx       12.52  5.69 
div_mov      12.52  5.75 
div_bx2       5.51  5.56 

It appears the 3 ARMv7 processors I tested recognise both mov pc, lr and bx lr as return instructions. However the Raspberry Pi 1 with ARM1176 is documented as having return prediction that recognises only BX lr and some loads as return instructions, but I find no evidence of return prediction.

header: .string "       Calle      BL       B  Difference"
format: .string "%12s %7i %7i %11i\n"
        .align

        .global main
main:   push    {r3-r5, lr}
        adr     r0, header
        bl      puts

        @ Warm up
        bl      clock
        mov     r0, #0x40000000
1:      subs    r0, r0, #1
        bne     1b
        bl      clock

        .macro  run_test test
2:      bl      1f
        nop
        bl      clock
        mov     r4, r0
        ldr     r0, =10000000
        .balign 64
3:      mov     lr, pc
        bl      1f
        nop
        mov     lr, pc
        bl      1f
        nop
        mov     lr, pc
        bl      1f
        nop
        subs    r0, r0, #1
        bne     3b
        bl      clock
        mov     r5, r0
        ldr     r0, =10000000

        .balign 64
5:      mov     lr, pc
        b       1f
        nop
        mov     lr, pc
        b       1f
        nop
        mov     lr, pc
        b       1f
        nop
        subs    r0, r0, #1
        bne     5b
        bl      clock
        sub     r2, r5, r4
        sub     r3, r0, r5
        sub     r0, r3, r2
        str     r0, [sp]
        adr     r1, 4f
        ldr     r0, =format
        bl      printf
        b       2f
        .ltorg
4:      .string "\test"
        .balign 64
1:
        .endm

        run_test mov
        mov     lr, lr
        mov     pc, lr

        run_test bx
        mov     lr, lr
        bx      lr

        run_test mov_mov
        mov     r2, lr
        mov     pc, r2

        run_test mov_bx
        mov     r2, lr
        bx      r2

        run_test pp_mov_mov
        push    {r1-r11, lr}
        pop     {r1-r11, lr}
        mov     r12, lr
        mov     pc, r12

        run_test pp_mov_bx
        push    {r1-r11, lr}
        pop     {r1-r11, lr}
        mov     r12, lr
        bx      r12

        run_test pp_mov_mov_f
        push    {r0-r11}
        pop     {r0-r11}
        mov     r12, lr
        mov     pc, r12

        run_test pp_mov_bx_f
        push    {r0-r11}
        pop     {r0-r11}
        mov     r12, lr
        bx      r12

        run_test pp_mov
        push    {r1-r11, lr}
        pop     {r1-r11, lr}
        mov     r12, lr
        mov     pc, lr

        run_test pp_bx
        push    {r1-r11, lr}
        pop     {r1-r11, lr}
        mov     r12, lr
        bx      lr

        run_test pp_mov_f
        push    {r0-r11}
        pop     {r0-r11}
        mov     r12, lr
        bx      lr

        run_test pp_bx_f
        push    {r0-r11}
        pop     {r0-r11}
        mov     r12, lr
        bx      lr

        run_test add_mov
        nop
        add     r2, lr, #4
        mov     pc, r2

        run_test add_bx
        nop
        add     r2, lr, #4
        bx      r2

2:      pop     {r3-r5, pc}

Results on Cortex-A17 are as expected:

       Calle      BL       B  Difference
         mov   94492  255882      161390
          bx   94673  255752      161079
     mov_mov  255872  255806         -66
      mov_bx  255902  255796        -106
  pp_mov_mov  506079  506132          53
   pp_mov_bx  506108  506262         154
pp_mov_mov_f  439339  439436          97
 pp_mov_bx_f  439437  439776         339
      pp_mov  247941  495527      247586
       pp_bx  247891  494873      246982
    pp_mov_f  230846  422626      191780
     pp_bx_f  230850  422772      191922
     add_mov  255997  255896        -101
      add_bx  255900  256288         388

However on my Raspberry Pi1 with ARM1176 running Linux 5.4.51+ from Raspbery Pi OS show no advantage of predictable instuctions:

       Calle      BL       B  Difference
         mov  464367  464372           5
          bx  464343  465104         761
     mov_mov  464346  464417          71
      mov_bx  464280  464577         297
  pp_mov_mov 1073684 1074169         485
   pp_mov_bx 1074009 1073832        -177
pp_mov_mov_f  769160  768757        -403
 pp_mov_bx_f  769354  769368          14
      pp_mov  885585 1030520      144935
       pp_bx  885222 1032396      147174
    pp_mov_f  682139  726129       43990
     pp_bx_f  682431  725210       42779
     add_mov  494061  493306        -755
      add_bx  494080  493093        -987
Leviticus answered 9/8, 2020 at 0:0 Comment(6)
div is often a poor choice for a throughput benchmark because it's not fully pipelined, so correct branch prediction to allow out-of-order execution doesn't help as much. But clearly there was still an effect; interesting.Autoradiograph
Making the calls with indirect branches (blx r1) means those indirect branches need to be correctly predicted. (Even direct branches need some prediction from pipelined superscalar CPUs to avoid fetch bubbles, but indirect is harder). Probably the CPU has limited ability to handle multiple predictions within one aligned 8-byte chunk of machine code, which is why spacing them out with nop helps. Effects like this are not rare in general, e.g. some x86 CPUs I'm familiar with have limitations like that on their predictors.Autoradiograph
Why are you using 32bit ARM? Thumb2 should always be faster. arm-thumb-interworking-confusion-regarding-thumb-2. Also, the commit message says *This allows us to detect the "mov pc, lr" case and fix it up *... most likely for kprobes. On modern cores like ARM1176, Cortex-A17, Cortex-A72 and Neoverse-N1 Thumb2 will be faster than ARM-32bit unless you have some extremely fast memory (almost zero chance such a system exists).Knotting
@artlessnoise Because porting 900K lines of pre-UAL assembly (mostly dating from 1985-1995) is a major undertaking.Leviticus
Then I think you have taken the Linux header out of context. Linux has no issue to build with Thumb2. For those machines ARMv5/ARMv6 the bx maybe faster. Since all ARMv7 has Thumb2 that is more efficient than ARM-32 most people will use that for ARMv7 (or even 6). Converting 900K lines should not be that difficult as most of the assembler is identical; unless there is significant conditional execution opcodes (addcs, subgt, etc). You will get a speed up by doing this.Knotting
if you are concerned about the performance, bx is not an issue in your case.Shrew
A
4

If you're testing simple cases where mov pc, ... always jumps to the same return address, regular indirect-branch prediction might do fine.

I'd guess that bx lr might use a return-address predictor that assumes matching call/ret (blx / bx lr) to correctly predict returns to various call sites, also without wasting space in the normal indirect branch-predictor.

According to Timothy's testing on Cortex-A17, Cortex-A72 and Neoverse-N1, on those CPUs mov pc, lr is recognized as a return idiom that can pair with blx. So the guess in this answer appears to be wrong for those CPUs.


To test this hypothesis, try something like

testfunc:
   bx lr         @ or mov pc,lr

caller:
 mov  r0, #100000000
.p2align 4
 .loop:
  blx   testfunc
  blx   testfunc     # different return address than the previous blx
  blx   testfunc
  blx   testfunc
  subs   r0, #1
  bne   .loop

If my hypothesis is right, I predict that mov pc, lr will be slower for this than bx lr.

(A more complicated pattern of target addresses (callsites in this case) might be needed to confound indirect branch prediction on some CPUs. Some CPUs have an indirect branch predictor that can only remember 1 target address, but somewhat more sophisticated predictors can handle a simple repeating pattern of 4 addresses.)


(This is a guess, I don't have any experience with any of these chips, but the general cpu-architecture technique of a return-address predictor is well known, and I've read that it's used in practice on multiple ISAs. I know for sure x86 uses it: http://blog.stuffedcow.net/2018/04/ras-microbenchmarks/ Mismatched call/ret is definitely a problem there.)

Autoradiograph answered 9/8, 2020 at 0:20 Comment(4)
I find both BX lr and MOV pc, lr are using a return-address predictor on Cortex-A17, Cortex-A72 and Neoverse-N1 cpus, performance gets equally worse if called using MOV pc, lr; MOV pc, r1 instead of nop; blx r1.Leviticus
@TimothyBaldwin: By "equally worse", you mean those last 2 options cost the same as each other, like 1 nop slower than just bx lr? I think you mangled the code sequences, e.g. mov pc, r1 is after a mov-to-pc so never reached? Did you mean mov r1, lr ; mov pc, r1 vs. nop ; bx lr? Or are you using blx r1 to make indirect calls? Oh, I see you edited your question with your test results.Autoradiograph
Yes, I mangled the comment, I meant MOV lr, pc; MOV pc, r1Leviticus
@TimothyBaldwin: Ok, so that was in the caller. Maybe try a function that returns with mov r2, lr / mov pc, r2 in case some CPUs recognize mov pc, lr as a return idiom.Autoradiograph

© 2022 - 2024 — McMap. All rights reserved.