Indexed branch overhead on X86 64 bit mode

Asked 1/11, 2017 at 2:51 Answered 1/11, 2017 at 3:55

performance optimization x86-64 micro-optimization branch-prediction

This is a follow up to some comments made in this prior thread:

The following code snippets calculate Fibonacci, the first example with a loop, the second example with a computed jump (indexed branch) into an unfolded loop. This was tested using Visual Studio 2015 Desktop Express on Windows 7 Pro 64 bit mode with an Intel 3770K 3.5ghz processor. With a single loop testing fib(0) thru fib(93), the best time I get for loop version is ~1.901 microseconds, and for computed jump is ~ 1.324 microseconds. Using an outer loop to repeat this process 1,048,576 times, the loop version takes about 1.44 seconds, the computed jump takes about 1.04 seconds. In both sets of tests, the loop version is about 40% slower than computed jump version.

Question: Why is the loop version much more sensitive to code location than the computed jump version? In prior tests, some code location combinations caused the loop version time to increase from about 1.44 seconds to 1.93 seconds, but I never found a combination that significantly affected the computed jump version time.

Partial answer: The computed jump version branches into 94 possible target locations within a 280 byte range, and apparently a branch target buffer (cache) does a good job of optimizing this. For the loop version, using align 16 to put the assembly based fib() function on a 16 byte boundary solved the loop version time issue for most cases, but some changes to main() were still affecting the time. I need to find a reasonably small and repeatable test case.

loop version (note I've read that | dec | jnz | is faster than | loop |) :

        align   16
fib     proc                            ;rcx == n
        mov     rax,rcx                 ;br if < 2
        cmp     rax,2
        jb      fib1
        mov     rdx,1                   ;set rax, rdx
        and     rax,rdx
        sub     rdx,rax
        shr     rcx,1
fib0:   add     rdx,rax
        add     rax,rdx
        dec     rcx
        jnz     fib0
fib1:   ret     
fib     endp

computed jump (indexed branch) into unfolded loop version:

        align   16
fib     proc                            ;rcx == n
        mov     r8,rcx                  ;set jmp adr
        mov     r9,offset fib0+279
        lea     r8,[r8+r8*2]
        neg     r8
        add     r8,r9
        mov     rax,rcx                 ;set rax,rdx
        mov     rdx,1
        and     rax,rdx
        sub     rdx,rax
        jmp     r8
fib0:   ; assumes add xxx,xxx takes 3 bytes
        rept    46
        add     rax,rdx
        add     rdx,rax
        endm
        add     rax,rdx
        ret
fib     endp

Test code that runs 1 million (1048576) loops to calculate fib(0) to fib(93) using multiples of 37%93 so the order is not sequential. On my system, the loop version took about 1.44 seconds and the indexed branch version took about 1.04 seconds.

#include <stdio.h>
#include <time.h>

typedef unsigned int uint32_t;
typedef unsigned long long uint64_t;

extern "C" uint64_t fib(uint64_t);

/* multiples of 37 mod 93 + 93 at end */
static uint64_t a[94] = 
     {0,37,74,18,55,92,36,73,17,54,
     91,35,72,16,53,90,34,71,15,52,
     89,33,70,14,51,88,32,69,13,50,
     87,31,68,12,49,86,30,67,11,48,
     85,29,66,10,47,84,28,65, 9,46,
     83,27,64, 8,45,82,26,63, 7,44,
     81,25,62, 6,43,80,24,61, 5,42,
     79,23,60, 4,41,78,22,59, 3,40,
     77,21,58, 2,39,76,20,57, 1,38,
     75,19,56,93};

/* x used to avoid compiler optimizing out result of fib() */
int main()
{
size_t i, j;
clock_t cbeg, cend;
uint64_t x = 0;
    cbeg = clock();
    for(j = 0; j < 0x100000; j++)
        for(i = 0; i < 94; i++)
            x += fib(a[i]);
    cend = clock();
    printf("%llx\n", x);
    printf("# ticks = %u\n", (uint32_t)(cend-cbeg));
    return 0;
}

The output for x is 0x812a62b1dc000000. The sum of fib(0) to fib(93) in hex is 0x1bb433812a62b1dc0, and add 5 more zeros for looping 0x100000 times: 0x1bb433812a62b1dc000000. The upper 6 nibbles are truncated due to 64 bit math.

I made an all assembly version to better control code location. The "if 1" is changed to "if 0" for loop version. The loop version takes about 1.465 to 2.000 seconds depending on nop padding used to put key locations on even or odd 16 byte boundaries (see comments below). The computed jump version takes about 1.04 seconds and boundaries make less than 1% difference in timing.

        includelib msvcrtd
        includelib oldnames

        .data
; multiples of 37 mod 93 + 93 at the end
a       dq      0,37,74,18,55,92,36,73,17,54
        dq     91,35,72,16,53,90,34,71,15,52
        dq     89,33,70,14,51,88,32,69,13,50
        dq     87,31,68,12,49,86,30,67,11,48
        dq     85,29,66,10,47,84,28,65, 9,46
        dq     83,27,64, 8,45,82,26,63, 7,44
        dq     81,25,62, 6,43,80,24,61, 5,42
        dq     79,23,60, 4,41,78,22,59, 3,40
        dq     77,21,58, 2,39,76,20,57, 1,38
        dq     75,19,56,93
        .data?
        .code
;       parameters      rcx,rdx,r8,r9
;       not saved       rax,rcx,rdx,r8,r9,r10,r11
;       code starts on 16 byte boundary
main    proc
        push    r15
        push    r14
        push    r13
        push    r12
        push    rbp
        mov     rbp,rsp
        and     rsp,0fffffffffffffff0h
        sub     rsp,64
        mov     r15,offset a
        xor     r14,r14
        mov     r11,0100000h
;       nop padding effect on loop version (with 0 padding in padx below)
;        0 puts main2 on  odd 16 byte boundary  clk = 0131876622h => 1.465 seconds
;        9 puts main1 on  odd 16 byte boundary  clk = 01573FE951h => 1.645 seconds
        rept    0
        nop
        endm
        rdtsc
        mov     r12,rdx
        shl     r12,32
        or      r12,rax
main0:  xor     r10,r10
main1:  mov     rcx,[r10+r15]
        call    fib
main2:  add     r14,rax
        add     r10,8
        cmp     r10,8*94
        jne     main1
        dec     r11
        jnz     main0
        rdtsc
        mov     r13,rdx
        shl     r13,32
        or      r13,rax
        sub     r13,r12
        mov     rdx,r14
        xor     rax,rax
        mov     rsp,rbp
        pop     rbp
        pop     r12
        pop     r13
        pop     r14
        pop     r15
        ret
main    endp

        align   16
padx    proc
;       nop padding effect on loop version with 0 padding above
;        0 puts fib on  odd 16 byte boundary    clk = 0131876622h => 1.465 seconds
;       16 puts fib on even 16 byte boundary    clk = 01A13C8CB8h => 2.000 seconds
;       nop padding effect on computed jump version with 9 padding above
;        0 puts fib on  odd 16 byte boundary    clk = 00D979792Dh => 1.042 seconds
;       16 puts fib on even 16 byte boundary    clk = 00DA93E04Dh => 1.048 seconds
        rept    0
        nop
        endm
padx    endp

        if      1       ;0 = loop version, 1 = computed jump version

fib     proc                            ;rcx == n
        mov     r8,rcx                  ;set jmp adr
        mov     r9,offset fib0+279
        lea     r8,[r8+r8*2]
        neg     r8
        add     r8,r9
        mov     rax,rcx                 ;set rax,rdx
        mov     rdx,1
        and     rax,rdx
        sub     rdx,rax
        jmp     r8
fib0:   ; assumes add xxx,xxx takes 3 bytes
        rept    46
        add     rax,rdx
        add     rdx,rax
        endm
        add     rax,rdx
        ret
fib     endp

        else

fib     proc                            ;rcx == n
        mov     rax,rcx                 ;br if < 2
        cmp     rax,2
        jb      fib1
        mov     rdx,1                   ;set rax, rdx
        and     rax,rdx
        sub     rdx,rax
        shr     rcx,1
fib0:   add     rdx,rax
        add     rax,rdx
        dec     rcx
        jnz     fib0
fib1:   ret     
fib     endp

        endif
        end

Stanleystanly answered 1/11, 2017 at 2:51 Comment(28)

todo: test that theory with perf counters on my SKL. – Watcher 1/11, 2017 at 9:53

@PeterCordes - I updated my post. Apparently code alignment affected the loop version. So with forced alignment, it's back to 1.43 seconds for loop and 1.03 seconds for indexed branch – Stanleystanly 1/11, 2017 at 11:24

I guess that makes sense with mispredicts happening. For a long-running loop it wouldn't matter much at all as long as alignment doesn't break the uop cache and stop it from running from the loop buffer. I don't know exactly why it matters more with mispredicts and short iteration counts, but it's certainly going to be more complicated. – Watcher 1/11, 2017 at 11:30

I tried it on Skylake with perf counters: only 19k branch-misses (br_misp_retired.all_branches) for the whole program with the computed jump version, so apparently indirect-branch target prediction is working fantastically well here. I didn't think a pattern this long would have any chance, but I guess it does. Could maybe defeat it with another 94 entries in a different order. It definitely should be counting indirect-branch mispredicts, not just conditional, because there's a br_misp_retired.conditional for just conditional. Will try again after my curling game. – Watcher 1/11, 2017 at 18:58

Times on my SKL are ~0.66s for the jump, 1.02s for the loop, at ~4.3GHz turbo. (I'm getting a different result from the sum of the function results; Maybe I broke it while porting to NASM and Linux calling convention, and making minor optimizations. Do you get the same x for jump vs. loop?) BTW, the loop version mispredicts about the expected amount; I think once per loop-exit, or 51M per program or 1.88% of all branches. – Watcher 1/11, 2017 at 19:3

@PeterCordes - My fault - there was a mistake in my code, as it was using an[93] when an[93] was not defined. I fixed this by increasing the size of an from 93 to 94 and setting an[93] == 93. My post now includes this fix in both the C program snippets. For both cases, I'm getting the same x value: 0x812a62b1dc000000 . – Stanleystanly 1/11, 2017 at 19:23

@PeterCordes - "19k branch misses for computed jump version" - how big is the branch target buffer ? Apparently it's large enough to handle all 94 possible targets from the computed jump, at least most of the time. – Stanleystanly 1/11, 2017 at 20:4

BTW, in the C version gcc finds that bug automatically: fib-indexed-branch.c:35:3: warning: iteration 93 invokes undefined behavior [-Waggressive-loop-optimizations]. Apparently I copy/pasted a version from before you fixed that in the C. I didn't copy/paste the asm table. – Watcher 1/11, 2017 at 22:24

The BTB is thousands of entries, but the surprising thing is that it can handle a pattern so large for a single indirect branch. It's indexed by branch history, so maybe it is somehow using lots of entries to be this accurate. – Watcher 1/11, 2017 at 22:28

Modern branch prediction approaches, including for indirect branches, such as TAGE and ITAGE potentially spread out the history for one branch across a huge amount of internal state. In prototypical designs, a branch with enough distinct histories could use a large fraction of the branch predictor for its predictions (limited by the number of ways). That's one of the secret sauces behind modern predictors: the storage per PC isn't limited some some small fraction of the BP state, but can expand if useful. – Crawly 1/11, 2017 at 23:57

@PeterCordes - I simplified main(). The y[] array wasn't needed. Having x += fib(a[i]); along with printf(...x) was enough to prevent the compiler from optimizing out x. All assembly loop version was slightly slower (1.44 to 1.46 seconds), which could be luck (I take the best of several runs), or a minor code location issue. – Stanleystanly 2/11, 2017 at 2:5

@BeeOnRope: In this case the history of taken/not-taken branches is always the same. The only thing that changes is how many add instructions were run after the last indirect branch. Does "branch history" include how many instructions separated the branches, or what the target of the last indirect branch was? I was thinking it probably wouldn't. – Watcher 2/11, 2017 at 2:7

@rcgldr: yeah, that's what I was testing with already. The y[] stuff looked over-complicated, so I just added my own #ifdef USE_RESULT / x += / #endif / fib(a[i]). – Watcher 2/11, 2017 at 2:8

@PeterCordes - I don't fully understand your question. Are you talking about the loop solution or the unrolled+indirect branch one? Intuitively though the branch prediction results here make sense. The indirect branch case only needs to predict a repeating pattern of 93 jumps. So 7 bits of history will already provide 128 distinct locations in the predictor tables and with 10 bits of history you can do a pretty good job. The loop-based solution, however, will generally have a long string of N (not taken) as the branch history at the moment it needs to predict loop exit. – Crawly 2/11, 2017 at 2:11

If we analyze a looping version without the 2x unroll (no shr ecx,1) then N would be between 1 and 93, with an average in the middle. We know the branch predictors on modern Intel seem to have a history length in the 20ish bits, so most of these will cases will see "all zeros" and mispredict the exit. Essentially TAGE is bad at predicting loop exits because the branch history in that case has low entropy (mostly zeros) and so it isn't storing the history in an efficient format and you quickly exceed the history size. We may see a return of the loop predictors at some point. – Crawly 2/11, 2017 at 2:15

Furthermore, even the loops that do fit in the buffer will usually just see something like TTTTTTTTTTTNTTTTTTTTTT in the history buffer: i.e., they will usually just see back to the "last" loop exit - but that doesn't help you predict the current one. You need to see the last two to effectively localize yourself in the benchmark input array, and that's even less likely. Analyzing the 2x unroll version that appears above is harder, but it doesn't work "much" better. You compress the history by 2:1, but now you have 2x aliasing also. – Crawly 2/11, 2017 at 2:18

@BeeOnRope: The loop version makes sense to me: typically a mispredict on loop exit. The indirect version surprised me because I don't understand where that history is coming from. I didn't know that different target addresses could turn into branch history bits. Now that you explain it, it makes sense that they can, and this seems like a good mechanism instead of needing a special target-history pattern buffer for each indirect branch or whatever else it might use. (Now that I type that out, it sounds wrong). I didn't know the mechanism by which indirect branches find patterns. – Watcher 2/11, 2017 at 2:28

One way you can do it for both conditional and indirect branches is just to take a hash of the path history. This is basically just a hash of the targets of the last N jumps. For conditional branches that's different but similarly powerful to the to "1 bit taken/not per branch" approach. It can handle cases where control flow converges (e.g., two branches at different PCs jump to the same location and then there is another branch): it keeps those separate while T/N approach would consider them the same. On the other hand ... – Crawly 2/11, 2017 at 2:40

... it can't capture cases where two different branches have the same history based on behavior, and the hash function isn't as exact as the T/N case. For indirect branches, this approach is even more natural since no obvious T/N approach presents itself. So for indirect branches you can expect it works at least something like that: hash information from some previous N branches "outcomes" (where "outcome" for an indirect branch is where it went) and then look that up in the prediction tables. To catch a repeating loop of period 94 you don't need a lot of hardware. – Crawly 2/11, 2017 at 2:44

@PeterCordes - in the same case for conditional branches, with random input, I have to make a loop have a period of many thousands before the predictor starts failing. A periodic loop of 1000 random conditional branches is predicted very successfully, for example. History length doesn't need to be close to 1000 bits: it just needs to be long enough to identify uniquely the 1000 positions in the period, which is probably something like lg(1000) + 1 = ~11 bits, for a reasonable rate. Loops exits predictions don't get close to 1000 because the history is low entropy: they are a worst case. – Crawly 2/11, 2017 at 2:46

FWIW, you observed a miss rate of about 1.8% in the period 94 loop. To get a "spurious aliasing" rate of 1.8% needs a lookup table of about ~5000 elements, which means a minimum history size of just over 12 bits. Even in the aliasing case, you have perhaps a 50% chance of getting the target right, since you'll usually alias with 1 other branch and they implement "anti ping-pong" algorithm, so the actual table is is probably half that. The total table storage is what limits these totally random cases though, and having ~2500 i-branch entries seems reasonable on Skylake. – Crawly 2/11, 2017 at 3:4

@PeterCordes - I updated the all assembly code in my post and was able to reproduce the 1.4 to 2.0 second increase by putting the return from fib() at odd 16 byte boundary and fib() entry at even 16 byte boundary. I'm wondering if this is an issue specific to Intel 3770K processor. – Stanleystanly 2/11, 2017 at 7:41

@PeterCordes - I repeated my initial test that only runs the inner loop for fib(0) to fib(93). The loop version remained about 40% slower than computed jump version (1.901 microseconds vs 1.324 microseconds). Since the computed jump version only jumps to each of the 94 target addresses one time in this shortened test, it would seem the branch target history should not be a factor. This would get back to your point about the loop overhead being about 40%. I'm wondering if you can repeate the inner loop only test with SKL performance counters. – Stanleystanly 2/11, 2017 at 15:55

Interesting, good idea. My usual test method can't perf count very short intervals like that. I use perf stat to get performance counters for the entire executable (statically linked so there's no CRT startup/cleanup, but still moderate overhead). There are libraries for using perf counters from user-space, but I'd have to set that up. – Watcher 2/11, 2017 at 16:29

@PeterCordes You can use uarch-bench for those kind of tests: it's among the reason I wrote it. It should be easy to copy paste in a new test (assembly or C) and get the number of cycles for a "single pass" (although currently it does still execute that pass 33 times collecting the result each time: after all, you won't want to count icache misses, right?). By default it just uses the cycles counter, but you can add any others you want with --extra-events. – Crawly 2/11, 2017 at 18:4

@PeterCordes - late request - I was wondering if you could test the all assembly code on your system (hopefully it's not an Intel 3770K), and comment if you see the same 35% difference on the loop version depending on location as noted in the assembly code comments. – Stanleystanly 5/8, 2019 at 14:34

mov rdx,1 => mov edx,1, xor rax, rax => xor eax, eax. Those save a REX prefix – Mcauley 6/8, 2019 at 4:26

@Mcauley - thanks. The main issue that popped up was the fact that code location for the loop version resulted in the run time changing from 1.465 seconds to 2.000 seconds, a 35+% increase. – Stanleystanly 6/8, 2019 at 6:20

This was an answer to the original question, about why the loop takes 1.4x the time of the computed-jump version when the result is totally unused. IDK exactly why accumulating the result with a 1-cycle add loop-carried dependency chain would make so much difference. Interesting things to try: store it to memory (e.g. assign it to a volatile int discard) so the asm dep chain doesn't just end with a clobbered register. HW might possibly optimize that (e.g. discard uops once it's sure the result is dead). Intel says Sandybridge-family can do that for one of the flag-result uops in shl reg,cl.

Old answer: Why the computed jump is 1.4x faster than the loop with the result unused

You're testing throughput here, not latency. In our earlier discussion, I was mostly focusing on latency. That may have been a mistake; throughput impact on the caller can often be as relevant as latency, depending on how much of what the caller does after has a data dependency on the result.

Out-of-order execution hides the latency because the result of one call isn't an input dependency for the arg to the next call. And IvyBridge's out-of-order window is large enough to be useful here: 168-entry ROB (from issue to retirement), and 54-entry scheduler (from issue to execute), and a 160-entry physical register file. See also PRF vs. ROB limits for OOO window size.

OOO execution also hides the cost of the branch-mispredict before any Fib work gets done. Work from the last fib(n) dep chain is still in flight and being worked on during that mispredict. (Modern Intel CPUs only roll back to the mispredicted branch, and can keep executing uops from before the branch while the mispredict is being resolved.)

It makes sense that the computed-branch version is good here, because you're mostly bottlenecked on uop throughput, and the mispredict from the loop-exit branch costs about the same as the indirect-branch mispredict on entry to the unrolled version. IvB can macro-fuse the sub/jcc into a single uop for port 5, so the 40% number matches up pretty well. (3 ALU execution units, so spending 1/3 or your ALU execution throughput on loop overhead explains it. Branch-mispredict differences and the limits of OOO execution explain the rest)

I think in most real use-cases, latency might will relevant. Maybe throughput will still be most important, but anything other than this will make latency more important, because this doesn't even use the result at all. Of course, it's normal that there will be previous work in the pipeline that can be worked on while an indirect-branch mispredict is recovered from, but this will delay the result being ready which might mean stalls later if most of the instructions after fib() returns are dependent on the result. But if they aren't (e.g. a lot of reloads and calculations of addresses for where to put the result), having the front-end start issuing uops from after fib() sooner is a good thing.

I think a good middle ground here would be an unroll by 4 or 8, with a check before the unrolled loop to make sure it should run once. (e.g. sub rcx,8 / jb .cleanup).

Also note that your looping version has a data dependency on n for the initial values. In our earlier discussion, I pointed out that avoiding this would be better for out-of-order execution, because it lets the add chain start working before n is ready. I don't think that's a big factor here, because the caller has low latency for n. But it does put the loop-branch mispredict on exiting the loop at the end of the n -> fib(n) dep chain instead of in the middle. (I'm picturing a branchless lea / cmov after the loop to do one more iteration if sub ecx, 2 went below zero instead of to zero.)

Watcher answered 1/11, 2017 at 3:55 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags