Slow jmp-instruction

Asked 7/8, 2016 at 7:23 Answered 7/8, 2016 at 8:23

Solved assembly x86 intel cpu-architecture branch-prediction

As follow up to my question The advantages of using 32bit registers/instructions in x86-64, I started to measure the costs of instructions. I'm aware that this have been done multiple times (e.g. Agner Fog), but I'm doing it for fun and self education.

My testing code is pretty simple (for simplicity here as pseudo code, in reality in assembler):

for(outer_loop=0; outer_loop<NO;outer_loop++){
    operation  #first
    operation  #second
    ...
    operation #NI-th
}

But yet some things should be considered.

If the inner part of the loop is large (large NI>10^7), the whole content of the loop does not fit into the instruction cache and thus must be loaded over and over again, making the speed of RAM define the time needed for execution. For example, for large inner parts, xorl %eax, %eax (2 bytes) is 33% faster than xorq %rax, %rax (3 bytes).
If NI is small and the whole loop fits easily into the instruction cache, than xorl %eax, %eax and xorq %rax, %rax are equally fast and can be executed 4 times per clock cycle.

However this simple model does not hold water for the jmp-instruction. For the jmp-instruction my test code looks as follows:

for(outer_loop=0; outer_loop<NO;outer_loop++){
    jmp .L0
    .L0: jmp .L1
    L1: jmp L2
    ....
}

And the results are:

For "large" loop sizes (already for NI>10^4) I measure 4.2 ns/jmp-instruction ( would equate to 42 bytes loaded from RAM or ca. 12 clock cycles on my machine).
For small loop sizes (NI<10^3) I measure 1 ns/jmp-instruction (which is around 3 clock cycles, which sounds plausible - Agner Fog's tables shows costs of 2 clock cycles).

The instruction jmp LX uses the 2 byte eb 00 encoding.

Thus, my question: What could be the explanation for the high cost of jmp-instruction in the "large" loops?

PS: If you like to try it out on your machine, you can download the scripts from here, just run sh jmp_test.sh in src-folder.

Edit: Experimental results confirming Peter's BTB size theory.

The following table shows cycles per instruction for different ǸI values (relative to NI=1000):

|oprations/ NI        | 1000 |  2000|  3000|  4000|  5000| 10000|
|---------------------|------|------|------|------|------|------|
|jmp                  |  1.0 |  1.0 |  1.0 |  1.2 |  1.9 |   3.8|
|jmp+xor              |  1.0 |  1.2 |  1.3 |  1.6 |  2.8 |   5.3|
|jmp+cmp+je (jump)    |  1.0 |  1.5 |  4.0 |  4.4 |  5.5 |   5.5|
|jmp+cmp+je (no jump) |  1.0 |  1.2 |  1.3 |  1.5 |  3.8 |   7.6|

It can be seen:

For the jmp instruction, a (yet unknown) resource becomes scarce and this leads to a performance degradation for ǸI larger than 4000.
This resource is not shared with such instructions as xor - the performance degradation kicks in still for NI about 4000, if jmp and xor are executed after each other.
But this resource is shared with je if the jump is made - for jmp+je after each other, the resource becomes scarce for NI about 2000.
However, if je does not jump at all, the resource is becoming scarce once again for NI being about 4000 (4th line).

Matt Godbolt's branch-prediction reverse engineering articles establishes that the branch target buffer capacity is 4096 entries. That is very strong evidence that BTB misses are the reason for the observed throughput difference between small and large jmp loops.

Broomstick answered 7/8, 2016 at 7:23 Comment(5)

The names are in the debug info. Release executables won't have label names anywhere. – Melee 7/8, 2016 at 7:45

Note that xorq %rax,%rax does exactly the same thing as xorl %eax,%eax so there is almost never a reason to use the former (except perhaps to avoid having to insert a nop for alignment somewhere). – Schoening 7/8, 2016 at 12:19

Your "large" 10,000 instruction loops would easily fit into the L2 cache of a modern processor (256K), so you're not measuring the speed of RAM. – Blakney 7/8, 2016 at 14:14

@RossRidge You are right, for mov and xor I need to go as far as 10^7 instruction in the loop to see the "RAM-speed". However jmp becomes 4 time slower from 10^3 to 10^4. I'm not saying it is because of RAM - it is something different, but I don't quite know what it is. – Broomstick 7/8, 2016 at 14:27

You probably already understood it (since you wrote that test case in the first place), but it probably bears being explicit - the reason your jmp+cmp+je (no jump) case doesn't hit resource scarcity until around 4,000 jumps is because jumps that aren't taken don't consume a BTB entry (indeed, there would be nothing to put in the BTB!). – Aftmost 9/8, 2016 at 5:28

TL:DR: my current guess is running out of BTB (branch target buffer) entries. Pipelined code fetch needs to predict the existence of an unconditional branch before it's even decoded. See below.

2021 update: https://blog.cloudflare.com/branch-predictor/ explores this in detail, using a block of jmp next_insn as an experiment. Branch density, and aliasing (same offset relative to a 64-byte line) for example can matter.

Even though your jmps are no-ops, the CPU doesn't have extra transistors to detect this special-case. They're handled just like any other jmp, which means having to re-start instruction fetch from a new location, creating a bubble in the pipeline.

To learn more about jumps and their effect on pipelined CPUs, Control Hazards in a classic RISC pipeline should be a good intro to why branches are difficult for pipelined CPUs. Agner Fog's guides explain the practical implications, but I think assume some of that kind of background knowledge.

Your Intel Broadwell CPU has a uop-cache, which caches decoded instructions (separate from the 32kiB L1 I-cache).

The uop cache size is 32 sets of 8 ways, with 6 uops per line, for a total of 1536 uops (if every line is packed with 6 uops; perfect efficiency). 1536 uops is between your 1000 and 10000 test sizes. Before your edit, I predicted that the cutoff for slow to fast would be right around 1536 total instructions in your loop. It doesn't slow down at all until well beyond 1536 instructions, so I think we can rule out uop-cache effects. This isn't as simple a question as I thought. :)

Running from the uop-cache (small code size) instead of the x86 instruction decoders (large loops) means that there are fewer pipeline stages before the stage that recognizes jmp instructions. So we might expect the bubbles from a constant stream of jumps to be smaller, even though they're predicted correctly.

Running from the decoders is supposed to give a larger branch mispredict penalty (like maybe 20 cycles instead of 15), but these aren't mispredicted branches.

Even though the CPU doesn't need to predict whether the branch is taken or not, it still uses branch-prediction resources to predict that a block of code contains a taken branch before it's decoded.

Caching the fact that there is a branch in a certain block of code, and its target address, allows the frontend to start fetching code from the branch target before the jmp rel32 encoding is actually decoded. Remember that decoding variable-length x86 instructions is hard: you don't know where one instruction starts until the previous one is decoded. So you can't just pattern-match the instruction stream looking for unconditional jumps / calls as soon as its fetched.

My current theory is that you're slowing down when you run out of branch-target-buffer entries.

See also What branch misprediction does the Branch Target Buffer detect? which has a nice answer, and discussion in this Realworldtech thread.

One very important point: the BTB predicts in terms of which block to fetch next, rather than the exact destination of a specific branch within a fetch block. So instead of having to predict targets for all branches in a fetch block, the CPU just needs to predict the address of the next fetch.

Yes, memory bandwidth can be a bottleneck when running very high throughput stuff like xor-zeroing, but you're hitting a different bottleneck with jmp. The CPU would have time to fetch 42B from memory, but that's not what it's doing. Prefetch can easily keep up with 2 bytes per 3 clocks, so there should be near-zero L1 I-cache misses.

In your xor with/without REX test, main memory bandwidth might actually have been the bottleneck there if you tested with a large enough loop to not fit in L3 cache. I consumes 4 * 2B per cycle on a ~3GHz CPU, which does just about max out the 25GB/s of DDR3-1600MHz. Even the L3 cache would be fast enough to keep up with 4 * 3B per cycle, though.

That's interesting that main memory BW is the bottleneck; I initially guessed that decode (in blocks of 16 bytes) would be the bottleneck for 3-byte XORs, but I guess they're small enough.

Also note that its a lot more normal to measure times in core clock cycles. However, your measurements in ns are useful when you're looking at memory, I guess, because low clock speeds for power-saving change the ratio of core clock speed to memory speed. (i.e. memory bottlenecks are less of a problem at minimum CPU clock speed.)

For benchmarking in clock cycles, use perf stat ./a.out. There are other useful performance counters which are essential to trying to understand the performance characteristics.

See x86-64 Relative jmp performance for perf-counter results from Core2 (8 cycles per jmp), and some unknown microarchitecture where it's ~10c per jmp.

The details of modern CPU performance characteristics are hard enough to understand even under more or less white-box conditions (reading Intel's optimization manual, and what they've published regarding CPU internals). You're going to get stuck early and often if you insist on black-box testing where you don't read stuff like arstechnica articles about the new CPU design, or maybe some more detailed stuff like David Kanter's Haswell microarch overview, or the similar Sandybridge writeup I linked earlier.

If getting stuck early and often is ok and you're having fun, then by all means keep doing what you're doing. But it makes it harder for people to answer your questions if you don't know those details, like in this case. :/ e.g. my first version of this answer assumed you had read enough to know what the uop cache was.

Sennet answered 7/8, 2016 at 8:23 Comment(21)

Thank you for your answer. I'm not quite sure what you mean by uop-cache: operation-cache (which should be 32kB on my machine i-7) or prefetch-queue (I guess my machine has one, don't know how big)? – Broomstick 7/8, 2016 at 9:38

In my case jmp is just a 2 byte nop. There will be no need to fetch new operation into the prefetch queue so I'm not sure the bubbles are the reason for the slowness. These bubbles would be also the be a problem for smaller code sizes - but they are not. – Broomstick 7/8, 2016 at 9:41

As you said the RAM is not the limiting factor here because only 2 bytes are loaded per operation. Do I understand you right your assumption is, that decoding the jmp instruction could be the bottle neck here? – Broomstick 7/8, 2016 at 9:52

I deduced that the bottle neck for big program sizes is RAM, because the time for the execution (of such instruction as xor, mov, add) was proportional to the size of the instruction and the speed was almost exact the speed of my RAM (only around 10GB/s). This my reason to say that 4.2ns would be enough to read 42byte from memory – Broomstick 7/8, 2016 at 9:55

@Broomstick The µop cache caches decoded instructions, i.e. it is a microcode cache. I think it's 27 or so µops long. If your loop is tight enough to fit into the µop cache, the CPU doesn't have to run the decoder on subsequent loop iterations. – Schoening 7/8, 2016 at 12:21

@FUZxxl thanks, my "tight" loop has 1000 jumps and my "large" loop around 10000, so I guess microcode cache should not make any difference. – Broomstick 7/8, 2016 at 12:30

@Broomstick (and FUZxxl): The uop cache in Intel Sandybridge-family CPUs does cache decoded instructions, but it's much larger than the loop buffer. It holds at most 1536 uops, depending on how well uops pack into cache-lines in groups of 6. I don't remember what the rules are for taken branches ending uop cache-lines. Agner Fog has investigated some. Your testing is potentially evidence that multiple jmp uops can fit in the same cache line. It would be great if you find the cutoff size between slow and fast. I predict ~1536 jmps. – Sennet 7/8, 2016 at 14:59

@ead: In my case jmp is just a 2 byte nop: yes, but the CPU doesn't have any optimizations for that useless special case. It still runs it as a normal jmp that requires restarting instruction fetch + decode from a new location. – Sennet 7/8, 2016 at 15:0

@FUZxxl: You're talking about the loop buffer, where the CPU reuses uops in the 28 uop decoded-instruction queue instead of re-decoding or re-fetching from uop cache. Nehalem introduced this, but SnB's major architecture change added a whole uop cache (while keeping a normal L1 I-cache, unlike P4). – Sennet 7/8, 2016 at 15:4

@ead: I made a big edit after re-reading your question and figuring out that you're really black-box testing this, and haven't read very much about how your CPU works. That's not what I would do, but I guess it's a valid and potentially fun approach. I first thought you had read stuff, and were just testing instruction costs, not trying to figure out everything else at the same time! – Sennet 7/8, 2016 at 15:47

Thank you very much for your time and your answers - it's very much appreciated! I just need some time to go through the article to catch up with your reasoning. – Broomstick 7/8, 2016 at 16:47

My machine has Broadwell architecture. Your assumption about 1536uops is not right (see my edit), but maybe Broadwell has a bigger uop cache... – Broomstick 7/8, 2016 at 16:49

@ead: Thanks for testing that, now this is getting interesting :). AFAIK, the uop cache size hasn't changed from Sandybridge to even Skylake. It's still 32 sets of 8 ways, with 6 uops per line, for a total of 1.5K uops capacity (if every line is packed with 6 uops: perfect efficiency). No slowdown until beyond 2k insns definitely rules out the uop cache. I'm not sure what else it could be, since that's still much smaller than L1 I-cache. Perhaps some kind of branch-prediction resources? Possibly branch-target-buffer entries are used to speed up unconditional jmp? – Sennet 7/8, 2016 at 16:58

@ead: I have a new guess, see my edit re: BTB entries. I don't actually have access to SnB hardware to test on, otherwise I'd try this myself and look at perf counters. – Sennet 7/8, 2016 at 17:44

I added experimental results which confirm your TBT-guess. Needless to say, without you I would have never found it out. – Broomstick 8/8, 2016 at 12:35

@ead: BTB, not TBT. (That typo is present in your question edit, too). That's cool; I wasn't sure exactly what resources unconditional jmps used up. I wouldn't have known what to expect for the tests you did, so that's definitely interesting. I was going to suggest adding in NOPs or something to rule out pure code-size effects, but you'd already thought of that by the time I came to post a comment. Adding in not-taken conditional branches was a neat idea. – Sennet 8/8, 2016 at 16:31

Yeah, you basically have two separate branch prediction resources on modern CPUs - the well known "branch direction" predictor, needed for a taken vs. not taken decision on conditional branches, and the BTB. The second of these "branch" resources is needed for all types jumps that are ever taken - which includes all unconditional jumps such as jmp or call, as well as conditional jumps and indirect jumps. Even if the branch target is a constant, there is no magic in the decoding pipeline that would let the front-end re-steer to the jumped-to-location - it relies on the BTB. – Aftmost 9/8, 2016 at 5:27

@BeeOnRope: Do high performance implementations of fixed-insn-size ISAs like PPC or ARM64 (i.e. non-Thumb2) actually scan the insn stream for direct jumps in a very early pipeline stage (ahead of normal decode)? I'd assume so, but I guess that still can't avoid fetch bubbles. It would certainly shorten the bubbles for cases where the BTB is cold, compared to what x86 can do. – Sennet 9/8, 2016 at 14:54

I'm not a hardware designer, but I suspect not - since it seems like the BTB serves that function very well - it happens before any kind of decode at all, based only on the IP, and so should be the fastest (bubble avoiding) approach. The downsides are (1) additional BTB use, but the other approach has additional complexity that you could just more into more BTB resources instead and (2) the cold case, but that case seems limited in impact except in extreme cases (e.g., a huge number loop with more jumps than BTB entries). – Aftmost 9/8, 2016 at 17:30

@BeeOnRope: I meant they might do both: no bubble thanks to BTB in the hot case, shorter bubble due to early scan in the cold case. Although that costs power and transistors, and probably isn't very helpful for a lot of HPC workloads. More useful for running bloated modern GUI software, like web browsers. I'd be interested to find out what the numbers are like for that tradeoff, compared to other things PowerPC or ARM could spend power budget on. I guess this is the kind of bubble that SMT is perfect for, though. – Sennet 9/8, 2016 at 17:37

Yeah, that makes sense. I asked the experts over here to weigh in. At some point branches will be detected and the fetch re-steered, but I think your question is, how early? Could it be even before decode (your original idea)? If not is it at/around decode? Or does it have to wait all the way until execution (i.e., just as bad as a branch mispredict)? – Aftmost 9/8, 2016 at 17:47

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags