Intel x86 0x2E/0x3E Prefix Branch Prediction actually used?

Asked 15/1, 2013 at 7:20 Answered 24/8, 2017 at 12:8

In the latest Intel software dev manual it describes two opcode prefixes:

Group 2 > Branch Hints

    0x2E: Branch Not Taken
    0x3E: Branch Taken

These allow for explicit branch prediction of Jump instructions (opcodes likeJxx)

I remember reading a couple of years ago that on x86 explicit branch prediction was essentially a no-op in the context of gccs branch prediciton intrinsics.

I am now unclear if these x86 branch hints are a new feature or whether they are essentially no-ops in practice.

Can anyone clear this up?

(That is: Does gccs branch prediction functions generate these x86 branch hints? - and do current Intel CPUs not ignore them? - and when did this happen?)

Update:

I created a quick test program:

int main(int argc, char** argv)
{
    if (__builtin_expect(argc,0))
        return 1;

    if (__builtin_expect(argc == 2, 1))
        return 2;

    return 3;
}

Disassembles to the following:

00000000004004cc <main>:
  4004cc:   55                      push   %rbp
  4004cd:   48 89 e5                mov    %rsp,%rbp
  4004d0:   89 7d fc                mov    %edi,-0x4(%rbp)
  4004d3:   48 89 75 f0             mov    %rsi,-0x10(%rbp)
  4004d7:   8b 45 fc                mov    -0x4(%rbp),%eax
  4004da:   48 98                   cltq   
  4004dc:   48 85 c0                test   %rax,%rax
  4004df:   74 07                   je     4004e8 <main+0x1c>
  4004e1:   b8 01 00 00 00          mov    $0x1,%eax
  4004e6:   eb 1b                   jmp    400503 <main+0x37>
  4004e8:   83 7d fc 02             cmpl   $0x2,-0x4(%rbp)
  4004ec:   0f 94 c0                sete   %al
  4004ef:   0f b6 c0                movzbl %al,%eax
  4004f2:   48 85 c0                test   %rax,%rax
  4004f5:   74 07                   je     4004fe <main+0x32>
  4004f7:   b8 02 00 00 00          mov    $0x2,%eax
  4004fc:   eb 05                   jmp    400503 <main+0x37>
  4004fe:   b8 03 00 00 00          mov    $0x3,%eax
  400503:   5d                      pop    %rbp
  400504:   c3                      retq   
  400505:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  40050c:   00 00 00 
  40050f:   90                      nop

I don't see 2E or 3E ? Maybe gcc has elided them for some reason?

Clincher answered 15/1, 2013 at 7:20 Comment(6)

Does gcc not have an option to make it spit out assembly? Could you not write a short program using these intrinsics and see whether it produces these? (I know that doesn't answer the other half of the question) – Segura 15/1, 2013 at 7:25

@Damien_The_Unbeliever: Added as update. – Clincher 15/1, 2013 at 7:34

Ordinarily, the __builtin_expect construction just affects the GCC optimizer. (The effects are pretty subtle.) Have you tried specifying a -march or -mcpu flag to let GCC know that you have a CPU which supports these prefixes? – Antibiotic 15/1, 2013 at 7:41

@duskwuff: Tried with -march=corei7 and gives same output – Clincher 15/1, 2013 at 7:52

OK, in that case I suspect that GCC simply doesn't generate the 2E/3E prefixes. – Antibiotic 15/1, 2013 at 7:53

See also: Is it possible to tell the branch predictor how likely it is to follow the branch? – Gastrotomy 24/7, 2016 at 8:22

These instruction prefixes have no effect on modern processors (anything newer than Pentium 4). They just cost one byte of code space, and thus, not generating them is the right thing.

For details, see Agner Fog's optimization manuals, in particular 3. Microarchitecture: http://www.agner.org/optimize/

The "Intel® 64 and IA-32 Architectures Optimization Reference Manual" no longer mentions them in the section about optimizing branches (section 3.4.1): http://www.intel.de/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf

These prefixes are a (harmless) relict of the Netburst architecture. In all-out optimization, you can use them to align code, but that's all they're good for nowadays.

Enzymology answered 15/1, 2013 at 9:35 Comment(0)

gcc is right to not generate the prefix, as they have no effect for all processors since the Pentium 4.

But __builtin_expect has other effects, like moving a not expected code path away from the cache-hot locations in the code or inlining decisions, so it is still useful.

Reis answered 15/1, 2013 at 9:45 Comment(0)

While Pentium 4 is the only generation which actually respects the branch-hint instructions, most CPUs do have some form of static branch prediction, which can be used to achieve the same effect. This answer is a bit tangential to the original question, but I think this would be valuable information to anyone who comes to this page.

The Intel optimisation guide and Agner Fog's guide (which have been mentioned here already) both have excellent descriptions of this feature.

Intel has this to say about generations newer than Core 2:

Make the fall-through code following a conditional branch be the likely target for a branch with a forward target

So conditional branches which jump forward in the code are predicted to be not-taken, by the static prediction algorithm.

This is consistent with what GCC seems to have generated using __builtin_expect: the 'expected' return 1 / return 2 code is placed in the not-taken paths from the conditional branches, which will be statically predicted as not-taken.

Additionally:

Branches that do not have a history in the Branch Target Buffer are predicted using a static prediction algorithm:

Predict unconditional branches to be taken.

Predict indirect branches to be NOT taken.

So in the 'expected' not-taken paths where GCC has placed unconditional jmps to the end of the function, those jumps will be statically predicted as taken (i.e. not skipped).

Intel also says:

make the fall-through code following a conditional branch be the unlikely target for a branch with a backward target

So conditional branches which jump backwards in the code are predicted to be taken, by the static prediction algorithm.

According to Agner Fog, most Pentiums also follow this algorithm:

On PPro, P2, P3, P4 and P4E, a control transfer instruction which has not been seen before, or which is not in the Branch Target Buffer, is predicted to fall through if it goes forwards, and to be taken if it goes backwards (e.g. a loop). Static prediction takes longer time than dynamic prediction on these processors.

However, the Core 2 family (and Pentium M) has a completely different policy:

These processors do not use static prediction. The predictor simply makes a random prediction the first time a branch is seen, depending on what happens to be in the BTB entry that is assigned to the new branch. There is simply a 50% chance of making the right prediction of jump or no jump, but the predicted target is correct.

As do AMD processors apparently:

A branch is predicted not taken the first time it is seen. A branch is predicted always taken after the first time it has been taken. Dynamic prediction is used only after a branch has been taken and then not taken. Branch hint prefixes have no effect.

There is one additional factor to consider: CPUs generally like to execute in a linear fashion, so even correctly-predicted taken branches are often more expensive than correctly-predicted not-taken branches.

Lands answered 25/10, 2015 at 7:20 Comment(2)

As usual, things are complicated, and modern Intel may still use some static prediction, according to Matt Godbolt's experiments and research. – Gastrotomy 24/7, 2016 at 8:25

You said "This is consistent with what GCC seems to have generated using" which seems wrong AFAICT when we look at the OP's assembly. But the issue here, I think, is that the OP hasn't turned on any optimizations. With optimizations turned on, it will generate a branch-free version (even better). But if we change the example in that it calls some function depending on a parameter (which can't be optimized into a math trick) we start to see GCC respecting the hints via the instruction ordering. So, starting at -O1 this hinting makes a difference. – Electrolytic 18/7, 2019 at 12:2

Intel® 64 and IA-32 Architectures Software Developer’s Manual -> Volume 2: Instruction Set Reference, A-Z -> Chapter 2: Instruction Format -> 2.1 Instruction Format for Protected Mode, real-address Mode, and virtual-8086 mode -> 2.1.1 Instruction Prefixes

Some earlier microarchitectures used these as branch hints, but recent generations have not and they are reserved for future hint usage.

Influx answered 24/8, 2017 at 12:8 Comment(1)

This is more of a comment than an answer, since the existing answers already say this. Interesting that Intel does officially reserve them for future use. – Gastrotomy 24/8, 2017 at 15:56

Recommended topics

Hot tags