Performance optimisations of x86-64 assembly - Alignment and branch prediction

Asked 7/8, 2013 at 21:18 Answered 27/7, 2015 at 0:18

performance assembly x86-64 sse2 branch-prediction

I’m currently coding highly optimised versions of some C99 standard library string functions, like strlen(), memset(), etc, using x86-64 assembly with SSE-2 instructions.

So far I’ve managed to get excellent results in terms of performance, but I sometimes get weird behaviour when I try to optimise more.

For instance, adding or even removing some simple instructions, or simply reorganising some local labels used with jumps completely degrades the overall performances. And there’s absolutely no reason in terms of code.

So my guess is that there is some issues with code alignment, and/or with branches which get mispredicted.

I know that, even with the same architecture (x86-64), different CPUs have different algorithms for branch prediction.

But is there some general advices, when developing for high performances on x86-64, about code alignment and branch prediction?

In particular, about alignment, should I ensure all labels used by jump instructions are aligned on a DWORD?

_func:
    ; ... Some code ...
    test rax, rax
    jz   .label
    ; ... Some code ...
    ret
    .label:
        ; ... Some code ...
        ret

In the previous code, should I use an align directive before .label:, like:

align 4
.label:

If so, is it enough to align on a DWORD when using SSE-2?

And about branch prediction, is there a «preffered» way to organize the labels used by jump instructions, in order to help the CPU, or are today's CPUs smart enough to determine that at runtime by counting the number of times a branch is taken?

EDIT

Ok, here's a concrete example - here's the start of strlen() with SSE-2:

_strlen64_sse2:
    mov         rsi,    rdi
    and         rdi,    -16
    pxor        xmm0,   xmm0
    pcmpeqb     xmm0,   [ rdi ]
    pmovmskb    rdx,    xmm0
    ; ...

Running it 10'000'000 times with a 1000 character string gives about 0.48 seconds, which is fine.
But it does not check for a NULL string input. So obviously, I'll add a simple check:

_strlen64_sse2:
    test       rdi,    rdi
    jz          .null
    ; ...

Same test, it runs now in 0.59 seconds. But if I align the code after this check:

_strlen64_sse2:
    test       rdi,    rdi
    jz          .null
    align      8
    ; ...

The original performances are back. I used 8 for alignment, as 4 doesn't change anything.
Can anyone explain this, and give some advices about when to align, or not to align code sections?

EDIT 2

Of course, it's not as simple as aligning every branch target. If I do it, performances will usually get worse, unless some specific cases like above.

Delitescent answered 7/8, 2013 at 21:18 Comment(11)

SSE2 has branch hint prefixes (2E and 3E). – Eaten 7/8, 2013 at 21:21

@KerrekSB Thanks for the comment. Are those instructions still used by modern CPUs, or are they simply ignored? I can't find anything about them in Intel's optimisation manual for x86-64... – Delitescent 7/8, 2013 at 21:36

Those should still work in modern CPUs, and AFAIK those work(ed) even before SSE2 – Wagonage 7/8, 2013 at 21:37

Branch hints are ignored by all processors except P4. – Dysgenic 7/8, 2013 at 21:38

Since you mentioned the optimization manual, I assume you have read section 3.4.1.5 Code Alignment which says "Assembly/Compiler Coding Rule 12. (M impact, H generality) All branch targets should be 16-byte aligned." The whole section 3.4.1 is worth reading, of course. – Portcullis 11/8, 2013 at 13:39

@Portcullis Thank you. I've read it before of course. The issue is that if I align every section, I'll usually get worse performances compared to un-aligned version, except for some specific case, like my last example. Trying to understand why... – Delitescent 11/8, 2013 at 18:57

Your "strategic" align could be triggering better cache-management. Of-course, blindly aligning everything will stuff a lot of NOPs and make the actual code a lot more sparse in memory; reducing the possibility that local objects are on the same cache-line i.e. lesser cache-hits? Just a thought. Maybe cachegrind will be of help here?... – Crutch 16/8, 2013 at 16:34

As far as branch-prediction on modern x86 CPUs is concerned, checkout section 3 of this manual. – Crutch 16/8, 2013 at 16:44

Also as i recently found out, a good place to find SSE gems is glibc. Grab the latest source and grep -rl "SSE" sysdeps/x86_64/ | uniq -u for 88 different commonly used functions, SSE optimised for x86_64. – Crutch 16/8, 2013 at 17:2

I wonder how useful this level of optimization will be in a more realistic setting where the entire string doesn't live in L1 cache, which it clearly does for the benchmark you're using. The 20% performance differences you're worried about could be totally insignificant compared to memory fetch costs. – Epperson 16/8, 2013 at 17:4

Voting to close as too broad. – Pinto 30/6, 2015 at 8:17

Alignment optimisations

1. Use `.p2align <abs-expr> <abs-expr> <abs-expr>` instead of `align`.

Grants fine-grained control using its 3 params

param1 - Align to what boundary.
param2 - Fill padding with what (zeroes or NOPs).
param3 - Do NOT align if padding would exceed specified number of bytes.

2. Align the start of a frequently used code blocks to cache-line size boundaries.

This increases the chances that the entire code-block lies in a single cache-line. Once loaded into the L1-cache, then can run entirely without the need to access RAM for instruction fetch. This is highly beneficial for loops with a large number of iterations.

3. Use multi-byte `NOP`s for padding to reduce the time spent executing `NOP`s.

  /* nop */
  static const char nop_1[] = { 0x90 };

  /* xchg %ax,%ax */
  static const char nop_2[] = { 0x66, 0x90 };

  /* nopl (%[re]ax) */
  static const char nop_3[] = { 0x0f, 0x1f, 0x00 };

  /* nopl 0(%[re]ax) */
  static const char nop_4[] = { 0x0f, 0x1f, 0x40, 0x00 };

  /* nopl 0(%[re]ax,%[re]ax,1) */
  static const char nop_5[] = { 0x0f, 0x1f, 0x44, 0x00, 0x00 };

  /* nopw 0(%[re]ax,%[re]ax,1) */
  static const char nop_6[] = { 0x66, 0x0f, 0x1f, 0x44, 0x00, 0x00 };

  /* nopl 0L(%[re]ax) */
  static const char nop_7[] = { 0x0f, 0x1f, 0x80, 0x00, 0x00, 0x00, 0x00 };

  /* nopl 0L(%[re]ax,%[re]ax,1) */
  static const char nop_8[] =
    { 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00};

  /* nopw 0L(%[re]ax,%[re]ax,1) */
  static const char nop_9[] =
    { 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 };

  /* nopw %cs:0L(%[re]ax,%[re]ax,1) */
  static const char nop_10[] =
    { 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 };

(upto 10byte NOPs for x86. Source binutils-2.2.3.)

Branch-prediction optimisations

_{Lot of variations between x86_64 micro-architectures/generations. However a common set of guidelines that are applicable for all of them can be summarised as follows. Reference : Section 3 of Agner Fog's x86 micro-architecture manual.}

1. Un-roll loops to avoid slightly-too-high iteration counts.

Loop detection logic is guaranteed to work ONLY for loops with < 64 iterations. This is due to the fact that a branch instruction is recognized as having loop behavior if it goes one way n-1 times and then goes the other way 1 time, for any n upto 64.

This doesn't really apply for the predictors in Haswell and later which use a TAGE predictor and doesn't have dedicated loop-detection logic for specific branches. Iteration counts of ~23 can be the worst-case for an inner loop inside a tight outer loop with no other branching, on Skylake: the exit from the inner loop mispredicts most times, but the trip count is so low that it happens often. Unrolling can help by shortening the pattern, but for very high loop trip counts the single mispredict at the end is amortized over a lot of trips and it would take an unreasonable amount of unrolling to do anything about it.

2. Stick to near/short jumps.

Far jumps are not predicted i.e. pipeline always stalls on a far jump to a new code segment (CS:RIP). There's basically never a reason to use a far jump anyway so this is mostly not relevant.

Indirect jumps with an arbitrary 64-bit absolute address are predicted normally on most CPUs.

But Silvermont (Intel's low-power CPUs) have some limitations in predicting indirect jumps when the target is more than 4GB away, so avoiding that by loading/mapping executables and shared libraries in the low 32 bits of virtual address space can be a win there. e.g. on GNU/Linux by setting the environment variable LD_PREFER_MAP_32BIT_EXEC. See Intel's optimization manual for more.

Crutch answered 16/8, 2013 at 18:10 Comment(4)

Thank you for the answer, especially for multi-byte NOPs. I'll add further details in another answer, as it may also help people. In the meantime, I'm awarding the bounty to you, to thank you taking time to write a detailed answer, even if it don't answer everything : ) – Delitescent 17/8, 2013 at 9:39

Thank you. :-) Looking forward to your answer with the details that you came across in your research. – Crutch 17/8, 2013 at 9:48

In x86, a FAR jump is one to a different code segment, i.e. it changes CS. This is pretty much only relevant for 16-bit. There's no need to even mention it for optimizing normal user-space code. Short (rel8) and Near (rel32) jumps are both predicted and speculatively executed. IDK if you thought Far meant rel32 or something. – Scorper 13/7, 2017 at 12:51

@Crutch re:"IIteration counts of ~23 can be the worst-case for an inner loop inside a tight outer loop" this is not because of the branch predictor. This is because the inner loop will start running out of the LSD around 23 iterations and the only stop condition for the LSD is a branch miss – Reedy 2/6, 2021 at 17:38

To extends on TheCodeArtist's answer, who made some good points, here are a few additional stuff and details, as I was actually able to solve the problem.

1 - Code alignment

Intel recommends aligning code and branch targets on 16-byte boundaries:

3.4.1.5 - Assembly/Compiler Coding Rule 12. (M impact, H generality)
All branch targets should be 16-byte aligned.

While this is usually a good advice, it should be done carefully.
Blindly 16-byte aligning everything may lead to performance lost, so this should be tested on each branch target before applying.

As TheCodeArtist pointed it out, using multi-byte NOPs may help here, as simply using standard one-byte NOPs may not bring the expected performance gain of code alignment.

As a sidenote, the .p2align directive is not available in NASM or YASM.
But they do support alignment with other instructions than NOPs with the standard align directive:

align 16, xor rax, rax

2 . Branch prediction

This turned out to be the most important part.
While it's right that every generation of x86-64 CPUs have different branch prediction algorithms, some simple rules can be applied generally to help the CPU predicting which branch will likely be taken.

The CPU tries to keep a branching history in the BTB (Branch Target Buffer).
But when branch information is not available in the BTB, the CPU will use what they call static prediction, which obey to simple rules, as mentioned in Intel's manuals:

Predict forward conditional branches to be not taken.
Predict backward conditional branches to be taken.

Here's an example for the first case:

test rax, rax
jz   .label

; Fallthrough - Most likely

.label:

    ; Forward branch - Most unlikely

Instructions under .label is the unlikely condition, because .label is declared after the actual branch.

For the second case:

.label:

    ; Backward branch - Most likely

test rax, rax
jz   .label

; Fallthrough - Most unlikely

Here, the instructions under .label are the likely condition, as .label is declared before the actual branch.

So each conditional branch should always follow this simple pattern.
And of course, this is also suitable for loops.

As I mentioned before, this was the most important part.

I was experiencing unpredictable performance gains or losses while adding simple tests that should logically improve the overall performances.
Sticking blindly to these rules solved the issues.
If not, the addition of a branch for optimisation purpose may have the opposite result.

TheCodeArtist also mentions loop unrolling in his answer.
While this wasn't the issue, as my loops were already unrolled, I mention it here as it's indeed extremely important, and brings substantial performance gains.

And as a last note for the readers, while this may seem obvious and wasn't the problem here, don't branch when unnecessary.

Starting with the Pentium Pro, x86 processors have conditional move instructions, which may help to eliminate branching and suppress the risk of misprediction:

test   rax, rax
cmovz  rbx, rcx

So just in case, nice thing to keep in mind.

Delitescent answered 17/8, 2013 at 10:41 Comment(5)

While your and TCA's answers are good general principles, the deeper question is when these rules actually apply. In general, this can't be answered without (lots of) reference to the details of the target CPU. While avoiding branch misprediction is critical, this loop should be predicted correctly every iteration but the exit regardless of which way you jump. I think your real issue with alignment is with instruction decoding and the micro-op loop buffer. Are you possibly testing this on an older processor? Could you post your full code? I think more exploration might be interesting. – Arnoldarnoldo 19/8, 2013 at 22:36

"All branch targets should be 16-byte aligned." This Coding Rule seems to have been removed in the May 2020 Intel® 64 and IA-32 Architectures Optimization Reference Manual and perhaps earlier. – Gaptoothed 7/10, 2020 at 0:1

Anyone have any idea why? – Gaptoothed 7/10, 2020 at 0:20

@Olsonist: Because modern CPUs with a uop cache care about 32-byte boundaries, but that's too wide to be worth padding for. Better to just go for density within functions, often including the tops of loops. And definitely branches implementing "if"/"else" logic that are only jumped-to once per call to the function. – Scorper 4/11, 2020 at 10:27

BTW, aligning code and aligning branch targets on 16-byte boundaries are 2 different things. I remember Intel used to recommend NOT letting instructions overlap 16-byte boundaries. Maybe that's faulty memory but they now say "The front end can fetch 16 bytes of instructions per cycle." NB that isn't 16 aligned bytes. So Intel is noticeably relaxing their recommendations. As for LCPs, they mention they're not a problem in the LSD because "No LCP penalties, as the pre-decode stage has already been passed." So for loops, they're not a problem. Are they an advantage? Only testing will tell. – Gaptoothed 4/11, 2020 at 18:30

To get a better understanding of why and how alignment matters, check out Agner Fog's the microarchitecture doc, esp. the section about the instruction-fetch front-end of various CPU designs. Sandybridge introduced the uop cache, which makes a huge different to throughput, esp. in SSE code where instruction length is often too long for 16B per cycle to cover 4 instructions.

The rules for filling uop cache lines are complicated, but a new block of 32B of instructions always starts a new cache line, IIRC. So aligning hot function entry points to 32B is a good idea. That much padding in other cases might be hurting I$ density more than helping. (L1 I$ still has 64B cache lines, though, so some things might hurt L1 I$ density while helping uop cache density.)

The loop buffer helps too, but taken branches disrupt the 4 uops per cycle, especially before Haswell. e.g. a loop of 3 uops executes like abc, abc, not abca, bcda on SnB/IvB. So a 5-uop loop goes at one iteration per 2 cycles, not one per 1.25. This makes unrolling even more valuable. (Haswell and later seem to unroll tiny loops in the LSD, making a 5-uop loop a lot less bad: Is performance reduced when executing loops whose uop count is not a multiple of processor width?)

Scorper answered 27/7, 2015 at 0:18 Comment(1)

I am having issues with this now. It's more complicated than I thought. I will have to ask a question about it. – Draggletailed 29/10, 2015 at 17:42

The "branch targets should be 16 byte aligned rule" is not an absolute. The reason for the rule is that with 16 byte alignment, 16 bytes of instructions can be read in one cycle, and then another 16 bytes in the next cycle. If your target is at offset 16n + 2, then the processor can still read 14 bytes of instructions (the remainder of the cache line) in one cycle, and that is often good enough. Starting a loop at offset 16n + 15 however is a bad idea, since only one instruction byte can be read at a time. More useful is to keep the whole loop in the smallest number of cache lines possible.

On some processors branch prediction has the odd behaviour that all branches within 8 or 4 bytes use the same branch predictor. Move branches so that each conditional branch uses its own branch predictor.

What both of these have in common is that inserting some bits of code can change the behaviour and make it faster or slower.

Ellanellard answered 26/7, 2015 at 23:28 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++