I’m currently coding highly optimised versions of some C99 standard library string functions, like strlen()
, memset()
, etc, using x86-64 assembly with SSE-2 instructions.
So far I’ve managed to get excellent results in terms of performance, but I sometimes get weird behaviour when I try to optimise more.
For instance, adding or even removing some simple instructions, or simply reorganising some local labels used with jumps completely degrades the overall performances. And there’s absolutely no reason in terms of code.
So my guess is that there is some issues with code alignment, and/or with branches which get mispredicted.
I know that, even with the same architecture (x86-64), different CPUs have different algorithms for branch prediction.
But is there some general advices, when developing for high performances on x86-64, about code alignment and branch prediction?
In particular, about alignment, should I ensure all labels used by jump instructions are aligned on a DWORD?
_func:
; ... Some code ...
test rax, rax
jz .label
; ... Some code ...
ret
.label:
; ... Some code ...
ret
In the previous code, should I use an align directive before .label:
, like:
align 4
.label:
If so, is it enough to align on a DWORD when using SSE-2?
And about branch prediction, is there a «preffered» way to organize the labels used by jump instructions, in order to help the CPU, or are today's CPUs smart enough to determine that at runtime by counting the number of times a branch is taken?
EDIT
Ok, here's a concrete example - here's the start of strlen()
with SSE-2:
_strlen64_sse2:
mov rsi, rdi
and rdi, -16
pxor xmm0, xmm0
pcmpeqb xmm0, [ rdi ]
pmovmskb rdx, xmm0
; ...
Running it 10'000'000 times with a 1000 character string gives about 0.48 seconds, which is fine.
But it does not check for a NULL string input. So obviously, I'll add a simple check:
_strlen64_sse2:
test rdi, rdi
jz .null
; ...
Same test, it runs now in 0.59 seconds. But if I align the code after this check:
_strlen64_sse2:
test rdi, rdi
jz .null
align 8
; ...
The original performances are back. I used 8 for alignment, as 4 doesn't change anything.
Can anyone explain this, and give some advices about when to align, or not to align code sections?
EDIT 2
Of course, it's not as simple as aligning every branch target. If I do it, performances will usually get worse, unless some specific cases like above.
2E
and3E
). – Eaten3.4.1.5 Code Alignment
which says "Assembly/Compiler Coding Rule 12. (M impact, H generality) All branch targets should be 16-byte aligned." The whole section 3.4.1 is worth reading, of course. – Portcullisalign
ing everything will stuff a lot ofNOP
s and make the actual code a lot more sparse in memory; reducing the possibility that local objects are on the same cache-line i.e. lesser cache-hits? Just a thought. Maybe cachegrind will be of help here?... – Crutchgrep -rl "SSE" sysdeps/x86_64/ | uniq -u
for 88 different commonly used functions, SSE optimised for x86_64. – Crutch