I decided to rerun the test on my own machine using Lik32 code. I had to change it due to my windows or compiler thinking high resolution is 1ms, using
mingw32-g++.exe -O3 -Wall -std=c++11 -fexceptions -g
vector<int> rand_vec(10000000);
GCC has made the same transformation on both original codes.
Note that only the two first conditions are tested as the third must always be true, GCC is a kind of a Sherlock here.
Reverse
.L233:
mov DWORD PTR [rsp+104], 0
mov DWORD PTR [rsp+100], 0
mov DWORD PTR [rsp+96], 0
call std::chrono::_V2::system_clock::now()
mov rbp, rax
mov rax, QWORD PTR [rsp+8]
jmp .L219
.L293:
mov edx, DWORD PTR [rsp+104]
add edx, 1
mov DWORD PTR [rsp+104], edx
.L217:
add rax, 4
cmp r14, rax
je .L292
.L219:
mov edx, DWORD PTR [rax]
cmp edx, 94
jg .L293 // >= 95
cmp edx, 19
jg .L218 // >= 20
mov edx, DWORD PTR [rsp+96]
add rax, 4
add edx, 1 // < 20 Sherlock
mov DWORD PTR [rsp+96], edx
cmp r14, rax
jne .L219
.L292:
call std::chrono::_V2::system_clock::now()
.L218: // further down
mov edx, DWORD PTR [rsp+100]
add edx, 1
mov DWORD PTR [rsp+100], edx
jmp .L217
And sorted
mov DWORD PTR [rsp+104], 0
mov DWORD PTR [rsp+100], 0
mov DWORD PTR [rsp+96], 0
call std::chrono::_V2::system_clock::now()
mov rbp, rax
mov rax, QWORD PTR [rsp+8]
jmp .L226
.L296:
mov edx, DWORD PTR [rsp+100]
add edx, 1
mov DWORD PTR [rsp+100], edx
.L224:
add rax, 4
cmp r14, rax
je .L295
.L226:
mov edx, DWORD PTR [rax]
lea ecx, [rdx-20]
cmp ecx, 74
jbe .L296
cmp edx, 19
jle .L297
mov edx, DWORD PTR [rsp+104]
add rax, 4
add edx, 1
mov DWORD PTR [rsp+104], edx
cmp r14, rax
jne .L226
.L295:
call std::chrono::_V2::system_clock::now()
.L297: // further down
mov edx, DWORD PTR [rsp+96]
add edx, 1
mov DWORD PTR [rsp+96], edx
jmp .L224
So this doesn't tell us much except that the last case doesn't need a branch predict.
Now I tried all 6 combinations of the if's, the top 2 are the original reverse and sorted. high is >= 95, low is < 20, mid is 20-94 with 10000000 iterations each.
high, low, mid: 43000000ns
mid, low, high: 46000000ns
high, mid, low: 45000000ns
low, mid, high: 44000000ns
mid, high, low: 46000000ns
low, high, mid: 44000000ns
high, low, mid: 44000000ns
mid, low, high: 47000000ns
high, mid, low: 44000000ns
low, mid, high: 45000000ns
mid, high, low: 46000000ns
low, high, mid: 45000000ns
high, low, mid: 43000000ns
mid, low, high: 47000000ns
high, mid, low: 44000000ns
low, mid, high: 45000000ns
mid, high, low: 46000000ns
low, high, mid: 44000000ns
high, low, mid: 42000000ns
mid, low, high: 46000000ns
high, mid, low: 46000000ns
low, mid, high: 45000000ns
mid, high, low: 46000000ns
low, high, mid: 43000000ns
high, low, mid: 43000000ns
mid, low, high: 47000000ns
high, mid, low: 44000000ns
low, mid, high: 44000000ns
mid, high, low: 46000000ns
low, high, mid: 44000000ns
high, low, mid: 43000000ns
mid, low, high: 48000000ns
high, mid, low: 44000000ns
low, mid, high: 44000000ns
mid, high, low: 45000000ns
low, high, mid: 45000000ns
high, low, mid: 43000000ns
mid, low, high: 47000000ns
high, mid, low: 45000000ns
low, mid, high: 45000000ns
mid, high, low: 46000000ns
low, high, mid: 44000000ns
high, low, mid: 43000000ns
mid, low, high: 47000000ns
high, mid, low: 45000000ns
low, mid, high: 45000000ns
mid, high, low: 46000000ns
low, high, mid: 44000000ns
high, low, mid: 43000000ns
mid, low, high: 46000000ns
high, mid, low: 45000000ns
low, mid, high: 45000000ns
mid, high, low: 45000000ns
low, high, mid: 44000000ns
high, low, mid: 42000000ns
mid, low, high: 46000000ns
high, mid, low: 44000000ns
low, mid, high: 45000000ns
mid, high, low: 45000000ns
low, high, mid: 44000000ns
1900020, 7498968, 601012
Process returned 0 (0x0) execution time : 2.899 s
Press any key to continue.
So why is the order high, low, med then faster (marginally)
Because the most unpredictable is last and therefore is never run through a branch predictor.
if (i >= 95) ++nHigh; // most predictable with 94% taken
else if (i < 20) ++nLow; // (94-19)/94% taken ~80% taken
else if (i >= 20 && i < 95) ++nMid; // never taken as this is the remainder of the outfalls.
So the branches will be predicted taken, taken and remainder with
6%+(0.94*)20% mispredicts.
"Sorted"
if (i >= 20 && i < 95) ++nMid; // 75% not taken
else if (i < 20) ++nLow; // 19/25 76% not taken
else if (i >= 95) ++nHigh; //Least likely branch
The branches will be predicted with not taken, not taken and Sherlock.
25%+(0.75*)24% mispredicts
Giving 18-23% difference (measured difference of ~9%) but we need to calculate cycles instead of mispredicting %.
Let's assume 17 cycles mispredict penalty on my Nehalem CPU and that each check takes 1 cycle to issue (4-5 instructions) and the loop takes one cycle too. The data dependencies are the counters and the loop variables, but once the mispredicts are out of the way it shouldn't influence the timing.
So for "reverse", we get the timings (this should be the formula used in Computer Architecture: A Quantitative Approach IIRC).
mispredict*penalty+count+loop
0.06*17+1+1+ (=3.02)
(propability)*(first check+mispredict*penalty+count+loop)
(0.19)*(1+0.20*17+1+1)+ (= 0.19*6.4=1.22)
(propability)*(first check+second check+count+loop)
(0.75)*(1+1+1+1) (=3)
= 7.24 cycles per iteration
and the same for "sorted"
0.25*17+1+1+ (=6.25)
(1-0.75)*(1+0.24*17+1+1)+ (=.25*7.08=1.77)
(1-0.75-0.19)*(1+1+1+1) (= 0.06*4=0.24)
= 8.26
(8.26-7.24)/8.26 = 13.8% vs. ~9% measured (close to the measured!?!).
So the obvious of the OP is not obvious.
With these tests, other tests with more complicated code or more data dependencies will certainly be different so measure your case.
Changing the order of the test changed the results but that could be because of different alignments of the loop start which should ideally be 16 bytes aligned on all newer Intel CPUs but isn't in this case.
if
s is actually a performance improvement is very situational and target dependent, so there may be no good rule-of-thumb. Furthermore, other easier optimization techniques are likely to have a greater effect (such as PGO). I don't think anyone is saying you shouldn't ever reorderif
s to find the optimal ordering, but that you should profile the different arrangements if that little extra bit of performance actually matters to you. – Overanxious(p > 0)
check first regardless of source order: it has heuristics that positive values are more likely. – Synchronism