The problem:
I'm trying to figure out how to write a code (C preffered, ASM only if there is no other solution) that would make the branch prediction miss in 50% of the cases.
So it has to be a piece of code that "is imune" to compiler optimizations related to branching and also all the HW branch prediction should not go better than 50% (tossing a coin). Even a greater challenge is being able to run the code on multiple CPU architectures and get the same 50% miss ratio.
I managed to write a code that goes to 47% branch miss ratio on an x86 platform. I'm suspecting the missing could 3% come from:
- Program launch overhead that has branching in it (very small though)
- Profiler overhead - Basically for each counter read an interrupt is raised so this might add additional predictable branches.
- System calls running in the background that contain loops and predictable branching
I written my own random number generator to avoid calls to a rand whose implementation might have hidden predictable branches. It can use also rdrand when available. Latency does not matter for me.
The questions:
- Can I do better than my version of code? Better means getting a higher branch misspredict and same results for all CPU architectures.
- Can this code be predicated? What would that mean?
The code:
#include <stdio.h>
#include <time.h>
#define RDRAND
#define LCG_A 1103515245
#define LCG_C 22345
#define LCG_M 2147483648
#define ULL64 unsigned long long
ULL64 generated;
ULL64 rand_lcg(ULL64 seed)
{
#ifdef RDRAND
ULL64 result = 0;
asm volatile ("rdrand %0;" : "=r" (result));
return result;
#else
return (LCG_A * seed + LCG_C) % LCG_M;
#endif
}
ULL64 rand_rec1()
{
generated = rand_lcg(generated) % 1024;
if (generated < 512)
return generated;
else return rand_rec1();
}
ULL64 rand_rec2()
{
generated = rand_lcg(generated) % 1024;
if (!(generated >= 512))
return generated;
else return rand_rec2();
}
#define BROP(num, sum) \
num = rand_lcg(generated); \
asm volatile("": : :"memory"); \
if (num % 2) \
sum += rand_rec1(); \
else \
sum -= rand_rec2();
#define BROP5(num, sum) BROP(num, sum) BROP(num, sum) BROP(num, sum) BROP(num, sum) BROP(num, sum)
#define BROP25(num, sum) BROP5(num, sum) BROP5(num, sum) BROP5(num, sum) BROP5(num, sum) BROP5(num, sum)
#define BROP100(num, sum) BROP25(num, sum) BROP25(num, sum) BROP25(num, sum) BROP25(num, sum)
int main()
{
int i = 0;
int iterations = 500000;
ULL64 num = 0;
ULL64 sum = 0;
generated = rand_lcg(0) % 54321;
for (i = 0; i < iterations; i++)
{
BROP100(num, sum);
// ... repeat the line above 10 times
}
printf("Sum = %llu\n", sum);
}
Update v1:
Following the suggestion of usr, I generated various patterns by varying the LCG_C parameter from the command line in a script. I was able to go to 49.67% BP miss. That is enough for my purpose and I have the methodology to produce this on various architectures.
rand
is not meant to be a good RNG. It could be so predictable that the branch predictor is actually able to predict the behaviour in a consistent way. – Ralstonrand
is not conditional and not predicted. This doesn't matter. However, ifrand()
executes a loop 10 times, the branch predictor on the loop condition will score a 90% (!) – Invertaserdrand
is insanely slow, with a throughput of one every couple of hundred cycles. It is also variable in latency. So if you're going to use it, use it to prepare a small array of randomness (small enough to fit in L1), to make sure you're not accidentally benchmarkingrdrand
(or the memory system). – Finfootsum += *arr[i]
, couldn't be optimized that way on most ISAs (except with a maskedgather). Or avolatile
store inside theif
body would also preclude branchless asm. – Hotshot