Is there a code that results in 50% branch prediction miss?
Asked Answered
S

4

26

The problem:

I'm trying to figure out how to write a code (C preffered, ASM only if there is no other solution) that would make the branch prediction miss in 50% of the cases.

So it has to be a piece of code that "is imune" to compiler optimizations related to branching and also all the HW branch prediction should not go better than 50% (tossing a coin). Even a greater challenge is being able to run the code on multiple CPU architectures and get the same 50% miss ratio.

I managed to write a code that goes to 47% branch miss ratio on an x86 platform. I'm suspecting the missing could 3% come from:

  • Program launch overhead that has branching in it (very small though)
  • Profiler overhead - Basically for each counter read an interrupt is raised so this might add additional predictable branches.
  • System calls running in the background that contain loops and predictable branching

I written my own random number generator to avoid calls to a rand whose implementation might have hidden predictable branches. It can use also rdrand when available. Latency does not matter for me.

The questions:

  1. Can I do better than my version of code? Better means getting a higher branch misspredict and same results for all CPU architectures.
  2. Can this code be predicated? What would that mean?

The code:

#include <stdio.h>
#include <time.h>

#define RDRAND
#define LCG_A   1103515245
#define LCG_C   22345
#define LCG_M   2147483648
#define ULL64   unsigned long long

ULL64 generated;

ULL64 rand_lcg(ULL64 seed)
{
#ifdef RDRAND
    ULL64 result = 0;
    asm volatile ("rdrand %0;" : "=r" (result));
    return result;
#else
    return (LCG_A * seed + LCG_C) % LCG_M;
#endif
}

ULL64 rand_rec1()
{
    generated = rand_lcg(generated) % 1024;

    if (generated < 512)
        return generated;
    else return rand_rec1();
}

ULL64 rand_rec2()
{
    generated = rand_lcg(generated) % 1024;

    if (!(generated >= 512))
        return generated;
    else return rand_rec2();
}

#define BROP(num, sum)                  \
    num = rand_lcg(generated);          \
    asm volatile("": : :"memory");      \
    if (num % 2)                        \
        sum += rand_rec1();             \
    else                                \
        sum -= rand_rec2();

#define BROP5(num, sum)     BROP(num, sum) BROP(num, sum) BROP(num, sum) BROP(num, sum) BROP(num, sum)
#define BROP25(num, sum)    BROP5(num, sum) BROP5(num, sum) BROP5(num, sum) BROP5(num, sum) BROP5(num, sum)
#define BROP100(num, sum)   BROP25(num, sum) BROP25(num, sum) BROP25(num, sum) BROP25(num, sum)

int main()
{
    int i = 0;
    int iterations = 500000;    
    ULL64 num = 0;
    ULL64 sum = 0;

    generated = rand_lcg(0) % 54321;

    for (i = 0; i < iterations; i++)
    {
        BROP100(num, sum);
        // ... repeat the line above 10 times
    }

    printf("Sum = %llu\n", sum);
}

Update v1:

Following the suggestion of usr, I generated various patterns by varying the LCG_C parameter from the command line in a script. I was able to go to 49.67% BP miss. That is enough for my purpose and I have the methodology to produce this on various architectures.

Smyth answered 10/3, 2015 at 10:34 Comment(24)
The code at Why is processing a sorted array faster than an unsorted array? is such a micro benchmark. Unless the compiler replaces the code by a branchless equivalent.Norge
@CodesInChaos. Thank you very much. That should do it. I can use the C code and check / enforce that the compiler generated branches.Smyth
@Norge It just came into my mind that the check done to see if each loop has ended are still branches, that are much easier to predict. Also the compiler may reorder the loops so in the end the branch miss goes very low. The code needs many adjustments.Smyth
if you unroll the loop manually a few times (say you check 100 random elements per iteration), the impact of the loop branch would be very small. You can use macro tricks for doing that easily.Intervalometer
@Leeor. Yes. I was just updating the post :) I'm getting 8% branch miss.Smyth
Can you post the relevant assembly code? maybe there are other branches hidden hereIntervalometer
How do you know you're only getting an 8% branch miss? I'm curious what instrumentation tools you're using to determine that.Incantatory
another thought - do you know how rand works internally? You're calling it quite intensivelyIntervalometer
@Incantatory I used perf for linux. To Leeor, regarding the rand function, I think it's a linear congruential generator with no branches in it ... but don't take my word.Smyth
Not sure if it is related, but rand is not meant to be a good RNG. It could be so predictable that the branch predictor is actually able to predict the behaviour in a consistent way.Ralston
If you're interested in timing you also need to make sure that each branch does significant work, otherwise speculative execution might give you a false impression of the overall speed.Metrorrhagia
Inline the rand() call, the rng doesn't have to be good you just have to not be branching to and from it.Antemeridian
@jthill: The call of rand is not conditional and not predicted. This doesn't matter. However, if rand() executes a loop 10 times, the branch predictor on the loop condition will score a 90% (!)Invertase
@Invertase it appears that std::rand() had some branching in it. I replaced it with rdrand. Still, I don't get 50% :)Smyth
@MarkRansom I'm more interested into the number of branch misses. Timing is not very important to me. I tried to control the speculative execution using the barrier.Smyth
@StefanoSanfilippo Yes. I suspect this is the case. I think most compiler builders and HW vendors know how rand is implemented by different languages and tune the prediction accordingly. To avoid that, I'm using rdrand. That should work at least for machines with rdrand.Smyth
If you want to learn something enlightening, print out the first 20 outputs of your LCG, all reduced modulo 2.Eupheemia
@Hurkyl good remark! That's why the 33% branch miss when using the default LCG.Smyth
rdrand is insanely slow, with a throughput of one every couple of hundred cycles. It is also variable in latency. So if you're going to use it, use it to prepare a small array of randomness (small enough to fit in L1), to make sure you're not accidentally benchmarking rdrand (or the memory system).Finfoot
@Finfoot Well it is insanely slow but it seems to do the job. My purpose is to have 50% branch miss rate on all architectures (that's probably the real hard part). However for the code I have now, it appears that LCG does the job quite good.Smyth
@Smyth but that's the thing, it doesn't really do the job. Oh it generates randomness, but it also destroys the experiment.Finfoot
@Finfoot probably the confusing thing is that i measure time in the code. Actually i can just remove that. I measure branch prediction miss with perf. Having a true RNG is beneficial for the experiment. Why do you say it destroys it?Smyth
@Smyth OK, if you use the performance counters it should be good.Finfoot
@CodesInChaos: Just for the record, the 2021 followup question to the branch prediction Q&A, Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang? , shows that C compilers do in fact auto-vectorize into branchless asm, because they easily can. A slight variation, like an array of pointers, a random 50% of which are NULL, else sum += *arr[i], couldn't be optimized that way on most ISAs (except with a maskedgather). Or a volatile store inside the if body would also preclude branchless asm.Hotshot
W
8

If you know how the branch predictor works you can get to 100% misprediction. Just take the expected prediction of the predictor each time and do the opposite. The problem is that we don't know how it is implemented.

I have read that typical predictors are able to predict patters such as 0,1,0,1 and so on. But I'm sure there is a limit to how long the pattern can be. My suggestion would be to try each and every pattern of a given length (such as 4) and see which one comes closest to your target percentage. You should be able to target both 50% and 100% and come very close. This profiling needs to be done for each platform once or at runtime.

I doubt that 3% of the total number of branches are in system code like you said. The kernel does not take 3% overhead on purely CPU bound user code. Increase the scheduling priority to the maximum.

You can take the RNG out of the game by generating random data once and iterating over the same data many times. The branch predictor is unlikely to detect this (although it clearly could).

I would implement this by filling a bool[1 << 20] with a zero-one pattern like I described. Then, you can run the following loop over it many times:

int sum0 = 0, sum1 = 0;
for (...) {
 //unroll this a lot
 if (array[i]) sum0++;
 else sum1++;
}
//print both sums here to make sure the computation is not being optimized out

You'll need to examine the disassembly to make sure that the compiler did not do anything clever.

I don't see why the complicated setup that you have right now is necessary. The RNG can be taken out of the question and I don't see why more than this simple loop is needed. If the compiler is playing tricks you might need to mark the variables as volatile which makes the compiler (better: most compilers) treat them as if they were external function calls.

Since the RNG now no longer matters since it is almost never called you can even invoke the cryptographic RNG of your OS to get numbers that are indistinguishable (to any human) from true random numbers.

Wag answered 14/3, 2015 at 22:18 Comment(4)
Thank you very much for your response. I opted to leave the RNG in the code but I followed your advice and generated multiple patterns by varying LCG. I can now observe sweet spots and low prediction spots. Take a look at my update. 50% is all what I needed. Filling the buffer with bools and generating the patterns would have complicated the setup in order to remove all predictable branches.Smyth
One problem is that the branch predictor might start in an unpredictable random state, so a series that ends up with 100% misprediction on one run of your process or test code might have 50% or 0% in the next one. This was less common with simpler predictors, but with the more modern predictors with lots of shared state and meta-predictors which decide how to make the prediction, it sometimes becomes hard to reproduce.Orsay
Modern predictors using TAGE (e.g., recent Intel) have a history length of about ~20ish branches, so will predict most repetitive patterns of about that length perfectly. Beyond that, they will still predict repetitive random patterns of much longer lengths almost perfectly since they are effectively using the last ~20 branches as a key into the history table. That's at least ~1,000,000 unique keys, so in principle patterns with periods up to say half that amount could predicted well since most keys will be "unique".Orsay
... of course, the actual predictors don't have enough storage to actually maintain entries for 1 million unique histories, so in practice you'll see degrading performance once you start to hit the capacity of the branch predictor - but you can't really characterize that in terms of "branch history length".Orsay
R
3

Fill an array with bytes, and write a loop that checks each byte and branches depending on the value of the byte.

Now examine the architecture of your processor and its branch prediction very carefully. Fill the initial bytes of the array so that after examining them, the processor is in a predictable known state. From that known state, you can find out whether the next branch is predicted taken or not. Set the next byte so that the prediction is wrong. Again, find out whether the next branch is predicted taken or not, and set the next byte so that the prediction is wrong and so on.

If you disable interrupts as well (which could change the branch prediction), you can come close to 100% mispredicted branches.

As a simple case, on an old PowerPC processor with strong/weak prediction, after three taken branches it will always be in the state "strong taken" and one branch not taken changes it to "weak taken". If you now have a sequence of alternating not taken / taken branches, the prediction is always wrong and switches between weak not taken and weak taken.

This will of course only work with that particular processor. Most modern processors would see that sequence as almost 100% predictable. For example, they might use two separate predictors; one for the case "last branch was taken" and one for the case "last branch was not taken". But for such a processor, a different sequence of bytes will give the same 100% misprediction rate.

Reuter answered 11/3, 2015 at 11:54 Comment(2)
Hmm... The thing is that I need a generic code, one that would statistically generate 50% branch miss on all architectures. I also wonder, if I turn off the interrupts, then I cannot measure branch related counters... right?Smyth
Thank you again. Your response was also correct but usr's one was a bit more detailed and voted by the viewers.Smyth
I
0

The easiest way to avoid compiler optimizations is to have void f(void) { } and void g(void) { } dummy functions in another Translation Unit, and have link-time optimizations disabled. This will force if (*++p) f(); else g(); to be a real unpredictable branch, assuming that p points to an array of random booleans (This sidesteps the branch prediction problem inside rand() - just do that before the measurement)

If a for(;;) loop gives you problems, just throw in a goto.

Note that the "loop unrolling trick" in the comment is somewhat misleading. You're essentially creating thousands of branches. Each branch would be individually predicted, except that it's likely none of them will be predicted as the CPU simply cannot hold thousands of distinct predictions. This may or may not be a benefit for your real goal.

Invertase answered 10/3, 2015 at 23:27 Comment(5)
I believe that your example is in fact perfectly predictable. It's an alternating on/off pattern.Xyster
@ZanLynx: It fully depends on the array of random data that p points to. Even if a compiler used two conditional branches (which is a bad implementation), both branches would depend solely on the last value of p which makes both predictions equally pointless.Invertase
Thanks for your answer. So you are suggesting to have 2 functions f and g in something like a shared library, and call them randomly. This might work. I'll give it a try. Regarding goto, I still need to get out of the emulated loop so i need to check something with a branch.Smyth
One more thing. You said that manually unrolling the loop can make the cpu overflow it's branch target buffer. I'm wondering if this is the case for branches executed only once. I think that in my case, a new branch would just ocupy an entry of a branch that has been evicted because it has no history.Smyth
@VAndrei: Don't try to get out of the loop. I did mean to write an infinite loop. Call TerminateThread or whatever your OS uses from another monitoring thread.Invertase
D
0

Try this with/without RAND_BRANCH difined:

/* gcc -DRAND_BRANCH -O0 -g -o br-miss br-miss.c */
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int number[1000000000];

int test_branch(void)
{
        int x = 0;

        for (int i = 0; i < sizeof(number)/sizeof(int); i++) {
                if (number[i] > 0) {
                        x++;
                };
        }

        return x;
}

long difftimespec_ns(const struct timespec start, const struct timespec end)
{
    return ((int64_t)end.tv_sec - (int64_t)start.tv_sec) * (int64_t)1000000000
        + ((int64_t)end.tv_nsec - (int64_t)start.tv_nsec);
}

int main(int argc, char *argv[])
{
        for (int i = 0; i < sizeof(number)/sizeof(int); i++) {
#ifdef RAND_BRANCH
                number[i] = rand() % 2 ? 0 : 1;
#else
        number[i] = i % 2 ? 0 : 1;
#endif
        }

        struct timespec start, end;
        clock_gettime(CLOCK_MONOTONIC, &start);
        test_branch();
        clock_gettime(CLOCK_MONOTONIC, &end);
        printf("test_branch takes %ld ns\n", difftimespec_ns(start, end));

        return 0;
}

Without RAND_BRANCH defined, close to ~100% hit. The TNTNTN... pattern can be predicted by modern CPUs.

$ ./br-miss
test_branch takes 467987135 ns

With RAND_BRANCH defined, close to ~50% miss. There's no pattern. The execution performance is 8.8 times slower.

$ ./br-miss
test_branch takes 4130685513 ns
Discounter answered 23/6, 2023 at 2:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.