Branch prediction and branch target prediction optimization
Asked Answered
M

3

10

My code makes frequent calls to a function with multiple (unpredictable) branches. When I profiled, I found that it is a minor bottleneck, with the majority of CPU time used on the conditional JMPs.

Consider the following two functions, where the original has multiple explicit branches.

void branch_example_original(void* mem, size_t s)
{
    if(!(s & 7)) {
        /* logic in _process_mem_64 inlined */
    }
    else if(!(s & 3)) {
        /* logic in _process_mem_32 inlined */
    }
    else if(!(s & 1)) {
        /* logic in _process_mem_16 inlined */
    }
    else {
        /* logic in _process_mem_8 inlined */
    }
}

Here is the new function, where I attempted to remove branches causing the bottleneck.

void branch_example_new(void* mem, size_t s)
{
    const fprocess_mem mem_funcs[] = {_process_mem_8, _process_mem_16, _process_mem_32, _process_mem_64};
    const uint32_t magic = 3 - !!(s & 7) - !!(s & 3) - !!(s & 1);
    mem_funcs[magic](mem, size >> magic);
}

However, when I profiled the new code, the performance increased by only ~20%, and the CALL itself (to a func in the mem_funcs array) took a very long time.

Is the second variation simply a more implicit conditional, as the CPU still cannot predict the function that will be called? Am I correct in assuming that this has to do with branch target prediction?

Why does this happen, and are there other solutions to this?

Edit:

Thanks for the ideas, but I'd like an explanation of why this happens as well.

Morningglory answered 29/8, 2015 at 21:34 Comment(4)
This looks like a function that deals with aligned/unaligned memory addresses. Can you do something to guarantee alignment? Do you know which path is taken most frequently? Can you predict the alignment at the callsite (e.g. if you know your memory block is 64-byte aligned)?Egoist
It does deal with aligned/unaligned memory, but I have no way of guaranteeing size or alignment in this case.Morningglory
@nneonneo: Even if you can't guarantee the alignement or size, you can usually do byte-at-a-time intro until you're aligned, then vectors until you're within 15B of the end, then byte-at-a-time cleanup. So you're doing big aligned chunks most of the time, with scalar setup/cleanup.Truckle
Duff's device? Or a derivate thereof.Berard
T
7

Is the second variation simply a more implicit conditional, as the CPU still cannot predict the function that will be called? Am I correct in assuming that this has to do with branch target prediction?

Yes, unconditional indirect branches require a branch-target-buffer hit for the CPU to figure out where to fetch code from next. Modern CPUs are heavily pipelined, and need to be fetching code well ahead of where they're executing if they're going to avoid bubbles in the pipe where they don't have anything to do. Having to wait until magic is calculated is far too late to avoid an instruction fetch bubble. Performance counters will show BTB misses as a branch mispredict, I think.

As I suggested in a comment, if you can you should restructure your code to do a scalar intro and cleanup around a vectorized loop. The intro handles elements up until you reach an aligned element. The cleanup loop handles cases where there's a non-zero amount of elements left to process, after the last full vector. Then you're not stuck doing a scalar loop just because the size or alignment of the first element wasn't ideal.


Depending on what you're processing, if it's ok to repeat work and overlap, then you can make a branchless startup that does an unaligned chunk, then the rest aligned. Some libraries probably impement memset something like this:

// not shown: check that count >= 16
endp = dest + count;
unaligned_store_16B( dest );    // e.g. x86 movdqu
dest+=16;
dest &= ~0xf;  // align by 16, first aligned write overlaps by up to 15B
for ( ; dest < endp-15 ; dest+=16) {
    aligned_store_16B( dest );  // e.g. x86 movdqa
}
// handle the last up-to-15 bytes from dest to endp similarly.

This makes handling the unaligned start of the loop branchless, because you don't care how much the unaligned start overlapped.

Note that most one-buffer functions aren't repeatable, though. e.g. in-place a[i] *= 2, or sum+=a[i] need to avoid processing the same input twice. Usually with a scalar loop until you get to an aligned address. a[i] &= 0x7f, or maxval = max(a[i], maxval) are exceptions, though.


Functions with two independent pointers that can be misaligned by different amounts are trickier. You have to be careful not to change their relative offset with masking. memcpy is the simplest example of a function that processes data from a src to a dest buffer. memcpy has to work if (src+3) %16 == 0 and (dest+7) %16 ==0. Unless you can put constraints on the callers, the best you can do in general is have either every load or every store aligned in the main loop.

On x86, the unaligned move instructions (movdqu and friends) are just as fast as the alignment-required version when the address is aligned. So you don't need a separate version of the loop for the special case when src and dest have the same (mis)alignment, and the loads and stores can both be aligned. IIRC, this is true for Intel Nehalem and newer CPUs, and for recent AMD.

// check count >= 16
endp = dest + count;
unaligned_copy_16B( dest, src );  // load with movdqu, store with movdqu
// src+=16; dest+=16;  // combine this with aligning dest, below

dest_misalign = dest & 0xf;  // number of bytes the first aligned iteration will overlap
src  += 16 - dest_misalign;  // src potentially still misaligned
dest += 16 - dest_misalign;  // dest aligned

for ( ; dest <= endp-16 ; src+=16, dest+=16) {
    tmpvec = unaligned_load_16B( src ); // x86 movdqu is fast if src is aligned
    aligned_store_16B( dest, tmpvec );  // x86 movdqa
}
// handle the last dest to endp bytes.

An aligned dest is probably more likely than an aligned source. No overlapping repeated work happens when the pointer we align is already aligned.

If you aren't doing memcpy, it can be an advantage to have src aligned so the load can fold into another instruction as a memory operand. This saves an instruction, and in many cases also saves an Intel uop internally.

For the case where src and dest have different alignments, I haven't tested whether it's faster to do aligned loads and unaligned stores, or the other way around. I picked aligned stores because of potential store->load forwarding benefits for short buffers. If the dest buffer is aligned, and only a couple vectors long, and will be read again right away, then aligned loads from dest will stall for ~10 cycles (Intel SnB) if the load crosses a boundary between two preceding stores that haven't made it to L1 cache yet. (i.e. store forwarding fails). See http://agner.org/optimize/ for info on low-level details like this (esp. the microarch guide.)

Store forwarding from memcpy to loads in the next loop is only going to happen if the buffers are small (maybe up to 64B?), or if your next loop starts reading from the end of the buffer (which will still be in cache even if the beginning has already been evicted). Otherwise, the stores to the start of the buffer will have made it from a store buffer to L1, so store-forwarding won't come into play.

It's possible that for large buffers with different alignments, aligned loads and unaligned stores will do better. I'm just making stuff up here, but this could be true if unaligned stores can retire quickly even if they cross a cache line or page line. Of course unaligned loads can't retire until the data is actually loaded. With more load/store instructions in flight, there's less chance of a cache miss stalling things. (You're potentially taking advantage of more of the CPU's load/store buffers.) Again, pure speculation. I tried to google if unaligned stores were better or worse than unaligned loads, but just got hits about how to do them, and misalignment penalties that apply to both.

Truckle answered 29/8, 2015 at 22:10 Comment(6)
Thanks for the explanation! I also tried implementing your solution (minus the loop as I'm not doing a copy), and it speeds it up quite a bit for small blocks, even with the overhead of initialization.Morningglory
If memcpy is what you want, then on many systems calling memcpy is the fastest method. For example, MacOS X will at boot time set up memcpy code that is optimised for the particular processor in your computer, and that does optimisations you don't even understand.Oletta
@gnasher729: I'm using memcpy as an easy-to-understand example of a re-doable operation, not as something you would actually want to implement yourself.Truckle
+=15 would be better?Cower
@user3528438: You mean dest+=15 after the unaligned copy, instead of +=16 and mask? No, you can't avoid the mask, since you need to get aligned pointers out of any of 16 possible input alignments, including actually aligned in the first place. If you did +=15 and then mask, you'd recopy the first 16B in the (hopefully common) case where the inputs are aligned.Truckle
@user3528438: Thinking about this made me realize I had a bug when src and dest had different alignments. Fixed now. I added memset as a simple example of a one-buffer function that can re-process bytes. (Most in-place functions aren't repeatable, though).Truckle
E
4

You could try something like this:

switch(s & 7) {
case 0:
    /* _process_mem_64 */
    break;
case 1:
case 3:
case 5:
case 7:
    /* _process_mem_8 */
    break;
case 2:
case 6:
    /* _process_mem_16 */
    break;
case 4:
    /* _process_mem_32 */
    break;
}

This involves only a single jump into a jump table, and does not require a call instruction.

Egoist answered 29/8, 2015 at 21:38 Comment(1)
Thanks, but it was the code that caused a branch, rather than the call instruction. In my case, removing it resolved the problem.Morningglory
O
3

A modern processor doesn't only have branch prediction, it also has jump prediction. For example, if you call a virtual function, it may predict that the actual function is the same as in the previous call and start execution before the pointer to the function is actually read - if the jump prediction was wrong, things get slow.

The same thing happens in your code. You don't use branch prediction anymore, but the processor uses jump prediction to predict which of four function pointers is called, and this slows down when the function pointers are unpredictable.

Oletta answered 29/8, 2015 at 22:32 Comment(1)
That's the same as what I said in my first paragraph. It's the branch-target-buffer that has to predict correctly which location the indirect branch will jump to, or you get a stall. Unconditional jumps are still called branches, in CPU terminology, because they disrupt the sequential flow of instruction bytes into the pipeline.Truckle

© 2022 - 2024 — McMap. All rights reserved.