Is the second variation simply a more implicit conditional, as the CPU
still cannot predict the function that will be called? Am I correct in
assuming that this has to do with branch target prediction?
Yes, unconditional indirect branches require a branch-target-buffer hit for the CPU to figure out where to fetch code from next. Modern CPUs are heavily pipelined, and need to be fetching code well ahead of where they're executing if they're going to avoid bubbles in the pipe where they don't have anything to do. Having to wait until magic
is calculated is far too late to avoid an instruction fetch bubble. Performance counters will show BTB misses as a branch mispredict, I think.
As I suggested in a comment, if you can you should restructure your code to do a scalar intro and cleanup around a vectorized loop. The intro handles elements up until you reach an aligned element. The cleanup loop handles cases where there's a non-zero amount of elements left to process, after the last full vector. Then you're not stuck doing a scalar loop just because the size or alignment of the first element wasn't ideal.
Depending on what you're processing, if it's ok to repeat work and overlap, then you can make a branchless startup that does an unaligned chunk, then the rest aligned. Some libraries probably impement memset
something like this:
// not shown: check that count >= 16
endp = dest + count;
unaligned_store_16B( dest ); // e.g. x86 movdqu
dest+=16;
dest &= ~0xf; // align by 16, first aligned write overlaps by up to 15B
for ( ; dest < endp-15 ; dest+=16) {
aligned_store_16B( dest ); // e.g. x86 movdqa
}
// handle the last up-to-15 bytes from dest to endp similarly.
This makes handling the unaligned start of the loop branchless, because you don't care how much the unaligned start overlapped.
Note that most one-buffer functions aren't repeatable, though. e.g. in-place a[i] *= 2
, or sum+=a[i]
need to avoid processing the same input twice. Usually with a scalar loop until you get to an aligned address. a[i] &= 0x7f
, or maxval = max(a[i], maxval)
are exceptions, though.
Functions with two independent pointers that can be misaligned by different amounts are trickier. You have to be careful not to change their relative offset with masking. memcpy
is the simplest example of a function that processes data from a src to a dest buffer. memcpy
has to work if (src+3) %16 == 0
and (dest+7) %16 ==0
. Unless you can put constraints on the callers, the best you can do in general is have either every load or every store aligned in the main loop.
On x86, the unaligned move instructions (movdqu
and friends) are just as fast as the alignment-required version when the address is aligned. So you don't need a separate version of the loop for the special case when src and dest have the same (mis)alignment, and the loads and stores can both be aligned. IIRC, this is true for Intel Nehalem and newer CPUs, and for recent AMD.
// check count >= 16
endp = dest + count;
unaligned_copy_16B( dest, src ); // load with movdqu, store with movdqu
// src+=16; dest+=16; // combine this with aligning dest, below
dest_misalign = dest & 0xf; // number of bytes the first aligned iteration will overlap
src += 16 - dest_misalign; // src potentially still misaligned
dest += 16 - dest_misalign; // dest aligned
for ( ; dest <= endp-16 ; src+=16, dest+=16) {
tmpvec = unaligned_load_16B( src ); // x86 movdqu is fast if src is aligned
aligned_store_16B( dest, tmpvec ); // x86 movdqa
}
// handle the last dest to endp bytes.
An aligned dest is probably more likely than an aligned source. No overlapping repeated work happens when the pointer we align is already aligned.
If you aren't doing memcpy, it can be an advantage to have src aligned so the load can fold into another instruction as a memory operand. This saves an instruction, and in many cases also saves an Intel uop internally.
For the case where src and dest have different alignments, I haven't tested whether it's faster to do aligned loads and unaligned stores, or the other way around. I picked aligned stores because of potential store->load forwarding benefits for short buffers. If the dest buffer is aligned, and only a couple vectors long, and will be read again right away, then aligned loads from dest will stall for ~10 cycles (Intel SnB) if the load crosses a boundary between two preceding stores that haven't made it to L1 cache yet. (i.e. store forwarding fails). See http://agner.org/optimize/ for info on low-level details like this (esp. the microarch guide.)
Store forwarding from memcpy to loads in the next loop is only going to happen if the buffers are small (maybe up to 64B?), or if your next loop starts reading from the end of the buffer (which will still be in cache even if the beginning has already been evicted). Otherwise, the stores to the start of the buffer will have made it from a store buffer to L1, so store-forwarding won't come into play.
It's possible that for large buffers with different alignments, aligned loads and unaligned stores will do better. I'm just making stuff up here, but this could be true if unaligned stores can retire quickly even if they cross a cache line or page line. Of course unaligned loads can't retire until the data is actually loaded. With more load/store instructions in flight, there's less chance of a cache miss stalling things. (You're potentially taking advantage of more of the CPU's load/store buffers.) Again, pure speculation. I tried to google if unaligned stores were better or worse than unaligned loads, but just got hits about how to do them, and misalignment penalties that apply to both.