Branchless internal merge slower than internal merge with branch
Asked Answered
P

2

35

I recently asked a question on Code Review to review a sorting algorithm named QuickMergeSort. I won't get in the details, but at some point the algorithm performs an internal mergesort: instead of using additional memory to store the data to merge, it swaps the elements to merge with elements from another part of the original sequence, which isn't otherwise concerned by the merge. Here is the part of the algorithm I'm concerned with: the function that performs the merging:

template<
    typename InputIterator1,
    typename InputIterator2,
    typename OutputIterator,
    typename Compare = std::less<>
>
auto half_inplace_merge(InputIterator1 first1, InputIterator1 last1,
                        InputIterator2 first2, InputIterator2 last2,
                        OutputIterator result, Compare compare={})
    -> void
{
    for (; first1 != last1; ++result) {
        if (first2 == last2) {
            std::swap_ranges(first1, last1, result);
            return;
        }

        if (compare(*first2, *first1)) {
            std::iter_swap(result, first2);
            ++first2;
        } else {
            std::iter_swap(result, first1);
            ++first1;
        }
    }
    // first2 through last2 are already in the right spot
}

That function was adapted from the eponym function in libc++ implementation of std::inplace_merge; this new version swaps elements with another part of the original array instead of moving elements from the auxiliary array.

Since the merge is internal, I realized that I didn't actually need to have two separate input types: InputIterator1 and InputIterator2 are always the same. Then I came to realize that, since the operations on first1 and first2 were always the same, I could store them in a two-element array and use the result of the comparison to index the array to know which iterator to swap and to increment. With that small trick, I get rid of the branch and obtain a mostly branchless merge algorithm:

template<
    typename InputIterator,
    typename OutputIterator,
    typename Compare = std::less<>
>
auto half_inplace_merge(InputIterator first1, InputIterator last1,
                        InputIterator first2, InputIterator last2,
                        OutputIterator result, Compare compare={})
    -> void
{
    InputIterator store[] = { first1, first2 };

    for (; store[0] != last1; ++result) {
        if (store[1] == last2) {
            std::swap_ranges(store[0], last1, result);
            return;
        }

        bool cmp = compare(*store[1], *store[0]);
        std::iter_swap(result, store[cmp]);
        ++store[cmp];
    }
    // first2 through last2 are already in the right spot
}

Now, the thing is: with this new half_inplace_merge function, the overall sorting algorithm is 1.5 times slower than with the original half_inplace_merge, and I've got no idea why. I've tried several compiler optimization levels, several tricks to avoid potential aliasing problems, but it seems that the problem comes from the branchless trick itself.

So, is anybody able to explain why the branchless code is slower?


Addendum: for those who want to run the same benchmark as I did... well, it will be a bit difficult: I used the benchmarks from a personal library, which include many things; you'll need to download the library, to add this file somewhere, and to run this benchmark after having added the required line to invoke quick_merge_sort near the highlighted section (you will need to redirect the standard output of the program to a file in a profiles subdirectory). Then you'll need to run this Python script to see the results, adding quick_merge_sort to the highlighted line. Note that NumPy and matplotlib need to be installed.

Preemie answered 13/12, 2016 at 19:53 Comment(25)
Does this happen on all compilers? (I guess you checked that, but I just wanted to do a small sanity check.)Alula
@Alula I only checked the speed with MinGW GCC6, but I did try to look at the assembly (for std::vector<int>::iterator) with the Compiler Explorer, and it looked almost the same with both GCC and Clang.Preemie
can you post the complete benchmark?Roughandtumble
The more I think about it, the more I suspect the dereferencing required to access either array element is the problem. In the original code, the compiler knows what iterator is being accessed for each case, where in the second case the memory access can't be optimized.Feminism
@RichardHodges It's gonna be a bit difficult: I used the benchmarks from my own sorting library, wrapping the sorting algorithm posted in the Code Review question in a sorter object (described in the library documentation). I don't feel at ease posting a few hundreds of lines of code here.Preemie
Looking at the assembly output, I see a lot more complex indirect addressing modes on the second version: godbolt.org/g/yjW1Ks - and no fewer branches.Roughandtumble
@RichardHodges Also note how the result is waaaaay different for some reason with GCC 7's snapshot with -O2.Preemie
I think the moral of the story is to write expressive code and allow the optimiser to... optimise :)Roughandtumble
@RichardHodges But... but sexy branchless tricks and performance D:Preemie
@Preemie the compiler writers got the first and saw you coming. If you want to make a difference, work on improving the optimisers. Then you can make everyone's code faster at the same time.Roughandtumble
To summarize my comments here: You might be pushing the "prediction" problem into the load-store unit instead of the branch predictor. Because of the randomness of the addresses, the memory disambiguator isn't able to correctly predict the dependencies between them - thereby getting you the same penalties as mispredicted branches. Unfortunately, I have no way to test this theory. So I'm leaving it as a comment.Buehrer
@Mystical I am in the same boat. Plenty of theory, not so much time to show the proof.Feminism
Could you please put up a pastebin link with a runnable version of your code? I'd be able to get you the performance counter values for your code.Gazpacho
What CPU architecture was used for the timing tests?Neldanelia
@Neldanelia I used and x86_64.Preemie
@IwillnotexistIdonotexist I described what needs to be done in order to run the banchmarks as I ran them... but it's unfortunately not trivial.Preemie
@Preemie With all due respect that's a bit short of what I had hoped for. You shouldn't make your potential helpers jump through hoops ("put this file somewhere"? But where?. Built with what flags, on what CPU? Not specified.) In any case I did manage to make your code build with GCC 6.1.1 -O3, and now I don't know which distribution (s) was (were) problematic. Is it shuffled? Because both branchy and non-branchy perform equally on my machine. We need to be able to reproduce your problem reliably to diagnose it for you.Gazpacho
@IwillnotexistIdonotexist To be honest, I thought that the reduced example from the question isolated the problem and was sufficient, I never thought anyone would jump into the whole library benchmark stuff. The 1.5 slowdown was clearly apparent for the « shuffled » distribution, but also for a few other ones (I don't remember exactly which). The working compilers are in the library's README, and the library is clearly sold as a C++14 library. As mentioned in the question, it was always that much slower, no matter the optimization flags, so mentioning them didn't feel worth it.Preemie
I used both -O1/2/3 with -march=native. The compiler and CPU are mentioned here in the comments. Sure I didn't aggregate the information and it's a bit scattered, but I really didn't think that anyone would jump into the whole thing, sorry for that.Preemie
@DouglasDaseeco I was indeed looking for a theoretical answer like yours (by the way, I'm waiting for the bounty just in case before accepting anything, to make sure that I don't influence anything). Your assumptions were correct, and the fix you're suggesting is the one I intended to go with since it makes things a bit simpler/clearer without incurring obvious performance problems (at first, I feared that there might be aliasing problems with the same type everywhere, but it seems fine) ^_^Preemie
Well, your unexpected pessimization did turn on a CPU esoteric called branch misprediction, so great detail is advisable. When I was asking which CPU, I meant not "x86-64" but "K8/K10/Piledriver/Excavator/Zen" (AMD) or "KBL/SKL/BRW/HSW/IVB/SNB/WST/NHM/Core2/..." (Intel) and maybe even cache geometry. At any rate, on my i7-4700MQ Linux machine with GCC 6.1.1 and -O3 and over all distributions the branchless code runs 2% slower, issues 28% fewer instructions, makes 72% fewer branch mispredicts but suffers 3.37x more resource stalls, including 10x the reservation station stalls. @PeterCordes?Gazpacho
@IwillnotexistIdonotexist Hum, unfortunately, I mostly don't know shit about CPUs. My computer information reads « Intel Core i7-4710MQ », but I'll probably be unable to provide any more info :/Preemie
@Preemie Glorious! The only difference between my CPU and yours is 100MHz!Gazpacho
Mysticial's theory is interesting but I don't think that's the problem. Rather, I suspect the problem is even simpler than that. In the branchy code, 50% of the time the CPU predicts correctly the comparison before it happens and executes speculatively the right codepath, and along that codepath all load/store addresses are "known". In the branchless code, 100% of the time the address of the swap target is not known until the comparison result is computed, and only then can the CPU proceed. The CPU is capable of speculating the direction of a branch but not the value of a pointer.Gazpacho
I've been following to see if discussion about branch prediction optimism and pessimism was brought up again. Previous comments about it were deleted for some reason. I was going to respond with the question, "How can branch prediction improve speed when the two legs are of equal length? How can folding two like branches into one facilitate speed optimization in either compiler or CPU? The comment given by @DouglasDaseeco addresses this from another angle.Fancywork
F
14

Such a large difference is the product of two conditions.

The first condition is related to the original code. The in-place merge is so efficient there would be difficulty devising anything significantly faster, even if coding manually at the assembly language level. The application of generics is straightforward, so the compiler ** produced the same assembly with or without it. Because the algorithm implementation is efficient, only a few machine instructions added into the loop is capable of producing the significant proportional change indicated in the question.

** Compilation specifics throughout this answer were using g++ 6.2.1 20160916, the default Fedora 24 dnf package, along with LINUX kernel 4.8.8-200.fc24.x86_64. Runtime was Intel i7-2600 8M cache. Also to Atmel SAM3X8E ARM Cortex-M3 with arm-none-eabi-g++ 4.8.3-2014q1.

The second condition is related to the compilation of the second trick described in paragraph 3 sentence 2 of the question. The first trick, the reduction of types in the template, did not produce any significant change in the assembly language. The second trick produced flop-affecting assembly level differences in the compiler output for the two calls.

This precompiler hack can ease testing.

#ifdef ORIG
#define half_inplace_merge half_inplace_merge_orig
#else // ORIG
#define half_inplace_merge half_inplace_merge_slow
#endif // ORIG
...
half_inplace_merge(niInA.begin(), niInA.end(),
        niInB.begin(), niInB.end(),
        niOut.begin(), compare);

Execution and compare using these commands in a bash shell exploits the precompiler hack.

g++ -DORIG -S -fverbose-asm -o /tmp/qq.orig.s /tmp/qq.cpp
g++ -DSLOW -S -fverbose-asm -o /tmp/qq.slow.s /tmp/qq.cpp
araxis.sh /tmp/qq.orig.s /tmp/qq.slow.s  # to run Araxis Merge in Wine

These instructions are a result of initializing the InputIterator store[ ], but that is outside the loop.

leaq    -48(%rbp), %rax #, _4
movq    -64(%rbp), %rdx # first1, tmp104
movq    %rdx, (%rax)    # tmp104, *_5
leaq    8(%rax), %rdx   #, _9
movq    -96(%rbp), %rax # first2, tmp105
movq    %rax, (%rdx)    # tmp105, *_9

The primary slow down comes in dereferencing the two items contained in store[ ], as needed by the compare and swap, and that is within the loop. These instructions do not exist in the version without the second trick.

movb    %al, -17(%rbp)  # _27, cmp
movzbl  -17(%rbp), %eax # cmp, _29
cltq
...
movzbl  -17(%rbp), %edx # cmp, _31
leaq    -48(%rbp), %rax #, tmp121
movslq  %edx, %rdx  # _31, tmp122
salq    $3, %rdx    #, tmp123
addq    %rdx, %rax  # tmp123, _32

Although there is duplication of code in the bodies of the conditional for the version without the trick, that only impacts the compactness of the code, adding two calls, five moves, and one compare instruction. The number of CPU cycles required to perform the in-place merge is the same between branches resulting from the compare, and both lack the instructions listed above.

For each of several syntax permutations tried, removing the redundancy in the branches to improve compactness inescapably leads to additional instructions required along the execution path.

The details of the instruction sequences for the various permutations discussed thus far will vary from compiler to compiler, optimization option selection, and even the conditions of calling the functions.

It is theoretically possible for a compiler to employ an AST (abstract symbol tree) refactoring rule (or the equivalent) to detect and reduce both program memory and CPU cycle requirements for either version of the function. Such rules have have antecedents (search patterns) that match the pattern to be optimized within the code.

Optimizing speed for the code with the second trick would require a rule antecedent that matches at the atypical score[ ] abstraction both inside and outside of the loop. Detecting the branch redundancy without the second trick is a more reasonable goal.

Integrating the two statements within each branch, one can see how the two like patterns in the AST might be simple enough for a refactoring rule antecedent to match and perform the desired code size reduction. There would be very little gain in speed for this case, if any.

if (compare(*first2, *first1)) {
    std::iter_swap(result, first2 ++);
} else {
    std::iter_swap(result, first1 ++);
}
Fancywork answered 16/12, 2016 at 0:50 Comment(1)
Agree, Douglas Daseeco. The space optimization is often the enemy of speed optimization.Fancywork
B
3

The following is just a short intuitive explanation:

If we scale away everything and assume that the iterators are normal pointers we can in the first example store all iterators in registers.

In the branch-less code we cannot easily do that, due to store[cmp], and ++store[cmp] - and that implies an overhead for all use of store[0], and store[1].

Thus (in this case) it is more important to maximize use of registers than avoiding branches.

Brynn answered 20/12, 2016 at 16:52 Comment(1)
Yes @DouglasDaseeco ... your answer and the comment at the end of it addresses the source of the slow down and the previous misconceptions of what might slow it down masterfully.Fancywork

© 2022 - 2024 — McMap. All rights reserved.