I recently asked a question on Code Review to review a sorting algorithm named QuickMergeSort. I won't get in the details, but at some point the algorithm performs an internal mergesort: instead of using additional memory to store the data to merge, it swaps the elements to merge with elements from another part of the original sequence, which isn't otherwise concerned by the merge. Here is the part of the algorithm I'm concerned with: the function that performs the merging:
template<
typename InputIterator1,
typename InputIterator2,
typename OutputIterator,
typename Compare = std::less<>
>
auto half_inplace_merge(InputIterator1 first1, InputIterator1 last1,
InputIterator2 first2, InputIterator2 last2,
OutputIterator result, Compare compare={})
-> void
{
for (; first1 != last1; ++result) {
if (first2 == last2) {
std::swap_ranges(first1, last1, result);
return;
}
if (compare(*first2, *first1)) {
std::iter_swap(result, first2);
++first2;
} else {
std::iter_swap(result, first1);
++first1;
}
}
// first2 through last2 are already in the right spot
}
That function was adapted from the eponym function in libc++ implementation of std::inplace_merge
; this new version swaps elements with another part of the original array instead of moving elements from the auxiliary array.
Since the merge is internal, I realized that I didn't actually need to have two separate input types: InputIterator1
and InputIterator2
are always the same. Then I came to realize that, since the operations on first1
and first2
were always the same, I could store them in a two-element array and use the result of the comparison to index the array to know which iterator to swap and to increment. With that small trick, I get rid of the branch and obtain a mostly branchless merge algorithm:
template<
typename InputIterator,
typename OutputIterator,
typename Compare = std::less<>
>
auto half_inplace_merge(InputIterator first1, InputIterator last1,
InputIterator first2, InputIterator last2,
OutputIterator result, Compare compare={})
-> void
{
InputIterator store[] = { first1, first2 };
for (; store[0] != last1; ++result) {
if (store[1] == last2) {
std::swap_ranges(store[0], last1, result);
return;
}
bool cmp = compare(*store[1], *store[0]);
std::iter_swap(result, store[cmp]);
++store[cmp];
}
// first2 through last2 are already in the right spot
}
Now, the thing is: with this new half_inplace_merge
function, the overall sorting algorithm is 1.5 times slower than with the original half_inplace_merge
, and I've got no idea why. I've tried several compiler optimization levels, several tricks to avoid potential aliasing problems, but it seems that the problem comes from the branchless trick itself.
So, is anybody able to explain why the branchless code is slower?
Addendum: for those who want to run the same benchmark as I did... well, it will be a bit difficult: I used the benchmarks from a personal library, which include many things; you'll need to download the library, to add this file somewhere, and to run this benchmark after having added the required line to invoke quick_merge_sort
near the highlighted section (you will need to redirect the standard output of the program to a file in a profiles
subdirectory). Then you'll need to run this Python script to see the results, adding quick_merge_sort
to the highlighted line. Note that NumPy and matplotlib need to be installed.
std::vector<int>::iterator
) with the Compiler Explorer, and it looked almost the same with both GCC and Clang. – Preemie-O3
, and now I don't know whichdistribution
(s) was (were) problematic. Is itshuffled
? Because both branchy and non-branchy perform equally on my machine. We need to be able to reproduce your problem reliably to diagnose it for you. – Gazpacho-O1/2/3
with-march=native
. The compiler and CPU are mentioned here in the comments. Sure I didn't aggregate the information and it's a bit scattered, but I really didn't think that anyone would jump into the whole thing, sorry for that. – Preemie-O3
and over alldistributions
the branchless code runs 2% slower, issues 28% fewer instructions, makes 72% fewer branch mispredicts but suffers 3.37x more resource stalls, including 10x the reservation station stalls. @PeterCordes? – Gazpacho