why is c++ std::max_element so slow?
Asked Answered
N

4

42

I need to find the max element in the vector so I'm using std::max_element, but I've found that it's a very slow function, so I wrote my own version and manage to get x3 better performance, here is the code:

#include <string>
#include <iostream>
#include <vector>
#include <algorithm>

#include <sys/time.h>

double getRealTime()
{
    struct timeval tv;
    gettimeofday(&tv, 0);
    return (double) tv.tv_sec + 1.0e-6 * (double) tv.tv_usec;
}

inline int my_max_element(const std::vector<int> &vec, int size)
{
    auto it = vec.begin();
    int max = *it++;
    for (; it != vec.end(); it++)
    {
        if (*it > max)
        {
            max = *it;
        }
    }
    return max;
}

int main()
{
    const int size = 1 << 20;
    std::vector<int> vec;
    for (int i = 0; i < size; i++)
    {
        if (i == 59)
        {
            vec.push_back(1000000012);
        }
        else
        {
            vec.push_back(i);
        }
    }

    double startTime = getRealTime();
    int maxIter = *std::max_element(vec.begin(), vec.end());
    double stopTime = getRealTime();
    double totalIteratorTime = stopTime - startTime;

    startTime = getRealTime();
    int maxArray = my_max_element(vec, size);
    stopTime = getRealTime();
    double totalArrayTime = stopTime - startTime;

    std::cout << "MaxIter = " << maxIter << std::endl;
    std::cout << "MaxArray = " << maxArray << std::endl;
    std::cout << "Total CPU time iterator = " << totalIteratorTime << std::endl;
    std::cout << "Total CPU time array = " << totalArrayTime << std::endl;
    std::cout << "iter/array ratio: = " << totalIteratorTime / totalArrayTime << std::endl;
    return 0;
}

Output:

MaxIter = 1000000012
MaxArray = 1000000012
Total CPU time iterator = 0.000989199
Total CPU time array = 0.000293016
iter/array ratio: = 3.37592

on average std::max_element takes x3 more time then my_max_element. So why am I able to create a much faster std function so easily? Should I stop using std and write my own functions since std is so slow?

Note: at first I though it was because I'm using and integer i in a for loop instead of an iterator, but that seams to not matter now.

Compiling info:

g++ (GCC) 4.8.2

g++ -O3 -Wall -c -fmessage-length=0 -std=c++0x

Neil answered 2/9, 2014 at 11:16 Comment(21)
what compiler and stl are you using?Rewarding
Are you compiling with optimizations?Deutzia
How did you compile your code? Did you enable optimizations (e.g. by compiling with g++ -Wall -O2 if using GCC...)?Thaddeusthaddus
I added all the info, @Alex where do I get stl version?Neil
Have you tried reversing the order of calls? Caching might be at work here.Conrad
this is what I got. max_element is so slow that even max_element is 70% fasterDamages
@Angew yes, it went down to 2.5.. good but not enough.Neil
@Vladp no much difference from my pc. almost same output!Li
Most of the difference seems to come from vectorization (only one version is vectorized). File a PR in gcc's bugzilla?Dialogize
And remove -pthread makes them closerLi
Try adding -DNDEBUG to your command line. Maybe there are assertions inside the standard library.Conrad
my_max_element breaks on empty vectors whereas std::max_element is required to detect and handle that casePeterkin
@MattMcNabb, the cost of such a check is absolutely negligible in this case.Battleplane
Here's a result with NDEBUG and my_max_element before std::max_element. iter/array ratio: = 0.995192 which is a very marginal win for the standard library.Forage
With g++ 4.8 and -O2 the difference disappears (and both time align to the time of std::max_element at -O3); it's the optimizer that is pulling some strange tricks.Incisor
This might be a good explaination of why this is happeningExcrescent
Does this happen if you use an array with random elements?Ploughshare
Your timing code implements catastrophic cancellation, which makes the whole question pointless. Please do proper handling of time values (and, do them on a scale where they are meaningful and not within the scheduler's variation). That is, never do any such thing as subtract two nearly identical float/double values, and never use double for time anyway. Also, do benchmarks which run least 2-3 seconds (better 10 or 20), not a few microseconds. A single interrupt or such can easily cause what you're seeing.Chloroplast
@Damon: Not really. Run the benchmark x times, the interrupt deviation will be negligible. Also double have 52 bits of precision, which makes them well suited to store the 32 bits of the tv_usec of measurement time.Nigeria
@xryl669: Not at all. Floating point works with powers of two, and 10E-6 is not a power of two, so multiplying with that is not great to begin with as it loses precision. Also, subtracting two almost identical numbers (differing only at the 5th digit) loses precision. The measurements are therefore total bogus. There are a lot of articles on the internet about this. Every measured difference may be real, or random, or rounding / catastrophic cancellation, or scheduling jitter, or something different. You can't tell. No accurate measurement forfeits everything else before you even begin.Chloroplast
@Angew, libstdc++ doesn't use assert anywhere, NDEBUG won't affect itSansculotte
S
29

Before voting on this answer, please test (and verify) this on your machine and comment/add the results. Note that I used a vector size of 1000*1000*1000 for my tests. Currently, this answer has 19 upvotes but only one posted results, and these results did not show the effect described below (though obtained with a different test code, see comments).


There seems to be an optimizer bug/artifact. Compare the times of:

template<typename _ForwardIterator, typename _Compare>
_ForwardIterator
my_max_element_orig(_ForwardIterator __first, _ForwardIterator __last,
_Compare __comp)
{
  if (__first == __last) return __first;
  _ForwardIterator __result = __first;

  while(++__first != __last)
    if (__comp(__result, __first))
      __result = __first;

  return __result;
}

template<typename _ForwardIterator, typename _Compare>
_ForwardIterator
my_max_element_changed(_ForwardIterator __first, _ForwardIterator __last,
_Compare __comp)
{
  if (__first == __last) return __first;
  _ForwardIterator __result = __first;
  ++__first;

  for(; __first != __last; ++__first)
    if (__comp(__result, __first))
      __result = __first;

  return __result;
}

The first is the original libstdc++ implementation, the second one should be a transformation without any changes in behaviour or requirements. Clang++ produces very similar run times for those two functions, whereas g++4.8.2 is four times faster with the second version.


Following Maxim's proposal, changing the vector from int to int64_t, the changed version is not 4, but only 1.7 times faster than the original version (g++4.8.2).


The difference is in predictive commoning of *result, that is, storing the value of the current max element so that it does not have to be reloaded from memory each time. This gives a far cleaner cache access pattern:

w/o commoning     with commoning
*                 *
**                 *
 **                 *
  **                 *
  * *                 *
  *  *                 *
  *   *                 *

Here's the asm for comparison (rdi/rsi contain the first/last iterators respectively):

With the while loop (2.88743 ms; gist):

    movq    %rdi, %rax
    jmp .L49
.L51:
    movl    (%rdi), %edx
    cmpl    %edx, (%rax)
    cmovl   %rdi, %rax
.L49:
    addq    $4, %rdi
    cmpq    %rsi, %rdi
    jne .L51

With the for loop (1235.55 μs):

    leaq    4(%rdi), %rdx
    movq    %rdi, %rax
    cmpq    %rsi, %rdx
    je  .L53
    movl    (%rdi), %ecx
.L54:
    movl    (%rdx), %r8d
    cmpl    %r8d, %ecx
    cmovl   %rdx, %rax
    cmovl   %r8d, %ecx
    addq    $4, %rdx
    cmpq    %rdx, %rsi
    jne .L54
.L53:

If I force commoning by explicitly storing *result into a variable prev at the start and whenever result is updated, and using prev instead of *result in the comparison, I get an even faster loop (377.601 μs):

    movl    (%rdi), %ecx
    movq    %rdi, %rax
.L57:
    addq    $4, %rdi
    cmpq    %rsi, %rdi
    je  .L60
.L59:
    movl    (%rdi), %edx
    cmpl    %edx, %ecx
    jge .L57
    movq    %rdi, %rax
    addq    $4, %rdi
    movl    %edx, %ecx
    cmpq    %rsi, %rdi
    jne .L59
.L60:

The reason this is faster than the for loop is that the conditional moves (cmovl) in the above are a pessimisation as they are executed so rarely (Linus says that cmov is only a good idea if the branch is unpredictable). Note that for randomly distributed data the branch is expected to be taken Hn times, which is a negligible proportion (Hn grows logarithmically, so Hn/n rapidly approaches 0). The conditional-move code will only be better on pathological data e.g. [1, 0, 3, 2, 5, 4, ...].

Salvo answered 2/9, 2014 at 11:16 Comment(14)
If you can run this on a different machine / g++ version, please comment/add your results.Salvo
I used 1000*1000*1000 elements for my tests, but I don't quite trust the OP's measurement code. Each test has been conducted several times, and in normal + reversed order.Salvo
I get no such difference with g++ 4.8.2. I'll post my results in a moment.Caponize
@R.MartinhoFernandes Interesting. I haven't used nonius before; I downloaded the git repo, included nonius/main and used the nonius/nonius.h++ header. Mean is 54.12 vs 34.06 ns for a vector size of 1000*1000*1000; for both the "variance is severely inflated by outliers".Salvo
"variance is severely inflated by outliers" is usually a sign that the results might be skewed by external factors (but now I suspect I have a bug in that logic, since "found 0 outliers [...] variance is severely inflated by outliers" does not make sense ;) But that's off-topic. The point is that I can't reproduce a difference like the one you experienced. I think we need to look at the generated assembly.Caponize
@R.MartinhoFernandes "I think we need to look at the generated assembly" Agreed. I did some additional tests, since you slightly changed the function template, and got a different means. I also get a third set of different values when using a local vs namespace-scope lambda predicate. -- It's a community wiki on purpose, feel free to add some assembly (I haven't much experience with assembly, let alone optimized programs or nonius). If it's indeed an optimizer problem, it might disappear when embedding in nonius, though.Salvo
I'll look into it some more later and add whatever I find. Sorta busy now :)Caponize
@R.MartinhoFernandes Off-topic: "Harward information" (looks like a typo)Salvo
Hint to localize the difference, already mentioned in the comments: -fno-tree-vectorize. Please don't forget to report it to gcc's bugzilla at some point.Dialogize
gcc-4.9.1, Intel Core i5: mean 2.88743 ms for while, 1235.55 μs for for. I'm using function pointers to prevent inlining and volatile to prevent no-op elimination; this results in test bodies that have identical asm apart from the precise call to the max_element function.Breger
@Salvo if you're seeing ns I think the loop is being no-opped. Try using volatile to prevent this: gist.github.com/ecatmur/8c0b50dc4e30e1206a98#file-max_element-cBreger
@R.MartinhoFernandes I tried your code and the calls got no-opped; I had to add volatile: gist.github.com/ecatmur/8c0b50dc4e30e1206a98#file-max_element-cBreger
@Breger Aha! I tried your code for 1000*1000*100 elements, the results are 240 ms / 100 ms / 30 ms with g++4.8.2, Intel Xeon E3; with the original 1 << 20 elements, the results are 2.5 ms / 1 ms / 0.3 ms Thanks for your contributions.Salvo
Using GCC 9.2.0 shows no issue with both my_max_element_orig() and my_max_element_changed() (from the accepted answer) - both show the same results. However I found significant difference if I use pointers and iterators - using iterators is about 2 times worse againt raw pointers. See my question #58811388Serration
V
10

You are probably running your test in 64-bit mode, where sizeof(int) == 4, but sizeof(std::vector<>::iterator) == 8, so that assignment in the loop to int (what my_max_element does) is faster than to std::vector<>::iterator (this is what std::max_element does).

If you change std::vector<int> to std::vector<long> results change in favour to std::max_element:

MaxIter = 1000000012
MaxArray = 1000000012
Total CPU time iterator = 0.00429082
Total CPU time array = 0.00572205
iter/array ratio: = 0.749875

One important note: when benchmarking disable the CPU frequency scaling, so that the CPU does not switch gears in the middle of the benchmark.


But I think something else is at play here, since just changing the loop variable from int to long does not change the results...

Vesta answered 2/9, 2014 at 11:50 Comment(3)
Are you suggesting that 64 bit aligned writes are slower than 32 bit aligned writes on a natively 64 bit architecture? (Heck, not to mention register writes.)Battleplane
You are right, even without changing for to while, long is much faster with std.Neil
I rewrote my_max_element to use iterators internally as max_element do and it doesn't change much (coliru.stacked-crooked.com/a/367efade62603da0).Incisor
E
2

It's a simple issue of cache. To wit, the first time you load memory, in this case the contents of the vector, it's always considerably slower than if it's been recently accessed. I copied and pasted your code with GCC 4.9.

When the functions are reversed, the ratio is 1. When they're in the original order, the ratio is 1.6.

This still seems like a fundamental misoptimization by GCC in the case of max_element to me. However, your function times are so low, they will be dominated by CPU noise like the above cache effects, instead of any meaningful comparison.

Reversed, Original

Etsukoetta answered 2/9, 2014 at 11:50 Comment(3)
Actually, with g++ 4.8 and -O3 running the functions in a loop ((with no changes in the original array) I get the same results as OP (std::max_element three times slower); the results hold even if I swap the order of execution of the two functions.Incisor
@Matteo: G++ 4.9 simply doesn't seem to quite reproduce the misoptimization here.Etsukoetta
Disable your CPU frequency scaling.Vesta
U
0

I performed a test for max_element for my specific usecase and did several implementations using intrinsics which can be found here: https://github.com/XapaJIaMnu/maxelem_test

My implementations improve upon vanilla std::max_element by a factor of 2 at least, provided your data is not sorted (or almost sorted) in ascending order.

These are test results on arrays of random size of up to 2000 elements that contain floats generated from a uniform distribution of `[5:5] and are run 100000 times.

GCC 11.2 clang 14 icc 2022.1.0
std::max_element 2.6696s 0.4221s 0.4662s
sequential 1.0831s 1.1924s 1.1472s
AVX512 max + max_reduce 0.2412s 0.2152s 0.2142s
AVX512 max_reduce only 0.2570s 0.2629s 0.2325s
AVX512 cmp_ps_mask 0.1884s 0.1826s 0.1833s
AVX512 ^ + vectorized overhang 0.2097s 0.2089s 0.2072s
AVX cmp_ps + movemask 0.2181s 0.1697s 0.1702s
SSE cmplt_psp + movemask 0.2692s 0.2051s 0.2221s

I'm too lazy to explore assembly at the moment for more explainable results.

Undercarriage answered 14/8, 2022 at 13:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.