Is there any efficiency advantage in using minmax_element over min_element and max_element together?

Asked 27/10, 2016 at 11:37 Answered 27/10, 2016 at 12:3

std::minmax_element : returns a pair consisting of an iterator to the smallest element as the first element and an iterator to the greatest element as the second.

std::min_element : returns an iterator to the smallest element in the range [first, last).

std::max_element : returns an iterator to the largest element in the range [first, last).

Does std::minmax_element uses sorting of complete list to achieve this?

Is the overhead of processing returned pair from std::minmax_element worthy enough?

Cralg answered 27/10, 2016 at 11:37 Comment(0)

You do not have to worry about std::minmax_element doing any sorting. It leaves the range in the exact way it was traversed. The reason it is more efficient is it can find both the max and min in a single pass where when looking for max and min separately you have to do two full traversals.

std::minmax_element has the complexity of max(floor(3/2(N−1)), 0) where as std::max_element and std::min_element each are max(N-1,0) so it is about 25% less operations using std::minmax_element

There is also a difference where std::minmax_element finds the last largest element while std::max_element finds the first largest.

So if you need to find the min and max of a range then you should use std::minmax_element. If you only need the min or max then you should use the specialized version. Dealing with the return from std::minmax_element will get even easier with the upcoming C++17 standard and structured bindings. You will be able to write

auto [min, max] = std::minmax_element(...);

and now the first element of the pair is stored in min and the second is stored in max.

Point answered 27/10, 2016 at 11:40 Comment(6)

Ahem: 25% less operations. Roughly 1½N vs 2N. – Duodenitis 27/10, 2016 at 11:48

@MartinBonner Wow. I messed that one up. Thanks – Point 27/10, 2016 at 11:50

One thing leaves me confused. I can see how it can require fewer comparisons in many cases, e.g. if the current element is larger than max_so_far, then we don't need to compare against min_so_far. This is because it cannot be less than min_so_far if it is larger than max_so_far. Therefore, one comparison saved. But I don't understand how it can be as little as 1.5N. (Does it compare each element to the previous element first, before comparing to min_so_far or max_so_far?) – Millimeter 27/10, 2016 at 11:58

@AaronMcDaid If you would like to see a possible implementation see this – Point 27/10, 2016 at 12:2

@AaronMcDaid Also see davmac answer – Point 27/10, 2016 at 12:5

@AaronMcDaid: As idea is the tournament m, M, a, b -> m < M, a < b, so min is min(m, a) and max is max(M, b). so compare new element by pair. – Overslaugh 27/10, 2016 at 12:32

The other answers are good. I wanted to add a little about how minmax_element necessarily works, however, which also helps to explain why it (usually) works better than running min_element and max_element separately, and talk about some specific cases where it doesn't perform better.

If we think about a naive implementation, you would maintain a max-value and min-value (and their corresponding iterators) and simply iterate through the range, comparing each value with both the min-value and max-value and adjusting either as necessary. However, this would give you a total of 2N comparisons; while it might well perform better than iterating through the list twice (due to better use of locality), the specification calls for (roughly) 3/2 N comparisons. How is that possible then?

It works by processing pairs rather than individual items. Taking the first two items in the range (#0 and #1), we can compare them and assign the largest to max-value and the smallest to min-value. Then, we compare the next two items (#3 and #4) to decide which of them is larger; we compare the larger one to max-value, and the smaller one to min-value, and update max-value/min-value as necessary. We then repeat this process with each additional pair (#5 and #6, then #7 and #8 and so on).

So, each pair requires three comparisons - with each other, then the highest with the current maximum, and the lowest with the current minimum. This cuts down the number of comparisons required to 3/2 N!

As per comments below, however, it should be noted that this "improved" algorithm tends to produce worse performance on modern processors than the naive version when using a type (or comparator) where the comparison is cheap - notably, with a range over a vector<int> or similar: the comparison between the two elements of each pair has an unpredictable result, leading to branch prediction failure in the processor (though this is only true if the data is more-or-less randomly ordered); current compilers will not always convert branches into conditional transfers, as they potentially could. Also, it is harder for a compiler to vectorise the more complex algorithm.

In theory, I think, a C++ library implementation could provide an overload for the minmax_element function which used the naive algorithm for primitive (int etc) element types with the default comparator. While the standard mandates a bound to the number of comparisons, if the effect of those comparisons cannot be observed then the number actually computed is unimportant, so long as the time complexity is the same (which it is - O(N) in both cases). However, while this might give better performance with randomly ordered data, it might produce worse performance when the data is ordered.

With all the above in mind, a simple test case (below) shows an interesting result: for randomly ordered data, using min_element and max_element separately can in fact be slightly faster than using minmax_element. However, for sorted data, minmax_element is much faster than using min_element and max_element separately. On my system (Haswell processor) the below (compiled with gcc -O3 -std=c++11 -march=native, GCC version 5.4) a sample run shows 692 milliseconds for min/max separately and 848 for minmax combined. There is of course some variation between runs but these values seem typical.

Note that:

The difference in performance is small enough that it is unlikely to be a dominating factor in a real-world program;
The difference is dependent on compiler optimisation; in the future, the results may well reverse;
for more complex data types (or rather for more complex comparators) the results will likely reverse, since in that case the fewer comparisons will likely be a significant improvement.
when the sample data is ordered instead of random (replace v.push_back(r(gen)) with v.push_back(i) in the program below) the performance is very different: for separate min/max, it's around 728 milliseconds, whereas for combined minmax, it goes down to 246 milliseconds.

The code:

#include <iostream>
#include <vector>
#include <algorithm>
#include <random>
#include <chrono>

constexpr int numEls = 100000000;

void recresult(std::vector<int> *v, int min, int max)
{
   // Make sure the compiler doesn't optimize out the values: 
   __asm__ volatile (
       ""
       :
       : "rm"(v), "rm"(min), "rm"(max)
   );
}

int main(int argc, char **argv)
{
    using namespace std;

    std::mt19937 gen(0);
    uniform_int_distribution<> r(0, 100000);


    vector<int> v;
    for (int i = 0; i < numEls; i++) {
        v.push_back(r(gen));
    }

    // run once for warmup
    int min = *min_element(v.begin(), v.end());
    int max = *max_element(v.begin(), v.end());
    recresult(&v, min, max);

    // min/max separately:
    {
        auto starttime = std::chrono::high_resolution_clock::now();
        for (int i = 0; i < 5; i++) {
        int min = *min_element(v.begin(), v.end());
            int max = *max_element(v.begin(), v.end());
            recresult(&v, min, max);
        }
        auto endtime = std::chrono::high_resolution_clock::now();
        auto millis = std::chrono::duration_cast<std::chrono::milliseconds>(endtime - starttime).count();

        cout << "min/max element: " << millis << " milliseconds." << endl;
    }

    // run once for warmup
    auto minmaxi = minmax_element(v.begin(), v.end());
    recresult(&v, *(minmaxi.first), *(minmaxi.second));

    // minmax together:
    {
        auto starttime = std::chrono::high_resolution_clock::now();
        for (int i = 0; i < 5; i++) {
        minmaxi = minmax_element(v.begin(), v.end());
        recresult(&v, *(minmaxi.first), *(minmaxi.second));
        }
        auto endtime = std::chrono::high_resolution_clock::now();
        auto millis = std::chrono::duration_cast<std::chrono::milliseconds>(endtime - starttime).count();

        cout << "minmax element: " << millis << " milliseconds." << endl;
    }

    return 0;
}

Sternlight answered 27/10, 2016 at 12:3 Comment(13)

Fun fact: on a modern processor, for a vector<int> whose elements are in random order, those 3N/2 comparisons take significantly longer than making the naive 2N comparisons, because of branch prediction. – Hatbox 28/10, 2016 at 9:1

@MarcGlisse that is interesting. I would have thought though that on many modern processors you could implement it without actual branches (using eg conditional load instructions) - do these suffer the same penalty? – Sternlight 28/10, 2016 at 9:11

Well-predicted branches are very cheap. You can try various tricks, but I'd be surprised if you managed to beat the naive version with scalar code using the 3n/2 trick. – Hatbox 28/10, 2016 at 10:19

@Marc and davmac: A branchless implementation should be possible (for things like int that we can cmp/cmov). On x86, you'd use cmov between registers, since with a memory operand it's implemented as an unconditional load feeding the same 3-input uop (dst, src, and flags). – Ikkela 28/10, 2016 at 10:49

@Marc and davmac: However, gcc's actual asm output uses a badly unpredictable branch on A[i+1] < A[i], but cmov to update the min/max. That's opposite of what I'd expect would perform well :( (updates to min/max are likely to become increasingly rare, so a branch could be good, but if you can't optimize away the a[i+1]<a[i] when the actual comparison is cheaper than swapping, cmov would be good.) With gcc6.2 on Godbolt, see .L7 inside the inner loop: the array loads and CMP/JL are there. Then (if the branch falls through), cmov updates of min/max. – Ikkela 28/10, 2016 at 10:51

@marc: Critically, it also fails to autovectorize with -march=haswell, where it could blow through the array using SSE4.1 / AVX2 VPMAXSD and VPMINSD on 256b vectors. (And then horizontal min/max at the end). So that's horrible for large vectors. (Even std::min_element fails to autovectorize, though, on a vector<int>, instead compiling to a simple load/cmp/cmovg loop :( Is there already a missed-optimization bug open for this? – Ikkela 28/10, 2016 at 10:55

Using -fprofile-generate/-fprofile-use already reduces the running time by 30%. Those functions are very inconvenient to optimize because they return the position, not the value. Note that for vectorization, the 2n version is easier than the 3n/2 one (which brings no benefit anyway because computing both min and max is 2 operations). – Hatbox 28/10, 2016 at 11:6

gcc actually comes quite close to vectorizing min_element, tweaking the order of the passes should do it. First, gcc does notice that it doesn't need the position but just the value in your code. Then it tries to vectorize but fails because the loop contains a branch (ifconvert failure?). Then it recognizes the pattern for MIN. Now it would likely be able to vectorize, but it is too late. I believe a bug report would make sense. – Hatbox 28/10, 2016 at 11:16

@MarcGlisse and PeterCordes, thanks. I find this really interesting. I've edited some of the discussion into the answer; please feel free to make further edits as you see fit. – Sternlight 28/10, 2016 at 12:8

@PeterCordes gcc.gnu.org/bugzilla/show_bug.cgi?id=78151 for the missed vectorization of *min_element. – Hatbox 28/10, 2016 at 20:24

@davmac: good point in your update. A template specialization for types without an overloaded operator< (or whatever other way you could detect that it's non-observable) could be helpful for getting better scalar code, and also easier to auto-vectorize for compilers that can see through a deref of the returned iterator and only keep track of min/max value (not position). – Ikkela 29/10, 2016 at 0:31

Do remember that for the 2n version to be better than the 3n/2 version, you want some kind of random order on the elements. That is not always the case, so specializing the algorithm may make some users unhappy. – Hatbox 30/10, 2016 at 14:33

@MarcGlisse good point. The combined minmax_element performs much better than two separate calls when the data is ordered. – Sternlight 30/10, 2016 at 18:46

Yes. You're iterating over the range only once instead of doing it twice.

Armandoarmature answered 27/10, 2016 at 11:38 Comment(2)

Ok, do you mean that the internal implementation maintains two flags : minimum_till_now and maximum_till_now, and returns them at the end of full traversal? – Cralg 27/10, 2016 at 11:43

Yes, it wouldn't make sense to implement this algorithm actually using the other two you mentioned. That being said, look at NathanOliver's answer - the difference is also in the number of comparisons. – Armandoarmature 27/10, 2016 at 11:45

std::minmax_element complexity:

At most max(floor(3/2(N−1)), 0) applications of the predicate, where N = std::distance(first, last).

std::min_element complexity (same as max_element):

Exactly max(N-1,0) comparisons, where N = std::distance(first, last).

Ignoring max and floor, we get:

(N-1) * 2 vs 3/2 (N-1)

So by using minmax_element you get 3/4 of the comparisons you would need by using max_element + min_element, or better.

minmax_element uses the transitivity of the < operator~~, it knows that if it is updating the minimum it does not need to compare for the maximum~~ by comparing two elements at once, i.e. if a < b then we only need to check min(a, current_min) and max(b, current_max), and vice-versa.

Also worth noting:

This algorithm is different from std::make_pair(std::min_element(), std::max_element()), not only in efficiency, but also in that this algorithm finds the last biggest element while std::max_element finds the first biggest element.

Palumbo answered 27/10, 2016 at 11:43 Comment(2)

Your explanation of the improved efficiency "if it is updating the minimum it does not need to compare for the maximum" is not exact. If that were done it would only give a marginal improvement of efficiency, since most elements will not update the minimum. Instead the function gives an improvement of 25% in all cases for which it uses a different strategy (see the answer by davmac for more). – Beholden 28/10, 2016 at 9:5

@MarcvanLeeuwen Fair enough, updating the answer. – Palumbo 28/10, 2016 at 9:6

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags