indices of the k largest elements in an unsorted length n array
Asked Answered
S

7

21

I need to find the indices of the k largest elements of an unsorted, length n, array/vector in C++, with k < n. I have seen how to use nth_element() to find the k-th statistic, but I'm not sure if using this is the right choice for my problem as it seems like I would need to make k calls to nth_statistic, which I guess it would have complexity O(kn), which may be as good as it can get? Or is there a way to do this just in O(n)?

Implementing it without nth_element() seems like I will have to iterate over the whole array once, populating a list of indices of the largest elements at each step.

Is there anything in the standard C++ library that makes this a one-liner or any clever way to implement this myself in just a couple lines? In my particular case, k = 3, and n = 6, so efficiency isn't a huge concern, but it would be nice to find a clean and efficient way to do this for arbitrary k and n.

It looks like Mark the top N elements of an unsorted array is probably the closest posting I can find on SO, the postings there are in Python and PHP.

Surculose answered 15/2, 2013 at 20:32 Comment(4)
Can you modify the vector? nth_element will do a partial sort in place, so it modifies the vector.Bilabiate
The vector can be modified, however the end result needs to be the indices (of the original vector) of the k largest elements.Surculose
This is just a selection algorithm. Usually you'll use either heap select or quick select. See https://mcmap.net/q/127179/-throwing-the-fattest-people-off-of-an-overloaded-airplane/56778 for a similar question. There is an answer with a good C++ solution. (using priority_queue)Operative
By the way, if k=3 and n=6, then you're probably best off just sorting the array and picking the top 3 items. As you say, efficiency isn't a huge concern, and the difference between O(kn) and O(n) is insignificant with such small numbers.Operative
A
11

This should be an improved version of @hazelnusse which is executed in O(nlogk) instead of O(nlogn)

#include <queue>
#include <iostream>
#include <vector>
// maxindices.cc
// compile with:
// g++ -std=c++11 maxindices.cc -o maxindices
int main()
{
  std::vector<double> test = {2, 8, 7, 5, 9, 3, 6, 1, 10, 4};
  std::priority_queue< std::pair<double, int>, std::vector< std::pair<double, int> >, std::greater <std::pair<double, int> > > q;
    int k = 5; // number of indices we need
  for (int i = 0; i < test.size(); ++i) {
    if(q.size()<k)
        q.push(std::pair<double, int>(test[i], i));
    else if(q.top().first < test[i]){
        q.pop();
        q.push(std::pair<double, int>(test[i], i));
    }
  }
  k = q.size();
  std::vector<int> res(k);
  for (int i = 0; i < k; ++i) {
    res[k - i - 1] = q.top().second;
    q.pop();
  }
  for (int i = 0; i < k; ++i) {
    std::cout<< res[i] <<std::endl;
  }
}

8 4 1 2 6

Abelabelard answered 15/7, 2016 at 8:38 Comment(0)
S
9

Here is my implementation that does what I want and I think is reasonably efficient:

#include <queue>
#include <vector>
// maxindices.cc
// compile with:
// g++ -std=c++11 maxindices.cc -o maxindices
int main()
{
  std::vector<double> test = {0.2, 1.0, 0.01, 3.0, 0.002, -1.0, -20};
  std::priority_queue<std::pair<double, int>> q;
  for (int i = 0; i < test.size(); ++i) {
    q.push(std::pair<double, int>(test[i], i));
  }
  int k = 3; // number of indices we need
  for (int i = 0; i < k; ++i) {
    int ki = q.top().second;
    std::cout << "index[" << i << "] = " << ki << std::endl;
    q.pop();
  }
}

which gives output:

index[0] = 3
index[1] = 1
index[2] = 0
Surculose answered 15/2, 2013 at 22:29 Comment(5)
I timed an implementation using nth_element and one with partial_sort and using a custom comparator... your implementation is faster.Bilabiate
There's no need to add all the items to the priority queue. That makes the algorithm O(n log n). It can be done in O(n log k) if you don't add things that are smaller than the smallest item already on the queue. See https://mcmap.net/q/127179/-throwing-the-fattest-people-off-of-an-overloaded-airplane/56778 for discussion.Operative
@JimMischel Perhaps I'm missing something, but as far as I can see, if I only add elements which are bigger than the smallest element in the queue I may end up missing some of the k-top elements. E.g if the first element I add into the priority queue is the maximum element, it is at the same time the smallest element in the queue and would result in the algorithm not adding any additional elements.Pineda
@BananaCode: If you look at the linked answer, you'll see that you initially populate the priority queue with the first k elements. Then you use the only-add-if-larger-than-smallest rule on the remaining elements.Operative
What about the case when there's a tie(s) with k-th largest element? Would be nice to have this extension to your method.Transformism
F
8

The question has the partial answer; that is std::nth_element returns the "the n-th statistic" with a property that none of the elements preceding nth one are greater than it, and none of the elements following it are less.

Therefore, just one call to std::nth_element is enough to get the k largest elements. Time complexity will be O(n) which is theoretically the smallest since you have to visit each element at least one time to find the smallest (or in this case k-smallest) element(s). If you need these k elements to be ordered, then you need to order them which will be O(k log(k)). So, in total O(n + k log(k)).

Fluid answered 6/5, 2014 at 4:33 Comment(2)
This finds the k largest elements, whereas the OP's requirement is finding the k largest indices.Amharic
Well, you are right and (looking at the question again) I don't know why I gave this answer in the first place and why people up-voted it. But most probably, they misunderstood the question just like me, and apparently, this answer helped them in some way so I will keep it like this.Fluid
M
3

You can use the basis of the quicksort algorithm to do what you need, except instead of reordering the partitions, you can get rid of the entries falling out of your desired range.

It's been referred to as "quick select" and here is a C++ implementation:

int partition(int* input, int p, int r)
{
    int pivot = input[r];

    while ( p < r )
    {
        while ( input[p] < pivot )
            p++;

        while ( input[r] > pivot )
            r--;

        if ( input[p] == input[r] )
            p++;
        else if ( p < r ) {
            int tmp = input[p];
            input[p] = input[r];
            input[r] = tmp;
        }
    }

    return r;
}

int quick_select(int* input, int p, int r, int k)
{
    if ( p == r ) return input[p];
    int j = partition(input, p, r);
    int length = j - p + 1;
    if ( length == k ) return input[j];
    else if ( k < length ) return quick_select(input, p, j - 1, k);
    else  return quick_select(input, j + 1, r, k - length);
}

int main()
{
    int A1[] = { 100, 400, 300, 500, 200 };
    cout << "1st order element " << quick_select(A1, 0, 4, 1) << endl;
    int A2[] = { 100, 400, 300, 500, 200 };
    cout << "2nd order element " << quick_select(A2, 0, 4, 2) << endl;
    int A3[] = { 100, 400, 300, 500, 200 };
    cout << "3rd order element " << quick_select(A3, 0, 4, 3) << endl;
    int A4[] = { 100, 400, 300, 500, 200 };
    cout << "4th order element " << quick_select(A4, 0, 4, 4) << endl;
    int A5[] = { 100, 400, 300, 500, 200 };
    cout << "5th order element " << quick_select(A5, 0, 4, 5) << endl;
}

OUTPUT:

1st order element 100
2nd order element 200
3rd order element 300
4th order element 400
5th order element 500

EDIT

That particular implementation has an O(n) average run time; due to the method of selection of pivot, it shares quicksort's worst-case run time. By optimizing the pivot choice, your worst case also becomes O(n).

Menam answered 15/2, 2013 at 20:39 Comment(0)
W
2

The standard library won't get you a list of indices (it has been designed to avoid passing around redundant data). However, if you're interested in n largest elements, use some kind of partitioning (both std::partition and std::nth_element are O(n)):

#include <iostream>
#include <algorithm>
#include <vector>

struct Pred {
    Pred(int nth) : nth(nth) {};
    bool operator()(int k) { return k >= nth; }
    int nth;
};

int main() {

    int n = 4;
    std::vector<int> v = {5, 12, 27, 9, 4, 7, 2, 1, 8, 13, 1};

    // Moves the nth element to the nth from the end position.
    std::nth_element(v.begin(), v.end() - n, v.end());

    // Reorders the range, so that the first n elements would be >= nth.
    std::partition(v.begin(), v.end(), Pred(*(v.end() - n)));

    for (auto it = v.begin(); it != v.end(); ++it)
        std::cout << *it << " ";
    std::cout << "\n";

    return 0;
}
Wynn answered 15/2, 2013 at 22:6 Comment(2)
I specifically need the indices.Surculose
@hazelnusse You can define a structure type for your elements, storing both value and the original index, and meanwhile define the comparator for it.Fredericfrederica
E
0

You can do this in O(n) time with a single order statistic calculation:

  • Let r be the k-th order statistic
  • Initialize two empty lists bigger and equal.
  • For each index i:
    • If array[i] > r, add i to bigger
    • If array[i] = r, add i to equal
  • Discard elements from equal until the sum of the lengths of the two lists is k
  • Return the concatenation of the two lists.

Naturally, you only need one list if all items are distinct. And if needed, you could do tricks to combine the two lists into one, although that would make the code more complicated.

Eagre answered 15/7, 2016 at 8:49 Comment(0)
H
0

Even though the following code might not fulfill the desired complexity constraints it might be an interesting alternative for the before-mentioned priority queue.

#include <queue>
#include <vector>
#include <iostream>
#include <iterator>
#include <algorithm>

std::vector<int> largestIndices(const std::vector<double>& values, int k) {
    std::vector<int> ret;

    std::vector<std::pair<double, int>> q;
    int index = -1;
    std::transform(values.begin(), values.end(), std::back_inserter(q), [&](double val) {return std::make_pair(val, ++index); });
    auto functor = [](const std::pair<double, int>& a, const std::pair<double, int>& b) { return b.first > a.first; };
    std::make_heap(q.begin(), q.end(), functor);
    for (auto i = 0; i < k && i<values.size(); i++) {
        std::pop_heap(q.begin(), q.end(), functor);
        ret.push_back(q.back().second);
        q.pop_back();
    }

    return ret;
}

int main()
{
    std::vector<double> values = { 7,6,3,4,5,2,1,0 };
    auto ret=largestIndices(values, 4);
    std::copy(ret.begin(), ret.end(), std::ostream_iterator<int>(std::cout, "\n"));
}
Helios answered 26/8, 2016 at 10:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.