Data structure that always keeps n-best elements
Asked Answered
N

9

11

I need a data structure that always holds the n largest items inserted so far (in no particular order).

So, if n is 3, we could have the following session where I insert a few numbers and the content of the container changes:

[]  // now insert 1
[1] // now insert 0
[1,0] // now insert 4
[1,0,4] // now insert 3
[1,4,3] // now insert 0
[1,4,3] // now insert 3
[4,3,3]

You get the idea. What's the name of the data structure? What's the best way to implement this? Or is this in some library?

I am thinking to use a container that has a priority_queue for its elements (delegation), which uses the reverse comparison, so pop will remove the smallest element. So the insert function first checks if the new element to be inserted is greater than the smallest. If so, we throw that smallest out and push the new element.

(I have a C++ implementation in mind, but the question is language-agnostic nevertheless.)

Nur answered 19/2, 2009 at 6:4 Comment(0)
B
15

The specific datastructure you want is probably the implicit heap. The raw datastructure is just an array; for convenience, say that it is N=2^n elements in size, and that you want to maintain the largest N-1 elements.

The idea is to treat the array (call it A) as a complete binary tree of depth n:

  • ignore A[0]; treat A[1] as the root node
  • for each node A[k], the children are A[2*k] and A[2*k+1]
  • nodes A[N/2..N-1] are the leaves

To maintain the tree as a "heap", you need to ensure that each node is smaller than (or equal to) its children. This is called the "heap condition":

  • A[k] <= A[2*k]
  • A[k] <= A[2*k+1]

To use the heap to maintain the largest N elements:

  • note that the root A[1] is the smallest element in the heap.
  • compare each new element (x) to the root: if it is smaller (x<A[1]), reject it.
  • otherwise, insert the new element into the heap, as follows:
    • remove the root (A[1], the smallest element) from the heap, and reject it
    • replace it with the new element (A[1]:= x)
    • now, restore the heap condition:
      • if x is less than or equal to both of its children, you're done
      • otherwise, swap x with the smallest child
      • repeat the test&swap at each new position until the heap condition is met

Basically, this will cause any replacement element to "filter up" the tree until it achieves its natural place. This will take at most n=log2(N) steps, which is as good as you can get. Also, the implicit form of the tree allows a very fast implementation; existing bounded-priority-queue libraries will most likely use an implicit heap.

Bevash answered 19/2, 2009 at 6:50 Comment(2)
As far as I can see this is the priority solution proposed by the OP with the added optimization of folding the remove_smallest and insert_new actions.Sack
This is most certainly not the most efficient you can get. See my answer for further information.Dur
G
6

In Java you can use a SortedSet implemented e.g. by a TreeSet. After each insertion check if the set is too large, if yes remove the last element.

This is reasonably efficient, I have used it successfully for solving several Project Euler problems.

Galbreath answered 19/2, 2009 at 7:30 Comment(1)
a TreeSet requires an ordering consistent with equals though, subject on wich the OP is not clear.Aurist
H
4

A priority_queue is the closest thing in C++ with STL. You could wrap it in another class to create your own implementation that trims the size automatically.

Language-agnostically (although maybe not memory-fragmentation-safely):

  1. Insert data
  2. Sort
  3. Delete everything after the nth element

std::priority_queue does step 2 for you.

Haberdashery answered 19/2, 2009 at 6:15 Comment(0)
C
2

A bounded priority queue, I think... Java has something like this in its standard library. EDIT: it's called LinkedBlockingQueue. I'm not sure if the C++ STL includes something similar.

Cheng answered 19/2, 2009 at 6:7 Comment(3)
It is not LinkedBlockingQueue, which blocks if size is exceeded instead of dropping smaller elements.Galbreath
ah well, I thought there was one in the library, but it appears not.Cheng
There are some existing implementation for BoundedPriorityQueue, e.g. alias-i.com/lingpipe/docs/api/com/aliasi/util/…Sall
C
1

Isn't it possible to just take the first n elements from a sorted collection?

Confound answered 19/2, 2009 at 7:30 Comment(1)
Yes, but that's not efficient in terms of space. I don't want to have to store all elements in this data structure. I'm be interested in a very small subset only. But for many cases that would be a fine solution.Nur
F
1

In Pyhton, use heapq. Create a small wrapper around it, something like this:

class TopN_Queue:
    def __init__(self, n):
        self.max_sz = n
        self.data = []

    def add(self, x):
        if len(self.data) == self.max_sz:
            heapq.heappushpop(self.data, x)
        else:
            heapq.heappush(self.data, x)

...

Franciskus answered 28/9, 2018 at 3:44 Comment(0)
F
0

yes you can maintain a minimal head of size N then you compare new item with the root item on each insertion pop the root and insert the item if it's "greater" than the root finally you end up with N largest items

Fiver answered 21/11, 2012 at 7:46 Comment(0)
P
0

Here's a simple implementation in C++ 11. Probably not the most optimal, but easy to use and concise.

    #include <iostream>
    #include <map>
    
    /// @brief Maintains a set of N elements with the best keys. (duplicate keys are allowed)
    template<class Key, class T, class Compare = std::less<Key> >
    class KeepNBests{
    public:
        KeepNBests(size_t N=0):N(N), is_less(Compare()){}
    
        bool add(const Key& key, const T& t){
            // If there are enough elements,
            if(bests.size()>N // and the candidate key is worse than the worst stored key...
                && is_less(key, bests.cbegin()->first)){
                return false; // nothing to do
            }
            // else insert the new candidate
            bests.insert({key, t});
            // keeps max N elts
            while(bests.size()>N){
                bests.erase(bests.cbegin());
            }
            return true;
        }
        const std::multimap<Key, T, Compare>& get()const{
            return bests;
        }
    public:
        const size_t N;
    private:
        Compare is_less;
        std::multimap<Key, T, Compare> bests;
    };
    
    
    int main() {
        
      KeepNBests<double, std::string> best(3);
      best.add(0., "bad");
      best.add(0., "bad");
      best.add(300., "the best");
      best.add(10., "ok");
      best.add(-1., "bad");
      best.add(-7., "very bad");
      best.add(-1., "bad");
      best.add(100., "good");
      best.add(2., "bof");
      best.add(200., "the second");
    
      for(const auto& b : best.get()){
        std::cout << b.first << " " << b.second << std::endl;
      }
      return 0;
    }

Output :

 100 good
 200 the second
 300 the best
Placebo answered 23/1 at 16:56 Comment(0)
D
-1

Create a min-heap, also store a counter.

Whenever the counter is reached; extract-min.

You can do this in: O(1) insert, get-min and O(log log n) extract-min.[1] Alternatively you can do this with O(log n) insert and O(1) for the other mentioned operations.[2]


[1] M. Thorup, “Integer priority queues with decrease key in constant time and the single source shortest paths problem,” in Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, New York, NY, USA, 2003, pp. 149–158.

[2] G. S. Brodal, G. Lagogiannis, C. Makris, A. Tsakalidis, and K. Tsichlas, “Optimal finger search trees in the pointer machine,” J. Comput. Syst. Sci., vol. 67, no. 2, pp. 381–418, Sep. 2003.

Dur answered 5/1, 2013 at 9:30 Comment(2)
What is the counter counting? When is it reached?Rana
Counter keeps track of how many items are currently in the data-structure. The Finger Tree is the best solution here (asymptotically speaking). Just remove-max until the counter is within the specified threshold.Dur

© 2022 - 2024 — McMap. All rights reserved.