For a sequence of N items where you want to find the Tth largest value in the sequence where storing N items is unmanageable (ala 10 billion), but storing T items is manageable (like 100000), then a sorted map data structure that maps between values in the sequence to the number of occurrences seen for that value can be used.
Sorted maps are often implemented as (binary) trees and have the keys in sorted order. Most programming languages have some variation of this data structure in their standard library.
Algorithm is this:
Create an empty sorted map that maps between a value
to the number of occurrences that value was seen in the
sequence
For each index from 0 to T-1:
Generate a new random sequence value => V
Store V as the key in the map:
if V is not in the map already, then map[V] = 1
if V is already in the map, then map[V] = map[V] + 1
Then assess what the smallest number, S, seen so far is in the map.
For each index from T to N-1:
Generate a new sequence value => V
if (V <= S):
skip;
else:
decrement the smallest value, S, from the map. If map[S] goes to 0, then remove S entirely from the map.
insert V in the same manner as above
Reassess what S is by inspecting what the smallest (first) item in the map is.
After the second for loop, the Tth largest element is the
smallest value in the map.
Here's a C++ implementation that leverages std::map for a sorted table.
int main() {
long long N = 10'000'000'000LL;
long long T = 100'000; // want the Tth largest number in the sequence
long long smallest = LLONG_MAX;
long long removals = 0;
std::map<long long, long long> table; // hash table that maps between a random value and the number of occurrences
std::vector<long long> largestNumbers;
seedRandomNumberGenerator();
for (long long i = 0; i < N; i++) {
if (i % 1'000'000LL == 0) {
std::cout << i << std::endl;
}
long long value = getNextRandomNumber();
if (i < T) {
table[value]++; // this will implictly insert table[value]=1 if value isn't in the table, otherwise increments the count
if (value < smallest) {
smallest = value;
}
} else {
// at this point there are 100000 items tracked in the hash table
// any time we encounter a value greater than the smallest item
// then we push the smallest item out
if (value > smallest) {
removals++;
auto itor = table.begin();
if ((--itor->second) == 0) {
table.erase(itor);
table[value]++;
smallest = table.begin()->first;
}
}
}
}
long long tValue = table.begin()->first;
std::cout << "the " << T << "th largest value in the sequence is" << tValue << "\n";
std::cout << "there were " << removals << "from the table\n";
}
Each insert and lookup where table[value]++
is invoked is typically O(lg(T))
. So the cost of the first T inserts is O(T*lg(T))
Removals, where table.erase is invoked, into the map in C++ is understood to be O(T)
. When i
is small, the probability of value > smallest
is high and the cost of a removal from the map is paid. As i
approaches N
, then the probability of a removal goes way down to T/N probability or .00001 in the case of 100K/10B. It my simulation of N=10B and T=100K, there was about 1 million removals.
table->begin() is O(1)
The above code, when compiled as a release build with optimizations takes about 3-5 minutes to run on my new Core-i9 workstation. I'm using a modified rand()
algorithm on as the implementation for getNextRandomNumber()
.
I think there are several optimizations that can be done to beat the performance of std::map at the expense of using slightly more memory. I'll have to sleep on that.