How to look for the 10^5th largest element in an array of 10^10 elements?

Asked 28/11, 2023 at 7:51 Answered 28/11, 2023 at 12:22

Use the PRNG with seed 4020 (first 3 numbers are -2123524894 961034805 1071375651) to generate 10^10 integers. Print the 10^5th largest element among the numbers generated.

Of course if the problem was at a smaller scale, I would have been able to solve it in a click, but I cannot figure out how to solve it. One approach was using the median using heaps method(BAD IDEA), then tried to div the input into chunks and try finding it in those, but none of those approaches are to work. I think I am getting stuck on the wrong things , there is no way this problems solution requires a supercomputer to calculate, so of course my thinking is wrong, can you help point me in the right direction as to what I can do to solve this?

Minnie answered 28/11, 2023 at 7:51 Comment(4)

It seems like your best bet would be to use a min heap to get the n largest elements without having to sort all of the elements, but I'm not so sure that it still wouldn't take a very long time on normal hardware. Is this a real world problem or some kind of challenge? – Bagwell 28/11, 2023 at 7:59

What do you know about the PRNG? How big is its state? If it has a period that is less than 10^10, maybe you could limit the amount of data to look at. – Trout 28/11, 2023 at 8:15

Assuming the PRNG output is uniform, you know approximately what the result should be. You could keep X closest values (for some appropriate X) to that and keep counts of the values above and below you discarded. – Trout 28/11, 2023 at 8:57

This is called a selection algorithm. – Chromatogram 28/11, 2023 at 12:32

Notice that 10^10 is much (100,000 times) larger than 10^5, so avoid storing all 10^10 integers.

One approach could be this:

For the first 10^5 integers, insert them into an array
Sort the array
For the remaining integers: if the integer is larger than the smallest integer of the array, discard the smallest integer and insert the new integer in its sorted place in the array. Otherwise, discard the integer. Notice that the size of the array will remain 10^5
After all 10^10 integers have been processed, the result is the smallest integer in the resulting array

Impudence answered 28/11, 2023 at 9:14 Comment(4)

And it would be a good idea to use a binary search to find the position of the new integer. – Tiedeman 28/11, 2023 at 11:8

@IanAbbott Yes, absolutely. I have not analyzed the computational complexity in details, but as numbers are generated, it will become less and less frequent that a new number must be added to the array, so the most important thing is probably that the decision to discard or insert a new number can be done with a single comparison against the smallest stored value. – Impudence 28/11, 2023 at 11:29

A weakness is that "and insert the new integer in its sorted place in the array." with a sorted array costs O(n) when insertion cost of O(log(n)) possible with other approaches. This is important in the worst case of data. Granted the insertion needs likely become significantly less frequent with a random set of inputs and so makes this a reasonable approach. – Unstudied 28/11, 2023 at 12:55

@chux-ReinstateMonica Yes, a "min heap" is probably the most efficient choice. I chose an array for the simplicity of describing the main idea (store only 10^5 integers, i.e. 1/100000 of the total 10^10 samples). This could be a first simple implementation before optimizing the finer details. – Impudence 28/11, 2023 at 14:6

Assuming your random numbers are in the range -2³¹...2³¹-1, you will get a pretty dense distribution - each integer will be repeated 2-3 times. Let's say N times. So you can estimate what the answer should be: M - 10⁵/N, where M is the maximum.

Then allocate an array of counters for ±500 numbers around your estimated answer. Count how many times each of these numbers is generated, as well as how many times larger or smaller numbers are generated. If the count of larger numbers at the end is less than 10⁵, you can analyze your small array of counters to get the exact answer. If you "missed" your expected answer - bad but not too bad! Update your estimate and generate the numbers again.

Ignazio answered 28/11, 2023 at 9:33 Comment(1)

I reckon that is by far the nicest way to do it. You can even put bounds on how big (plus a safety factor) the histogram array needs to be assuming a uniform RNG. – Zebu 28/11, 2023 at 9:46

Use a Binary Heap.

Collect first 10^5 samples for PRNG.
Form a MINIMUM binary heap.
For every number from the rest of 10^10-10^5 numbers.
- If it is smaller than or equal to heap's minimum then discard the sample
- If it is larger than heap's minimum
  - extract the minimum from the heap
  - push a new sample into the heap
Dump the minimum from the heap.

Assuming N to be the length of the sequence (i.e. 10^10) and M to the rank of sought value (i.e. 10^5):

Memory Complexity is O(M)
Time complexity is O(N * log M)

EDIT

Assuming that the PRNG can be restart, the algorithm could be further improved.

Split the range of PRNG's values R (i.e. 2^32) to S = sqrt(R)ranges.
For each value V in the sequence update increment a bucket at position floor(V / R).
Find the bucket where M-th largest element lies, call it k.
Split k-th bucket into S sub-buckets, each representing a single integer.
Reset PRNG.
For each value V in the sequence:
- if it does not belong to range of k-th bucket, just ignore it
- otherwise increment sub-bucket at position V % R
Find the sub-bucket with the desired element.

The time complexity will be O(N + sqrt(R)) = O(N), and memory complexity will be O(sqrt(R)).

Afebrile answered 28/11, 2023 at 11:15 Comment(3)

Assuming N=10^10 and M=10^5 memory complexity is O(1) and time complexity is O(1), too. :) – Planoconvex 28/11, 2023 at 11:42

@CiaPan, O(1) is also O(100^100). I've used N and M symbols to show how the complexity changes if bound of the given task are changed. – Afebrile 28/11, 2023 at 12:33

Then N = 'number of values generated' rather than '10^10' and M = 'rank of the value sought' instead of '10^5'. – Planoconvex 28/11, 2023 at 12:41

can you help point me in the right direction as to what I can do to solve this?

OK, so minimal code provided.

How to look for the 10⁵th largest element in an array of 10¹⁰ elements?

Form a priority queue of size 10⁵ (least is highest priority). Priority queues are really quite simply. This can be as simple as:

void PQ_add(int value);// TBD for OP. It should be O(log(length))
int PQ_pop_top(void);  // TBD for OP. It should be O(log(length))
int PQ_top(int value); // pq[0]
size_t PQ_length(void);// length
size_t PQ_size(void);  // PQ_N

#define PQ_N 100000
int pq[PQ_N];
size_t length = 0;

Then loop 10¹⁰ times considering to insert the new random value. Each insertion/deletion costs O(length*log(length)) time. When length < PQ_N, simply insert. After the first PQ_N, first pull the lowest value out if it is smaller than the new value and then add the new value.

The top of the queue will be the 100,000th greatest one.

We really do not need the array of 10¹⁰ elements. Simply a loop the generates the 10¹⁰ values.

Finding the N greatest value and with M as the count of random values, the overall time cost O(M*log(N)) and space cost O(N).

Deeper

Review selection algorithms.

Unstudied answered 28/11, 2023 at 12:22 Comment(1)

It would be good for answers to mention these are selection algorithms. – Chromatogram 28/11, 2023 at 12:38

For a sequence of N items where you want to find the Tth largest value in the sequence where storing N items is unmanageable (ala 10 billion), but storing T items is manageable (like 100000), then a sorted map data structure that maps between values in the sequence to the number of occurrences seen for that value can be used.

Sorted maps are often implemented as (binary) trees and have the keys in sorted order. Most programming languages have some variation of this data structure in their standard library.

Algorithm is this:

Create an empty sorted map that maps between a value
to the number of occurrences that value was seen in the
sequence
 
For each index from 0 to T-1:
    Generate a new random sequence value => V
    Store V as the key in the map:
        if V is not in the map already, then map[V] = 1
        if V is already in the map, then map[V] = map[V] + 1

Then assess what the smallest number, S, seen so far is in the map.

For each index from T to N-1:
    Generate a new sequence value => V
    if (V <= S):
       skip;
    else:
        decrement the smallest value, S, from the map. If map[S] goes to 0, then remove S entirely from the map.
        insert V in the same manner as above
        Reassess what S is by inspecting what the smallest (first) item in the map is.

After the second for loop, the Tth largest element is the
smallest value in the map.

Here's a C++ implementation that leverages std::map for a sorted table.

int main() {

    long long N = 10'000'000'000LL;
    long long T = 100'000;  // want the Tth largest number in the sequence
    long long smallest = LLONG_MAX;
    long long removals = 0;

    std::map<long long, long long> table; // hash table that maps between a random value and the number of occurrences
    std::vector<long long> largestNumbers;

    seedRandomNumberGenerator();

    for (long long i = 0; i < N; i++) {

        if (i % 1'000'000LL == 0) {
            std::cout << i << std::endl;
        }

        long long value = getNextRandomNumber();
        if (i < T) {
            table[value]++; // this will implictly insert table[value]=1 if value isn't in the table, otherwise increments the count
            if (value < smallest) {
                smallest = value;
            }
        } else {

            // at this point there are 100000 items tracked in the hash table
            // any time we encounter a value greater than the smallest item
            // then we push the smallest item out

            if (value > smallest) {
                removals++;
                auto itor = table.begin();
                if ((--itor->second) == 0) {
                    table.erase(itor);
                    table[value]++;
                    smallest = table.begin()->first;
                }
            }
        }
    }

    long long tValue = table.begin()->first;
    std::cout << "the " << T << "th largest value in the sequence is" << tValue << "\n";
    std::cout << "there were " << removals << "from the table\n";
}

Each insert and lookup where table[value]++ is invoked is typically O(lg(T)). So the cost of the first T inserts is O(T*lg(T))
Removals, where table.erase is invoked, into the map in C++ is understood to be O(T). When i is small, the probability of value > smallest is high and the cost of a removal from the map is paid. As i approaches N, then the probability of a removal goes way down to T/N probability or .00001 in the case of 100K/10B. It my simulation of N=10B and T=100K, there was about 1 million removals.
table->begin() is O(1)

The above code, when compiled as a release build with optimizations takes about 3-5 minutes to run on my new Core-i9 workstation. I'm using a modified rand() algorithm on as the implementation for getNextRandomNumber().

I think there are several optimizations that can be done to beat the performance of std::map at the expense of using slightly more memory. I'll have to sleep on that.

Scat answered 28/11, 2023 at 11:23 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

EDIT

Recommended topics

Hot tags