Say I have a linked list of numbers of length N
. N
is very large and I don’t know in advance the exact value of N
.
How can I most efficiently write a function that will return k
completely random numbers from the list?
Say I have a linked list of numbers of length N
. N
is very large and I don’t know in advance the exact value of N
.
How can I most efficiently write a function that will return k
completely random numbers from the list?
There's a very nice and efficient algorithm for this using a method called reservoir sampling.
Let me start by giving you its history:
Knuth calls this Algorithm R on p. 144 of his 1997 edition of Seminumerical Algorithms (volume 2 of The Art of Computer Programming), and provides some code for it there. Knuth attributes the algorithm to Alan G. Waterman. Despite a lengthy search, I haven't been able to find Waterman's original document, if it exists, which may be why you'll most often see Knuth quoted as the source of this algorithm.
McLeod and Bellhouse, 1983 (1) provide a more thorough discussion than Knuth as well as the first published proof (that I'm aware of) that the algorithm works.
Vitter 1985 (2) reviews Algorithm R and then presents an additional three algorithms which provide the same output, but with a twist. Rather than making a choice to include or skip each incoming element, his algorithm predetermines the number of incoming elements to be skipped. In his tests (which, admittedly, are out of date now) this decreased execution time dramatically by avoiding random number generation and comparisons on each in-coming number.
In pseudocode the algorithm is:
Let R be the result array of size s
Let I be an input queue
> Fill the reservoir array
for j in the range [1,s]:
R[j]=I.pop()
elements_seen=s
while I is not empty:
elements_seen+=1
j=random(1,elements_seen) > This is inclusive
if j<=s:
R[j]=I.pop()
else:
I.pop()
Note that I've specifically written the code to avoid specifying the size of the input. That's one of the cool properties of this algorithm: you can run it without needing to know the size of the input beforehand and it still assures you that each element you encounter has an equal probability of ending up in R
(that is, there is no bias). Furthermore, R
contains a fair and representative sample of the elements the algorithm has considered at all times. This means you can use this as an online algorithm.
Why does this work?
McLeod and Bellhouse (1983) provide a proof using the mathematics of combinations. It's pretty, but it would be a bit difficult to reconstruct it here. Therefore, I've generated an alternative proof which is easier to explain.
We proceed via proof by induction.
Say we want to generate a set of s
elements and that we have already seen n>s
elements.
Let's assume that our current s
elements have already each been chosen with probability s/n
.
By the definition of the algorithm, we choose element n+1
with probability s/(n+1)
.
Each element already part of our result set has a probability 1/s
of being replaced.
The probability that an element from the n
-seen result set is replaced in the n+1
-seen result set is therefore (1/s)*s/(n+1)=1/(n+1)
. Conversely, the probability that an element is not replaced is 1-1/(n+1)=n/(n+1)
.
Thus, the n+1
-seen result set contains an element either if it was part of the n
-seen result set and was not replaced---this probability is (s/n)*n/(n+1)=s/(n+1)
---or if the element was chosen---with probability s/(n+1)
.
The definition of the algorithm tells us that the first s
elements are automatically included as the first n=s
members of the result set. Therefore, the n-seen
result set includes each element with s/n
(=1) probability giving us the necessary base case for the induction.
References
McLeod, A. Ian, and David R. Bellhouse. "A convenient algorithm for drawing a simple random sample." Journal of the Royal Statistical Society. Series C (Applied Statistics) 32.2 (1983): 182-184. (Link)
Vitter, Jeffrey S. "Random sampling with a reservoir." ACM Transactions on Mathematical Software (TOMS) 11.1 (1985): 37-57. (Link)
I
is simply a structure from which we will draw the elements to be sampled. This could be a queue, linked list, array, vector, &c. However, since N
is potentially very large, the algorithm looks at each item only once and then, for all practical purposes, discards it. Since the algorithm also never looks ahead beyond the current element, it is appropriate to call this structure a queue. –
Excurvature s
can be any number of elements. However, once you begin sampling the stream, you cannot change the size of s
and guarantee that all elements will appear in the output with equal probability. Therefore, choose the size of s
such that it is the maximum number of random elements you will need and then draw the elements you need randomly from s
. Does that help? –
Excurvature This is called a Reservoir Sampling problem. The simple solution is to assign a random number to each element of the list as you see it, then keep the top (or bottom) k elements as ordered by the random number.
I would suggest: First find your k random numbers. Sort them. Then traverse both the linked list and your random numbers once.
If you somehow don't know the length of your linked list (how?), then you could grab the first k into an array, then for node r, generate a random number in [0, r), and if that is less than k, replace the rth item of the array. (Not entirely convinced that doesn't bias...)
Other than that: "If I were you, I wouldn't be starting from here." Are you sure linked list is right for your problem? Is there not a better data structure, such as a good old flat array list.
If you don't know the length of the list, then you will have to traverse it complete to ensure random picks. The method I've used in this case is the one described by Tom Hawtin (54070). While traversing the list you keep k
elements that form your random selection to that point. (Initially you just add the first k
elements you encounter.) Then, with probability k/i
, you replace a random element from your selection with the i
th element of the list (i.e. the element you are at, at that moment).
It's easy to show that this gives a random selection. After seeing m
elements (m > k
), we have that each of the first m
elements of the list are part of you random selection with a probability k/m
. That this initially holds is trivial. Then for each element m+1
, you put it in your selection (replacing a random element) with probability k/(m+1)
. You now need to show that all other elements also have probability k/(m+1)
of being selected. We have that the probability is k/m * (k/(m+1)*(1-1/k) + (1-k/(m+1)))
(i.e. probability that element was in the list times the probability that it is still there). With calculus you can straightforwardly show that this is equal to k/(m+1)
.
Well, you do need to know what N is at runtime at least, even if this involves doing an extra pass over the list to count them. The simplest algorithm to do this is to just pick a random number in N and remove that item, repeated k times. Or, if it is permissible to return repeat numbers, don't remove the item.
Unless you have a VERY large N, and very stringent performance requirements, this algorithm runs with O(N*k)
complexity, which should be acceptable.
Edit: Nevermind, Tom Hawtin's method is way better. Select the random numbers first, then traverse the list once. Same theoretical complexity, I think, but much better expected runtime.
Why can't you just do something like
List GetKRandomFromList(List input, int k)
List ret = new List();
for(i=0;i<k;i++)
ret.Add(input[Math.Rand(0,input.Length)]);
return ret;
I'm sure that you don't mean something that simple so can you specify further?
© 2022 - 2024 — McMap. All rights reserved.