Efficiently selecting a set of random elements from a linked list
Asked Answered
S

6

40

Say I have a linked list of numbers of length N. N is very large and I don’t know in advance the exact value of N.

How can I most efficiently write a function that will return k completely random numbers from the list?

S answered 10/9, 2008 at 13:40 Comment(0)
E
42

There's a very nice and efficient algorithm for this using a method called reservoir sampling.

Let me start by giving you its history:

Knuth calls this Algorithm R on p. 144 of his 1997 edition of Seminumerical Algorithms (volume 2 of The Art of Computer Programming), and provides some code for it there. Knuth attributes the algorithm to Alan G. Waterman. Despite a lengthy search, I haven't been able to find Waterman's original document, if it exists, which may be why you'll most often see Knuth quoted as the source of this algorithm.

McLeod and Bellhouse, 1983 (1) provide a more thorough discussion than Knuth as well as the first published proof (that I'm aware of) that the algorithm works.

Vitter 1985 (2) reviews Algorithm R and then presents an additional three algorithms which provide the same output, but with a twist. Rather than making a choice to include or skip each incoming element, his algorithm predetermines the number of incoming elements to be skipped. In his tests (which, admittedly, are out of date now) this decreased execution time dramatically by avoiding random number generation and comparisons on each in-coming number.

In pseudocode the algorithm is:

Let R be the result array of size s
Let I be an input queue

> Fill the reservoir array
for j in the range [1,s]:
  R[j]=I.pop()

elements_seen=s
while I is not empty:
  elements_seen+=1
  j=random(1,elements_seen)       > This is inclusive
  if j<=s:
    R[j]=I.pop()
  else:
    I.pop()

Note that I've specifically written the code to avoid specifying the size of the input. That's one of the cool properties of this algorithm: you can run it without needing to know the size of the input beforehand and it still assures you that each element you encounter has an equal probability of ending up in R (that is, there is no bias). Furthermore, R contains a fair and representative sample of the elements the algorithm has considered at all times. This means you can use this as an online algorithm.

Why does this work?

McLeod and Bellhouse (1983) provide a proof using the mathematics of combinations. It's pretty, but it would be a bit difficult to reconstruct it here. Therefore, I've generated an alternative proof which is easier to explain.

We proceed via proof by induction.

Say we want to generate a set of s elements and that we have already seen n>s elements.

Let's assume that our current s elements have already each been chosen with probability s/n.

By the definition of the algorithm, we choose element n+1 with probability s/(n+1).

Each element already part of our result set has a probability 1/s of being replaced.

The probability that an element from the n-seen result set is replaced in the n+1-seen result set is therefore (1/s)*s/(n+1)=1/(n+1). Conversely, the probability that an element is not replaced is 1-1/(n+1)=n/(n+1).

Thus, the n+1-seen result set contains an element either if it was part of the n-seen result set and was not replaced---this probability is (s/n)*n/(n+1)=s/(n+1)---or if the element was chosen---with probability s/(n+1).

The definition of the algorithm tells us that the first s elements are automatically included as the first n=s members of the result set. Therefore, the n-seen result set includes each element with s/n (=1) probability giving us the necessary base case for the induction.

References

  1. McLeod, A. Ian, and David R. Bellhouse. "A convenient algorithm for drawing a simple random sample." Journal of the Royal Statistical Society. Series C (Applied Statistics) 32.2 (1983): 182-184. (Link)

  2. Vitter, Jeffrey S. "Random sampling with a reservoir." ACM Transactions on Mathematical Software (TOMS) 11.1 (1985): 37-57. (Link)

Excurvature answered 18/1, 2014 at 7:31 Comment(5)
What does I, Input queue mean here?Habakkuk
@Anoop, I is simply a structure from which we will draw the elements to be sampled. This could be a queue, linked list, array, vector, &c. However, since N is potentially very large, the algorithm looks at each item only once and then, for all practical purposes, discards it. Since the algorithm also never looks ahead beyond the current element, it is appropriate to call this structure a queue.Excurvature
what an awesome answer! what if i want just one random element at a time? will it be ok if s=1 and have the same randomness?Graffito
@tamara, s can be any number of elements. However, once you begin sampling the stream, you cannot change the size of s and guarantee that all elements will appear in the output with equal probability. Therefore, choose the size of s such that it is the maximum number of random elements you will need and then draw the elements you need randomly from s. Does that help?Excurvature
my thoughts went into that direction as well, trying to implement it at the moment. thank you!Graffito
L
33

This is called a Reservoir Sampling problem. The simple solution is to assign a random number to each element of the list as you see it, then keep the top (or bottom) k elements as ordered by the random number.

Loci answered 10/9, 2008 at 13:54 Comment(5)
Any idea why this is called reservoir sampling?Dupondius
@Lazer: I'm not sure if Vitter (PDF) coined the term originally, but he phrases the problem as "selecting a random sample of n records without replacement from a pool of N records..." (emphasis mine). He gives a definition in section 2 that starts out 'The first step of any reservoir algorithm is to put the first n records of the file into a “reservoir.”...'Loci
That's elegant. However, I've a doubt. How about having duplicates in those random numbers that are being assigned to our stream. Let's say that kth and (k+1)th elements, in sorted order, have the same number assigned, and picking only top-K will "depend" on the way we break ties..right? Please could you clarifyLutyens
@phani Yes, if you're using random integers in a range, that might be an issue. If you just use a random double, duplicates should have negligible impact on the randomness of your sample.Loci
@BilltheLizard True for arbitrary-precision real numbers, but doubles use a few bits for the exponent, so using plain 64-bit integers might be better.Stinker
T
2

I would suggest: First find your k random numbers. Sort them. Then traverse both the linked list and your random numbers once.

If you somehow don't know the length of your linked list (how?), then you could grab the first k into an array, then for node r, generate a random number in [0, r), and if that is less than k, replace the rth item of the array. (Not entirely convinced that doesn't bias...)

Other than that: "If I were you, I wouldn't be starting from here." Are you sure linked list is right for your problem? Is there not a better data structure, such as a good old flat array list.

Threecornered answered 10/9, 2008 at 13:45 Comment(2)
Hi Tom - Sorry, I don't quite understand - What do you mean by "First find your k random numbers". The objective is to find the k random numbers taken from the list - or am I misunderstanding something.S
@TomHawtin, there is no bias: see my answer.Excurvature
E
2

If you don't know the length of the list, then you will have to traverse it complete to ensure random picks. The method I've used in this case is the one described by Tom Hawtin (54070). While traversing the list you keep k elements that form your random selection to that point. (Initially you just add the first k elements you encounter.) Then, with probability k/i, you replace a random element from your selection with the ith element of the list (i.e. the element you are at, at that moment).

It's easy to show that this gives a random selection. After seeing m elements (m > k), we have that each of the first m elements of the list are part of you random selection with a probability k/m. That this initially holds is trivial. Then for each element m+1, you put it in your selection (replacing a random element) with probability k/(m+1). You now need to show that all other elements also have probability k/(m+1) of being selected. We have that the probability is k/m * (k/(m+1)*(1-1/k) + (1-k/(m+1))) (i.e. probability that element was in the list times the probability that it is still there). With calculus you can straightforwardly show that this is equal to k/(m+1).

Eurypterid answered 10/9, 2008 at 14:22 Comment(1)
This is should be the selected answerTellurize
W
-3

Well, you do need to know what N is at runtime at least, even if this involves doing an extra pass over the list to count them. The simplest algorithm to do this is to just pick a random number in N and remove that item, repeated k times. Or, if it is permissible to return repeat numbers, don't remove the item.

Unless you have a VERY large N, and very stringent performance requirements, this algorithm runs with O(N*k) complexity, which should be acceptable.

Edit: Nevermind, Tom Hawtin's method is way better. Select the random numbers first, then traverse the list once. Same theoretical complexity, I think, but much better expected runtime.

Winterize answered 10/9, 2008 at 13:45 Comment(0)
D
-3

Why can't you just do something like

List GetKRandomFromList(List input, int k)
  List ret = new List();
  for(i=0;i<k;i++)
    ret.Add(input[Math.Rand(0,input.Length)]);
  return ret;

I'm sure that you don't mean something that simple so can you specify further?

Digitigrade answered 10/9, 2008 at 13:49 Comment(1)
The reason that's inefficient is that getting the value of input[x] as to do in the 4th line count be very expensive for a linked list, since you have to traverse all the items up to x to get there - and your solution does that k times. As the other answers here point out, you can do it with a single pass through the list.S

© 2022 - 2024 — McMap. All rights reserved.