Select n records at random from a set of N
Asked Answered
M

2

9

I need to select n records at random from a set of N (where 0 < n < N).

A possible algorithm is:

Iterate through the list and for each element, make the probability of selection = (number needed) / (number left)

So if you had 40 items, the first would have a 5/40 chance of being selected.

If it is, the next has a 4/39 chance, otherwise it has a 5/39 chance. By the time you get to the end you will have your 5 items, and often you'll have all of them before that.

Assuming a good pseudo-random number generator, is this algorithm correct?


NOTE

There're many questions of this kind on stackoverflow (a lot of them are marked as duplicates of Select N random elements from a List<T> in C#).

The above algorithm is often proposed (e.g. Kyle Cronin's answer) and it's always questioned (e.g. see here, here, here, here...).

Can I have a final word about the matter?

Mundt answered 28/1, 2016 at 15:40 Comment(7)
While wiki type questions and answering your own is perfectly fine here, it should be something not asked before. Still thinking if I should close or not. on one hand answer is high quality. On other hand, that should be an answer on the dupe.Mulcahy
Or this dupeMulcahy
Decided not to close it single handedly because I am uncertain. If it was not a close hammer, I would have voted to close though. Let the community decide if they think it is a dupe or not.Mulcahy
@Mulcahy Sorry, my problem wasn't "how to sample" (as in the other questions) but "is this algorithm right?". The algorithm is often proposed without references / details and the answers always generate a lot of questions about correctness, bias... Perhaps other users (like me) get confused about that. I tried to clarify some points.Mundt
@manilio The top answer in the dupe thread explains that exactly, with the reference to knuth as well.Mulcahy
@Mulcahy ...that's a similar but different algorithm (algorithm S vs algorithm R) described in the same paragraph of the book.Mundt
Let us continue this discussion in chat.Mundt
M
16

The algorithm is absolutely correct.

It's not the sudden invention of a good poster, it's a well known technique called Selection sampling / Algorithm S (discovered by Fan, Muller and Rezucha (1) and independently by Jones (2) in 1962), well described in TAOCP - Volume 2 - Seminumerical Algorithms - § 3.4.2.

As Knuth says:

This algorithm may appear to be unreliable at first glance and, in fact, to be incorrect. But a careful analysis shows that it is completely trustworthy.

The algorithm samples n elements from a set of size N and the t + 1st element is chosen with probability (n - m) / (N - t), when already m elements have been chosen.

It's easy to see that we never run off the end of the set before choosing n items (as the probability will be 1 when we have k elements to choose from the remaining k elements).

Also we never pick too many elements (the probability will be 0 as soon n == m).

It's a bit harder to demonstrate that the sample is completely unbiased, but it's

... true in spite of the fact that we are not selecting the t + 1st item with probability n / N. This has caused some confusion in the published literature

(so not just on Stackoverflow!)

The fact is we should not confuse conditional and unconditional probabilities:

For example consider the second element; if the first element was selected in the sample (this happen with probability n / N), the second element is selected with probability (n - 1) / (N - 1); if the first element was not selected, the second element is selected with probability n / (N - 1).

The overall probability of selecting the second element is (n / N) ((n - 1) / (N - 1)) + (1 - n/N)(n / (N - 1)) = n/N.

TAOCP - Vol 2 - Section 3.4.2 exercise 3

Apart from theoretical considerations, Algorithm S (and algorithm R / reservoir sampling) is used in many well known libraries (e.g. SGI's original STL implementation, std::experimental::sample, random.sample in Python...).

Of course algorithm S is not always the best answer:

  • it's O(N) (even if we will usually not have to pass over all N records: the average number of records considered when n=2 is about 2/3 N; the general formulas are given in TAOCP - Vol 2 - § 3.4.2 - ex 5/6);
  • it cannot be used when the value of N isn't known in advance.

Anyway it works!


  1. C. T. Fan, M. E. Muller and I. Rezucha, J. Amer. Stat. Assoc. 57 (1962), pp 387 - 402
  2. T. G. Jones, CACM 5 (1962), pp 343

EDIT

how do you randomly select this item, with a probability of 7/22

[CUT]

In rare cases, you might even pick 4 or 6 elements when you wanted 5

This is from N3925 (small modifications to avoid the common interface / tag dispatch):

template<class PopIter, class SampleIter, class Size, class URNG>
SampleIter sample(PopIter first, PopIter last, SampleIter out, Size n, URNG &&g)
{
  using dist_t = uniform_int_distribution<Size>;
  using param_t = typename dist_t::param_type;

  dist_t d{};

  Size unsampled_sz = distance(first, last);
  for (n = min(n, unsampled_sz); n != 0;  ++first)
  {
    param_t const p{0, --unsampled_sz};

    if (d(g, p) < n) { *out++ = *first; --n; }
  }

  return out;
}

There aren't floats.

  • If you need 5 elements you get 5 elements;
  • if uniform_int_distribution "works as advertised" there is no bias.
Mundt answered 28/1, 2016 at 15:40 Comment(6)
Downvoted because this technique is correct in number theory, but this is not math.stackexchange.com, this is stackoverflow, and the answer needs to be correct in computers, where you can't represent the probabilities in infinite precision needed by this answer. See my answer about fairness and infinite precision.Shipmate
Yes, it can be used when the value of N (i.e. number of items in the list) is not known. The description on the linked Wikipedia page shows exactly how it's done. The first item is kept with probability 1/1. The second is kept with probability 1/2. The third with 1/3, etc. There is never a requirement to know the value of N.Terse
@JimMischel N must be known in advance for algorithm S; algorithm R is a good alternative when you don't know N (but you always have to read the entire input set).Mundt
Your sample code is very unclear to me. I don't know what a PopIter is, or SampleIter, etc. It's not clear how the sample() method or function returns true/false with a probability of 7/22.Shipmate
@EdwardNedHarvey The function needs forward iterators (PopIter) to access the input set, but only an output iterator (PopIter) to the resulting sample. This is a C++ technicality and it's not of general interest, but I didn't want to mess around (too much) the original source code. The key point is already highlighted in your other comment: d(g, p) is similar to UInt32 RandomRange(). Anyway I've also seen many answers affected by the issue you describe (e.g. https://mcmap.net/q/22139/-get-random-sample-from-list-while-maintaining-ordering-of-items).Mundt
"... but only an output iterator (SampleIter) to ..."Mundt
S
-3

Although the algorithm described is technically correct, it depends on having an algorithm to return a bool with arbitrary probability determined by the ratio of two ints. For example, how do you select this item with a probability of 7/22? For the point of talking, let's call it the bool RandomSelect(int x, int y) method, or just the RS(x,y) method, designed to return true with probability x/y. If you're not very concerned about accuracy, the oft-given answer is to use return Random.NextDouble() < (double)x/(double)y; which is inaccurate because Random.NextDouble() is imprecise and not perfectly uniform, and the division (double)x/(double)y is also imprecise. The choice of < or <= should be irrelevant (but it's not) because in theory it's impossible to randomly pick the infinite precision random number exactly equal to the specified probability. While I'm sure an algorithm can be created or found, to implement the RS(x,y) method precisely, which would then allow you to implement the described algorithm correctly, I think that to simply answer this question as "yes the algorithm is correct" would be misleading - as it has misled so many people before, into calculating and choosing elements using double, unaware of the bias they're introducing.

Don't misunderstand me - I'm not saying everyone should avoid using the described algorithm - I'm only saying that unless you find a more precise way to implement the RS(x,y) algorithm, your selections will be subtly biased in favor of some elements more frequently than other elements.

If you care about fairness (equal probability of all possible outcomes) I think it is better, and easier to understand, to use a different algorithm instead, as I've described below:

If you take it as given that the only source of random you have available is random bits, you have to define a technique of random selection that assures equal probability, given binary random data. This means, if you want to pick a random number in a range that happens to be a power of 2, you just pick random bits and return them. But if you want a random number in a range that's not a power of 2, you have to get more random bits, and discard outcomes that could not map to fair outcomes (throw away the random number and try again). I blogged about it with pictoral representations and C# example code here: https://nedharvey.com/blog/?p=284 Repeat the random selection from your collection, until you have n unique items.

Shipmate answered 29/1, 2016 at 12:42 Comment(10)
It seems to me that the problem you describe is an important implementation detail, but it doesn't change the fact that the algorithm is correct. A direct transposition of the algorithm could be entirely integer-based (e.g., in C++, using std::uniform_int_distribution<unsigned> for the random number distribution, unsigned for n and N) and so it'd avoid the inherent lack of precision of floats.Mundt
@Mundt std::uniform_int_distribution<unsigned> produces a random int in a range, same as UInt32 RandomRange() in my blog post. It's not an implementation of RS(x,y). Your comment made me aware that my answer didn't directly answer your question, so I revised my answer.Shipmate
Given your UInt32 RandomRange(UInt32 max) function, how about bool RandomSelect(int x, int y) { return RandomRange(y-1) < x; }? If RandomRange provides equal distribution, this shouldn't have the same problem as the double implementation.Ramtil
I find the "FP randoms are not perfect" part of this answer, while technically correct, quite misleading, and wrong from a practical perspective; as following its advice to avoid any kind of approximation would lead to avoiding floating-point altogether. Whenever floating-point is used, we deviate from the perfect world of pure math and enter the imperfect world of actually applying it (dare we call it engineering)?. Actual computers are engineering, and while we should be aware of floating-point pitfalls, FP is a very, very useful tool for many real-world tasks.Skirling
The linked blog post provides an integer-only approach that is strictly worse than those found in existing FP random generators in standard libraries from major languages (say, C++'s STL or Java). It is a slow, low-entropy option. It may be fair within that range, but 32 bits of entropy is much less than what is found in most libraries; and libraries are tested and coded for both speed and quality.Skirling
@Skirling You and I are approaching this from different perspectives. Using the FP approximation is great for fast, statistically random situations, but not good enough for cryptographic situations. For crypto, the exact probability of each possible outcome must be strictly equal. Also your assessment of slow low entropy is simply incorrect. The "slow" part is relying on the crypto random library to generate a crypto random number. Then, doing a mod and div operation is really damn fast, to cryptographically select a random element from a collection.Shipmate
The OP was not asking for anything crypto-related, and the blog post was certainly not suitable for practical cryptography, where the default PRNGs simply do not have a large-enough internal entropy. Importantly, other than for education, only researchers should be "rolling their own": there's way too many pitfalls to trust any "homebrew crypto".Skirling
@Skirling As an actual crypto expert, I know not to roll your own crypto. That's why I use a standard library such as RNGCryptoServiceProvider. I did not roll my own crypto; I used a standard library, with div and mod to select an element from a range. Your statement shows you don't know anything about crypto random or entropy, because calling RNGCryptoServiceProvider a "default prng" with not enough entropy is the exact opposite of conventional wisdom. Also I just found this div and mod method is exactly what python implements internally in random.py in the _randbelow fn so you're off base.Shipmate
If the "let us continue this in chat" option appears I would be willing to justify my downvote (justified as "does not answer OP" and "not a very good idea anyway"). Otherwise, I do not see any point in going back and forth.Skirling
For many purposes, the random float will be good enough, but it doesn't yield exactly the same probability of each option being selected. If you care about each possible outcome being equally probable, you can't use the random float. This technique of random int, div and mod, is the solution instead. I work in crypto, where it matters, hence I used a crypto rng and I care about this, but surely it must matter to some other people in other fields too.Shipmate

© 2022 - 2024 — McMap. All rights reserved.