Algorithm for sampling without replacement?
Asked Answered
W

7

16

I am trying to test the likelihood that a particular clustering of data has occurred by chance. A robust way to do this is Monte Carlo simulation, in which the associations between data and groups are randomly reassigned a large number of times (e.g. 10,000), and a metric of clustering is used to compare the actual data with the simulations to determine a p value.

I've got most of this working, with pointers mapping the grouping to the data elements, so I plan to randomly reassign pointers to data. THE QUESTION: what is a fast way to sample without replacement, so that every pointer is randomly reassigned in the replicate data sets?

For example (these data are just a simplified example):

Data (n=12 values) - Group A: 0.1, 0.2, 0.4 / Group B: 0.5, 0.6, 0.8 / Group C: 0.4, 0.5 / Group D: 0.2, 0.2, 0.3, 0.5

For each replicate data set, I would have the same cluster sizes (A=3, B=3, C=2, D=4) and data values, but would reassign the values to the clusters.

To do this, I could generate random numbers in the range 1-12, assign the first element of group A, then generate random numbers in the range 1-11 and assign the second element in group A, and so on. The pointer reassignment is fast, and I will have pre-allocated all data structures, but the sampling without replacement seems like a problem that might have been solved many times before.

Logic or pseudocode preferred.

Womble answered 22/11, 2008 at 19:56 Comment(0)
J
6

See my answer to this question Unique (non-repeating) random numbers in O(1)?. The same logic should accomplish what you are looking to do.

Jowl answered 22/11, 2008 at 20:0 Comment(1)
Excellent! Sorry I did not see that answer when I searched SO (for sampling without replacement, statistics, algorithms, etc). Maybe this will serve like a meta-question to lead people like me to your original answer. Cheers!Womble
K
41

Here's some code for sampling without replacement based on Algorithm 3.4.2S of Knuth's book Seminumeric Algorithms.

void SampleWithoutReplacement
(
    int populationSize,    // size of set sampling from
    int sampleSize,        // size of each sample
    vector<int> & samples  // output, zero-offset indicies to selected items
)
{
    // Use Knuth's variable names
    int& n = sampleSize;
    int& N = populationSize;

    int t = 0; // total input records dealt with
    int m = 0; // number of items selected so far
    double u;

    while (m < n)
    {
        u = GetUniform(); // call a uniform(0,1) random number generator

        if ( (N - t)*u >= n - m )
        {
            t++;
        }
        else
        {
            samples[m] = t;
            t++; m++;
        }
    }
}

There is a more efficient but more complex method by Jeffrey Scott Vitter in "An Efficient Algorithm for Sequential Random Sampling," ACM Transactions on Mathematical Software, 13(1), March 1987, 58-67.

Kotta answered 22/11, 2008 at 20:8 Comment(5)
I don't have this book (yet), and had trouble proving the correctness of the algorithm to myself. I implemented it in java and checked the population items are sampled with uniform probability. Results are convincing. See this gistAnnabel
An uncritical implementation of Vitter's Method D in Mathematica is orders of magnitude faster than the built-in algorithm. I describe it over here: tinyurl.com/lbldlpqParsifal
@Annabel - We can view the problem of sampling n elements from a population of N by considering the first element. There is a (n/N) probability that this element is included: if it is, then the problem reduces to sampling (n-1) elements out of (N-1) remaining; if not, then the problem reduces to sampling (n) elements out of (N-1) remaining. Some variable transformation will show that this is the essence of Knuth's algorithm (by incrementing t).Collagen
Does it matter if u is in the open, half-open or closed interval, (0, 1), [0, 1) or [0, 1]? Knuth just says "uniformly distributed between zero and one."Demodulator
@Demodulator mathematically those intervals are the same almost surely. In application the probability of a random double between 0 and 1 being exactly 0 is something like 2^-53.Pancratium
F
8

A C++ working code based on the answer by John D. Cook.

#include <random>
#include <vector>

// John D. Cook, https://mcmap.net/q/129750/-algorithm-for-sampling-without-replacement
void SampleWithoutReplacement
(
    int populationSize,    // size of set sampling from
    int sampleSize,        // size of each sample
    std::vector<int> & samples  // output, zero-offset indicies to selected items
)
{
    // Use Knuth's variable names
    int& n = sampleSize;
    int& N = populationSize;

    int t = 0; // total input records dealt with
    int m = 0; // number of items selected so far

    std::default_random_engine re;
    std::uniform_real_distribution<double> dist(0,1);

    while (m < n)
    {
        double u = dist(re); // call a uniform(0,1) random number generator

        if ( (N - t)*u >= n - m )
        {
            t++;
        }
        else
        {
            samples[m] = t;
            t++; m++;
        }
    }
}

#include <iostream>
int main(int,char**)
{
  const size_t sz = 10;
  std::vector< int > samples(sz);
  SampleWithoutReplacement(10*sz,sz,samples);
  for (size_t i = 0; i < sz; i++ ) {
    std::cout << samples[i] << "\t";
  }

  return 0;
}
Fields answered 7/8, 2014 at 10:15 Comment(1)
I edited your answer so it wouldn't be absurdly slow due to thread guards in GCC and other common compilers. Per my comment on John's answer, I don't know if the interval should be open, half-open or closed, however. This is currently half-open.Demodulator
J
6

See my answer to this question Unique (non-repeating) random numbers in O(1)?. The same logic should accomplish what you are looking to do.

Jowl answered 22/11, 2008 at 20:0 Comment(1)
Excellent! Sorry I did not see that answer when I searched SO (for sampling without replacement, statistics, algorithms, etc). Maybe this will serve like a meta-question to lead people like me to your original answer. Cheers!Womble
J
2

Inspired by @John D. Cook's answer, I wrote an implementation in Nim. At first I had difficulties understanding how it works, so I commented extensively also including an example. Maybe it helps to understand the idea. Also, I have changed the variable names slightly.

iterator uniqueRandomValuesBelow*(N, M: int) =
  ## Returns a total of M unique random values i with 0 <= i < N
  ## These indices can be used to construct e.g. a random sample without replacement
  assert(M <= N)

  var t = 0 # total input records dealt with
  var m = 0 # number of items selected so far

  while (m < M):
    let u = random(1.0) # call a uniform(0,1) random number generator

    # meaning of the following terms:
    # (N - t) is the total number of remaining draws left (initially just N)
    # (M - m) is the number how many of these remaining draw must be positive (initially just M)
    # => Probability for next draw = (M-m) / (N-t)
    #    i.e.: (required positive draws left) / (total draw left)
    #
    # This is implemented by the inequality expression below:
    # - the larger (M-m), the larger the probability of a positive draw
    # - for (N-t) == (M-m), the term on the left is always smaller => we will draw 100%
    # - for (N-t) >> (M-m), we must get a very small u
    #
    # example: (N-t) = 7, (M-m) = 5
    # => we draw the next with prob 5/7
    #    lets assume the draw fails
    # => t += 1 => (N-t) = 6
    # => we draw the next with prob 5/6
    #    lets assume the draw succeeds
    # => t += 1, m += 1 => (N-t) = 5, (M-m) = 4
    # => we draw the next with prob 4/5
    #    lets assume the draw fails
    # => t += 1 => (N-t) = 4
    # => we draw the next with prob 4/4, i.e.,
    #    we will draw with certainty from now on
    #    (in the next steps we get prob 3/3, 2/2, ...)
    if (N - t)*u >= (M - m).toFloat: # this is essentially a draw with P = (M-m) / (N-t)
      # no draw -- happens mainly for (N-t) >> (M-m) and/or high u
      t += 1
    else:
      # draw t -- happens when (M-m) gets large and/or low u
      yield t # this is where we output an index, can be used to sample
      t += 1
      m += 1

# example use
for i in uniqueRandomValuesBelow(100, 5):
  echo i
Jube answered 13/5, 2015 at 11:51 Comment(0)
B
2

When the population size is much greater than the sample size, the above algorithms become inefficient, since they have complexity O(n), n being the population size.

When I was a student I wrote some algorithms for uniform sampling without replacement, which have average complexity O(s log s), where s is the sample size. Here is the code for the binary tree algorithm, with average complexity O(s log s), in R:

# The Tree growing algorithm for uniform sampling without replacement
# by Pavel Ruzankin 
quicksample = function (n,size)
# n - the number of items to choose from
# size - the sample size
{
  s=as.integer(size)
  if (s>n) {
    stop("Sample size is greater than the number of items to choose from")
  }
  # upv=integer(s) #level up edge is pointing to
  leftv=integer(s) #left edge is poiting to; must be filled with zeros
  rightv=integer(s) #right edge is pointig to; must be filled with zeros
  samp=integer(s) #the sample
  ordn=integer(s) #relative ordinal number

  ordn[1L]=1L #initial value for the root vertex
  samp[1L]=sample(n,1L) 
  if (s > 1L) for (j in 2L:s) {
    curn=sample(n-j+1L,1L) #current number sampled
    curordn=0L #currend ordinal number
    v=1L #current vertice
    from=1L #how have come here: 0 - by left edge, 1 - by right edge
    repeat {
      curordn=curordn+ordn[v]
      if (curn+curordn>samp[v]) { #going down by the right edge
        if (from == 0L) {
          ordn[v]=ordn[v]-1L
        }
        if (rightv[v]!=0L) {
          v=rightv[v]
          from=1L
        } else { #creating a new vertex
          samp[j]=curn+curordn
          ordn[j]=1L
          # upv[j]=v
          rightv[v]=j
          break
        }
      } else { #going down by the left edge
        if (from==1L) {
          ordn[v]=ordn[v]+1L
        }
        if (leftv[v]!=0L) {
          v=leftv[v]
          from=0L
        } else { #creating a new vertex
          samp[j]=curn+curordn-1L
          ordn[j]=-1L
          # upv[j]=v
          leftv[v]=j
          break
        }
      }
    }
  }
  return(samp)  
}

The complexity of this algorithm is discussed in: Rouzankin, P. S.; Voytishek, A. V. On the cost of algorithms for random selection. Monte Carlo Methods Appl. 5 (1999), no. 1, 39-54. http://dx.doi.org/10.1515/mcma.1999.5.1.39

If you find the algorithm useful, please make a reference.

See also: P. Gupta, G. P. Bhattacharjee. (1984) An efficient algorithm for random sampling without replacement. International Journal of Computer Mathematics 16:4, pages 201-209. DOI: 10.1080/00207168408803438

Teuhola, J. and Nevalainen, O. 1982. Two efficient algorithms for random sampling without replacement. /IJCM/, 11(2): 127–140. DOI: 10.1080/00207168208803304

In the last paper the authors use hash tables and claim that their algorithms have O(s) complexity. There is one more fast hash table algorithm, which will soon be implemented in pqR (pretty quick R): https://stat.ethz.ch/pipermail/r-devel/2017-October/075012.html

Bortz answered 18/10, 2017 at 9:38 Comment(0)
E
2

I wrote a survey of algorithms for sampling without replacement. I may be biased but I recommend my own algorithm, implemented in C++ below, as providing the best performance for many k, n values and acceptable performance for others. randbelow(i) is assumed to return a fairly chosen random non-negative integer less than i.

void cardchoose(uint32_t n, uint32_t k, uint32_t* result) {
    auto t = n - k + 1;
    for (uint32_t i = 0; i < k; i++) {
        uint32_t r = randbelow(t + i);
        if (r < t) {
            result[i] = r;
        } else {
            result[i] = result[r - t];
        }
    }
    std::sort(result, result + k);
    for (uint32_t i = 0; i < k; i++) {
        result[i] += i;
    }
}
Enclose answered 5/6, 2021 at 14:10 Comment(4)
How does it compare to std::sample and ranges::sample?Virtuoso
This would depend on how your particular C++ stdlib implements it. In both cases, the docs say "This function may implement selection sampling or reservoir sampling" so perhaps it will perform similarly to my implementation of one of these algorithms, but you'd have to test for yourself to be sure.Enclose
cppreference says the complexity of std::sample is linear in the input range, so O(n) whereas this algorithm is O(k log k).Assegai
Yes in theory the O(k) algorithms like Floyd's F2 should be faster for sufficiently large k, but in practice they never seem to be. They have big constant factors and in practice memory hierarchies make everything O(k log k).Enclose
A
0

Another algorithm for sampling without replacement is described here.

It is similar to the one described by John D. Cook in his answer and also from Knuth, but it has different hypothesis: The population size is unknown, but the sample can fit in memory. This one is called "Knuth's algorithm S".

Quoting the rosettacode article:

  1. Select the first n items as the sample as they become available;
  2. For the i-th item where i > n, have a random chance of n/i of keeping it. If failing this chance, the sample remains the same. If not, have it randomly (1/n) replace one of the previously selected n items of the sample.
  3. Repeat #2 for any subsequent items.
Annabel answered 16/4, 2014 at 12:42 Comment(1)
Rosettacode has the wrong name for the algorithm: it should be "Algorithm R" or "Reservoir Sampling". "Algorithm S" (aka "Selection Sampling Technique") requires to know in advance the population size. Both algorithms are described in TAOCP - Vol 2 - §3.4.2Burkitt

© 2022 - 2025 — McMap. All rights reserved.