Reservoir sampling
Asked Answered
P

4

40

To retrieve k random numbers from an array of undetermined size we use a technique called reservoir sampling. Can anybody briefly highlight how it happens with a sample code?

Paraguay answered 10/4, 2010 at 7:45 Comment(3)
This page contains a good explanation with pseudo-code. (The Wikipedia page I originally linked to is unclear, and the pseudo-code is incomplete.)Harmonize
I wrote a blog entry about the exact thing a couple of months back, which has a C# implementation: gregbeech.com/blog/sampling-very-large-sequences The best description of how it works that I've found is here: gregable.com/2007/10/reservoir-sampling.htmlFretwell
Related question #54559Charteris
I
45

I actually did not realize there was a name for this, so I proved and implemented this from scratch:

def random_subset(iterator, K):
    result = []
    N = 0

    for item in iterator:
        N += 1
        if len(result) < K:
            result.append(item)
        else:
            s = int(random.random() * N)
            if s < K:
                result[s] = item

    return result

From: http://web.archive.org/web/20141026071430/http://propersubset.com:80/2010/04/choosing-random-elements.html

With proof near the end.

Indigestible answered 10/4, 2010 at 8:59 Comment(7)
@Larry: Where is the accept it with probability s/k part in your code? [ quote from algorithm mentioned at blogs.msdn.com/spt/archive/2008/02/05/reservoir-sampling.aspx ]Transcalent
By coincidence, it seems like between that article and mine, we use the same variables, but for different things. My "K" appears to be their "S", and my "N" (in code) appears to be their "K". In other words, I accept things with K/N probability, where N is the current count of things.Indigestible
what I meant to ask was how you were implementing probability in your code. Anyways, now I understand. Thanks!Transcalent
Not quite understanding your question, but I can explain the code a little more if you are specific! =)Indigestible
Thanks. This is super easy to understand and get started with basics.Sampan
Is there a way to do this to create multiple reservoirs of varied sizes from an unbounded stream? That is to be able to iterate over a stream once and sample them into one of multiple reservoirs? Assume each reservoir could have a different size or equal sizes.Lisa
With this implementation, don't the first n elements have a higher probability to be picked because there is less of a chance of them being overwritten later on?Crain
G
23

Following Knuth's (1981) description more closely, Reservoir Sampling (Algorithm R) could be implemented as follows:

import random

def sample(iterable, n):
    """
    Returns @param n random items from @param iterable.
    """
    reservoir = []
    for t, item in enumerate(iterable):
        if t < n:
            reservoir.append(item)
        else:
            m = random.randint(0, t)
            if m < n:
                reservoir[m] = item
    return reservoir
Georgetown answered 1/3, 2017 at 13:23 Comment(5)
What's the difference between this and the accepted answer? I think this is exactly the same thing, even if this code is more elegant.Monumentalize
It can be directly related to published research (Knuth, 1981), in case someone is interested in more extended explanation or Knuth's proof.Georgetown
Where 0 <= random.randint(0,t) <= t per random.randint.Mammalian
@Georgetown Isn't randint inclusive?Ebberta
@Ebberta Correct, and it should be. Let iterable contain 11 items from which we want to sample n=10. For the 11th item, t will be 10 (because enumerate starts at 0), and we generate a random integer between 0 and 10 inclusive (i.e., 11 possibilities) such that the probability of adding the 11th item to reservoir is n/t = 10/11.Georgetown
R
2

Java

import java.util.Random;

public static void reservoir(String filename,String[] list)
{
    File f = new File(filename);
    BufferedReader b = new BufferedReader(new FileReader(f));

    String l;
    int c = 0, r;
    Random g = new Random();

    while((l = b.readLine()) != null)
    {
      if (c < list.length)
          r = c++;
      else
          r = g.nextInt(++c);

      if (r < list.length)
          list[r] = l;

      b.close();}
}
Raynard answered 1/2, 2012 at 7:42 Comment(0)
L
-1

Python solution

import random

class RESERVOIR_SAMPLING():
    def __init__(self, k=1000):
        self.reservoir = [] 
        self.k = k
        self.nb_processed = 0

    def add_to_reservoir(self, sample):
        self.nb_processed +=1
        if(self.k >= self.nb_processed):
            self.reservoir.append(sample)
        else:
            #randint(a,b) gives a<=int<=b
            j = random.randint(0,self.nb_processed-1)
            if(j < k):
                self.reservoir[j] = sample

k = 10
samples = [i for i in range(10)] * k
res = RESERVOIR_SAMPLING(k)
for sample in samples:
    res.add_to_reservoir(sample)

print(res.reservoir)

out[1]: [9, 8, 4, 8, 3, 5, 1, 7, 0, 9]
Laxative answered 21/1, 2020 at 23:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.