To retrieve k random numbers from an array of undetermined size we use a technique called reservoir sampling. Can anybody briefly highlight how it happens with a sample code?
Reservoir sampling
Asked Answered
This page contains a good explanation with pseudo-code. (The Wikipedia page I originally linked to is unclear, and the pseudo-code is incomplete.) –
Harmonize
I wrote a blog entry about the exact thing a couple of months back, which has a C# implementation: gregbeech.com/blog/sampling-very-large-sequences The best description of how it works that I've found is here: gregable.com/2007/10/reservoir-sampling.html –
Fretwell
Related question #54559 –
Charteris
I actually did not realize there was a name for this, so I proved and implemented this from scratch:
def random_subset(iterator, K):
result = []
N = 0
for item in iterator:
N += 1
if len(result) < K:
result.append(item)
else:
s = int(random.random() * N)
if s < K:
result[s] = item
return result
With proof near the end.
@Larry: Where is the
accept it with probability s/k
part in your code? [ quote from algorithm mentioned at blogs.msdn.com/spt/archive/2008/02/05/reservoir-sampling.aspx ] –
Transcalent By coincidence, it seems like between that article and mine, we use the same variables, but for different things. My "K" appears to be their "S", and my "N" (in code) appears to be their "K". In other words, I accept things with
K/N
probability, where N is the current count of things. –
Indigestible what I meant to ask was how you were implementing probability in your code. Anyways, now I understand. Thanks! –
Transcalent
Not quite understanding your question, but I can explain the code a little more if you are specific! =) –
Indigestible
Thanks. This is super easy to understand and get started with basics. –
Sampan
Is there a way to do this to create multiple reservoirs of varied sizes from an unbounded stream? That is to be able to iterate over a stream once and sample them into one of multiple reservoirs? Assume each reservoir could have a different size or equal sizes. –
Lisa
With this implementation, don't the first
n
elements have a higher probability to be picked because there is less of a chance of them being overwritten later on? –
Crain Following Knuth's (1981) description more closely, Reservoir Sampling (Algorithm R) could be implemented as follows:
import random
def sample(iterable, n):
"""
Returns @param n random items from @param iterable.
"""
reservoir = []
for t, item in enumerate(iterable):
if t < n:
reservoir.append(item)
else:
m = random.randint(0, t)
if m < n:
reservoir[m] = item
return reservoir
What's the difference between this and the accepted answer? I think this is exactly the same thing, even if this code is more elegant. –
Monumentalize
It can be directly related to published research (Knuth, 1981), in case someone is interested in more extended explanation or Knuth's proof. –
Georgetown
Where
0 <= random.randint(0,t) <= t
per random.randint. –
Mammalian @Georgetown Isn't
randint
inclusive? –
Ebberta @Ebberta Correct, and it should be. Let
iterable
contain 11 items from which we want to sample n
=10. For the 11th item, t
will be 10 (because enumerate
starts at 0), and we generate a random integer between 0 and 10 inclusive (i.e., 11 possibilities) such that the probability of adding the 11th item to reservoir
is n/t = 10/11. –
Georgetown Java
import java.util.Random;
public static void reservoir(String filename,String[] list)
{
File f = new File(filename);
BufferedReader b = new BufferedReader(new FileReader(f));
String l;
int c = 0, r;
Random g = new Random();
while((l = b.readLine()) != null)
{
if (c < list.length)
r = c++;
else
r = g.nextInt(++c);
if (r < list.length)
list[r] = l;
b.close();}
}
Python solution
import random
class RESERVOIR_SAMPLING():
def __init__(self, k=1000):
self.reservoir = []
self.k = k
self.nb_processed = 0
def add_to_reservoir(self, sample):
self.nb_processed +=1
if(self.k >= self.nb_processed):
self.reservoir.append(sample)
else:
#randint(a,b) gives a<=int<=b
j = random.randint(0,self.nb_processed-1)
if(j < k):
self.reservoir[j] = sample
k = 10
samples = [i for i in range(10)] * k
res = RESERVOIR_SAMPLING(k)
for sample in samples:
res.add_to_reservoir(sample)
print(res.reservoir)
out[1]: [9, 8, 4, 8, 3, 5, 1, 7, 0, 9]
© 2022 - 2024 — McMap. All rights reserved.