re-using random number in reservoir sampling
Asked Answered
W

3

8

It was asked in relation to another question recently: Given an unknown length list, return a random item in it by scanning it only 1 time

I know you shouldn't, I just can't put my finger on a canonical explanation of why not.

Look at the example code:

import random, sys

def rnd(): # a function that returns a random number each call
    return int(random.getrandbits(32))

class fixed: # a functor that returns the same random number each call
    def __init__(self):
        self._ret = rnd()
    def __call__(self):
        return self._ret

def sample(rnd,seq_size):
    choice = 0
    for item in xrange(1,seq_size):
        if (rnd() % (item+1)) == 0:
            choice = item
    return choice

dist = [0 for i in xrange(500)]
for i in xrange(1000):
    dist[sample(rnd,len(dist))] += 1
print "real",dist
print

dist = [0 for i in xrange(500)]
for i in xrange(1000):
    dist[sample(fixed(),len(dist))] += 1
print "reuse",dist

The choices for the proper reservoir sampling that generates a new random number per item is nicely evenly distributed as it should be:

real [1, 3, 0, 1, 2, 3, 2, 3, 1, 2, 2, 2, 2, 0, 0, 1, 3, 3, 4, 0, 2, 1, 2, 1, 1, 4, 0, 3, 1, 1, 2, 0, 0, 0, 1, 4, 6, 2, 3, 1, 1, 3, 2, 1, 3, 3, 1, 4, 1, 1, 2, 2, 5, 1, 2, 1, 0, 3, 1, 0, 2, 6, 1, 2, 2, 1, 1, 1, 1, 3, 2, 1, 5, 4, 0, 3, 3, 4, 0, 0, 2, 1, 3, 2, 3, 0, 2, 4, 6, 3, 0, 1, 3, 0, 2, 2, 4, 3, 2, 1, 2, 1, 2, 2, 1, 4, 2, 0, 0, 1, 1, 0, 1, 4, 2, 2, 2, 1, 0, 3, 1, 2, 1, 0, 2, 2, 1, 5, 1, 5, 3, 3, 1, 0, 2, 2, 0, 3, 2, 3, 0, 1, 1, 3, 0, 1, 2, 2, 0, 1, 2, 2, 3, 2, 3, 1, 1, 0, 1, 2, 2, 2, 2, 2, 3, 2, 1, 2, 2, 2, 1, 3, 3, 1, 0, 1, 1, 0, 1, 3, 2, 1, 4, 3, 4, 1, 1, 1, 2, 1, 2, 0, 0, 0, 1, 1, 2, 6, 0, 1, 1, 0, 1, 0, 1, 2, 2, 3, 0, 1, 2, 2, 1, 0, 4, 2, 1, 2, 2, 0, 4, 4, 0, 3, 2, 2, 1, 2, 4, 1, 2, 1, 0, 2, 1, 1, 5, 1, 2, 2, 3, 2, 3, 0, 1, 2, 3, 2, 5, 2, 3, 0, 1, 1, 1, 1, 3, 4, 2, 4, 1, 2, 3, 2, 5, 2, 1, 0, 1, 1, 2, 2, 3, 1, 1, 1, 2, 1, 2, 0, 4, 1, 1, 2, 3, 4, 3, 1, 2, 3, 3, 3, 2, 1, 2, 0, 0, 4, 3, 2, 2, 5, 5, 3, 3, 3, 1, 0, 1, 3, 1, 1, 2, 4, 3, 1, 4, 4, 2, 5, 0, 5, 4, 2, 1, 0, 4, 1, 3, 3, 2, 4, 2, 3, 3, 1, 3, 3, 4, 2, 2, 1, 1, 1, 1, 3, 3, 5, 3, 2, 4, 0, 1, 3, 2, 2, 4, 2, 2, 3, 4, 5, 3, 2, 1, 2, 3, 2, 2, 2, 4, 4, 0, 1, 3, 3, 3, 4, 1, 2, 4, 0, 4, 0, 3, 2, 1, 1, 4, 2, 1, 0, 0, 0, 4, 2, 2, 1, 4, 3, 1, 1, 3, 2, 4, 3, 4, 2, 1, 1, 2, 2, 3, 3, 1, 2, 2, 1, 1, 2, 3, 1, 9, 1, 3, 4, 2, 4, 4, 0, 1, 0, 1, 0, 2, 1, 0, 1, 2, 3, 3, 6, 2, 2, 1, 2, 4, 3, 3, 3, 2, 1, 2, 1, 2, 8, 2, 3, 1, 5, 3, 0, 2, 1, 1, 4, 2, 2, 1, 2, 3, 2, 1, 0, 4, 3, 4, 3, 1, 3, 2, 3, 2, 2, 1, 0, 1, 2, 5, 3, 0, 3, 1, 2, 2, 2, 1, 0, 1, 4]

Whereas when you re-use the same random number for all items, you get a distribution skewed to the very low numbers:

reuse [92, 50, 34, 19, 23, 16, 13, 9, 9, 9, 11, 10, 6, 7, 8, 5, 5, 6, 4, 2, 2, 3, 2, 3, 3, 6, 6, 1, 4, 3, 5, 2, 2, 1, 1, 2, 3, 4, 3, 4, 1, 3, 1, 0, 0, 1, 5, 3, 1, 2, 0, 2, 0, 1, 1, 6, 2, 0, 2, 2, 4, 2, 2, 0, 2, 2, 2, 0, 3, 0, 4, 1, 2, 1, 4, 2, 2, 0, 1, 0, 1, 1, 0, 0, 0, 2, 0, 0, 2, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 2, 1, 3, 1, 0, 1, 2, 0, 4, 3, 0, 0, 2, 0, 0, 1, 0, 0, 2, 0, 2, 1, 0, 1, 0, 0, 1, 1, 3, 0, 1, 1, 0, 2, 0, 1, 2, 0, 1, 1, 4, 1, 1, 1, 2, 1, 0, 1, 2, 0, 2, 1, 1, 2, 0, 1, 1, 0, 2, 0, 2, 0, 0, 2, 0, 1, 0, 2, 1, 1, 0, 0, 1, 2, 4, 1, 0, 2, 0, 1, 2, 1, 3, 0, 1, 0, 0, 1, 0, 0, 2, 1, 0, 0, 0, 3, 2, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 4, 1, 0, 2, 1, 0, 0, 2, 1, 1, 3, 3, 2, 0, 1, 0, 2, 0, 1, 1, 0, 0, 3, 1, 0, 0, 1, 0, 3, 2, 2, 0, 0, 0, 0, 0, 2, 0, 1, 0, 2, 0, 4, 1, 0, 0, 2, 0, 1, 1, 0, 0, 3, 1, 3, 2, 2, 1, 3, 1, 2, 0, 1, 1, 3, 0, 3, 1, 2, 0, 2, 0, 2, 0, 3, 0, 3, 0, 3, 1, 0, 2, 3, 1, 1, 0, 1, 3, 3, 1, 1, 1, 0, 2, 1, 1, 4, 1, 1, 1, 2, 0, 3, 1, 1, 0, 4, 1, 1, 0, 1, 3, 1, 0, 1, 1, 0, 3, 3, 0, 2, 4, 0, 1, 2, 1, 6, 1, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 4, 2, 0, 1, 2, 0, 1, 4, 1, 2, 0, 5, 2, 2, 0, 6, 2, 2, 1, 3, 0, 3, 1, 1, 0, 3, 1, 4, 2, 0, 1, 0, 1, 2, 3, 1, 1, 3, 0, 0, 0, 1, 1, 4, 3, 3, 0, 0, 1, 0, 1, 1, 2, 1, 0, 2, 1, 4, 5, 1, 1, 3, 0, 1, 1, 1, 3, 1, 1, 0, 3, 3, 1, 3, 0, 1, 0, 0, 1, 1, 3, 2, 1, 0, 3, 1, 1, 3, 1, 3, 1, 2, 2, 2, 0, 0, 5, 1, 3, 0, 1, 4, 1, 1, 1, 3, 2, 1, 3, 2, 1, 3, 1, 2, 2, 3, 2, 2, 1, 0, 3, 3, 1, 3, 3, 3, 2, 1, 2, 3, 3, 3, 1, 2, 2, 2, 4, 2, 1, 5, 2, 2, 0]

What's the maths here? Why can't you re-use the same random number?

Wriest answered 2/1, 2012 at 16:42 Comment(2)
Are you sure you are using the same random number here? Looks like you are initializing a different fixed every time...Planksheer
@TimGee yes, that is intentional; when using a new fixed() each call to sample, the random number is not re-generated for each and every item in the stream, however.Wriest
E
7

Edit, in response to comment:

The way reservoir sampling should work is: you want to select exactly the right proportion of samples from each of the existing bins in order to make up an additional bin with the same probability. In your sample() loop, given that you have randomly sampled one of item bins, you need to select samples from each bin with probability 1/(item + 1).

However, with fixed(), both the selection decision and the previous bin number depend on the same fixed 32-bit number. This means that the likelihood that a sample is removed from each of the bins will not be uniform.


Consider what happens during the third iteration of the sample() loop. You have three existing bins (0, 1, and 2), and you want to pick 1/4 of the samples in each and add them to a newly created bin 3.

Note that all the 32-bit fixed() numbers in bin 1 will be even (because the first pass selected all numbers divisible by 2), and all the numbers in bin 0 will be odd (because the even ones were moved to bin 1). The second pass moves all numbers divisible by three to bin 2 (which should be OK so far, and does not change the even/odd division in bins 0 and 1).

The third pass then moves all fixed() numbers divisible by 4 into bin 3. But, this will select half the numbers from bin 1 (because half of all even numbers are divisible by 4), and none of the numbers from bin 0 (because they are all odd). So, even though the expected size of the new bin should be correct, the expected sizes of the old bins are no longer the same.

That is how fixed() generates an uneven distribution: the implicit assumption, that you can select an exact fraction of each bin by choosing a random number, is violated if that number depends in a predictable way on the numbers used to choose the original bin.


The basic property of random numbers is that each sample must be independently distributed from the preceding samples, in a statistical sense. Algorithms based on random numbers depend on this property.

Pseudo-random number generators (PRNG's) are not actually random; as you know, their results are actually a fixed sequence. The PRNG results are deliberately scrambled so that they act enough like actual random numbers for most purposes. However, if the PRNG is "weak" for a particular application, the inner workings of the PRNG can interact with the details of the application in odd ways, to very non-random results.

What you're trying out here, by re-using the same random number, is building a bad PRNG. The actual results depend on the details of how the application uses the random numbers...

Even though fixed() is an intentionally broken PRNG, many commercial library PRNG's are "weak", and can end up with similar weird interactions with some applications. As a practical matter, "weakness" is relative to the application -- and, while there are statistical tests that are widely used to try to expose weak PRNG's, there is no guarantee that your application won't stumble on some odd correlation of even a "strong" PRNG.

Edmonton answered 2/1, 2012 at 20:23 Comment(1)
How does the divisor thing not affect the proper working reservoir sampling though? How does generating a different random number each item solve the divisor bias?Wriest
B
1

If you pick a random number each time, the next item from the stream has 1/CURRENTSIZE chance of beating the previous picked item.

So what is the problem with one random number per stream? Why does it skew the distribution?

I haven't found a complete answer yet, but I have an idea.

For example, lets take a stream of 100 numbers and pick a random number 0...999. Now we look at it from viewpoint of the second item.

When does it win? Well, first of all, it needs to be N%2==0. So it has to be an even number. Also, it is also beat by every other multiple of each multiple of 2 in the stream, 4...6...8....10 etc. But it wins for example on 106.

Calculating all numbers it wins with, 0..999 and we get 81 times!

Now lets take 4, it needs to be N%4==0 and it is beat by all multiples of 4 to N (8...12....16). If we calculate how many times 4 can win, we get 45 times...! So the distribution isn't fair.

This can be fixed if you make the random number maximum the size of the stream, then all have 1 chance to win, making it an even distribution again.

For example, if we have a stream size of 100, and we pick a random number of 0..199. We know the first 100 random numbers all have exacly 1 match, so they are distributed evenly. But what happens with random numbers 99...199? The distribution isn't even! For example 101 will only give 101%X==0 for 1. This is true for all the prime numbers (101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151, 157, 163, 167, 173, 179, 181, 191, 193, 197, 199). So item one has a LOT larger chance to win than the others.

This isn't the case if you pick a new random number for each item, in that case the chances can be added. For example when item one comes along it has a chance of winning that is:

NOT(1/2 + 1/3 + 1/4 + 1/5 + 1/6 (...etc))

Brimstone answered 15/2, 2012 at 12:7 Comment(0)
I
0

Think about this: What happens when your fixed number is really small?

Intuitionism answered 12/1, 2012 at 11:43 Comment(1)
stackoverflow is where you give canonical answers, not hintsWriest

© 2022 - 2024 — McMap. All rights reserved.