Iterative or Lazy Reservoir Sampling
Asked Answered
A

2

8

I'm fairly well acquainted with using Reservoir Sampling to sample from a set of undetermined length in a single pass over the data. One limitation of this approach, in my mind, is that it still requires a pass over the entire data set before any results can be returned. Conceptually this makes sense, since one has to allow items in the entirety of the sequence the opportunity to replace previously encountered items to achieve a uniform sample.

Is there a way to be able to yield some random results before the entire sequence has been evaluated? I'm thinking of the kind of lazy approach that would fit well with python's great itertools library. Perhaps this could be done within some given error tolerance? I'd appreciate any sort of feedback on this idea!

Just to clarify the question a bit, this diagram sums up my understanding of the in-memory vs. streaming tradeoffs of different sampling techniques. What I want is something that falls into the category of Stream Sampling, where we don't know the length of the population beforehand.

enter image description here

Clearly there is a seeming contradiction in not knowing the length a priori and still getting a uniform sample, since we will most likely bias the sample towards the beginning of the population. Is there a way to quantify this bias? Are there tradeoffs to be made? Does anybody have a clever algorithm to solve this problem?

Alright answered 11/6, 2014 at 21:15 Comment(5)
You could do it that way, but in doing so you're losing the ability to generate some sequences. For example, if you want to select 10 items at random from a list, but you do some type of early return for one or more items, then your sample will never contain the last 10 items in the list. If you're okay with biasing the output, then you can do an early return. Otherwise, you have to wait until the entire list has been examined.Corruptible
It would make more sense to implement reservoir sampling so that it always iterates its entire iterable. If a caller wants a faster result that does not iterate over its entire iterable, it can pass in a truncated iterable itself. Yielding an iterable of reservoirs wouldn't make much sense because consecutive reservoirs are extremely correlated (they differ in 0 or 1 positions).Fontenot
@TimothyShields I agree in terms of API design, one would and should expect a reservoir sample to behave that way. What I am looking for here is some sort of analogous statistical cleverness that would allow us to return items early or some good argument why this is not possible at all.Alright
@Alright Return items early for what purpose? See the last sentence in my previous comment.Fontenot
@TimothyShields Let's say we have a population of size 100 and we want a sample of size 15. One could easily devise an algo that holds the population in memory and streams (iteratively yields) items from the sample. Alternatively, if we were to use reservoir sampling, we would hold the sample in memory and stream the population. I want a way to stream both the population and the sample. ALA this diagram: littleml.files.wordpress.com/2013/01/memory.pngAlright
A
7

If you know in advance the total number of items that will be yielded by an iterable population, it is possible to yield the items of a sample of population as you come to them (not only after reaching the end). If you don't know the population size ahead of time, this is impossible (as the probability of any item being in the sample can't be be calculated).

Here's a quick generator that does this:

def sample_given_size(population, population_size, sample_size):
    for item in population:
        if random.random() < sample_size / population_size:
            yield item
            sample_size -= 1
        population_size -= 1

Note that the generator yields items in the order they appear in the population (not in random order, like random.sample or most reservoir sampling codes), so a slice of the sample will not be a random subsample!

Are answered 11/6, 2014 at 23:26 Comment(9)
This is a true streaming reservoir sample, very nice. Is this possible to do without knowing in advance the size of the population?Alright
@Stankalank: No. Think about the very first element in the population: if we aren't allowed to ever "change" a decision to include or exclude it from the sample (as a "streaming output" algorithm implies), how on earth are we supposed to know with what probability to include it, unless we know the population size?Deas
@Stankalank, no, it's crucial here to know the size in advance. Simplify this and you'll see it more easily: restrict your question to picking a sample of size 1. You should be able to convince yourself easily that, without knowing the population size in advance, not only can't you cut it off early, you can't even come up with "an approximation" that's of any real value. It doesn't get any easier when the sample size exceeds 1 ;-)Dorathydorca
Won't your result simply be wrong (statistically skewed) if you assume the wrong population size in advance? (Not considering the population size means using a different formula for the probability, which is then equivalent to assuming a certain population size.)Jeter
@TimPeters I fully grasped the seeming contradiction in the problem statement when I asked the question. However, I also would have thought that doing a random sample in a single pass (reservoir sample) would be impossible before I discovered that somebody much more clever than me had invented an algorithm for it. I was wondering if these same clever people had a similar solution for stream sampling =)Alright
Note also that this could yield fewer than sample_size items. Especially if sample_size is a reasonably large percentage of population_size.Corruptible
@Blckknght, how to prove the correctness of this algorithm in a formal way? I've gone mad about the proof recently...Benildas
@old_bear: It's been a very long time since I've attempted any formal proofs for CS algorithms. You could do it by induction, I think. The base cases would be sample_size = 0 and sample_size = population_size (which should both be obviously correct). The inductive step is to assume we can correctly find a sample of M items from a population of size N, then see if the probability we test in the code above are correct for a selecting the first element of a population of size N+1 and a sample of size M or M+1.Are
@Blckknght: Finally I've done it by means of induction on index i. Let n be the population size and k be the sample size. Starting from i = 0, the first element's probability is k/n. Now suppose when i = i' all elements before i' has the probability k/n. When i = i'+1 we have probability of the i'+1th element equals to (k - i'*k/n)/(n - i') = k/n. i'*k/n means on average we have chosen that many samples.Benildas
M
0

If population size is known before hand, can't you just generate sample_size random "indices" (in the stream) and use that to do a lazy yield? You won't have to read the entire stream.

For instance, if population_size was 100, and sample_size was 3, you generate a random set of integers from 1 to 100, say you get 10, 67 and 72.

Now you yield the 10th, 62nd and 72nd elements of the stream and ignore the rest.

I guess I don't understand the question.

Millennial answered 12/6, 2014 at 4:32 Comment(1)
The problem is fairly trivial if the population size is known a priori, see the accepted answer for a solution. I'm thinking of a data stream that has an unknown length, where we want to stream both the population and the sample. What I seem to be getting from the other answers is that this is not really possible, unfortunately.Alright

© 2022 - 2024 — McMap. All rights reserved.