Applying HyperLogLog to a sample of the population
Asked Answered
L

2

14

The HyperLogLog algorithm by Flajolet et al describes a clever way to estimate the cardinality of a set using only a tiny amount of memory. However, it does take into account all N elements of the original set in the calculation. What if we had access to only a small random sample (say, 10%) of the original N? Has there been any research on how HyperLogLog or similar algorithms can be adapted to this situation?

I am aware that this is essentially the problem described as distinct value estimation, for which abundant research exists (see for example this paper for an overview). However, the research on the distinct value estimation that I'm aware of uses a number of ad-hoc estimators very different from the approach used by HyperLogLog. Therefore, I am wondering if someone has already thought of adapting HyperLogLog to the distinct value estimation problem.

Latchstring answered 25/11, 2012 at 16:1 Comment(1)
I believe it would be better to post this on stats.stackexchange.comFistic
S
9

However, the research on the distinct value estimation that I'm aware of uses a number of ad-hoc estimators very different from the approach used by HyperLogLog.

Yes, because they are solving a very different problem.

Suppose you just confiscated a stash of 1.000.000 counterfeit dollar bills, and you want to know the number of distinct serial numbers.

Sampling 100.000 of them (using HyperLogLog, as your antique steam-driven counting machine has only 1k memory) you count 5000 different serial numbers, each of which occurs somewhere around 20 times. Then you can be pretty sure that the whole stash will contain only a little over 5000 distinct serial numbers.

Now suppose that 1 serial number occurs 95.001 times, and 4999 serial numbers occur only once. Apparently some bona fide bank notes found their way into your stash. Now you can be pretty confident that the stash contains around 5% honest banknotes, so that the entire stash contains around 50.000 distinct serial numbers

Note that the distribution of the frequencies in your sample is used to infer something about the distribution in the entire stash. This is actually mentioned as one of the "ad hoc" (your words) methods in the second paper you cite ("Sampling-based estimation of the number of distinct values(..)"):

The idea behind a parametric estimator is to fit a probability distribution to the observed relative frequencies of the different attribute values.

Also note that the results of HyperLogLog and similar methods are completely insensitive to the distribution of the samples over their values. But your final estimate evidently depends very much on it!

My advice: use method of your choice (like HyperLogLog) to count the number of distinct values in your sample, and then use one of the methods in "Sampling-based estimation" to estimate the number of values in your entire multiset , or use your prior knowledge abut the distribution of the multiset to calculate an estimate (maybe you saw the counterfeiters' printing press, and you know it could only ever print one serial number)

Stockish answered 6/12, 2012 at 23:17 Comment(0)
D
1

Citation search is a wonderful thing. I'm not super familiar with the two problems as posed, so this paper might not be exactly what you meant. At the least they certainly talk about HyperLogLog and its relationship to the problem, so maybe it will sate your curiosity.

An Optimal Algorithm for the Distinct Elements Problem

Deadlock answered 1/12, 2012 at 7:34 Comment(1)
I was familiar with that paper. I may be missing something (I only skimmed over it), but the algorithm described in the paper seems to belong to the streaming class of estimators, which provide their estimation based on the entirety of the population, and not just a sample (which is the problem I have).Latchstring

© 2022 - 2024 — McMap. All rights reserved.