How does the HyperLogLog algorithm work?
Asked Answered
G

3

206

I've been learning about different algorithms in my spare time recently, and one that I came across which appears to be very interesting is called the HyperLogLog algorithm - which estimates how many unique items are in a list.

This was particularly interesting to me because it brought me back to my MySQL days when I saw that "Cardinality" value (which I always assumed until recently that it was calculated not estimated).

So I know how to write an algorithm in O(n) that will calculate how many unique items are in an array. I wrote this in JavaScript:

function countUniqueAlgo1(arr) {
    var Table = {};
    var numUnique = 0;
    var numDataPoints = arr.length;
    for (var j = 0; j < numDataPoints; j++) {
        var val = arr[j];
        if (Table[val] != null) {
            continue;
        }
        Table[val] = 1;
        numUnique++;
    }
    return numUnique;
}

But the problem is that my algorithm, while O(n), uses a lot of memory (storing values in Table).

I've been reading this paper about how to count duplicates in a list in O(n) time and using minimal memory.

It explains that by hashing and counting bits or something one can estimate within a certain probability (assuming the list is evenly distributed) the number of unique items in a list.

I've read the paper, but I can't seem to understand it. Can someone give a more layperson's explanation? I know what hashes are, but I don't understand how they are used in this HyperLogLog algorithm.

Grewitz answered 8/9, 2012 at 0:28 Comment(2)
This paper (research.google.com/pubs/pub40671.html) also summarize the HyperLogLog algorithm and some improvements. I think it's easier to understanding than the original paper.Pearlene
Just a hint on nomenclature: Some people use the word set to describe a collection of unique items. To them, your question might make better sense if you used the term list or array instead.Slavey
C
194

The main trick behind this algorithm is that if you, observing a stream of random integers, see an integer which binary representation starts with some known prefix, there is a higher chance that the cardinality of the stream is 2^(size of the prefix).

That is, in a random stream of integers, ~50% of the numbers (in binary) starts with "1", 25% starts with "01", 12,5% starts with "001". This means that if you observe a random stream and see a "001", there is a higher chance that this stream has a cardinality of 8.

(The prefix "00..1" has no special meaning. It's there just because it's easy to find the most significant bit in a binary number in most processors)

Of course, if you observe only one stream, the chance this value is wrong is high. That's why the algorithm divides the stream in "m" independent substreams and keeps the maximum length of a seen "00...1" prefix of each substream. Then, it estimates the final value by taking the mean value of all substreams.

That's the main idea of this algorithm. There are some missing details (the correction for low estimate values, for example), but it's all well written in the paper. Sorry for the terrible English.

Crane answered 4/10, 2012 at 19:19 Comment(6)
"there is a higher chance that this stream has a cardinality of 8" Can you please explain why 000 means expected number of trials 2^3. I tried to compute math expectation of number of trials assuming we have at least one run with 3 zeros and no runs with 4 zeros...Crossexamine
Didn't quite understand the paper until I read this. Now it makes sense.Cerenkov
@Crossexamine I know it's a very old comment, but it may be useful for other people. He said "That is, in a random stream of integers, (...) 12,5% starts with "001"." The probable cardinality is 8 because 12,5% represents one eighth of the whole stream.Reasonable
this is the best/essential explanation of hll i've ever read.Sesame
Can you please explain how cardinality can depend on length of prefix? I can choose bitness of number arbitrarily, right? I can have a stream of uint16 numbers, but I also can have a stream of the same numbers as uint64 integers, in which case, length of prefix will dramatically increase, but obviously cardinality will not.Cutoff
@Cutoff Notice we are talking about a stream of random numbers, typically produced by applying a hash function to the original stream, which, while not strictly random, are a good enough approximation. In that case, we are assuming each bit has 50% chance of being either 0 or 1, so using uint16 or uint64 shouldn't affect much the expected value on the preffix length (also assuming expected cardinality << 2^(bit length)).Crane
Q
137

A HyperLogLog is a probabilistic data structure. It counts the number of distinct elements in a list. But in comparison to a straightforward way of doing it (having a set and adding elements to the set) it does this in an approximate way.

Before looking how the HyperLogLog algorithm does this, one has to understand why you need it. The problem with a straightforward way is that it consumes O(distinct elements) of space. Why there is a big O notation here instead of just distinct elements? This is because elements can be of different sizes. One element can be 1 another element "is this big string". So if you have a huge list (or a huge stream of elements) it will take a lot memory.


Probabilistic Counting

How can one get a reasonable estimate of a number of unique elements? Assume that you have a string of length m which consists of {0, 1} with equal probability. What is the probability that it will start with 0, with 2 zeros, with k zeros? It is 1/2, 1/4 and 1/2^k. This means that if you have encountered a string starting with k zeros, you have approximately looked through 2^k elements. So this is a good starting point. Having a list of elements that are evenly distributed between 0 and 2^k - 1 you can count the maximum number of the biggest prefix of zeros in binary representation and this will give you a reasonable estimate.

The problem is that the assumption of having evenly distributed numbers from 0 t 2^k-1 is too hard to achieve (the data we encountered is mostly not numbers, almost never evenly distributed, and can be between any values. But using a good hashing function you can assume that the output bits would be evenly distributed and most hashing function have outputs between 0 and 2^k - 1 (SHA1 give you values between 0 and 2^160). So what we have achieved so far is that we can estimate the number of unique elements with the maximum cardinality of k bits by storing only one number of size log(k) bits. The downside is that we have a huge variance in our estimate. A cool thing that we almost created 1984's probabilistic counting paper (it is a little bit smarter with the estimate, but still we are close).

LogLog

Before moving further, we have to understand why our first estimate is not that great. The reason behind it is that one random occurrence of high frequency 0-prefix element can spoil everything. One way to improve it is to use many hash functions, count max for each of the hash functions and in the end average them out. This is an excellent idea, which will improve the estimate, but LogLog paper used a slightly different approach (probably because hashing is kind of expensive).

They used one hash but divided it into two parts. One is called a bucket (total number of buckets is 2^x) and another - is basically the same as our hash. It was hard for me to get what was going on, so I will give an example. Assume you have two elements and your hash function which gives values form 0 to 2^10 produced 2 values: 344 and 387. You decided to have 16 buckets. So you have:

0101 011000  bucket 5 will store 1
0110 000011  bucket 6 will store 4

By having more buckets you decrease the variance (you use slightly more space, but it is still tiny). Using math skills they were able to quantify the error (which is 1.3/sqrt(number of buckets)).

HyperLogLog

HyperLogLog does not introduce any new ideas, but mostly uses a lot of math to improve the previous estimate. Researchers have found that if you remove 30% of the biggest numbers from the buckets you significantly improve the estimate. They also used another algorithm for averaging numbers. The paper is math-heavy.


And I want to finish with a recent paper, which shows an improved version of hyperLogLog algorithm (up until now I didn't have time to fully understand it, but maybe later I will improve this answer).

Quip answered 5/2, 2016 at 8:42 Comment(2)
I am assuming theoretically k zeroes is not a special thing. you can instead look for k ones and the logic would be same or even look for k length string of {0,1} but take one such string and stick with it ? because all of them have equal probability of 1/2^k in case of such binary strings ?Dorina
HyperLogLog does not remove 30% of the biggest numbers. This is the idea of the SuperLogLog algorithm also described in the LogLog paper. The main idea of the HyperLogLog algorithm is to average the power of twos using the harmonic mean instead of the geometric mean as used by SuperLogLog and LogLog.Shall
V
27

The intuition is that if your input is a large set of random numbers (e.g. hashed values), then they should distribute evenly over a range. Let's say the range is up to 10 bits to represent values up to 1024. Then observe the minimum value. Let's say it is 10. Then the cardinality will estimated to be about 100 (10 × 100 ≈ 1024).

Read the paper for the real logic, of course.

Another good explanation with sample code can be found here:
Damn Cool Algorithms: Cardinality Estimation - Nick's Blog

Verlaverlee answered 11/9, 2012 at 17:45 Comment(1)
upvoted for the link to the damn cool algorithms blog post. that really helped me grasp the algorithm.Kelson

© 2022 - 2024 — McMap. All rights reserved.