H

The problem is simple: I need to find the optimal strategy to implement accurate HyperLogLog unions based on Redis' representation thereof--this includes handling their sparse/dense representations if the data structure is exported for use elsewhere.

Two Strategies

There are two strategies, one of which seems vastly simpler. I've looked at the actual Redis source and I'm having a bit of trouble (not big in C, myself) figuring out whether it's better from a precision and efficiency perspective to use their built-in structures/routines or develop my own. For what it's worth, I'm willing to sacrifice space and to some degree errors (stdev +-2%) in the pursuit of efficiency with extremely large sets.

1. Inclusion Principle

By far the simplest of the two--essentially I would just use the lossless union (PFMERGE) in combination with this principle to calculate an estimate of the overlap. Tests seem to show this running reliably in many cases, although I'm having trouble getting an accurate handle on in-the-wild efficiency and accuracy (some cases can produce errors of 20-40% which is unacceptable in this use case).

Basically:

aCardinality + bCardinality - intersectionCardinality

or, in the case of multiple sets...

aCardinality + (bCardinality x cCardinality) - intersectionCardinality

seems to work in many cases with good accuracy, but I don't know if I trust it. While Redis has many built-in low-cardinality modifiers designed to circumvent known HLL issues, I don't know if the issue of wild inaccuracy (using inclusion/exclusion) is still present with sets of high disparity in size...

2. Jaccard Index Intersection/MinHash

This way seems more interesting, but a part of me feels like it may computationally overlap with some of Redis' existing optimizations (ie, I'm not implementing my own HLL algorithm from scratch).

With this approach I'd use a random sampling of bins with a MinHash algorithm (I don't think an LSH implementation is worth the trouble). This would be a separate structure, but by using minhash to get the Jaccard index of the sets, you can then effectively multiply the union cardinality by that index for a more accurate count.

Problem is, I'm not very well versed in HLL's and while I'd love to dig into the Google paper I need a viable implementation in short order. Chances are I'm overlooking some basic considerations either of Redis' existing optimizations, or else in the algorithm itself that allows for computationally-cheap intersection estimates with pretty lax confidence bounds.

thus, my question:

How do I most effectively get a computationally-cheap intersection estimate of N huge (billions) sets, using redis, if I'm willing to sacrifice space (and to a small degree, accuracy)?

Hydrotherapeutics answered 7/5, 2015 at 16:20 Comment(6)

What are the criteria for the 'best method' so we know what answers to provide? i.e. how would you decide what is the 'best answer'? You need to provide some limits - what are the resources available to you to solve this problem? – Subsidy 7/5, 2015 at 18:55

Is 'redis' the best tool to solve the 'matching' problem that you have? 'redis' requires everything to be stored in 'memory'. This could be 'interesting' for 'billions' of records. – Subsidy 7/5, 2015 at 19:5

Well, 'best' in this case is essentially the order I described @Ryan, space is irrelevant; accuracy would be the next sacrifice within given deviation bounds, and computational efficiency is my #1 priority. – Hydrotherapeutics 8/5, 2015 at 0:3

I'm also not quite sure it is, but I feel so as I don't want to venture beyond an in-memory solution given my needs for accessing this data and performing these queries in a dynamic stack--the resources to support that are essentially unlimited, however bear in mind I don't necessarily need to intersect the exact VALUES of N sets of billions, just have an accurate/cheap cardinality solution based around HLL where I can intersect at will... then again, when you have a hammer... – Hydrotherapeutics 8/5, 2015 at 0:4

Thanks for replying, to be honest this is 'way beyond my field of expertise', but it looks like a fun problem to be working on. :-) – Subsidy 8/5, 2015 at 0:11

This is great! I am trying to solve something very similar. Can you share how you implemented the minhash in conjunction with the redis HLL? This might be very useful for others trying to solve, what should be, a very common problem – Arrear 28/8, 2015 at 23:47

S

5

Read this paper some time back. Will probably answer most of your questions. Inclusion Principle inevitably compounds error margins a large number of sets. Min-Hash approach would be the way to go.

http://tech.adroll.com/media/hllminhash.pdf

Seignior answered 21/8, 2015 at 6:22 Comment(2)

Already built this, actually :). That paper definitely helped, but it really started to fly when I swapped in a custom murmurhash3 extension. Holding strong @ 4MM queries/min. – Hydrotherapeutics 21/8, 2015 at 15:53

Add an answer for your question, if you found the "right" way about it. And mark it as accepted. – Seignior 12/1, 2016 at 7:3

M

2

There is a third strategy to estimate the intersection size of any two sets given as HyperLogLog sketches: Maximum likelihood estimation.

For more details see the paper available at http://oertl.github.io/hyperloglog-sketch-estimation-paper/.

Morganmorgana answered 2/11, 2016 at 11:38 Comment(0)

Two Strategies

1. Inclusion Principle

2. Jaccard Index Intersection/MinHash

Recommended topics

Hot tags