How to get top K elements from count-min-sketch?

Asked 8/7, 2020 at 4:31 Answered 13/9, 2022 at 6:1

stream real-time distributed-computing frequency count-min-sketch

I'm reading how the probabilistic data structure count-min-sketch is used in finding the top k elements in a data stream. But I cannot seem to wrap my head around the step where we maintain a heap to get our final answer.

The problem:

We have a stream of items [B, C, A, B, C, A, C, A, A, ...]. We are asked to find out the top k most frequently appearing items.

My understanding is that, this can be done using micro-batching, in which we accumulate N items before we start doing some real work.

The hashmap+heap approach is easy enough for me to understand. We traverse the micro-batch and build a frequency map (e.g. {B:34, D: 65, C: 9, A:84, ...}) by counting the elements. Then we maintain a min-heap of size k by traversing the frequency map, adding to and evicting from the heap with each [item]:[freq] as needed. Straightforward enough and nothing fancy.

Now with CMS+heap, instead of a hashmap, we have this probabilistic lossy 2D array, which we build by traversing the micro-batch. The question is: how do we maintain our min-heap of size k given this CMS?

The CMS only contains a bunch of numbers, not the original items. Unless I also keep a set of unique elements from the micro-batch, there is no way for me to know which items I need to build my heap against at the end. But if I do, doesn't that defeat the purpose of using CMS to save memory space?

I also considered building the heap in real-time when we traverse the list. With each item coming in, we can quickly update the CMS and get the cumulative frequency of that item at that point. But the fact that this frequency number is cumulative does not help me much. For example, with the example stream above, we would get [B:1, C:1, A:1, B:2, C:2, A:2, C:3, A:3, A:4, ...]. If we use the same logic to update our min-heap, we would get incorrect answers (with duplicates).

I'm definitely missing something here. Please help me understand.

Covariance answered 8/7, 2020 at 4:31 Comment(0)

Keep a hashmap of size k, key is id, value is Item(id, count) Keep a minheap of size k with Item As events coming in, update the count-min 2d array, get the min, update Item in the hashmap, bubble up/bubble down the heap to recalculate the order of the Item. If heap size > k, poll min Item out and remove id from hashmap as well

Comus answered 13/9, 2022 at 6:1 Comment(0)

Below explanation comes from a comment from this Youtube video:

We need to store the keys, but only K of them (or a bit more). Not all. When every key comes, we do the following:

Add it to the count-min sketch.
Get key count from the count-min sketch.
Check if the current key is in the heap. If it presents in the heap, we update its count value there. If it not present in the heap, we check if heap is already full. If not full, we add this key to the heap. If heap is full, we check the minimal heap element and compare its value with the current key count value. At this point we may remove the minimal element and add the current key (if current key count > minimal element value).

Richards answered 13/6, 2021 at 16:34 Comment(0)

Recommended topics

Hot tags