I'm reading how the probabilistic data structure count-min-sketch is used in finding the top k elements in a data stream. But I cannot seem to wrap my head around the step where we maintain a heap to get our final answer.
The problem:
We have a stream of items
[B, C, A, B, C, A, C, A, A, ...]
. We are asked to find out the top k most frequently appearing items.
My understanding is that, this can be done using micro-batching, in which we accumulate N items before we start doing some real work.
The hashmap+heap approach is easy enough for me to understand. We traverse the micro-batch and build a frequency map (e.g. {B:34, D: 65, C: 9, A:84, ...}
) by counting the elements. Then we maintain a min-heap of size k by traversing the frequency map, adding to and evicting from the heap with each [item]:[freq]
as needed. Straightforward enough and nothing fancy.
Now with CMS+heap, instead of a hashmap, we have this probabilistic lossy 2D array, which we build by traversing the micro-batch. The question is: how do we maintain our min-heap of size k given this CMS?
The CMS only contains a bunch of numbers, not the original items. Unless I also keep a set of unique elements from the micro-batch, there is no way for me to know which items I need to build my heap against at the end. But if I do, doesn't that defeat the purpose of using CMS to save memory space?
I also considered building the heap in real-time when we traverse the list. With each item coming in, we can quickly update the CMS and get the cumulative frequency of that item at that point. But the fact that this frequency number is cumulative does not help me much. For example, with the example stream above, we would get [B:1, C:1, A:1, B:2, C:2, A:2, C:3, A:3, A:4, ...]
. If we use the same logic to update our min-heap, we would get incorrect answers (with duplicates).
I'm definitely missing something here. Please help me understand.