Find top k visiting URL for last day, or last hour, or last minute?

About

Asked 2/1, 2013 at 5:38 Answered 2/1, 2013 at 5:44

algorithm data-structures hash binary-heap streaming-algorithm

The original question is given file containing 5GB URL being visited last day, find the top k frequent URL. The problem can be solved by using hash map to count the occurrences of distinct URL and find top k with the help of min heap, taking a O(n log k) time.

Now I'm thinking what if the input was unlimited online data stream (instead of static file), then how can I know the top k URL of the last day?

Or is there any improvement that I can made to the system that allow me to get top k URL for last minute and last day and last hours dynamically?

Any hint will be appreciated!!

Ideality answered 2/1, 2013 at 5:38 Comment(1)

checkout https://mcmap.net/q/1302694/-realtime-tracking-of-top-100-twitter-words-per-min-hour-day – Pal 15/11, 2015 at 10:33

If you are willing to settle for a probabilistic answer that might contain a few wrong entries, you should definitely look into the count-min sketch data structure. It was specifically designed to estimate frequent elements in a stream using as little memory as possible, and most implementations support a very time and space efficient approximation of the top k elements out of a stream. Moreover, the structure lets you tune space usage, which makes it ideal for situations like these. IIRC Google uses this to determine their most frequent search queries.

There are several implementations of this data structure available online.

Hope this helps!

Cetinje answered 2/1, 2013 at 5:44 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags