System Design of Google Trends?

Asked 27/9, 2013 at 20:2 Answered 20/5, 2014 at 8:21

I am trying to figure out system design behind Google Trends (or any other such large scale trend feature like Twitter).

Challenges:

Need to process large amount of data to calculate trend.
Filtering support - by time, region, category etc.
Need a way to store for archiving/offline processing. Filtering support might require multi dimension storage.

This is what my assumption is (I have zero practial experience of MapReduce/NoSQL technologies)

Each search item from user will maintain set of attributes that will be stored and eventually processed.

As well as maintaining list of searches by time stamp, region of search, category etc.

Example:

Searching for Kurt Cobain term:

Kurt-> (Time stamp, Region of search origin, category ,etc.)

Cobain-> (Time stamp, Region of search origin, category ,etc.)

Question:

How do they efficiently calculate frequency of search term ?
In other words, given a large data set, how do they find top 10 frequent items in distributed scale-able manner ?

Haslam answered 27/9, 2013 at 20:2 Comment(3)

Also need to consider time decay factor – Lyons 13/5, 2014 at 17:47

I think using special data-structures that are structured in a way that accelerates finding the trends, data is arranged in a way that pre-process it for all open features for millions of users online – Clinker 15/5, 2014 at 9:10

Apparently I can't vote to close a question someone else has offered a bounty on, but to me this question seems off-topic/too broad: there are many technologies and areas of research related to this topic, and there is no way an answer could encapsulate them other than by linking to some more suitable resource such as a textbook or dedicated website. To paraphrase one of the guidelines in the help centre: "if you can imagine a whole career or business plan based on finding the answer, the question is probably too broad". – Bromoform 17/5, 2014 at 14:22

Well... finding out the top K terms is not really a big problem. One of the key ideas in this fields have been the idea of "stream processing", i.e., to perform the operation in a single pass of the data and sacrificing some accuracy to get a probabilistic answer. Thus, assume you get a stream of data like the following:

A B K A C A B B C D F G A B F H I B A C F I U X A C

What you want is the top K items. Naively, one would maintain a counter for each item, and at the end sort by the count of each item. This takes O(U) space and O(max(U*log(U), N)) time, where U is the number of unique items and N is the number of items in the list.

In case U is small, this is not really a big problem. But once you are in the domain of search logs with billions or trillions of unique searches, the space consumption starts to become a problem.

So, people came up with the idea of "count-sketches" (you can read up more here: count min sketch page on wikipedia). Here you maintain a hash table A of length n and create two hashes for each item:

h1(x) = 0 ... n-1 with uniform probability

h2(x) = 0/1 each with probability 0.5

You then do A[h1[x]] += h2[x]. The key observation is that since each value randomly hashes to +/-1, E[ A[h1[x]] * h2[x] ] = count(x), where E is the expected value of the expression, and count is the number of times x appeared in the stream.

Of course, the problem with this approach is that each estimate still has a large variance, but that can be dealt with by maintaining a large set of hash counters and taking the average or the minimum count from each set.

With this sketch data structure, you are able to get an approximate frequency of each item. Now, you simply maintain a list of 10 items with the largest frequency estimates till now, and at the end you will have your list.

Montagna answered 20/5, 2014 at 8:21 Comment(0)

How exactly a particular private company does it is likely not publicly available, and how to evaluate the effectiveness of such a system is at the discretion of the designer (be it you or Google or whoever)

But many of the tools and research is out there to get you started. Check out some of the Big Data tools, including many of the top-level Apache projects, like Storm, which allows for the processing of streaming data in real-time

Also check out some of the Big Data and Web Science conferences like KDD or WSDM, as well as papers put out by Google Research

How to design such a system is challenging with no correct answer, but the tools and research are available to get you started

Literalism answered 17/5, 2014 at 14:15 Comment(0)

Recommended topics

Hot tags