Use of indexes for multi-word queries in full-text search (e.g. web search)

Asked 17/5, 2011 at 14:36 Answered 25/9, 2016 at 9:48

Solved algorithm indexing search-engine full-text-indexing inverted-index

I understand that a fundamental aspect of full-text search is the use of inverted indexes. So, with an inverted index a one-word query becomes trivial to answer. Assuming the index is structured like this:

some-word -> [doc385, doc211, doc39977, ...] (sorted by rank, descending)

To answer the query for that word the solution is just to find the correct entry in the index (which takes O(log n) time) and present some given number of documents (e.g. the first 10) from the list specified in the index.

But what about queries which return documents that match, say, two words? The most straightforward implementation would be the following:

set A to be the set of documents which have word 1 (by searching the index).
set B to be the set of documents which have word 2 (ditto).
compute the intersection of A and B.

Now, step three probably takes O(n log n) time to perform. For very large A and Bs that could make the query slow to answer. But search engines like Google always return their answer in a few milliseconds. So that can't be the full answer.

One obvious optimization is that since a search engine like Google doesn't return all the matching documents anyway, we don't have to compute the whole intersection. We can start with the smallest set (e.g. B) and find enough entries which also belong to the other set (e.g. A).

But can't we still have the following worst case? If we have set A be the set of documents matching a common word, and set B be the set of documents matching another common word, there might still be cases where A ∩ B is very small (i.e. the combination is rare). That means that the search engine has to linearly go through a all elements x member of B, checking if they are also elements of A, to find the few that match both conditions.

Linear isn't fast. And you can have way more than two words to search for, so just employing parallelism surely isn't the whole solution. So, how are these cases optimized? Do large-scale full-text search engines use some kind of compound indexes? Bloom filters? Any ideas?

Megen answered 17/5, 2011 at 14:36 Comment(3)

For now, I accepted one answer and upvoted another one due to the helpful suggestion of the book "Information Retrieval: Implementing Search Engines". But I'm still looking for a more complete and satisfying answer. – Torsk 17/5, 2011 at 21:15

i am having the exact same problem in the tiny search engine i am implementing and so far haven't been able to pinpoint the exact solution. – Compressibility 20/9, 2013 at 6:36

I imagine, also, that pretty much any search you can think of has already been made. Which means google has already computed the intersection at some point and might store it (I guess they have endless memory). – Drais 30/10, 2017 at 7:10

As you said some-word -> [doc385, doc211, doc39977, ...] (sorted by rank, descending), I think the search engine may not do this, the doc list should be sorted by doc ID, each doc has a rank according to the word.
When a query comes, it contains several keywords. For each word, you can find a doc list. For all keywords, you can do merge operations, and compute the relevance of doc to query. Finally return the top ranked relevance doc to user.
And the query process can be distributed to gain better performance.

Lester answered 17/5, 2011 at 15:44 Comment(3)

Ok, this is starting to look like a good answer. But I was really looking forward to details on how performant such a solution can be and/or if there isn't more to it regarding comercial large-scale implementations. I'll accept the best answer and upvote the other relevant ones too. – Torsk 17/5, 2011 at 15:55

@Luís Marques: Lucene maybe the solution and/or implementation you're looking for. In one word, lucene is an open source search software. You can find more details in wikipedia and apache lucene. – Lester 23/5, 2011 at 10:32

Can you provide more pointers on performing the merge operation. – Compressibility 18/9, 2013 at 18:14

Even without ranking, I wonder how the intersection of two sets is computed so fast by google.

Obviously the worst-case scenario for computing the intersection for some words A, B, C is when their indexes are very big and the intersection very small. A typical case would be a search for some very common ("popular" in DB terms) words in different languages.

Let's try "concrete" and 位置 ("site", "location") in chinese and 極端な ("extreme") in japanese.

Google search for 位置 returns "About 1,500,000,000 results (0.28 seconds) " Google search for "concrete" returns "About 2,020,000,000 results (0.46 seconds) " Google search for "極端な" About 7,590,000 results (0.25 seconds)

It is extremly improbable that all three terms would ever appear in the same document, but let's google them: Google search for "concrete 位置極端な" returns "About 174,000 results (0.13 seconds)"

Adding a russian word "игра" (game) Search игра: About 212,000,000 results (0.37 seconds)

Search for all of them: " игра concrete 位置極端な " returns About 12,600 results (0.33 seconds)

Of course the returned search results are nonsense and they do not contain all the search terms.

But looking at the query time for the composed ones, I wonder if there is some intersection computed on the word indexes at all. Even if everything is in RAM and heavily sharded, computing the intersection of two sets with 1,500,000,000 and 2,020,000,000 entries is O(n) and can hardly be done in <0.5 sec, since the data is on different machines and they have to communicate.

There must be some join computation, but at least for popular words, this is surely not done on the whole word index. Adding the fact that the results are fuzzy, it seems evident that Google uses some optimization of kind "give back some high-ranked results, and stop after 0,5 sec".

How this is implemented, I don't know. Any ideas?

Timeout answered 28/10, 2013 at 12:12 Comment(2)

Found the very old paper by Sergey Brin and Larry Page describing the Google insex and search. See chapter 4.5 here: infolab.stanford.edu/~backrub/google.html At that time the search algo was pretty brute force and they admit it was not optimal. – Timeout 29/10, 2013 at 11:16

Interesting finding. Thanks Nutiu. – Torsk 29/10, 2013 at 14:56

Most systems somehow implement TF-IDF in one way or another. TF-IDF is a product of functions term frequency and inverse document frequency.

The IDF function relates the document frequency to the total number of documents in a collection. The common intuition for this function says that it should give a higher value for terms that appear in few documents and lower value for terms that appear in all documents making them irrelevant.

You mention Google, but Google optimises search with PageRank (links in/out) as well as term frequency and proximity. Google distributes the data and uses Map/Reduce to parallelise operations - to compute PageRank+TF-IDF.

There's a great explanation of the theory behind this in Information Retrieval: Implementing Search Engines chapter 2. Another idea to investigate further is also to look how Solr implements this.

Cordon answered 17/5, 2011 at 15:0 Comment(1)

Thanks for your answer, but it seems to me that the problem of how to rank documents is fairly independent of how to recall documents which match multiple words (or other multiple conditions). – Torsk 17/5, 2011 at 15:6

Google does not need to actually find all results, only the top ones. The index can be sorted by grade first and only then by id. Since the same ID always has the same grade this does not hurt sets intersection time.

So google starts intersection until it finds 10 results , and then does a statistical estimation to tell you how many more results it found.

A worst case is almost impossible. If all words are "common" then intersection will give the first 10 results very fast. If there is a rare word, then intersection is fast because complexity is O(N long M) where N is the smallest group.

You need to remember that google keeps it's indexes in memory and uses parallel computing.For example U can split the problem into two searches each searching only half of the web, and then marge result and take the best. Google has millions of computes

Tuckie answered 25/9, 2016 at 9:48 Comment(1)

Very interesting finding. I did some experiments and I can confirm what you said is correct. – Hippo 10/12, 2017 at 11:28

Recommended topics

Hot tags