Lucene's algorithm
Asked Answered
F

1

19

I read the paper by Doug Cutting; "Space optimizations for total ranking".

Since it was written a long time ago, I wonder what algorithms lucene uses (regarding postings list traversal and score calculation, ranking).

Particularly, the total ranking algorithm described there involves traversing down the entire postings list for each query term, so in case of very common query terms like "yellow dog", either of the 2 terms may have a very very long postings list in case of web search. Are they all really traversed in the current Lucene/Solr? Or are there any heuristics to truncate the list employed?

In the case when only the top k results are returned, I can understand that distributing the postings list across multiple machines, and then combining the top-k from each would work, but if we are required to return "the 100th result page", i.e. results ranked from 990--1000th, then each partition would still have to find out the top 1000, so partitioning would not help much.

Overall, is there any up-to-date detailed documentation on the internal algorithms used by Lucene?

Farny answered 25/4, 2012 at 21:25 Comment(2)
additionally, anybody knows roughly (of course the details are a secret, but I guess the main ideas should be common enough these days) how google does fast ranking in cases of multi-term queries with AND ? (if their postings are sorted by PageRank order, then it's understandable that a single term query would quickly return the top-k, but if it's multi-term, they would have to traverse the entire lists to find the insersection set, because the lists are not sorted by docId, as in the Lucene paper case)Farny
I don't know how this actually works, but if you want to do early query termination, you should make index order (doc ids) match relevance (pagerank in your case) order, at least on a per-segment basis. This would solve your problem for multi-term queries.Marcelline
M
30

I am not aware of such documentation, but since Lucene is open-source, I encourage you go reading the source code. In particular, the current trunk version includes flexible indexing, meaning that the storage and posting list traversal has been decoupled from the rest of the code, making it possible to write custom codecs.

You assumptions are correct regarding posting list traversal, by default (it depends on your Scorer implementation) Lucene traverses the entire posting list for every term present in the query and puts matching documents in a heap of size k to compute the top-k docs (see TopDocsCollector). So returning results from 990 to 1000 makes Lucene instantiate a heap of size 1000. And if you partition your index by document (another approach could be to split by term), every shard will need to send the top 1000 results to the server which is responsible for merging results (see Solr QueryComponent for example, which translates a query from N to P>N to several shard requests from 0 to P sreq.params.set(CommonParams.START, "0");). This is why Solr might be slower in distributed mode than in standalone mode in case of extreme paging.

I don't know how Google manages to score results efficiently, but Twitter published a paper on their retrieval engine Earlybird where they explain how they patched Lucene in order to do efficient reverse chronological order traversal of the posting lists, which allows them to return the most recent tweets matching a query without traversing the entire posting list for every term.

Update: I found this presentation from Googler Jeff Dean, which explains how Google built its large scale information retrieval system. In particular, it talks about sharding strategies and posting list encoding.

Marcelline answered 26/4, 2012 at 8:34 Comment(4)
thanks a lot for the answer, I'll try to dig down the Twitter link to see if I can find more referencesFarny
if the entire postings list is traversed, that seems to suggest that Lucene is not really feasible for web-scale search, since something like "yellow dog" are bound to match billions of web pages in the world. even after aggressive partitioning, the time for traversing postings on each box would be too longFarny
@user933882 Well, it might require some tweaking, but the Twitter example shows that Lucene can be used to handle huge amounts of rapidly changing data.Marcelline
Thanks a lot for the references, they are very helpfulFarny

© 2022 - 2024 — McMap. All rights reserved.