How does lucene index documents?
Asked Answered
S

4

120

I read some document about Lucene; also I read the document in this link (http://lucene.sourceforge.net/talks/pisa).

I don't really understand how Lucene indexes documents and don't understand which algorithms Lucene uses for indexing?

On the above link, it says Lucene uses this algorithm for indexing:

  • incremental algorithm:
    • maintain a stack of segment indices
    • create index for each incoming document
    • push new indexes onto the stack
    • let b=10 be the merge factor; M=8

for (size = 1; size < M; size *= b) {
    if (there are b indexes with size docs on top of the stack) {
        pop them off the stack;
        merge them into a single index;
        push the merged index onto the stack;
    } else {
        break;
    }
}

How does this algorithm provide optimized indexing?

Does Lucene use B-tree algorithm or any other algorithm like that for indexing - or does it have a particular algorithm?

Subscription answered 8/4, 2010 at 17:51 Comment(2)
Most answers here are correct that first Lucene creates inverted index, but that does not explain the key point of how that term index subsequently gets searched (and is, I believe, what the OP actually asked for). So below please find a new answer to this rather old question that hopefully provides better insight.Pericardium
Updated my answer once more, because the current answers (including mine!) are not really satisfactory to answer the OP's main two questions (how does Lucene provide optimized indexing and by which particular algorithm - a Skip-List, not a B-Tree, BTW). Hope my final updates now properly answer the actual question!Pericardium
P
79

In a nutshell, Lucene builds an inverted index using Skip-Lists on disk, and then loads a mapping for the indexed terms into memory using a Finite State Transducer (FST). Note, however, that Lucene does not (necessarily) load all indexed terms to RAM, as described by Michael McCandless, the author of Lucene's indexing system himself. Note that by using Skip-Lists, the index can be traversed from one hit to another, making things like set and, particularly, range queries possible (much like B-Trees). And the Wikipedia entry on indexing Skip-Lists also explains why Lucene's Skip-List implementation is called a multi-level Skip-List - essentially, to make O(log n) look-ups possible (again, much like B-Trees).

So once the inverted (term) index - which is based on a Skip-List data structure - is built from the documents, the index is stored on disk. Lucene then loads (as already said: possibly, only some of) those terms into a Finite State Transducer, in an FST implementation loosely inspired by Morfologick.

Michael McCandless (also) does a pretty good and terse job of explaining how and why Lucene uses a (minimal acyclic) FST to index the terms Lucene stores in memory, essentially as a SortedMap<ByteSequence,SomeOutput>, and gives a basic idea for how FSTs work (i.e., how the FST compacts the byte sequences [i.e., the indexed terms] to make the memory use of this mapping grow sub-linear). And he points to the paper that describes the particular FST algorithm Lucene uses, too.

For those curious why Lucene uses Skip-Lists, while most databases use (B+)- and/or (B)-Trees, take a look at the right SO answer regarding this question (Skip-Lists vs. B-Trees). That answer gives a pretty good, deep explanation - essentially, not so much make concurrent updates of the index "more amenable" (because you can decide to not re-balance a B-Tree immediately, thereby gaining about the same concurrent performance as a Skip-List), but rather, Skip-Lists save you from having to work on the (delayed or not) balancing operation (ultimately) needed by B-Trees (In fact, as the answer shows/references, there is probably very little performance difference between B-Trees and [multi-level] Skip-Lists, if either are "done right.")

Pericardium answered 4/4, 2017 at 9:30 Comment(1)
Afaik they are using Skip List instead of B-tree to reduce the number of disk seeks, since the part of Skip List resides in memory and very few disk IO requires when traversing indexInfarct
T
58

There's a fairly good article here: https://web.archive.org/web/20130904073403/http://www.ibm.com/developerworks/library/wa-lucene/

Edit 12/2014: Updated to an archived version due to the original being deleted, probably the best more recent alternative is http://lucene.apache.org/core/3_6_2/fileformats.html

There's an even more recent version at http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/codecs/lucene410/package-summary.html#package_description, but it seems to have less information in it than the older one.

In a nutshell, when lucene indexes a document it breaks it down into a number of terms. It then stores the terms in an index file where each term is associated with the documents that contain it. You could think of it as a bit like a hashtable.

Terms are generated using an analyzer which stems each word to its root form. The most popular stemming algorithm for the english language is the Porter stemming algorithm: http://tartarus.org/~martin/PorterStemmer/

When a query is issued it is processed through the same analyzer that was used to build the index and then used to look up the matching term(s) in the index. That provides a list of documents that match the query.

Thymelaeaceous answered 9/4, 2010 at 13:22 Comment(4)
Thanks for your answer and links . But i heard that Lucene project has a special stemmer named "Snowball"? Do you heard anything about that ?Subscription
This is a different question: See lucidimagination.com/search/… Other than that, seeing your question pattern I suggest you read the 'Lucene in Action' book: manning.com/hatcher2 (First edition is a bit dated, but can be found in a dead tree version. The second edition can be bought as an e-book).Mustache
May you modify you answer, the first link which is an IBM link is not found :)Histaminase
Also, how do fields enter the whole picture? If a query is on a specific field, how and at which point does lucene know that the term that points to the document is not anywhere in the document, but inside a requested field?Lebel
R
29

It seems your question more about index merging than about indexing itself.

Indexing process is quite simple if you ignore low-level details. Lucene form what is called "inverted index" from documents. So if document with text "To be or not to be" and id=1 comes in, inverted index would look like:

[to] → 1
[be] → 1
[or] → 1
[not] → 1

This is basically it – the index from the word to the list of documents containing given word. Each line of this index (word) is called posting list. This index is persisted on long-term storage then.

In reality of course things are more complicated:

  • Lucene may skip some words based on the particular Analyzer given;
  • words can be preprocessed using stemming algorithm to reduce flexia of the language;
  • posting list can contains not only identifiers of the documents, but also offset of the given word inside document (potentially several instances) and some other additional information.

There are many more complications which are not so important for basic understanding.

It's important to understand though, that Lucene index is append only. In some point in time application decide to commit (publish) all the changes in the index. Lucene finish all service operations with index and close it, so it's available for searching. After commit index basically immutable. This index (or index part) is called segment. When Lucene execute search for a query it search in all available segments.

So the question arise – how can we change already indexed document?

New documents or new versions of already indexed documents are indexed in new segments and old versions invalidated in previous segments using so called kill list. Kill list is the only part of committed index which can change. As you might guess, index efficiency drops with time, because old indexes might contain mostly removed documents.

This is where merging comes in. Merging – is the process of combining several indexes to make more efficient index overall. What is basically happens during merge is live documents copied to the new segment and old segments removed entirely.

Using this simple process Lucene is able to maintain index in good shape in terms of a search performance.

Hope it'll helps.

Respondent answered 1/10, 2015 at 1:58 Comment(2)
So for the sake of finding the most up-to-date results first, would a search start by looking at the newest segments? So just to clarify - suppose a document is updated. The older version of the document is added to the kill list, then any matches that are found in older segments are removed from the search results if their document id matches an id in the kill list?Leisha
Yes, you are correct. The only thing to mention is the final order is defined using sorting rules (relevance index in trivial case), thus the order in which segments are searched isn't relevant.Respondent
G
13

It is inverted index, but that does not specify which structure it uses. Index format in lucene has complete information.
Start with 'Summary of File Extensions'.

You'll first notice that it talks about various different indexes. As far as I could notice none of these use strictly speaking a B-tree, but there are similarities - the above structures do resemble trees.

Greenshank answered 9/4, 2010 at 8:36 Comment(1)
Lucene's inverted index is based on a skip-list, not a B-tree. Still a tree-like structure in a very broad sense, but just to be complete - e.g., see this SO question re. Lucene's use of a skip-list and this SO question why skip-lists might be preferable over B-trees.Pericardium

© 2022 - 2024 — McMap. All rights reserved.