How to define a primary key field in a Lucene document to get the best lookup performance?
Asked Answered
A

1

6

When creating a document in my Lucene index (v7.2), I add a uid field to it which contains a unique id/key (string):

doc.add(new StringField("uid",uid,Field.Store.YES))

To retrieve that document later on, I create a TermQuery for the given unique id and search for it with an IndexSearcher:

searcher.search(new TermQuery(new Term("uid",uid)),1)

Being a Lucene "novice", I would like to know the following:

  1. How should I improve this approach to get the best lookup performance? Would it, for example, make a difference if I store the unique id as a byte array instead of as a string? Or are there some special codecs or filters that can be used?

  2. What is the time complexity of looking up a document by its unique id? Since the index contains at least one unique term for each document, the lookup times will increase linearly with the number of documents (O(n)), right?

Atlas answered 1/1, 2018 at 15:23 Comment(7)
It looks like you are trying to search through not optimized index. Did you call full optimize after each commit? Under the hood Lucene creates a new segment on each commit.Shanghai
@Ivan: According to the documentation, the IndexWriter does not have an optimize function ( lucene.apache.org/core/7_2_0/core/org/apache/lucene/index/… ) . Or are you refering to maybeRefresh?Atlas
@Ivan: ... I meant maybeMerge ...Atlas
I've checked your graphs - it is definitely behavior of non optimized index. You can build another graph by index size and it will have the shape similar to query latency. Take a look at commit spikes - this is index optimization which causes merge of small segments into bigger one. At the same time query's latency reduces significantly. In order to merge all segments use the following method lucene.apache.org/core/7_2_0/core/org/apache/lucene/index/… with integer parameter maxNumSegments=1.Shanghai
@Ivan: I thought the spikes occur because Lucene automatically merges all segments into ONE. However, this doesn't seem to be the case: Forcing merging into a single segment, as you suggested, seems to do the trick. I will report back when I have re-done all benchmarks. Thanks very much for your advices.Atlas
If you want to know how Lucene picks segments to merge during indexing, you can read the following post blog.mikemccandless.com/2011/02/…Shanghai
@Ivan: As always, I thank you for your input. You were absolutely right: After forcing merging into 1 segment the lookup times stayed practically constant, even when storing a lot of text.Atlas
S
4

Theory

There is a blog post about Lucene term index and lookup performance. It clearly reveals all the details of complexity of looking up a document by id. This post is quite old, but nothing was changed since then.

Here is some highlights related to your question:

  • Lucene is a search engine where the minimum element of retrieval is a text term, so this means: binary, number and string fields are represented as strings in the BlockTree terms dictionary.
  • In general, the complexity of lookup depends on the term length: Lucene uses an in-memory prefix-trie index structure to perform a term lookup. Due to restrictions of real-world hardware and software implementations (in order to avoid superfluous disk reads and memory overflow for extremely large tries), Lucene uses a BlockTree structure. This means it stores prefix-trie in small chunks on disk and loads only one chunk at time. This is why it's so important to generate keys in an easy-to-read order. So let's arrange the factors according to the degree of their influence:
    • term's length - more chunks to load
    • term's pattern - to avoid superfluous reads
    • terms count - to reduce chunks count

Algorithms and Complexity

Let term be a single string and let term dictionary be a large set of terms. If we have a term dictionary, and we need to know whether a single term is inside the dictionary, the trie (and minimal deterministic acyclic finite state automaton (DAFSA) as a subclass) is the data structure that can help us. On your question: “Why use tries if a hash lookup can do the same?”, here are a few reasons:

  • The tries can find strings in O(L) time (where L represents the length of a single term). This is a bit faster compared to hash table in the worst case (hash table requires linear scan in case of hash collisions and sophisticated hashing algorithm like MurmurHash3), or similar to a hash table in perfect case.
  • The hash tables can only find terms of a dictionary that exactly match with the single term that we are looking for; whereas the trie allows us to find terms that have a single different character, a prefix in common, a character missing, etc.
  • The trie can provide an alphabetical ordering of the entries by key, so we can enumerate all terms in alphabetical order.
  • The trie (and especially DAFSA) provides a very compact representation of terms with deduplication.

Here is an example of DAFSA for 3 terms: bath, bat and batch: Example of DAFSA Data Structure

In case of key lookup, notice that lowering a single level in the automata (or trie) is done in constant time, and every time that the algorithm lowers a single level in the automata (trie), a single character is cut from the term, so we can conclude that finding a term in a automata (trie) can be done in O(L) time.

Shanghai answered 1/1, 2018 at 21:50 Comment(9)
Thanks for your answer. I had read the article before, but due to my limited theoretical knowledge in that field I obviously didn't understand all parts of it. Also I was hoping that by now there might be something like a hash-based index implementation that would allow almost constant lookup times (O(1)). But for reasons unknown to me this does not seem to be the case.Atlas
In general lookup complexity is O(1) and has comparable performance to hash-based index. On other hand trie based index allows to compress index, provide fuzzy match, alpha-numerical sort, regular expressions, term enumeration and other features for free. I am going to update my answer now to provide more details.Shanghai
If you find the time for it, it would be great if you could add the explanation why the lookup complexity is O(1) to the answer.Atlas
Thanks for the great explanation - I wish I could upvote this answer more than once :)Atlas
Sorry for bothering you again: I did some benchmarking to check the performance of term searches in Lucene (range from 10k to 5M documents). The graphs indicate that the lookup time LINEARLY increases with the number of documents. The inclination gets higher the more data is stored within each document. Even if documents only contain their unique id as a non-stored StringField, the lookup time increases by a factor of ~2 between the 10k and 5M documents benchmarks. I do believe that your theoretical assumptions are correct, but do you have an explanation for this?Atlas
@Atlas could you please share your experiment(it would be greate to have code snippet), there is a bunch of speculatie optimizations in Lucene term index and other implementation details. For example, term index is shared across all terms in the index. Did you try memory mapped filres? Did you see any IO?Shanghai
(1.) Yes, I'm using memory-mapped files (MMapDirectory). (2.) The IO was minimal, especially for the benchmarks where nothing was stored. (3.) What would be the best way to share all the benchmark code and data/figures? Should I add it as an update to my question? (4.) You should not feel forced to help me with this just because you answered my initial question. If you don't want to waste your time with something that could also be a benchmarking error, just say so :)Atlas
@Atlas you can use whatever online code snippet tool and share link here.Shanghai
I added some information, graphs and code as an update to my question.Atlas

© 2022 - 2024 — McMap. All rights reserved.