Why does Lucene use arrays instead of hash tables for its inverted index?

Asked 21/7, 2017 at 4:16 Answered 22/2, 2022 at 17:52

Solved arrays indexing lucene hashmap hashtable

I was watching Adrien Grand's talk on Lucene's index architecture and a point he makes is that Lucene uses sorted arrays to represent the dictionary part of its inverted indices. What's the reasoning behind using sorted arrays instead of hash tables (the "classic" inverted index data structure)?

Hash tables provide O(1) insertion and access, which to me seems like it would help a lot with quickly processing queries and merging index segments. On the other hand, sorted arrays can only offer up O(logN) access and (gasp) O(N) insertion, although merging 2 sorted arrays is the same complexity as merging 2 hash tables.

The only downsides to hash tables that I can think of are a larger memory footprint (this could indeed be a problem) and less cache friendliness (although operations like querying a sorted array require binary search which is just as cache unfriendly).

So what's up? The Lucene devs must have had a very good reason for using arrays. Is it something to do with scalability? Disk read speeds? Something else entirely?

Tola answered 21/7, 2017 at 4:16 Comment(2)

excellent question! – Furculum 21/7, 2017 at 9:2

Multiple reasons why Lucene does not make use of hash tables have been provided by @Ivan in this answer: https://mcmap.net/q/1783436/-how-to-define-a-primary-key-field-in-a-lucene-document-to-get-the-best-lookup-performance – Guffaw 3/1, 2018 at 23:34

Well, I will speculate here (should probably be a comment - but it's going to be too long).

HashMap is in general a fast look-up structure that has search time O(1) - meaning it's constant. But that is the average case; since (at least in Java) a HashMap uses TreeNodes - the search is O(logn) inside that bucket. Even if we treat that their search complexity is O(1), it does not mean it's the same time wise. It just means it is constant for each separate data structure.
Memory Indeed - I will give an example here. In short storing 15_000_000 entries would require a little over 1GB of RAM; the sorted arrays are probably much more compact, especially since they can hold primitives, instead of objects.
Putting entries in a HashMap (usually) requires all the keys to re-hashed that could be a significant performance hit, since they all have to move to different locations potentially.
Probably one extra point here - searches in ranges, that would require some TreeMap probably, wheres arrays are much more suited here. I'm thinking about partitioning an index (may be they do it internally).
I have the same idea as you - arrays are usually contiguous memory, probably much easier to be pre-fetched by a CPU.
And the last point: put me into their shoes, I would start with a HashMap first... I am sure there are compelling reasons for their decision. I wonder if they have actual tests that prove this choice.

Furculum answered 21/7, 2017 at 9:2 Comment(4)

Thanks for the answer! I'm thinking it might also have to do with the fact that Lucene has to generalize to more than just text terms, and hashing arbitrary terms might be quite a hit. But I'll see if I can do a little bit of an experiment to see how HashMap and arrays compare for text indexing. – Tola 21/7, 2017 at 18:27

Don't forget the immutability of their setup. – Caucasus 12/10, 2017 at 6:29

@AnthonyDeMeulemeester I have no idea how lucene is set-up, like zero knowledge, thx for the feedback – Furculum 12/10, 2017 at 6:56

Lucene creates a segment for each doc you index and when there are too many segments they merge them into 1 segment. This makes it immutable as they don't update existing memory. – Caucasus 12/10, 2017 at 8:48

I was thinking of the reasoning behind it. Just thought of one use-case that was important in the context of text search. I could be totally wrong :)

Why sorted array and not Dictionary?

Yes, it performs well on range queries, but IMO Lucene was mainly built for text searches. Now imagine you were to do a search for prefix-based queries Eg: country:Ind*, you will need to scan the whole HashMap/Dictionary. Whereas this becomes log(n) if you have a sorted array.

Since we have a sorted array, it would be inefficient to update the array. Hence, in Lucene segments(inverted index resides in segments) are immutable.

Sard answered 22/2, 2022 at 17:52 Comment(0)

Recommended topics

Hot tags