C++ Most efficient way for storing, loading and looking up a lexicon

Asked 21/5, 2013 at 15:34 Answered 21/5, 2013 at 16:48

I have a dictionary that consists of words and their phonetic transcriptions. The words are all lower case, so there is not case-sensitive search involved.

The lexicon is really huge, and I need to load it quickly when my application starts. I would prefer reading it without having to read each entry separately.

I guess the way I stored and load it also affects how I keep the lexicon in memory

Thank you for any ideas.

Lailalain answered 21/5, 2013 at 15:34 Comment(4)

How huge is "really huge"? Do you plan on loading the whole lexicon in your application's memory, or reading it from a file or database? Also, what types of operations will the structure need to do efficiently? Mainly lookup, or enumeration as well? – Tensile 21/5, 2013 at 15:41

Really huge meaning 200.000 words. I would like to load it into memory. I only need to look up words, no writing or displaying. – Lailalain 21/5, 2013 at 17:3

Do you need to search with "typos" and wild-chars ? – Oraleeoralia 21/5, 2013 at 17:13

@MartinPerry No, just a 1:1 lookup. – Lailalain 22/5, 2013 at 8:1

You probably want to store this as a Trie

This is an efficient way of storing a dictionary. Look at the following answers for more information

http://en.wikipedia.org/wiki/Trie

https://stackoverflow.com/questions/296618/what-is-the-most-common-use-of-the-trie-data-structure

Persisting a trie to a file - C

Middlebrooks answered 21/5, 2013 at 15:40 Comment(2)

Note that unless special care is taken, a trie will have fairly significant memory requirements. – Sigridsigsmond 21/5, 2013 at 15:49

Though when done properly, a trie is the probably most efficient way to store a dictionary, thanks to prefix compression. – Gillan 21/5, 2013 at 17:0

A few options come to mind:

You could use sqlite, which uses mmap to map the file to memory, to store the lexicon so only what is accessed gets read. This is probably reasonable fast and reliable as well as the easiest to implement.
You can mmap the file yourself
Use seek operations to move the file pointer through the file without reading the whole thing. This will only help if the lexicon is structured in some way so you can find the right position without reading everything, i.e. it has to be a data structure that allows better than O(n) searching (a Trie usually being a good choice, as suggested by Salgar).

Tamandua answered 21/5, 2013 at 16:48 Comment(6)

Let's say I memory-map the file, and I know at which positions words start (for example: "a" words start at pos 1, "b" words start at pos 93229), how would I structure my file? Do I have to work with fixed lengths or what did you mean by mmapping the file? – Lailalain 21/5, 2013 at 16:57

My application is plain C++ code without any third party libraries, and although I love SQLite, I would opt for not using it in this case. – Lailalain 21/5, 2013 at 17:1

Combine the two answers, and mmap a trie. – Bair 21/5, 2013 at 17:6

If you know exactly where to look: great, that should work in terms of structure (or you can use a trie). mmap maps part of the file to memory, so you can access it like an array. Check out the manpage. It also has a good example of how to do it. Be aware that this low-level approach is not for the faint of heart though, because you have to page-align the offset among other things. – Tamandua 21/5, 2013 at 17:9

@Tamandua I am not sure how mmap'ing a dictionary would work. Even though I can access the bytes quickly, how would I search for an entry? I would have to iterate through the entire map, I think. That would not really make sense in my opinion. – Lailalain 29/5, 2013 at 17:29

@tmighty, in your first comment you said you already know where to start looking, so why would you need to iterate through the entire structure? – Tamandua 30/5, 2013 at 17:52

Recommended topics

Hot tags