Optimal data structure for a special dictionary
Asked Answered
U

5

10

Which data structure is best in terms of computational complexity to implement a dictionary of (key,val) items, which must support only following commands:

  1. Insert(key) - appends an item (key,val) with val=1
  2. Increment(key) - increments val of existed (key,val)
  3. Find(key) - returns a val of (key,val)
  4. Select(part_of_key) - returns a list of all items (key,val) if strstr(key,part_of_key)!=NULL in the form of a new dictionary of the same type (without allocating new memory if possible); for example if dictionary is {(red,3), (blue,4), (green,1)}, then Select(re)={(red,3), (green,1)}
  5. Max(i) - returns an item which has the i-th maximal value among all items; for example if dictionary is {(red,3), (blue,4), (green,1)}, then Max(1)=blue, Max(2)=red, Max(3)=green

The keys are strings and the values are positive integers. The number of items in the dictionary is expected to be very large.

I think it must be a synthesis of two different data structures. But should it be a hash table + a binary tree or a trie + sorted array or something else?

Upheld answered 23/9, 2011 at 11:14 Comment(3)
this question seems to have nothing in common with C... I recommend to remove that tag and add data-structure or something.Bikales
Notice that 4 can never be faster than the number of items it has to return, so if a select will often return all, half or a tenth of the items, you could just as well run through all the keys.Winthrop
Can you give more info about how frequently you will be performing the various operations? I.e. will you be finding much more frequently than inserting, do you know how many elements you will need in advance, ...Provender
E
6

A combination of balanced tree (such as red-black tree) and suffix tree (or suffix array).

  • Balanced tree: operation 1, 2 (implemented as remove + insert), 3 and 5.
  • Suffix tree: operation 4.

NOTE: Hash table will not be able to support operation 5 efficiently.

I think you'll have a hard time implementing the suffix tree. You could possibly use Mark Nelson's C++ implementation of Ukkonen's algorithm, but it has memory leaks and is essentially a singleton, so you'll need to clean it up before being ready for production use. Even after you fix it, you'll need to adjust it so it works with your "other" data structure (which is balanced tree in my proposal) instead of one big plain string.

If you do operation 1 more frequently than operation 4 and/or you can live with linear operation 4, I recommend you skip the whole complication with the suffix tree and just traverse your data structure linearly.

Eustacia answered 23/9, 2011 at 12:22 Comment(3)
I think that property 1 is very important, because it allows insertions of O(1) complexity if a list is used to keep all values in the sorted order. The property 2 should also allow fast updates (probably of O(log n) if binary search). I need operation 4 followed by operation 5 (performed on obtained subset) be the fastest.Upheld
@Upheld If the operation 4 is really so important, you won't escape some form of full-text indexing, such as the suffix tree.Eustacia
@psihodelia: O(1) insert is not available even by hashing, you cannot read the input string in O(1). Prehaps O(S), where S is the string's length?Unsuspecting
I
4

For first three operation, hash table might be good idea.

For 4th operation (select part of key), you may have to write hash function differently. Yes, hash function which was used to find/calculate hash value from given key. As you want to support partial match and your key is string, you may want to use Suffix-tree or trie.

For 5th operation (ith max element), you may want to maintain heap or sorted Linked-list (or skip-list) which interacts with hash-table.

You will also have to see various use-cases and find which operation should be optimized. For exa: If you have lots of query on part_of_key operation, you should use Suffix-tree/LC-trie kind of structure and that will give good results. However, your Find operation may not be fast as it will take O(logN) instead of constant look-up.

To summarize, you need to integrate hash-table, heap and suffix tree to achieve all operations.

Ingate answered 23/9, 2011 at 11:23 Comment(6)
find a string in constant time? how to do that? constant time means you cannot even read the entire input string.Unsuspecting
@amit: Whats your question? Are you asking how hash find key value pair in constant time?Ingate
no, I'm pointing it is impossible to implement a data structure with an accurate find() for strings in O(1), since at the very least you need to read the string, which is O(S), where S the the string's length.Unsuspecting
In dictionary with N words (each with length S), O(S) << O(N). Typically, that is considered as constant. Yes, your point is valid that we will at least need to look-up S characters of string. That is lower bound.Ingate
:your claim is wrong. In the literature, a string look up is always regarded as O(S). It is so because it takes more times to look for a 200 charsh string then 20 cars strings,regardless how many strings are in the dictionary .Unsuspecting
If O(S) is very big concern, one can use LC-trie which reduces string look-up considerably in average. Look at academypublisher.com/jnw/vol02/no03/jnw02031827.pdf for details.Ingate
L
3

While the exact answer depends on the frequency of your operations, your list of options should include a suffix array

Levantine answered 23/9, 2011 at 11:21 Comment(0)
J
1

I think a trie with several shortcuts would be best.

1-3 are already supported by a trie as fast as possible (length of the key).

For 4 I would add an additional table of 256 elements to all trie nodes that can be entered from that character. This way I can do fast lookup of parts of the string, without ever explicitely constructing or storing the suffixes as in a suffix tree.

For 5 I would add a linked list on the leaves, that get's updated sorted when data is inserted. You would maybe need another reference to the head of the list and backlinks in the trie to find the right value and get the key.

The memory complexity does not change from the trie with these additions, since it is bounded by the number of trie nodes. However the time that insertions take might change because of the inefficient sort of the linked list.

The final structure should probably be called a hash-leaf-trie. But I would not want to be implementing such a beast.

Jeopardy answered 23/9, 2011 at 14:5 Comment(0)
U
1

I think an efficient data base will be a modified trie, with bi-directional links [can go from leaves to the root and reconstruct the word], and each terminating node will have additional 'value' field.
You will also need a multimap [i.e. a map, in which each value is a set of addresses].
The keys will be arranged in a tree like, and the set of values will be hash based. [in jave, it should be something like TreeMap<Integer,HashSet<Node>>]

Pseudo-code: [well, very pseudo... it just shows the general ideas for each op].

Insert(key):
1. Insert a word into the trie
2. add the value `1` to the terminating node.
3. add the terminating field to the multimap [i.e. map.add(1,terminating_node_address)]
Increment(key):
1. read the word in the trie, and store the value from the terminating node as temp.
2. delete the entree (temp,terminating_node_address) from the multimap.
3. insert the entree (temp+1,terminating_node_address) to the multimap.
4. increase the value of the terminating node's address by 1.
Find(key):
1. find the terminating node for the key in the trie, return its value.
Select(part_of_key):
1. go to the last node you can reach with part_of_key.
2. from this node: do BFS and retrieve all possible words [store them in an empty set you will later return].
Max(i):
1. find the i'th biggest element key in the multimap.
2. choose an arbitrary address, and return get the relevant terminating word node.
3. build the string by following the uplinks to the root, and return it.

complexities:
let n be the number of strings, k be the maximum value, and S a string's length.
Insert: O(S) [trie insertion is O(S)]. The smallest element (1) in the map can be cached, and thus can be access at O(1).
Increment: O(S+logk): find the string in a trie is O(S), finding the relevant key is O(logk), deleting element from a hash-set is O(1), finding the next element in the tree is also O(logk) [can be O(1) actually, for a skip-list based map], and inserting a value is O(1). Note that by linking the node to the relevant key in the map, complexity can even be improved to O(S), since no search is actually required.
Find: O(S): just find in the trie and return the value.
Select: O(S*n): at worst case you need to retrieve all elements, which enforces you to iterate the whole trie in order to find all words.
Max: O(logk+S): find a key in a sorted map, reconstruct the word.

EDIT: Note that in here, I assumed for find(i) you want the i'th distinct biggest element. If it is not the case, you will need to modify the set a bit, and use a multimap that allows duplicate keys [instead a set as a value]. This will make all the tree based ops O(logn) instead O(logk), still efficient in my opinion...

Unsuspecting answered 26/9, 2011 at 13:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.