Data structure to search name of files and get its path

I will be inserting names of files in a dynamically way, approximately till 1 billion of names. Besides, I do also want to store the path where the files are located in order to do the following queries:

Searching wheter the name of a file is stored in order to get its path.
Searching the name of all the files which matches with a substring, a kinda of a like-query (e.g. If a search *o*, it will return me joel, hola, ola, oso, osea, algo, if a search aa*, it will return me aaab and if I search *so, It will return oso).
Delete the name of a file.

So, I am trying to make a kind of trie data structure in the following way:

I got 26 nodes (the english alphabet a-z, I am not going to put all the nodes in the image because space) such that if I insert the word "hola" then I create an edge from node with the letter 'h' to node with the letter 'o' and whose edge has a data 1 since this number represent the level of the depth. Furthermore, in the node where 'a' is stored, I will have a map structure in order to store the path of the file, this is because I will surely have a lot of paths stored in the node which contains the letter 'a'.

Having said that, I inserted the following words: joel, hola, ola, oso, osea, algo, aaab.

I have done so because I do not want to have many nodes with the sama lettres (e.g. a, b, etc) but the problem is that I got a lot of edges and the sctructure needs

formula

bytes of memory (I am programming in C++) where w is a string of size formula .

As you can see, if I search for the name of file "jola" (which is not inserted) no path will returned and this tell us that such file is not stored.

How can I improve this? Is it any way to reduce the number of edges? or there exists a better structure and way to do this? I am very open to hear of any suggestion.

I have solved a somewhat similar problem in the past, for storing wordlists for crossword puzzles, and finding words very fast. I called it a "superindex". My main goal was speed, not storage size, but the original question does not state what the author considers as an "improvement": maybe it is size, maybe it is speed, maybe it is algorithm complexity. My approach yielded tremendous speed at relatively small complexity, but with rather moderate savings in storage size. The approach was as follows:

Use a tree, not a graph. Each node in the tree has up to 26 "entries". Each entry stands for a letter of the alphabet and contains a link to a child node, or, if the entry belongs to a leaf node, then a link to the "payload", which in your case is a "path". So, when a node contains an entry for a given letter, this signifies that there exists a "name" with that letter in that position. (The position being the depth of the node in the tree.)
Separate all names by name-length, because that's very easy to determine. Use an entirely separate tree for each name-length. This means that in each tree all leaf nodes are at exactly the same depth, and the only nodes in the tree that contain additional data (a path in your case) are the leaf nodes. This keeps things very simple.

So, the search algorithm is as follows:

Among all different trees for different name lengths, use the one which corresponds to the length of the "name" you are searching for.
Begin with the first letter of the "name" you are looking for, and at the root node of the tree.
Find the entry for the current letter in the current node; if no such entry exists, then the name was not found, done. Otherwise:
If we have reached a leaf node, then the name has been found, return the payload, which in your case is a "path". Otherwise:
Move on to the next letter of the word you are looking for; follow the link from the found entry in the current node to the next child node; go to step 3.

As you can see, this is a fairly simple algorithm, and it is guaranteed to be very fast. Without any further optimizations, the storage requirements will be of the order of one "entry" per name stored, and an "entry" will consist of a character plus a pointer.

Then, there is a multitude of optimizations possible. For example, an "entry", might be useful concept when reasoning about your data structure, but it can be completely annihilated in the actual implementation: in each node you can have a single "summary" 32-bit machine word where each of the first 26 bits indicates whether the corresponding letter of the alphabet is present or not in the node, followed by an array of pointers to child nodes (or payloads) which contains as many elements as there are set bits in the summary word.

Recommended topics

Hot tags