Let's consider the tradeoffs before any suggestions.
You say you need to store "millions" of paths. I'll assume one million, because it makes the calculations easier (and even on servers, I haven't seen more than a million directories).
How long are those paths? You've shown an example with very short paths, so we're looking at maybe a hundred megabytes to store those million paths. I don't have a reference for maximum path lengths, but 256 characters sticks in my mind. So your paths will take a maximum of 512 Mb of memory. Do you have that much memory?
How evenly distributed are the pathnames? In other words, do you follow an 80:20 rule where 80% of the paths are found in 20% of the directories? The reason that I ask is because a trie structure needs some form of index between levels. If you have a lot of directories with only a few paths under them, you're going to have a lot of overhead to maintain a trie.
The recommendations: if I had enough memory, I'd use a HashSet<String>
and be done with it.
If I didn't have a lot of memory, and a directory structure that followed the 80:20 rule (or, more likely, 95:5), I'd think of a HashMap<String,Set<String>>
. The key of this map would be the longest leading path string that had a "reasonable" amount of duplication, and the values would be the remaining strings. You'd probe this map with progressively shorter leading components until you found a match, then probe the set for the remainder.
That leaves open the question of "reasonable" duplication. It's the amount of duplication where the overhead of a two-piece data structure is overcome by the reduction in duplication. For example, /usr/bin/
might be valid (because it holds thousands of files and you save 9 characters or 18 bytes from each), but /usr/local/bin/
probably wouldn't be (at least on my system, it only holds a single file).