Space efficient collection for strings with common prefixes - Java implementation
Asked Answered
P

6

5

I need to store millions of string with common prefixes (they don't correspond to file system paths) in a Set like structure in memory, and query the Collection to see if a path exists.

e.g.

/path
/path/1
/path/2
/path/1/a
/path/1/b

I want to store these as efficiently as possible (they will be in memory) , given that there will be many common prefixes for all the strings involved would a Trie be a reasonable candidate?

I'm looking for a recommendation for an implementation of a suitable data structure in Java.

Proteus answered 8/4, 2011 at 13:30 Comment(1)
Similar #2219405Proteus
G
4

A Trie looks like the structure you need. Also similar structures are Radix Tries, that unlike tries, use sequence of characters to label edges. In plain tries edges are labeled with single characters, I am sure they behave better in your case where strings share quite a lot of prefixes.

see also ...

http://code.google.com/p/trie/

http://code.google.com/p/radixtree/

Ginoginsberg answered 8/4, 2011 at 13:35 Comment(7)
The authors comment doesn't inspire much confidence! "Note to any visitors, this is fine SAMPLE code, but not production code. It was written by an inexperienced programmer (me at the time) in one evening."Proteus
yes, you're right ... go for Radix Tries (second link) they are actually better for this case.Ginoginsberg
Any reason the prefer the second implementation of a radix trie to this one? code.google.com/p/patricia-trieProteus
But both are radix tries (aka Patricia-Trie), just different implementations....Proteus
no, they are different. the first one is simply 'trie' and the second one is a 'radix trie'. See the wikipedia explanations and you will easily spot the diferences.Ginoginsberg
Yes, but code.google.com/p/patricia-trie and code.google.com/p/radixtree are both Radfix Tries (Patricia Tries)Proteus
yes, radix trie and patricia trees are the same thing. Sorry.Ginoginsberg
T
3

This looks like a good candidate implementation: https://github.com/rkapsi/patricia-trie

Tribromoethanol answered 8/4, 2011 at 13:36 Comment(0)
C
1

Let's consider the tradeoffs before any suggestions.

You say you need to store "millions" of paths. I'll assume one million, because it makes the calculations easier (and even on servers, I haven't seen more than a million directories).

How long are those paths? You've shown an example with very short paths, so we're looking at maybe a hundred megabytes to store those million paths. I don't have a reference for maximum path lengths, but 256 characters sticks in my mind. So your paths will take a maximum of 512 Mb of memory. Do you have that much memory?

How evenly distributed are the pathnames? In other words, do you follow an 80:20 rule where 80% of the paths are found in 20% of the directories? The reason that I ask is because a trie structure needs some form of index between levels. If you have a lot of directories with only a few paths under them, you're going to have a lot of overhead to maintain a trie.

The recommendations: if I had enough memory, I'd use a HashSet<String> and be done with it.

If I didn't have a lot of memory, and a directory structure that followed the 80:20 rule (or, more likely, 95:5), I'd think of a HashMap<String,Set<String>>. The key of this map would be the longest leading path string that had a "reasonable" amount of duplication, and the values would be the remaining strings. You'd probe this map with progressively shorter leading components until you found a match, then probe the set for the remainder.

That leaves open the question of "reasonable" duplication. It's the amount of duplication where the overhead of a two-piece data structure is overcome by the reduction in duplication. For example, /usr/bin/ might be valid (because it holds thousands of files and you save 9 characters or 18 bytes from each), but /usr/local/bin/ probably wouldn't be (at least on my system, it only holds a single file).

Cuyp answered 8/4, 2011 at 14:31 Comment(0)
V
0

You could use a Tree structure, just like it would on disk. However, you need to remember that tree structures can use as much or more memory in overhead as they save. i.e. they are not really designed to save memory.

Perhaps you could use the cache of the disk subsystem if these files exist. It could be faster.

I would check you really need to do this as you can store a million entries in a JVM quite comfortably. ;)

If you want to minimise memory consumption, you could compress the data in memory. This could be much smaller than any other option, but more complicated to make as efficient.

Vergara answered 8/4, 2011 at 13:36 Comment(10)
I'll possibly have tens of millions of entries, and reasonably limited resources.Proteus
I would consider using compression, you won't get smaller than that.Vergara
I need to avoid too much processing as I'll be creating & destroying these collections frequently.Proteus
Something work knowing is that the latest version of Oracle's JVM can use byte[] instead of char[] for Strings. Can you use the last JVM?Vergara
Also its important to have the right tools for the job. A server from major vendor with 8 GB of memory costs around £430. ;)Vergara
So in summary, you can't use the latest JVM which would halve the consumption of Strings, have very restricted hardware, but want to update a large dataset with signifciant performance demands. ;)Vergara
By the way, by latest version of JVM do you mean the current version 1.6, or beta version 1.7?Proteus
You can buy a 1 TB server for £40K (that is 1 TB of main memory ;)Vergara
Java 6 update 21 or later. -XX:+UseCompressedStrings (Which is on by default)Vergara
It is hard to imagine anyone needing that much, however 8 - 32 GB servers are available for a surprising low prices even compared to a year or so ago. ;)Vergara
E
0

What I would use:

  1. a multi-level map that resembles the directory structure.
  2. A balanced tree with single characters as keys and further trees as values.
Emie answered 8/4, 2011 at 13:38 Comment(0)
F
0

I would recommend you to store the paths as they are, as Strings. I believe the overhead trying to save memory will result in the reverse.

Of course, it's simple enough to test if it is, by benchmarking against the Tries data structures mentioned above.

Fisherman answered 8/4, 2011 at 13:46 Comment(2)
" I believe the overhead trying to save memory will result in the reverse." why?Proteus
The pointers to the children might be larger than the children themselves.Fisherman

© 2022 - 2024 — McMap. All rights reserved.