I recall seeing a couple of string-intensive programs that do a lot of string comparison but relatively few string manipulation, and that have used a separate table to map strings to identifiers for efficient equality and lower memory footprint, e.g.:
public class Name {
public static Map<String, Name> names = new SomeMap<String, Name>();
public static Name from(String s) {
Name n = names.get(s);
if (n == null) {
n = new Name(s);
names.put(s, n);
}
return n;
}
private final String str;
private Name(String str) { this.str = str; }
@Override public String toString() { return str; }
// equals() and hashCode() are not overridden!
}
I'm pretty sure one of these programs was javac from OpenJDK, so not some toy application. Of course the actual class was more complex (and also I think it implemented CharSequence), but you get the idea - the entire program was littered with Name
in any location you would expect String
, and on the rare cases where string manipulation was needed, it converted to strings and then cached them again, conceptually like:
Name newName = Name.from(name.toString().substring(5));
I think I understand the point of this - especially when there are a lot of identical strings all around and a lot of comparisons - but couldn't the same be achieved by just using regular strings and intern
ing them? The documentation for String.intern()
explicitly says:
...
When the intern method is invoked, if the pool already contains a string equal to this String object as determined by the equals(Object) method, then the string from the pool is returned. Otherwise, this String object is added to the pool and a reference to this String object is returned.It follows that for any two strings s and t, s.intern() == t.intern() is true if and only if s.equals(t) is true.
...
So, what are the advantages and disadvantages of manually managing a Name
-like class vs using intern()
?
What I've thought about so far was:
- Manually managing the map means using regular heap,
intern()
uses the permgen. - When manually managing the map you enjoy type-checking that can verify something is a
Name
, while an interned string and a non-interned string share the same type so it's possible to forget interning in some places. - Relying on
intern()
means reusing an existing, optimized, tried-and-tested mechanism without coding any extra classes. - Manually managing the map results in a code more confusing to new users, and strign operations become more cumbersome.
... but I feel like I'm missing something else here.