String.intern() vs manual string-to-identifier mapping?
Asked Answered
T

5

4

I recall seeing a couple of string-intensive programs that do a lot of string comparison but relatively few string manipulation, and that have used a separate table to map strings to identifiers for efficient equality and lower memory footprint, e.g.:

public class Name {
    public static Map<String, Name> names = new SomeMap<String, Name>();
    public static Name from(String s) {
        Name n = names.get(s);
        if (n == null) {
            n = new Name(s);
            names.put(s, n);
        }
        return n;
    }
    private final String str;
    private Name(String str) { this.str = str; }
    @Override public String toString() { return str; }
    // equals() and hashCode() are not overridden!
}

I'm pretty sure one of these programs was javac from OpenJDK, so not some toy application. Of course the actual class was more complex (and also I think it implemented CharSequence), but you get the idea - the entire program was littered with Name in any location you would expect String, and on the rare cases where string manipulation was needed, it converted to strings and then cached them again, conceptually like:

Name newName = Name.from(name.toString().substring(5));

I think I understand the point of this - especially when there are a lot of identical strings all around and a lot of comparisons - but couldn't the same be achieved by just using regular strings and interning them? The documentation for String.intern() explicitly says:

...
When the intern method is invoked, if the pool already contains a string equal to this String object as determined by the equals(Object) method, then the string from the pool is returned. Otherwise, this String object is added to the pool and a reference to this String object is returned.

It follows that for any two strings s and t, s.intern() == t.intern() is true if and only if s.equals(t) is true.
...

So, what are the advantages and disadvantages of manually managing a Name-like class vs using intern()?

What I've thought about so far was:

  • Manually managing the map means using regular heap, intern() uses the permgen.
  • When manually managing the map you enjoy type-checking that can verify something is a Name, while an interned string and a non-interned string share the same type so it's possible to forget interning in some places.
  • Relying on intern() means reusing an existing, optimized, tried-and-tested mechanism without coding any extra classes.
  • Manually managing the map results in a code more confusing to new users, and strign operations become more cumbersome.

... but I feel like I'm missing something else here.

Tepefy answered 13/1, 2012 at 16:6 Comment(0)
C
2

Unfortunately, String.intern() can be slower than a simple synchronized HashMap. It doesn't need to be so slow, but as of today in Oracle's JDK, it is slow (probably due to JNI)

Another thing to consider: you are writing a parser; you collected some chars in a char[], and you need to make a String out of them. Since the string is probably common and can be shared, we'd like to use a pool.

String.intern() uses such a pool; yet to look up, you'll need a String to begin with. So we need to new String(char[],offset,length) first.

We can avoid that overhead in a custom pool, where lookup can be done directly based on a char[],offset,length. For example, the pool is a trie. The string most likely is in the pool, so we'll get the String without any memory allocation.

If we don't want to write our own pool, but use the good old HashMap, we'll still need to create a key object that wraps char[],offset,length (something like CharSequence). This is still cheaper than a new String, since we don't copy chars.

Catwalk answered 13/1, 2012 at 16:30 Comment(0)
C
1

what are the advantages and disadvantages of manually managing a Name-like class vs using intern()

Type checking is a major concern, but invariant preservation is also a significant concern.

Adding a simple check to the Name constructor

Name(String s) {
  if (!isValidName(s)) { throw new IllegalArgumentException(s); }
  ...
}

can ensure* that there exist no Name instances corresponding to invalid names like "12#blue,," which means that methods that take Names as arguments and that consume Names returned by other methods don't need to worry about where invalid Names might creep in.

To generalize this argument, imagine your code is a castle with walls designed to protect it from invalid inputs. You want some inputs to get through so you install gates with guards that check inputs as they come through. The Name constructor is an example of a guard.

The difference between String and Name is that Strings can't be guarded against. Any piece of code, malicious or naive, inside or outside the perimeter, can create any string value. Buggy String manipulation code is analogous to a zombie outbreak inside the castle. The guards can't protect the invariants because the zombies don't need to get past them. The zombies just spread and corrupt data as they go.

That a value "is a" String satisfies fewer useful invariants than that a value "is a" Name.

See stringly typed for another way to look at the same topic.

* - usual caveat re deserializing of Serializable allowing bypass of constructor.

Clique answered 13/1, 2012 at 16:13 Comment(0)
M
1

I would always go with the Map because intern() has to do a (probably linear) search inside the internal String's pool of strings. If you do that quite often it is not as efficient as Map - Map is made for fast search.

Microwave answered 13/1, 2012 at 16:14 Comment(2)
I think one can be quite sure that intern() itself uses a Map-like data structure to decide quickly if this is a new string.Rainier
String.intern() uses a weak hash map. It's a point of concurrency contention, so the cost of lookup grows with the number of threads that intern strings, but it does not grow with the number of strings interned.Clique
A
1

String.intern() in Java 5.0 & 6 uses the perm gen space which usually has a low maximum size. It can mean you run out of space even though there is plenty of free heap.

Java 7 uses its the regular heap to store intern()ed Strings.

String comparison it pretty fast and I don't imagine there is much advantage in cutting comparison times when you consider the overhead.

Another reason this might be done is if there are many duplicate strings. If there is enough duplication, this can save a lot of memory.

A simpler way to cache Strings is to use a LRU cache like LinkedHashMap

private static final int MAX_SIZE = 10000;
private static final Map<String, String> STRING_CACHE = new LinkedHashMap<String, String>(MAX_SIZE*10/7, 0.70f, true) {
    @Override
    protected boolean removeEldestEntry(Map.Entry<String, String> eldest) {
        return size() > 10000;
    }
};

public static String intern(String s) {
    // s2 is a String equals to s, or null if its not there.
    String s2 = STRING_CACHE.get(s);
    if (s2 == null) {
        // put the string in the map if its not there already.
        s2 = s;
        STRING_CACHE.put(s2,s2);
    }
    return s2;
}

Here is an example of how it works.

public static void main(String... args) {
    String lo = "lo";
    for (int i = 0; i < 10; i++) {
        String a = "hel" + lo + " " + (i & 1);
        String b = intern(a);
        System.out.println("String \"" + a + "\" has an id of "
                + Integer.toHexString(System.identityHashCode(a))
                + " after interning is has an id of "
                + Integer.toHexString(System.identityHashCode(b))
        );
    }
    System.out.println("The cache contains "+STRING_CACHE);
}

prints

String "hello 0" has an id of 237360be after interning is has an id of 237360be
String "hello 1" has an id of 5736ab79 after interning is has an id of 5736ab79
String "hello 0" has an id of 38b72ce1 after interning is has an id of 237360be
String "hello 1" has an id of 64a06824 after interning is has an id of 5736ab79
String "hello 0" has an id of 115d533d after interning is has an id of 237360be
String "hello 1" has an id of 603d2b3 after interning is has an id of 5736ab79
String "hello 0" has an id of 64fde8da after interning is has an id of 237360be
String "hello 1" has an id of 59c27402 after interning is has an id of 5736ab79
String "hello 0" has an id of 6d4e5d57 after interning is has an id of 237360be
String "hello 1" has an id of 2a36bb87 after interning is has an id of 5736ab79
The cache contains {hello 0=hello 0, hello 1=hello 1}

This ensure the cache of intern()ed Strings will be limited in number.

A faster but less effective way is to use a fixed array.

private static final int MAX_SIZE = 10191;
private static final String[] STRING_CACHE = new String[MAX_SIZE];

public static String intern(String s) {
    int hash = (s.hashCode() & 0x7FFFFFFF) % MAX_SIZE;
    String s2 = STRING_CACHE[hash];
    if (!s.equals(s2))
        STRING_CACHE[hash] = s2 = s;
    return s2;
}

The test above works the same except you need

System.out.println("The cache contains "+ new HashSet<String>(Arrays.asList(STRING_CACHE)));

to print out the contents which shows the following include on null for the empty entries.

The cache contains [null, hello 1, hello 0]

The advantage of this approach is speed and that it can be safely used by multiple thread without locking. i.e. it doesn't matter if different threads have different view of STRING_CACHE.

Animatism answered 13/1, 2012 at 16:54 Comment(13)
The LRU cache doesn't guarantee that a.equals(b) -> intern(a) == intern(b) which is the property of String.intern that many find useful, and it uses a static mutable map in a non-thread-safe way.Clique
And shouldn't if (!s.equals(s2)) be if (s != s2)?Clique
@MikeSamuel if a.equals(b) is almost always as fast as a == b because they are the same object almost always, there may not be much point using ==.Animatism
true. I'm just pointing out that your intern is not a substitute for String.intern because it does not preserve the property that most users of String.intern want. I understand that you are trying to solve the caching problem, but then calling it intern is just going to lead to confusion.Clique
Most user want improved performance and efficiency. I doubt most users are doing this so they can use the == syntax when it doesn't work for other classes.Animatism
Peter, regarding your first comment - I wouldn't say a.equals(b) is almost always as fast as a == b. It is that fast if they are indeed the same instance, since String.equals() starts with if (this == obj) return true (like many other equals() implementations); but if they aren't the same then a character-by-character comparison is still done, as far as I know. I agree that in a hash table the strings will almost always be the same, just sayin'.Tepefy
@Peter Lawrey, Ancedotally, I have run into more than a few novice devs who intern strings all over the place so they can use == on Strings because otherwise they get NullPointerExceptions when comparing against null.Clique
@Oak, It will only do a character-by-character comparison if all the characters are the same (all but the last) If the length is different it doesn't look at the characters and if first characters are different, it stops there.Animatism
@PeterLawrey. I hope you don't mind, but I fixed the if (!s.equals(s2)) /* insert into intern table */ since as written, no strings would ever be put into the intern table.Clique
@Oak, This is why you only need to remove almost all duplicates strings, but if you use equals() you don't need a guarantee. If it works only 99.99% of the time it means that 0.1% if the time it will be slightly slower. However, a collection of Strings which grows without limit is more likely to slow down your system or cause an OutOfMemoryError.Animatism
@MikeSamuel Good catch, it should be if (!s.equals(s2))Animatism
@PeterLawrey, the only puts are of .put(s, s), so the key always .equals the value. That means that s.equals(s2) will always be true. So as written, the cache only consumes memory, never saves anything.Clique
@MikeSamuel You are right that s.equals(s2) is always true unless s2 is null (So I simplified the first example) The second example need to check equals as there could be collision of mod'ed hash values. It saves memory every time the s.equals(s2) which you suggests happens all the time. (Not entirely true either)Animatism
R
0

So, what are the advantages and disadvantages of manually managing a Name-like class vs using intern()?

One advantage is:

It follows that for any two strings s and t, s.intern() == t.intern() is true if and only if s.equals(t) is true.

In a program where many many small strings must be compared often, this may pay off. Also, it saves space in the end. Consider a source program that uses names like AbstractSyntaxTreeNodeItemFactorySerializer quite often. With intern(), this string will be stored once and that is it. Everything else if just references to that, but the references you have anyway.

Rainier answered 13/1, 2012 at 16:20 Comment(2)
Haven't you just described an advantage of both approaches? I'm looking for comparison between those two... unless I misunderstood you.Tepefy
Well, you see, .... its just easier to say intern() than to set up and maintain a separate map. That being said, it may nevertheless be better for spezialized types like Name.Rainier

© 2022 - 2024 — McMap. All rights reserved.