Does String.intern() really increase performance?
Asked Answered
I

2

16

I did a little investigation to find out how the String.intern() method is implemented in java.

I looked at C++ implementation of Intern pool from Open JDK 6 and there I saw a simple HashSet. For me it meant that when someone are trying to intern a String the next steps should be done:

  1. finding hash code associated with the given String
  2. finding an appropriate bucket
  3. comparing the given String to all another Strings in the bucket. Before this step there may be 0 Strings, one String or a LOT OF Strings in the bucket. So if the given String has been previously put in the bucket we will get at least one comparison (that's the best case. Of course there might have been a lot of collisions and now many other Strings are in the bucket)
  4. If the String has been found in the bucket then it should be returned by intern() method
  5. If the String has not been found in the bucket then it should be put in the bucket and returned by intern() method

So many people say that str1.intern() == str2.intern() would be faster than str1.equals(str2).

But I cannot see the reason it should be faster.

As I can see in case of str1.equals(str2) we always have two strings comparing char by char in String.equals() method.

In case of str1.intern() == str2.intern(), how many comparisons we would have to get or to put the String to/from the pool (right, it can be a lot of comparisons and they are simple char by char comparisons too)?

So in case of str1.intern() == str2.intern() even if we use == to compare Strings we also will have many additional actions such as comparisons described previously.

When I understood it I decided to make some benchmark testing.

The first results shewed me that str1.intern() == str2.intern() was faster than str1.equals(str2).

This behaviour was caused by the fact that String.intern() method is native so it shouldn't be interpreted every time and String.equals() is a java method.

So then I decided to use -Xcomp option to make JVM compile all the code on start.

After that equals shewed a better speed than intern.

I tested it on Java 6 and 7.

So my question is have you ever seen a situation when interning increased speed of String comparison? I yes how can it be?

Or maybe intern() can only help to save more free memory?

Intercross answered 7/4, 2014 at 15:32 Comment(6)
str1.intern() == str2.intern() - no! You're supposed to have the strings already interned. Interning them at the comparison site is pure overhead. (Whether interning is useful when you're using it properly is still debatable, but interning like this is just useless.)Clausius
I don't think this fully answers the question nor do I have a reference handy, but I recall reading a long time ago that the String.hashCode method has been optimized for very good distribution, such that in a hash table, you will get very few collisions.Hufuf
"People say" is never a good reason to do anything. +1 for doing your own research.Laurencelaurene
+1 for actually testing in order to answer a "is X faster than Y" question.Hufuf
+1 Nice research, and an interesting question!Motherland
@Hufuf In Java 6 the default string intern pool size is 1009 and can't be changed. And the HashSet can't change its size by itself. So if you have more than 1009 interned strings you will obviously have collisions. It was fixed in Java 7 where you can set the default pool size by -XX:StringTableSize=N optionIntercross
T
7

String.intern() is meant to decrease memory use.

Only use interned Strings (if ever) when you have many, many multiple copies of the SAME String in memory. by interning the Strings, all those copies will use the same reference.

I have only seen interning Strings being helpful when I have millions of copies of the same String.

As with any kind of optimization, only do it after there is a performance or memory problem and you have profiled it so that you have detected that this is the bottleneck.

See this Blog post for more details on String interning.

Tillfourd answered 7/4, 2014 at 15:37 Comment(3)
For example, I once looked at a heap dump from a production server that often had heap space issues. I found 8 MB of strings holding the value "0" and "1". This was a great opportunity to use intern.Hufuf
@Hufuf That sounds more like bad choice of data type. In a situation like that, wouldn't boolean be a better choice, using String.valueOf whenever a string representation is needed?Dionnadionne
Agreed - looking back at the source, I can't find where I used intern for that example, but I see a bunch of Boolean.valueOf, Integer.valueOf, etc. I think what I ended up interning was the property names. The "0" and "1" etc. were property values loaded from the database and there was a small fixed set of properties loaded with different values frequently.Hufuf
L
3

To your question about why str1.intern() == str2.intern() may be faster:

This is the String.equals() implementation - as you can see it may be very inefficient depending on the strings compared.

public boolean equals(Object anObject) {
    if (this == anObject) {
        return true;
    }
    if (anObject instanceof String) {
        String anotherString = (String) anObject;
        int n = value.length;
        if (n == anotherString.value.length) {
            char v1[] = value;
            char v2[] = anotherString.value;
            int i = 0;
            while (n-- != 0) {
                if (v1[i] != v2[i])
                        return false;
                i++;
            }
            return true;
        }
    }
    return false;
}

Your steps may be a lot faster:
1) hashCode() is calculated once for any String due to it's immutability and is fairly fast
2) find the bucket is O(1)
3) comparing your String with others in the same bucket - there maybe a few but still should be faster than equals()
4) and 5) are fast

And don't forget that the above operations will have to be done only once for any String, once it is intern()ed the result is returned from first comparison.

Literally answered 7/4, 2014 at 16:1 Comment(5)
"And don't forget that the above operations will have to be done only once for any String, once it is intern()ed the result is returned from first comparison." I might have missed something, let's assume that we have str1.intern() and then str1.intern() again. I thought that the first line would put the string to the pool (making all the steps) and then the second line would do the same steps (finding a bucket, comparing the String with others in the bucket) to find this String in the pool and to return it. As far as I remmember this is what I saw in C++ code.Intercross
Or you just mean that we can do s1 = str1.intern(); s2 = str2.intern() and then we will be able to do s1==s2 a million times and the performance will be better than in case of million equals comparisons? Because I thought that if we had a million str1.intern()==str2.intern() we would have a lot of interaction with HashSet every iteration.Intercross
Yes, your seconds comment is what I meant. When you invoke intern() it makes sense to keep a copy and use it everywhere else...Literally
If you're using intern(), it means that at least one of the strings is coming from "somewhere else," such as being read from a file. In which case you will pay the cost of calculating a hashcode for each of those strings, which takes the same amount of time is comparing the two strings byte-by-byte.Laurencelaurene
@Laurencelaurene There is nothing to consider and/or improve if you only use all these Strings once. If on the other hand you need to compare more than 2 Strings to each other you will suddenly find big improvement.Literally

© 2022 - 2024 — McMap. All rights reserved.