Is string interning really useful?
Asked Answered
B

7

21

I was having a conversation about strings and various languages a while back, and the topic of string interning came up. Apparently Java and the .NET framework do this automatically with all strings, as well as several scripting languages. Theoretically, it saves memory because you don't end up with multiple copies of the same string, and it saves time because string equality comparisons are a simple pointer comparison instead of an O(N) run through each character of the string.

But the more I think about it, the more skeptical I grow of the concept's benefits. It seems to me that the advantages are mostly theoretical:

  • First off, to use automatic string interning, all strings must be immutable, which makes a lot of string processing tasks harder than they need to be. (And yes, I've heard all the arguments for immutability in general. That's not the point.)
  • Every time a new string is created, it has to be checked against the string interning table, which is at least a O(N) operation. (EDIT: Where N is the size of the string, not the size of the table, since this was confusing people.) So unless the ratio of string equality comparisons to new string creation is pretty high, it's unlikely that the net time saved is a positive value.
  • If the string equality table uses strong references, the strings will never get garbage collected when they're no longer needed, thus wasting memory. On the other hand, if the table uses weak references, then the string class requires some sort of finalizer to remove the string from the table, thus slowing down the GC process. (Which could be pretty significant, depending on how the string intern table is implemented. Worst case, deleting an item from a hash table can require an O(N) rebuild of the entire table under certain circumstances.)

This is just the result of me thinking about implementation details. Is there something I've missed? Does string interning actually provide any significant benefits in the general case?

EDIT 2: All right, apparently I was operating from a mistaken premise. The person I was talking to never pointed out that string interning was optional for newly-created strings, and in fact gave the strong impression that the opposite was true. Thanks to Jon for setting the matter straight. Another accepted answer for him.

Brassbound answered 11/7, 2011 at 17:25 Comment(10)
Why do you think that checking a new string against the string interning table is an O(N) operation?Brighton
This article is interesting too: codeinstructions.com/2009/01/…Wyly
Interesting question. I don't agree on O(N) because intern table can be dictionary.Appomattox
@Paul: O(N) where N is the length of the string, not the size of the table. Should have been more specific.Brassbound
Java doesn't do it for all strings - just all string literals, which can be determined at compile time and set up as part of class loading, so there's little run time cost. New String objects are not interned; code must explicitly call the intern() method on them to do so. So your code can decide whether interning is appropriate for its usage patterns, and choose to use it or not. The pool of interned strings does not count as a strong reference, so doesn't preclude GC.Nilsanilsen
I have a feeling that it is hard to say about interning & immutability which is chicken and which is egg. There were reasons to make strings immutable, and one of the useful benefit from such implementation could be interning but it might have not been the main reason.Appomattox
"O(N) operation. (EDIT: Where N is the size of the string, not the size of the table, since this was confusing people.)". There's a reason why it's confusing. Length of string rarely applies to interning strings, since the hashes are computed exactly once. The size doesn't matter.Cetus
Constructing a string of length n is O(n). Two O(n)s are still an O(n). Now the actual performance may be interesting, but not in a OMG(n) kind of way.Euphoria
Why did you tag it Ruby? Are you interested in what goes on in it, or are you just tagging it with any popular language?Pressor
@Andrew: Because that's one of the languages that implements it, and I wanted to get replies from different perspectives.Brassbound
F
27

No, Java and .NET don't do it "automatically with all strings". They (well, Java and C#) do it with constant string expressions expressed in bytecode/IL, and on demand via the String.intern and String.Intern (.NET) methods. The exact situation in .NET is interesting, but basically the C# compiler will guarantee that every reference to an equal string constant within an assembly ends up referring to the same string object. That can be done efficiently at type initialization time, and can save a bunch of memory.

It doesn't happen every time a new string is created.

(On the string immutability front, I for one am extremely glad that strings are immutable. I don't want to have to take a copy every time I receive a parameter etc, thank you very much. I haven't seen it make string processing tasks harder, either...)

And as others have pointed out, looking up a string in a hash table isn't generally an O(n) operation, unless you're incredibly unlucky with hash collisions...

Personally I don't use string interning in user-land code; if I want some sort of cache of strings I'll create a HashSet<string> or something similar. That can be useful in various situations where you expect to come across the same strings several times (e.g. XML element names) but with a simple collection you don't pollute a system-wide cache.

Further answered 11/7, 2011 at 17:31 Comment(18)
To provide some perspective, I'm used to Delphi, where strings are a reference type with reference counting and copy-on-write semantics guaranteed by the compiler. There's no need to make a copy when passing it as a parameter; it only makes a copy when you're modifying the string. You can even skip the reference-counting overhead if you pass it as a const parameter.Brassbound
@Mason: Reference counting has its own headaches of course, such as cycles... Regardless, most of the assertions in your question are simply incorrect.Further
@Mason Wheeler I coded in Delphi for several years, and I don't remember such behavior there. As far as I remember strings were just arrays + length counter.Appomattox
Refcount cycle problems only exist in data structures that can contain cyclic references, which doesn't apply to strings. But I think the "not all strings are interned automatically" bit was the part I was missing. The person I was talking to didn't clarify that, and gave the distinct impression that all strings are interned automatically in .NET. Thanks for clarifying.Brassbound
@Andrey: You're describing old-school Pascal strings. The Delphi string type, which has been around at least since Delphi 2 (I don't remember if D1 had it or not) works the way I described. The compiler just takes care of it under the hood for you so you don't have to worry about it.Brassbound
@Mason Wheeler Well, this just means that Delphi creates immutable strings to you, but covering it with fat layer of syntax sugar.Appomattox
@Andrey: Not exactly. If the ref count is 1, then changing a string will change that string. If it was immutable, it would have to make a copy. What Delphi creates is safe mutable strings.Brassbound
@Mason: Even if reference counting doesn't affect the strings themselves, you need deterministic deallocation throughout the system to make it effective. I suspect in a mark-and-sweep GC system like .NET's that would mean either excessive copying or every string object having a finalizer or possibly both, none of which is ideal.Further
Ref counts or copy-on-write for mutable string objects in Java would be difficult - I think the reference count (effectively, all String assignments) and mutation operations would have to be synchronized to avoid corruption from multithreaded access. That's high overhead for a basic type. Immutability means references can be shared across threads without locking.Nilsanilsen
@Andrew: Absolutely. Basically Java and .NET don't have ref-counting as part of the design, and bolting it on for just some types doesn't really work.Further
Found via Google: Looks like Delphi strings synchronize refcount access, and incur overhead for it. TANSTAAFL. efg2.com/Lab/Library/UseNet/1999/1119c.txt.Nilsanilsen
@Andrew: That's a pretty short snippet, and it's 12 years old. Not sure how accurate it was back then, but in modern versions, the synchronization is done in an ASM routine that uses a single atomic LOCK INC or LOCK DEC instruction. Unless other cores are currently holding a cache line that includes the string header, the "overhead" is negligible.Brassbound
Looking up a string in a hash table is definitely O(n). I think you meant to say "isn't generally Omega(n)."Janettajanette
Why not do your string interning with HashSet<int64> instead of HashSet<string>? It will be faster for long strings since it doesn't have to store or do a string comparison, and the probability of a collision in int64 space is basically 0.Janettajanette
@Neil: Yes, it's the typical "difference between the CompSci O(N) and common parlance O(N)". In my experience it's actually more useful to just keep using O(N).Further
@Neil: I disagree with the idea that the probability of a collision is 0. Do you really want people to be able to come up with different strings which happen to have the same hash (and given that these are non-crypto hash algorithms, that would be easy to do) and end up interning them together?Further
@Jon: I guess it depends on what you're using the strings for. You're right that if people are going to try to find two strings with the same hash, then you should be extra careful, but in many cases, that's not going to happen (say, the names of people in a phonebook.)Janettajanette
@Neil: Bear in mind that interning usually happens due to string constants in code. You really, really don't want that to get screwed up. Personally even in user-land code I'd usually use a HashSet<string> as it expresses what I'm interested in better than a HashSet<long>. I'd only start going into optimization if I found that this was really the bottleneck.Further
T
6

First off, to use automatic string interning, all strings must be immutable, which makes a lot of string processing tasks harder than they need to be. (And yes, I've heard all the arguments for immutability in general. That's not the point.)

This is true and string are immutable in Java. I am not sure if this a bad thing. Without going in to immutable vs mutable, I like to think this is a great design because of caching and so much more simplicity that I won't get in to.

Every time a new string is created, it has to be checked against the string interning table, which is at least a O(N) operation. So unless the ratio of string equality comparisons to new string creation is pretty high, it's unlikely that the net time saved is a positive value.

Not exactly O(n). You can do hashmaps and/or other data structures that will bring this to near constant look up.

If the string equality table uses strong references, the strings will never get garbage collected when they're no longer needed, thus wasting memory. On the other hand, if the table uses weak references, then the string class requires some sort of finalizer to remove the string from the table, thus slowing down the GC process. (Which could be pretty significant, depending on how the string intern table is implemented. Worst case, deleting an item from a hash table can require an O(N) rebuild of the entire table under certain circumstances.)

You are right about this and I would agree with you. Except I feel that the GC processing and negligible. The benefits in the long run are much more useful than having a garbage collector doing an extra check. I am not sure what you mean about O(n) for deleting from hashtable. Most operations on hashtables are O(1)

So in summary, I think your assumption that most operation are linear. But looking up strings is closer to a constant time. Therefore, this approach will have negligible performance loss but a huge memory gain. Which I'd argue is worth it.

Here is a nice quote on what is actually happening and how it saves memory.

To save memory (and speed up testing for equality), Java supports “interning” of Strings. When the intern() method is invoked on a String, a lookup is performed on a table of interned Strings. If a String object with the same content is already in the table, a reference to the String in the table is returned. Otherwise, the String is added to the table and a reference to it is returned.

Tomlin answered 11/7, 2011 at 17:30 Comment(5)
Question was "Is string interning really useful?". You answer doesn't really answer the question and looks like extended comment.Appomattox
I was still editing. But there is my answer. neglige cpu loss vs big memory gain. Vote goes it is useful.Tomlin
don't think that there is real memory gain. Only string literals get to intern table. If I have duplicated string values in code I promote them to constants which is basically the same. Immutability of strings pollute heap with short leaving objects, so I don't think that there are real benefits in terms of performance.Appomattox
Not sure what you mean because if you do a lot of string manipulation and there is only one copy of that string in the virtual machine then I feel there would be a memory gain. Also the wiki quotes "Interning strings makes some string processing tasks more time- or space-efficient". Are you just saying the gain isn't as much as one would think?Tomlin
"I am not sure what you mean about O(n) for deleting from hashtable. Most operations on hashtables are O(1)" Most operations, yes. But if you have two keys that hash to the same location in the table, and collision resolution involves putting one of those two somewhere else, and then the one that went in the right place gets removed, lookup is now broken for the other one unless you rehash it. This usually involves rebuilding the entire table.Brassbound
E
3

Here's the python documentation's take on it:

sys.intern(string)

Enter string in the table of “interned” strings and return the interned string – which is string itself or a copy. Interning strings is useful to gain a little performance on dictionary lookup – if the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare. Normally, the names used in Python programs are automatically interned, and the dictionaries used to hold module, class or instance attributes have interned keys.

Interned strings are not immortal; you must keep a reference to the return value of intern() around to benefit from it.

Extinction answered 11/7, 2011 at 17:36 Comment(0)
F
3

The a.equals(b) is very fast for random strings. Its is only slow for Strings which are long and the same (or almost the same)

Random rand = new Random(1);
String[] list = new String[2000];
for(int i=0;i<list.length;i++)
    list[i] = "1234567"+Long.toString(rand.nextInt(36*37), 36); // semi random
int count = 0;
long start = System.nanoTime();
for(int i=0;i<list.length;i++)
    for(int j=0;j<list.length;j++)
        if (list[i].equals(list[j]))
            count++;
long time = System.nanoTime() - start;
System.out.printf("The average time for equals() was %,d ns.%n", time/list.length/list.length);

on a 2.3 GHz laptop prints

The average time for equals() was 19 ns.

If you intern() the first value and have to intern() one value to do the comparison

       if (list[i] == list[j].intern())

prints

The average time for equals() was 258 ns.

This is a common case as you often have one value you know is interned and a second which is input and is not intern'ed.

if you only use intern'ed Strings and == it, and don't count the cost, prints

The average time for equals() was 4 ns.

Which is many times faster if you are doing millions of comparison. However for a small number of comparisons, you are saving 8 ns but could be costing 250 ns more.

It may just be simpler to avoid intern() and use equals().

Fibrosis answered 11/7, 2011 at 18:14 Comment(2)
Good point. Having to do the intern per get to "save" on an equals check is a no-go. Interning is only wise if you need a mapping which has a lot of reads, and you fully control the keys... In which case, you can probably just == on them anyway without filling the intern table.Boyce
...Or if your bottleneck is memory and you have a lot of repeated strings. In which case, spending more cpu to preserve working memory would pay out in terms of user experience... but, that's a corner case which should not affect general usage.Boyce
F
-1

The points you listed are all valid to a certain extent. But there are important counter-arguments.

  1. Immutability is very important, especially if you're using hash maps, and they are used a lot.
  2. String composition operations are very slow anyway, because you have to constantly reallocate the array containing the characters.
  3. On the other hand, subString() operations are very fast.
  4. String equality is indeed used a lot, and you're not losing anything there. The reason being that strings aren't interned automatically. In fact in Java if the references are different, equals() falls back to a character by character comparison.
  5. Clearly, using strong references for the intern table isn't a good idea. You have to live with the GC overhead.
  6. Java string handling was designed to be space-efficient, especially on constant strings and substring operations.

On balance I'd say it is worth it in most cases and fits well with the VM-managed heap concept. I could imagine some special scenarios where it could be a real pain though.

Fantinlatour answered 11/7, 2011 at 17:42 Comment(1)
substring is less fast on java 7... java6 and lower returns a string object pointing to the char[] of the original string (and thus leaking memory). 7 makes immutable array copies for substring now too; it's a little more runtime data, but it cuts down on memory. Intern() is the same thing; getting the == to pay off is hard (both strings must be interned), but if you have 2^20 strings, interning will save your heap, and will have higher performance in demanding situations.Boyce
C
-1

Does string interning actually provide any significant benefits in the general case?

Yes. It's huge. Try it in java.

Write simple tests that compare 1,000's of semi-random strings for equality with and without interning.

a.equals( b )  is slow

a == b is fast.
Cetus answered 11/7, 2011 at 17:46 Comment(8)
Yes, but that was my point. There are several string operations, of which equality comparison is the only one that benefits from this. How often do you use string equality comparison?Brassbound
@Mason Wheeler: Constantly. Indeed, I rarely use anything else. "Sorting" is relatively rare, and I try to design things to avoid it as much as possible.Cetus
a.equals(b) is very fast for random strings, The first thing it compares the length and then the first chars. For random Strings this is as far as it needs to look if they are different.Fibrosis
@Peter Lawrey: Hence the advice to use "semi-random" strings. We did a comparison using 20,000 financial accounts which were 8 or 9 characters with lots of repeated patterns of various lengths. "random" isn't realistic data to use for testing anything.Cetus
@S. Lott, I did a performance test and comparing semi random Strings of 8-9 chars, using == saved 15 ns compared with equals, but using intern() cost 250 us.Fibrosis
@Peter Lawrey: How many cycles? 20,000? A million? And how often were they interned? Our measurements interned once for repeated processing. The intern() does have a non-zero cost, but should amortize well over numerous uses. If you don't think intern() has a performance benefit, please post your own question with your own benchmark code. Comments aren't the place to post benchmark results.Cetus
@S Lott, that should read 250 ns! not 250 us. In any case its expensive. IMHO it is fairly rare that you want to repeatedly compare String which have already been intern()ed. If you have this situation you can use usually refactor the code to perform incremental checks rather than check everything every time. I can't post a question if I don't have a question in mind.Fibrosis
@Peter Lawrey: "is fairly rare that you want to repeatedly compare String which have already been intern()ed". Interestingly, that wasn't our experience. I rewrote existing Java apps to intern() and got a significant speedup. It appears you do have a question in mind. The question appears to be some kind of benchmarking question related to interns. Since you keep repeating your claim, that sounds like the basis for a question.Cetus
T
-1

String interning is useful when you need to compare strings (1) from a finite set (2) a number of times.

Then the overhead of interning a string is outweighed by the benefit of being able to do a fast == instead of equals().

Doing this can sometimes be faster than using a HashMap, which relies on hashCode() and equals() calls.

Tomy answered 25/4, 2015 at 11:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.