Alternatives to Java string interning
Asked Answered
G

1

8

Since Java's default string interning has got a lot of bad press, I am looking for an alternative.

Can you suggest an API which is a good alternative to Java string interning? My application uses Java 6. My requirement is mainly to avoid duplicate strings via interning.

Regarding the bad press:

  • String intern is implemented via a native method. And the C implementation uses a fixed size of some 1k entries and scales very poorly for large number of strings.
  • Java 6 stores interned strings in Perm gen. And therefore are not GC'd and possibly lead to perm gen errors. I know this is fixed in java 7 but I can't upgrade to java 7.

Why do I need to use intering?

  • My application is a server app with heap size of 10-20G for different deployments.
  • During profiling we have figured that hundrends of thousands of string are duplicates and we can significantly improve the memory usage by avoiding storing duplicate strings.
  • Memory has been a bottleneck for us and therefore we are targetting it rather than doing any premature optimization.
Goon answered 9/10, 2012 at 4:44 Comment(9)
Part of me respects the requirements you post, but if "bad press" is enough for you to avoid them, then I really do have to ask how you've profiled your application (if at all) to determine Java strings are not suitable.Deadandalive
Have you noticed a problem in your application regarding these issues? If not, I wouldn't worry about it.Kowtko
@Kowtko my application has hundreds of thousands of duplicate Strings. So interning is a must have for me.Goon
@pst hope I have answered your question. I assume you are referring to Map rather than Set. I would need something that is thread safe and will GC the strings once they are no longer being referenced. something like concurrent weak hash map.Goon
@Chandra Please refrain from making unnecessary edits.Colewort
@pst if I use Set, I will be able to check if that string is duplicate but will not be able to get reference to original string so that I do not use the duplicate string.Goon
@ManojGumber #8854015 (impl with Map) , #3973341 (mentions Guava Interner)Resinate
@pst i appreciate the idea of prefix trie but I want to avoid doing all heavy weight on my own and will want to reuse an api that is available.Goon
@pst Guava interner looks promising.Goon
S
12

String intern is implemented via a native method. And the C implementation uses a fixed size of some 1k entries and scales very poorly for large number of strings.

It scales poorly for many thousand Strings.

Java 6 stores interned strings in Perm gen. And therefore are not GC'd

It will be cleaned up when the perm gen is cleaned up which is not often but it can mean you reach the maximum of this space if you don't increase it.

My application is a server app with heap size of 10-20G for different deployments.

I suggest you consider using off heap memory. I have 500 GB in off heap memory and about 1 GB in heap in one application. It isn't useful in all cases but worth considering.

During profiling we have figured that hundrends of thousands of string are duplicates and we can significantly improve the memory usage by avoiding storing duplicate strings.

For this I have used a simple array of String. This is very light weight and you can control the upper bound of Strings stored easily.


Here is an example of generic interner.

class Interner<T> {
    private final T[] cache;

    @SuppressWarnings("unchecked")
    public Interner(int primeSize) {
        cache = (T[]) new Object[primeSize];
    }

    public T intern(T t) {
        int hash = Math.abs(t.hashCode() % cache.length);
        T t2 = cache[hash];
        if (t2 != null && t.equals(t2))
            return t2;
        cache[hash] = t;
        return t;
    }
}

An interest property of this cache is it doesn't matter that its not thread safe.

For extra speed you can use a power of 2 size and a bit mask, but its more complicated and may not work very well depending on how your hashCodes are calculated.

Sidras answered 9/10, 2012 at 6:12 Comment(3)
For the array-of-Strings approach, was it just an unordered collection?Resinate
@peter Lawrey how will it handle collision. i.e. when two strings with different hashcode point to same cache index? Is there an assumption that you make Interner of size that you expect number of different strings?Goon
If there is a collision it replaces the value there. The size needs to be 2-3x larger than the number of strings you might consider optimal as it doesn't try to very smart about handling collisions. BTW even HashMap will be 1.4 to 2.8 times the number of entries. You can use primes.utm.edu/curios to find "interesting" primes of any size.Sidras

© 2022 - 2024 — McMap. All rights reserved.