The best alternative for String flyweight implementation in Java
Asked Answered
S

5

10

My application is multithreaded with intensive String processing. We are experiencing excessive memory consumption and profiling has demonstrated that this is due to String data. I think that memory consumption would benefit greatly from using some kind of flyweight pattern implementation or even cache (I know for sure that Strings are often duplicated, although I don't have any hard data in that regard).

I have looked at Java Constant Pool and String.intern, but it seems that it can provoke some PermGen problems.

What would be the best alternative for implementing application-wide, multithreaded pool of Strings in java?

EDIT: Also see my previous, related question: How does java implement flyweight pattern for string under the hood?

Scholium answered 26/5, 2010 at 18:1 Comment(6)
what are your heap size settings?Dunkirk
@scot we fiddled with those.. our solution is not efficient so it doesn't scale well.Scholium
you realize you just stated "I know for sure something without having any data to prove it" you need to re-evaluate what you think you know. thinking something without proof is called speculating.Lenette
@fuzzy, I probably didn't express myself well. I know that many strings are duplicated from looking at the code, I didn't perform any measurements to obtain runtime data that would give me the exact proportion. I think I can still make valid assumption from static code analysis if you wish. But that is beside the point. Even if I don't actually need flyweight implementation, I'm sure many other do.Scholium
you miss the point, you don't know unless you actually profile and measure, the JRE and JIT do lots of non-obvious things, most of which are temporal. So guessing at what you think it should be doing by reading code you wrote, is a really bad practice. My point is without profiling and measuring you don't really know what you need. And the behaviors you think you understand, they probably change from one version of JRE to the next, even in minor point releases.Lenette
I'm using a version of WeakHashMap for this which is possible the fastest way (3 times faster than String.intern in my tests). Example program. Benchmark.Bonneau
O
8

Note: This answer uses examples that might not be relevant in modern runtime JVM libraries. In particular, the substring example is no longer an issue in OpenJDK/Oracle 7+.

I know it goes against what people often tell you, but sometimes explicitly creating new String instances can be a significant way to reduce your memory.

Because Strings are immutable, several methods leverage that fact and share the backing character array to save memory. However, occasionally this can actually increase the memory by preventing garbage collection of unused parts of those arrays.

For example, assume you were parsing the message IDs of a log file to extract warning IDs. Your code would look something like this:

//Format:
//ID: [WARNING|ERROR|DEBUG] Message...
String testLine = "5AB729: WARNING Some really really really long message";

Matcher matcher = Pattern.compile("([A-Z0-9]*): WARNING.*").matcher(testLine);
if ( matcher.matches() ) {
    String id = matcher.group(1);
        //...do something with id...
}

But look at the data actually being stored:

    //...
    String id = matcher.group(1);
    Field valueField = String.class.getDeclaredField("value");
    valueField.setAccessible(true);

    char[] data = ((char[])valueField.get(id));
    System.out.println("Actual data stored for string \"" + id + "\": " + Arrays.toString(data) );

It's the whole test line, because the matcher just wraps a new String instance around the same character data. Compare the results when you replace String id = matcher.group(1); with String id = new String(matcher.group(1));.

Overside answered 26/5, 2010 at 19:5 Comment(3)
Heed this, if you're doing lots of string manipulation on large strings. String s = hugeString.substring(0,1); hugeString = null; s is not "1 character", it's as large as hugeString was (since it IS hugeString).Sylphid
Will, I don't think I follow, substring will create new String instance whose size is 1.Scholium
@Dan: the "size" (as returned by length()) is 1. But both Strings reference the same underlying character array, simply recording different start and end indexes of the array that the string refers to. The character array is not copied to the new String; only a reference to the array is.Overside
S
3

This is already done at the JVM level. You only need to ensure that you aren't creating new Strings everytime, either explicitly or implicitly.

I.e. don't do:

String s1 = new String("foo");
String s2 = new String("foo");

This would create two instances in the heap. Rather do so:

String s1 = "foo";
String s2 = "foo";

This will create one instance in the heap and both will refer the same (as evidence, s1 == s2 will return true here).

Also don't use += to concatenate strings (in a loop):

String s = "";
for (/* some loop condition */) {
    s += "new";
}

The += implicitly creates a new String in the heap everytime. Rather do so

StringBuilder sb = new StringBuilder();
for (/* some loop condition */) {
    sb.append("new");
}
String s = sb.toString();

If you can, rather use StringBuilder or its synchronized brother StringBuffer instead of String for "intensive String processing". It offers useful methods for exactly those purposes, such as append(), insert(), delete(), etc. Also see its javadoc.

Striate answered 26/5, 2010 at 18:10 Comment(4)
I know. I am wondering if that is the best way to do it? Interning uses PermGen memory and I am told that if it is exhausted, it can be difficult to debug...Scholium
Well, if your problem can't be solved with writing code more efficiently, then you have 2 options: 1) Don't store large data in JVM's memory, but on disk file system or a database and only query it whenever really needed and then garbage it after use. 2) Give the JVM more memory.Striate
I think problem can be greatly reduced by writing more efficient code, but there are more than one alternative: 1. I can rely on Java Constant Pool (as you suggest) 2. I can roll-your-own flyweight, for starters. I am trying to understand and compare alternatives.Scholium
I don't see how a homegrown flyweight String can solve this. It would only add an extra abstraction layer (and extra overhead). It's already solved with the String pool in JVM. Rather review the code and use StringBuilder where applicable.Striate
L
2

Java 7/8

If you are doing what the accepted answer says and using Java 7 or newer you are not doing what it says you are.

The implementation of subString() has changed.

Never write code that relies on an implementation that can change drastically and might make things worse if you are relying on the old behavior.

1950    public String substring(int beginIndex, int endIndex) {
1951        if (beginIndex < 0) {
1952            throw new StringIndexOutOfBoundsException(beginIndex);
1953        }
1954        if (endIndex > count) {
1955            throw new StringIndexOutOfBoundsException(endIndex);
1956        }
1957        if (beginIndex > endIndex) {
1958            throw new StringIndexOutOfBoundsException(endIndex - beginIndex);
1959        }
1960        return ((beginIndex == 0) && (endIndex == count)) ? this :
1961            new String(offset + beginIndex, endIndex - beginIndex, value);
1962    }

So if you use the accepted answer with Java 7 or newer you are creating twice as much memory usage and garbage that needs to be collected.

Lenette answered 30/5, 2015 at 16:18 Comment(0)
A
1

Effeciently pack Strings in memory! I once wrote a hyper memory efficient Set class, where Strings were stored as a tree. If a leaf was reached by traversing the letters, the entry was contained in the set. Fast to work with, too, and ideal to store a large dictionary.

And don't forget that Strings are often the largest part in memory in nearly every app I profiled, so don't care for them if you need them.

Illustration:

You have 3 Strings: Beer, Beans and Blood. You can create a tree structure like this:

B
+-e
  +-er
  +-ans
+-lood

Very efficient for e.g. a list of street names, this is obviously most reasonable with a fixed dictionary, because insert cannot be done efficiently. In fact the structure should be created once, then serialized and afterwards just loaded.

Anteversion answered 26/5, 2010 at 20:23 Comment(4)
sounds interesting, but I am not sure I quite follow. Would you mind illustrating it?Scholium
As I understand your description of it, this is known as a Patricia trie: en.wikipedia.org/wiki/Patricia_trieOverside
+1 Thanks for the name! Now I know how to look for better implementations than mine :).Anteversion
this is called a TrieLenette
S
0

First, decide how much your application and developers would suffer if you eliminated some of that parsing. A faster application does you no good if you double your employee turnover rate in the process! I think based on your question we can assume you passed this test already.

Second, if you can't eliminate creating an object, then your next goal should be to ensure it doesn't survive Eden collection. And parse-lookup can solve that problem. However, a cache "implemented properly" (I disagree with that basic premise, but I won't bore you with the attendant rant) usually brings thread contention. You'd be replacing one kind of memory pressure for another.

There's a variation of the parse-lookup idiom that suffers less from the sort of collateral damage you usually get from full-on caching, and that's a simple precalculated lookup table (see also "memoization"). The Pattern you usually see for this is the Type Safe Enumeration (TSE). With the TSE, you parse the String, pass it to the TSE to retrieve the associated enumerated type, and then you throw the String away.

Is the text you're processing free-form, or does the input have to follow a rigid specification? If a lot of your text renders down to a fixed set of possible values, then a TSE could help you here, and serves a greater master: Adding context/semantics to your information at the point of creation, instead of at the point of use.

Stationery answered 26/5, 2010 at 18:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.