String Deduplication feature of Java 8
Asked Answered
E

4

80

Since String in Java (like other languages) consumes a lot of memory because each character consumes two bytes, Java 8 has introduced a new feature called String Deduplication which takes advantage of the fact that the char arrays are internal to strings and final, so the JVM can mess around with them.

I have read this example so far but since I am not a pro java coder, I am having a hard time grasping the concept.

Here is what it says,

Various strategies for String Duplication have been considered, but the one implemented now follows the following approach: Whenever the garbage collector visits String objects it takes note of the char arrays. It takes their hash value and stores it alongside with a weak reference to the array. As soon as it finds another String which has the same hash code it compares them char by char. If they match as well, one String will be modified and point to the char array of the second String. The first char array then is no longer referenced anymore and can be garbage collected.

This whole process of course brings some overhead, but is controlled by tight limits. For example if a string is not found to have duplicates for a while it will be no longer checked.

My First question,

There is still a lack of resources on this topic since it is recently added in Java 8 update 20, could anyone here share some practical examples on how it help in reducing the memory consumed by String in Java ?

Edit:

The above link says,

As soon as it finds another String which has the same hash code it compares them char by char

My 2nd question,

If hash code of two String are same then the Strings are already the same, then why compare them char by char once it is found that the two String have same hash code ?

Enchantment answered 14/1, 2015 at 17:45 Comment(16)
While not related to Java 8 per-se, reading up on how modern JS engines (eg. V8 and *Monkey) handle strings internally is interesting - there must be about a dozen different abstract categories. (JavaScript engines arguably benefit a lot from this sort of "de-duplication" and "sharing" optimization work.)Messieurs
Did you every hear of “hash collisions”? There are only 2³² == 4294967296 different hash codes but 65536²¹⁴⁷⁴⁸³⁶⁴⁸ == practically infinite different possible Strings. In other words, having the same hash code does not guaranty that the String are equal. You have to check that. Only the opposite is true, having different hash codes implies that the Strings are not equal.Erine
@Erine could you please provide me any link where i can read about total String combinations i.e 65536²¹⁴⁷⁴⁸³⁶⁴⁸Enchantment
I don’t have a link, as it’s simple to find out: one char is a 16 Bit value, so it allows 2¹⁶ == 65536 combinations. A String is a sequence that has an int length, so it may have up to 2³¹ characters (2³¹ not 2³² because int is signed in Java but a String has a positive size) so the maximum String length is 2³¹ == 2147483648 (theoretically, the practical limit is a bit smaller). So a String can combine up to 2147483648 chars which can have 65536 possible combinations, which makes 65536²¹⁴⁷⁴⁸³⁶⁴⁸ combinations (actually a bit larger as a String could also be shorter)Erine
@mbomb007: it’s like having a number with n digit positions when there are m different digits which allows mⁿ combinations, e.g. the decimal numbers from 000 to 999 allow 10³ combinations. For a String there are 65536 different “digits” (aka chars) at 2147483648 digit positions, so its 65536²¹⁴⁷⁴⁸³⁶⁴⁸. It’s only “slightly” more as \0 and “end-of-String” are distinct in Java. Not that it matters, as it’s too large to imagine anyway.Erine
It should equal (2¹⁶)^(∑ n=0_31(2^n)) if you include a String that can be shorter. That's what I'm talking about. That's not really slightly more.Trippet
This is because (2^16)^(2^31) permutations at max length, (2^16)^(2^30) permutations with 1 less, and so on.Trippet
Equal hash codes does not mean equal strings. See #28081Burmeister
@mbomb007: You have the sum operator in the wrong place, though. With the correct formula, it is only "slightly" more. With yours, it is insanely more.Independent
@mbomb007: Actually forget that, your formula doesn't cover different lengths at all.Independent
Oh my bad, was looking at my first attempt when I copied it over. I meant to write: ∑n=0_31((2¹⁶)^(2^n))Trippet
Here is a list of String which have the hashCode of 0. #2310998Dissociation
How does this affect concurrency of existing programs? Has the char array field of String been made a volatile field now?Merri
@Adrian Pronk: the char array field stays final and thus, may get locally cached/treated as constant by threads as before. The change of the reference during the string de-duplication is a special action that lives outside the ordinary access rules. This can’t create race conditions as it doesn’t matter whether threads are using the old array or the new array, both have identical contents. But keep in mind that the de-duplication is done by the garbage collector anyway. The garbage collector has to know which objects (including arrays) are referenced by live threads in the first place.Erine
No specific practical example other than to say, on the web app we support, we had to write our own String Deduplication feature because of the huge number of Strings the app used, and it made a huge difference (in Gigabytes) in memory use, and made a big difference in overall performance as you can imagine - less GC etc.Monteux
FYI, related Question: Is String Deduplication feature of the G1 garbage collector enabled by default?Naughton
C
71

Imagine you have a phone book, which contains people, which have a String firstName and a String lastName. And it happens that in your phone book, 100,000 people have the same firstName = "John".

Because you get the data from a database or a file those strings are not interned so your JVM memory contains the char array {'J', 'o', 'h', 'n'} 100 thousand times, one per John string. Each of these arrays takes, say, 20 bytes of memory so those 100k Johns take up 2 MB of memory.

With deduplication, the JVM will realise that "John" is duplicated many times and make all those John strings point to the same underlying char array, decreasing the memory usage from 2MB to 20 bytes.

You can find a more detailed explanation in the JEP. In particular:

Many large-scale Java applications are currently bottlenecked on memory. Measurements have shown that roughly 25% of the Java heap live data set in these types of applications is consumed by String objects. Further, roughly half of those String objects are duplicates, where duplicates means string1.equals(string2) is true. Having duplicate String objects on the heap is, essentially, just a waste of memory.

[...]

The actual expected benefit ends up at around 10% heap reduction. Note that this number is a calculated average based on a wide range of applications. The heap reduction for a specific application could vary significantly both up and down.

Colossian answered 14/1, 2015 at 18:6 Comment(14)
what if there are no same char arrays at all ? but 100s of non-similar strings and char arrays ? what is the use of this feature then ?Enchantment
Actually, that’s what I do for decades now. And somehow I expected other developers to be as smart as well to canonicalize Strings when reading a larger batch from a database or other external resource.Erine
@Enchantment in most application you will have a lot of similar strings, for example database queries, names, countries, webservice queries etc. It will probably not improve every single application but I'm sure they have tested the costs vs benefits on many typical use cases before rolling out that feature.Colossian
@assylias, i get it but if that is all deduplication is about then why java didn't introduce it earlier ? why took so many years ?Enchantment
@Enchantment You would have to ask the JVM designers - string interning has been there for a long time and I suspect that as the performance of the JVM/garbage collector get better and as the number of CPUs per device increases, they can improve things that would have introduced too much overhead in the past.Colossian
@assylias, thanks this make more sense now. just one more question, why can't this be done for substring instead of String ? I am asking this question in general and not about this deduplication feature, this is unrelated question but i am curious.Enchantment
In older versions, a String could refer to a range within an array, using an int offset and length. In that case, de-duplication would be much more complex, but on the other hand it wasn’t required for results of String.substring as these substrings referred to the original array. This has changed in Java 7 raising the demand for a de-duplication feature.Erine
Finding a substring in a string is much slower (O(n^2) ?) whereas finding if two strings are equal is O(n) worst case and O(1) when the two strings have different (cached) hashcodes, i.e. most of the time it is a simple int comparison.Colossian
@Enchantment I have added a link which probably better answers your questions.Colossian
I wouldn’t expect the JVM to find substrings. But if such an older JVM detected that two Strings are equal, it shouldn’t make a larger-then-necessary array shared, so it would have to create and fill a new array then, and it had to update value, offset and count of a String atomically. In contrast, given the way it works today, only a single reference has to be changed.Erine
thank you very much :). one question, in my mentioned link it says "As soon as it finds another String which has the same hash code it compares them char by char".. If Strings have the same hash code then they are already the same then why compare them char by char ?Enchantment
@Enchantment No two strings can be different but have the same hashcode (pigeon hole demonstration: there are only 2^32-1 possible hashcode and you can create (almost) as many strings as you want). See for example this post for a list of strings which all have a hashcode of 0.Colossian
Did you mean: No , two strings ...? I read that very differently.Annamariaannamarie
@SoririosDelimanolis yes indeed - sorry for the missing punctuation.Colossian
E
73

@assylias answer basiclly tells you how it work and is very good answer. I have tested a production application with String Deduplication and have some results. The web app heavily uses Strings so i think the advantage is pretty clear.

To enable String Deduplication you have to add these JVM params (you need at least Java 8u20):

-XX:+UseG1GC -XX:+UseStringDeduplication -XX:+PrintStringDeduplicationStatistics

The last one is optional but like the name says it shows you the String Deduplication statistics. Here are mine:

[GC concurrent-string-deduplication, 2893.3K->2672.0B(2890.7K), avg 97.3%, 0.0175148 secs]
   [Last Exec: 0.0175148 secs, Idle: 3.2029081 secs, Blocked: 0/0.0000000 secs]
      [Inspected:           96613]
         [Skipped:              0(  0.0%)]
         [Hashed:           96598(100.0%)]
         [Known:                2(  0.0%)]
         [New:              96611(100.0%)   2893.3K]
      [Deduplicated:        96536( 99.9%)   2890.7K( 99.9%)]
         [Young:                0(  0.0%)      0.0B(  0.0%)]
         [Old:              96536(100.0%)   2890.7K(100.0%)]
   [Total Exec: 452/7.6109490 secs, Idle: 452/776.3032184 secs, Blocked: 11/0.0258406 secs]
      [Inspected:        27108398]
         [Skipped:              0(  0.0%)]
         [Hashed:        26828486( 99.0%)]
         [Known:            19025(  0.1%)]
         [New:           27089373( 99.9%)    823.9M]
      [Deduplicated:     26853964( 99.1%)    801.6M( 97.3%)]
         [Young:             4732(  0.0%)    171.3K(  0.0%)]
         [Old:           26849232(100.0%)    801.4M(100.0%)]
   [Table]
      [Memory Usage: 2834.7K]
      [Size: 65536, Min: 1024, Max: 16777216]
      [Entries: 98687, Load: 150.6%, Cached: 415, Added: 252375, Removed: 153688]
      [Resize Count: 6, Shrink Threshold: 43690(66.7%), Grow Threshold: 131072(200.0%)]
      [Rehash Count: 0, Rehash Threshold: 120, Hash Seed: 0x0]
      [Age Threshold: 3]
   [Queue]
      [Dropped: 0]

These are the results after running the app for 10 minutes. As you can see String Deduplication was executed 452 times and "deduplicated" 801.6 MB Strings. String Deduplication inspected 27 000 000 Strings. When i compared my memory consumption from Java 7 with the standard Parallel GC to Java 8u20 with the G1 GC and enabled String Deduplication the heap dropped approximatley 50%:

Java 7 Parallel GC

Java 7 Parallel GC

Java 8 G1 GC with String Deduplication

Java 8 G1 GC with String Deduplication

Eyetooth answered 14/1, 2015 at 22:40 Comment(7)
Thanks for this great answer. But can you tell me which tool you used to measure the memory consumption and how to do it? Any pointers to oracle/java website detailing this would be very helpful. I would like to do such analysis for my web application. Thanks in advance :)Blunger
The charts are from the NetBeans IDE - from the bulit-in profiler. Look at the Netbeans website and on Google for tutorial. Alternatively you could get the same charts from jVisualVM.Eyetooth
@RobertNiestroj, as per this article cubrid.org/blog/dev-platform/… We are not supposed/recommended to use G1GC. So how do we solve the problem?Pleochroism
What problem? If you cannot use G1GC then you cannot use String Deduplication. There's no workaround for this.Eyetooth
@Reddy, the article indicates G1 should not be considered because it is so new. This was 'newly official' in JDK7. I'm not sure how 'new' something needs to be to be too-new, but JDK7 came out in 2011.Este
Why is it not on by default, and need to turned on?Diapophysis
@Blunger you can use jprofiler. ej-technologies.com/products/jprofiler/overview.html . i used intellij idea and jprofiler to observe my JVM stats.Coarctate
C
71

Imagine you have a phone book, which contains people, which have a String firstName and a String lastName. And it happens that in your phone book, 100,000 people have the same firstName = "John".

Because you get the data from a database or a file those strings are not interned so your JVM memory contains the char array {'J', 'o', 'h', 'n'} 100 thousand times, one per John string. Each of these arrays takes, say, 20 bytes of memory so those 100k Johns take up 2 MB of memory.

With deduplication, the JVM will realise that "John" is duplicated many times and make all those John strings point to the same underlying char array, decreasing the memory usage from 2MB to 20 bytes.

You can find a more detailed explanation in the JEP. In particular:

Many large-scale Java applications are currently bottlenecked on memory. Measurements have shown that roughly 25% of the Java heap live data set in these types of applications is consumed by String objects. Further, roughly half of those String objects are duplicates, where duplicates means string1.equals(string2) is true. Having duplicate String objects on the heap is, essentially, just a waste of memory.

[...]

The actual expected benefit ends up at around 10% heap reduction. Note that this number is a calculated average based on a wide range of applications. The heap reduction for a specific application could vary significantly both up and down.

Colossian answered 14/1, 2015 at 18:6 Comment(14)
what if there are no same char arrays at all ? but 100s of non-similar strings and char arrays ? what is the use of this feature then ?Enchantment
Actually, that’s what I do for decades now. And somehow I expected other developers to be as smart as well to canonicalize Strings when reading a larger batch from a database or other external resource.Erine
@Enchantment in most application you will have a lot of similar strings, for example database queries, names, countries, webservice queries etc. It will probably not improve every single application but I'm sure they have tested the costs vs benefits on many typical use cases before rolling out that feature.Colossian
@assylias, i get it but if that is all deduplication is about then why java didn't introduce it earlier ? why took so many years ?Enchantment
@Enchantment You would have to ask the JVM designers - string interning has been there for a long time and I suspect that as the performance of the JVM/garbage collector get better and as the number of CPUs per device increases, they can improve things that would have introduced too much overhead in the past.Colossian
@assylias, thanks this make more sense now. just one more question, why can't this be done for substring instead of String ? I am asking this question in general and not about this deduplication feature, this is unrelated question but i am curious.Enchantment
In older versions, a String could refer to a range within an array, using an int offset and length. In that case, de-duplication would be much more complex, but on the other hand it wasn’t required for results of String.substring as these substrings referred to the original array. This has changed in Java 7 raising the demand for a de-duplication feature.Erine
Finding a substring in a string is much slower (O(n^2) ?) whereas finding if two strings are equal is O(n) worst case and O(1) when the two strings have different (cached) hashcodes, i.e. most of the time it is a simple int comparison.Colossian
@Enchantment I have added a link which probably better answers your questions.Colossian
I wouldn’t expect the JVM to find substrings. But if such an older JVM detected that two Strings are equal, it shouldn’t make a larger-then-necessary array shared, so it would have to create and fill a new array then, and it had to update value, offset and count of a String atomically. In contrast, given the way it works today, only a single reference has to be changed.Erine
thank you very much :). one question, in my mentioned link it says "As soon as it finds another String which has the same hash code it compares them char by char".. If Strings have the same hash code then they are already the same then why compare them char by char ?Enchantment
@Enchantment No two strings can be different but have the same hashcode (pigeon hole demonstration: there are only 2^32-1 possible hashcode and you can create (almost) as many strings as you want). See for example this post for a list of strings which all have a hashcode of 0.Colossian
Did you mean: No , two strings ...? I read that very differently.Annamariaannamarie
@SoririosDelimanolis yes indeed - sorry for the missing punctuation.Colossian
T
6

Since your first question has already been answered, I'll answer your second question.

The String objects must be compared character by character, because though equal Objects implies equal hashes, the inverse is not necessarily true.

As Holger said in his comment, this represents a hash collision.

The applicable specifications for the hashcode() method are as follows:

  • If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.

  • It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. ...

This means that in order for them to guarantee equality, the comparison of each character is necessary in order for them to confirm the equality of the two objects. They start by comparing hashCodes rather than using equals since they are using a hash table for the references, and this improves performance.

Trippet answered 14/1, 2015 at 19:12 Comment(10)
this doesn't answer the original (main) question.Enchantment
actually i didnt edit the main question...it was always there as you can see other answers.Enchantment
You did edit it. You changed it to say "2nd" in the last question. But thanks for making it more clear.Trippet
And he didn't answer your 2nd question, so I have added information that will hopefully be helpful and informative to someone reading it.Trippet
oh, I am sorry for that but i meant that i never really edit the main question...also, it is not a good approach to answer a question just by reading the last paragraph.Enchantment
I read it all, but I already had seen the accepted answer, so I wanted to only provide new information.Trippet
thanks for that...can you please edit your answer and add a statement that this is the answer to the 2nd question...i think that would be helpful for future readers. :) P.S, +1Enchantment
Let us continue this discussion in chat.Trippet
Also a good way to explain hash codes is the fact that the hash value is an integer of limited size. And each of these integers can be represented as a string. It follows that there are a lot more strings than hash values. Since there are many times more strings than there are possible hash values, it's obvious that many strings must share the same hash value.Jevon
@Peter, yes but that has already been discussed in detail in the comments under the question, hence why I left it at "hash collisions". Click the link to the comment in the answer to see.Trippet
C
1

The strategy they describe is to simply reuse the internal character array of one String in possibly many equal Strings. There's no need for each String to have its own copy if they are equal.

In order to more quickly determine if 2 strings are equal, the hash code is used as a first step, as it is a fast way to determine if Strings may be equal. Hence their statement:

As soon as it finds another String which has the same hash code it compares them char by char

This is to make a certain (but slower) comparison for equality once possible equality has been determined using the hash code.

In the end, equal Strings will share a single underlying char array.

Java has had String.intern() for a long time, to do more or less the same (i.e. save memory by deduplicating equal Strings). What's novel about this is that it happens during garbage collection time and can be externally controlled.

Conga answered 14/1, 2015 at 17:54 Comment(3)
you just copied what is written in that example link.Enchantment
@Enchantment I'm quoting part of their statement while trying to explain why the hash code is relevant in all this. Editing the answer in an attempt to make that more obvious.Conga
If you're quoting chunks of some other document, you should use the blockquote style and provide some form of attribution.Maryannamaryanne

© 2022 - 2024 — McMap. All rights reserved.