Java 8 String deduplication vs. String.intern()
Asked Answered
A

3

20

I am reading about the feature in Java 8 update 20 for String deduplication (more info) but I am not sure if this basically makes String.intern() obsolete.

I know that this JVM feature needs the G1 garbage collector, which might not be an option for many, but assuming one is using G1GC, is there any difference/advantage/disadvantage of the automatic deduplication done by the JVM vs manually having to intern your strings (one obvious one is the advantage of not having to pollute your code with calls to intern())?

This is especially interesting considering that Oracle might make G1GC the default GC in java 9

Accuse answered 29/9, 2015 at 22:46 Comment(9)
Suggested video -- but anyway, the conclusion is always the same: You. Should. Not. Care.Gruelling
sorry, should not care about what? about which one to use (meaning they are equivalnet) or about the new feature (meaning it's not that useful)??Accuse
Meaning: just use the String class without a second thought.Gruelling
Good question. That these features are added into the JVM is a hint towards developers to focus on coding instead of memory management. You shouldn't use String.intern() nor System.gc() -- just let the VM do its work.Discrepancy
Are you just interning strings to save memory in your application, or is it so you can treat them as unique symbols? How many strings? How much memory are you using (or saving) with your current interning approach? Most applications shouldn't ever have to worry about this, as other commenters have noted.Kenyakenyatta
@DavidConrad i'm not doing either in my applications. I am just trying to understand how this feature works. Any insights why I shouldn't worry about it? everyone is just saying don't worry about it without an explanation. Are you implying that String deduplication is not very useful/effective? Previous comments about only using String without thinking about all this have nothing to do with enabling a feature in the JVM (it's not a devel-time change). It is more related to tweaking the runtimeAccuse
I'm sure it's useful and effective, but since the G1GC does it automatically and 99% of applications aren't under any memory pressure from having too many duplicate strings, you don't need to worry about it. As Knuth said, "We should forget about small efficiencies," so unless you're trying to fix a specific problem with an application that is using too much memory and you think the problem is duplicate strings, there's no need to consider it.Kenyakenyatta
The relevant part of the video @Gruelling referenced runs 29m-39m. My take was very different from "do not care": rather: use your own Java code to 'intern' (pool) Strings! (Aleksey is of course a performance maniac.)Prowl
@DavidConrad In the full quote Knuth actually quantifies his suggestion: "We should forget about small efficiencies, say about 97% of the time. Premature optimisation is the root of all evil". So Knuth does care about low-level performance (very much so - see the detail in Art of Programming) and the 3% of code that will benefit (the whole application) from some attention and tuning. But yes, point taken: prioritise your own time over CPU time any day of the week. :-)Prowl
B
13

With this feature, if you have 1000 distinct String objects, all with the same content "abc", JVM could make them share the same char[] internally. However, you still have 1000 distinct String objects.

With intern(), you will have just one String object. So if memory saving is your concern, intern() would be better. It'll save space, as well as GC time.

However, the performance of intern() isn't that great, last time I heard. You might be better off by having your own string cache, even using a ConcurrentHashMap ... but you need to benchmark it to make sure.

Bred answered 30/9, 2015 at 0:6 Comment(2)
are you aware of any other differences?Accuse
Actually, performance with String.intern is comparable to manual string pooling. Mikhail Vorontsov did some performance benchmarks and showed that with the StringTableSize parameter set sufficiently high to a prime, that the performance was comparable to manual string pooling yourself. http://java-performance.info/string-intern-in-java-6-7-8/Ezara
P
6

As a comment references, do see: http://java-performance.info/string-intern-in-java-6-7-8/. It is very insightful reference and I learned a lot, however I'm not sure its conclusions are necessarily "one size fits all". Each aspect depends on the needs of your own application - taking measurements of realistic input data is highly recommended!

The main factor probably depends on what you are in control over:

  • Do you have full control over the choice of GC? In a GUI application for example, there is still a strong case to be made for using Serial GC. (far lower total memory footprint for the process - think 400 MB vs ~1 GB for a moderately complex app, and being much more willing release memory, e.g. after a transient spike in usage). So you might pick that or give your users the option. (If the heap remains small the pauses should not be a big deal).

  • Do you have full control over the code? The G1GC option is great for 3rd party libraries (and applications!) which you can't edit.

The second consideration (as per @ZhongYu's answer) is that String.intern can de-duplication the String objects themselves, whereas G1GC necessarily can only de-duplicate their private char[] field.

A third consideration may be CPU usage, say if impact on laptop battery life might be of concern to your users. G1GC will run an extra thread dedicated to de-duplicating the heap. For example, I played with this to run Eclipse and found it caused an initial period of increased CPU activity after starting up (think 1 - 2 minutes) but it settled on a smaller heap "in-use" and no obvious (just eye-balling the task manager) CPU overhead or slow-down thereafter. So I imagine a certain % of a CPU core will be taken up on de-duplication (during? after?) periods of high memory-churn. (Of course there may be a comparable overhead if you call String.intern everywhere, which would also runs in serial, but then...)

You probably don't need string de-duplication everywhere. There are probably only certain areas of code that:

  • really impact long-term heap usage, and
  • create a high proportion of duplicate strings

By using String.intern selectively, other parts of the code (which may create temporary or semi-temporary strings) don't pay the price.

And finally, a quick plug for the Guava utility: Interner, which:

Provides equivalent behavior to String.intern() for other immutable types

You can also use that for Strings. Memory probably is (and should be) your top performance concern, so this probably doesn't apply often: however when you need to squeeze every drop of speed out of some hot-spot area, my experience is that Java-based weak-reference HashMap solutions do run slightly but consistently faster than the JVM's C++ implementation of String.intern(), even after tuning the jvm options. (And bonus: you don't need to tune the JVM options to scale to different input.)

Prowl answered 15/6, 2017 at 12:28 Comment(0)
B
3

I want to introduce another decision factor regarding the targeted audience:

  • For a system integrator having a system composed by many different libraries/frameworks, with low capacity to influence those libraries internal development, StringDeDuplication could be a quick winner if memory is a problem. It will affect all the Strings in the JVM, but G1 will use only spare time to do it. You may even tweak when DeDuplication is calculated by using another parameter(StringDeduplicationAgeThreshold)
  • For a developer profiling his own code, String.intern could be more interesting. Thoughful review of the domain model is necessary to decide whether to call intern, and when. As rule of thumb you may use intern when you know the String will contain a limited set of values, like a kind of enumerated set (i.e. Country name, month, day of week...).
Bearce answered 18/3, 2016 at 11:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.