Since String
in Java (like other languages) consumes a lot of memory because each character consumes two bytes, Java 8 has introduced a new feature called String Deduplication which takes advantage of the fact that the char arrays are internal to strings and final, so the JVM can mess around with them.
I have read this example so far but since I am not a pro java coder, I am having a hard time grasping the concept.
Here is what it says,
Various strategies for String Duplication have been considered, but the one implemented now follows the following approach: Whenever the garbage collector visits String objects it takes note of the char arrays. It takes their hash value and stores it alongside with a weak reference to the array. As soon as it finds another String which has the same hash code it compares them char by char. If they match as well, one String will be modified and point to the char array of the second String. The first char array then is no longer referenced anymore and can be garbage collected.
This whole process of course brings some overhead, but is controlled by tight limits. For example if a string is not found to have duplicates for a while it will be no longer checked.
My First question,
There is still a lack of resources on this topic since it is recently added in Java 8 update 20, could anyone here share some practical examples on how it help in reducing the memory consumed by String
in Java ?
Edit:
The above link says,
As soon as it finds another String which has the same hash code it compares them char by char
My 2nd question,
If hash code of two String
are same then the Strings
are already the same, then why compare them char
by char
once it is found that the two String
have same hash code ?
2³² == 4294967296
different hash codes but65536²¹⁴⁷⁴⁸³⁶⁴⁸ == practically infinite
different possibleString
s. In other words, having the same hash code does not guaranty that theString
are equal. You have to check that. Only the opposite is true, having different hash codes implies that theString
s are not equal. – ErineString
combinations i.e65536²¹⁴⁷⁴⁸³⁶⁴⁸
– Enchantmentchar
is a 16 Bit value, so it allows2¹⁶ == 65536
combinations. AString
is a sequence that has anint
length, so it may have up to2³¹
characters (2³¹
not2³²
becauseint
is signed in Java but aString
has a positive size) so the maximumString
length is2³¹ == 2147483648
(theoretically, the practical limit is a bit smaller). So aString
can combine up to 2147483648 chars which can have 65536 possible combinations, which makes65536²¹⁴⁷⁴⁸³⁶⁴⁸
combinations (actually a bit larger as aString
could also be shorter) – Erinen
digit positions when there arem
different digits which allowsmⁿ
combinations, e.g. the decimal numbers from000
to999
allow10³
combinations. For aString
there are65536
different “digits” (akachar
s) at2147483648
digit positions, so its65536²¹⁴⁷⁴⁸³⁶⁴⁸
. It’s only “slightly” more as\0
and “end-of-String” are distinct in Java. Not that it matters, as it’s too large to imagine anyway. – ErineString
that can be shorter. That's what I'm talking about. That's not really slightly more. – Trippet∑n=0_31((2¹⁶)^(2^n))
– Trippetfinal
and thus, may get locally cached/treated as constant by threads as before. The change of the reference during the string de-duplication is a special action that lives outside the ordinary access rules. This can’t create race conditions as it doesn’t matter whether threads are using the old array or the new array, both have identical contents. But keep in mind that the de-duplication is done by the garbage collector anyway. The garbage collector has to know which objects (including arrays) are referenced by live threads in the first place. – Erine