Zero-garbage large String deserialization in Java, Humongous object issue
Asked Answered
C

3

9

I am looking for a way to deserialize a String from a byte[] in Java with as little garbage produced as possible. Because I am creating my own serializer and de-serializer, I have complete freedom to implement any solution on the server-side (i.e. when serializing data), and on the client-side (i.e. when de-serializing data).

I have managed to efficiently serialize a String without incurring any garbage overhead by iterating through the String's chars (String.charAt(i)) and converting each char (16-bit value) to 2x 8-bit value. There is a nice debate regarding this here. An alternative is to use Reflection to access String's underlying char[] directly, but this in outside the scope of the problem.

However, it seems impossible for me to deserialize the byte[] without creating the char[] twice, which seems, well, weird.

The procedure:

  1. Create char[]
  2. Iterate through byte[] and fill-in the char[]
  3. Create String with String(char[]) constructor

Because of Java's String immutability rules, the constructor copies the char[], creating 2x GC overhead. I can always use mechanisms to circumvent this (Unsafe String allocation + Reflection to set the char[] instance), but I just wanted to ask if there are any consequences to this other than me breaking every convention on String's immutability.

Of course, the wisest response to this would be "come on, stop doing this and have trust in GC, the original char[] will be extremely short-lived and G1 will get rid of it momentarily", which actually makes sense, if the char[] is smaller than 1/2 of the G1's region size. If it is larger, the char[] will be directly allocated as a humongous object (i.e. automatically propagated outside of the G1's region). Such objects are extremely hard to be efficiently garbage collected in G1. That's why each allocation matters.

Any ideas on how to tackle the issue?

Many thanks.

Coniah answered 21/1, 2015 at 9:57 Comment(3)
have you considered simply not working with Strings and just serializing the raw byte data and doing character set conversions on subsections when it's absolutely necessary?Epiphora
I have. My idea was to create a new class MutableString, and implement a lot of traditionally garbage-heavy opertations over it (fastpath String split, for instance), and then have a method toString(from, to) which creates a "view" instance which is of type String. I could do that. But this would require to completely refactor our application and to use MutableStrings everywhere possible. It is a nice idea, but I wanted to explore alternatives first.Coniah
Are you aware that all these things already exist? There are CharBuffer and StringBuilder, both being a kind of mutable String (unless you’ve created an immutable view), there are methods for creating lightweight subsequences of them and they all implement CharSequence, the interface on which the regex package, which actually implements the split operation, works on. And while it looks like character contents is copied all the time when converting between Strings, CharBuffers and StringBuilders when looking at the source code, HotSpot has special optimizations for them…Brod
C
1

I have found a solution, which is useless, if you have an unmanaged environment.

The java.lang.String class has a package-private constructor String(char[] value, boolean share).

Source:

/*
* Package private constructor which shares value array for speed.
* this constructor is always expected to be called with share==true.
* a separate constructor is needed because we already have a public
* String(char[]) constructor that makes a copy of the given char[].
*/
String(char[] value, boolean share) {
    // assert share : "unshared not supported";
    this.value = value;
}

This is being used extensively within Java, e.g. in Integer.toString(), Long.toString(), String.concat(String), String.replace(char, char), String.valueOf(char).

The solution (or hack, whatever you want to call it) is to move the class to java.lang package and to access the package-private constructor. This will not bode well with the security manager, but this can be circumvented.

Coniah answered 22/1, 2015 at 10:34 Comment(1)
instead of moving a class into the package you could probably just access the construcor via reflection and then build a method handle/lambda for the constructor method to avoid calling overheadsEpiphora
E
4

Such objects are extremely hard to be efficiently garbage collected in G1.

This may not be true any longer, but you will have to evaluate it for your own application. JDK Bugs 8027959 and 8048179 introduce new mechanisms for collecting humongous, short-lived objects. According to the bug flags you might have to run with jdk versions ≥8u40 and ≥8u60 to reap their respective benefits.

Experimental option of interest:

-XX:+G1ReclaimDeadHumongousObjectsAtYoungGC

Tracing:

-XX:+G1TraceReclaimDeadHumongousObjectsAtYoungGC

For further advice and questions regarding those features I would recommend hitting the hotspot-gc-use mailing list.

Epiphora answered 21/1, 2015 at 10:55 Comment(0)
C
1

I have found a solution, which is useless, if you have an unmanaged environment.

The java.lang.String class has a package-private constructor String(char[] value, boolean share).

Source:

/*
* Package private constructor which shares value array for speed.
* this constructor is always expected to be called with share==true.
* a separate constructor is needed because we already have a public
* String(char[]) constructor that makes a copy of the given char[].
*/
String(char[] value, boolean share) {
    // assert share : "unshared not supported";
    this.value = value;
}

This is being used extensively within Java, e.g. in Integer.toString(), Long.toString(), String.concat(String), String.replace(char, char), String.valueOf(char).

The solution (or hack, whatever you want to call it) is to move the class to java.lang package and to access the package-private constructor. This will not bode well with the security manager, but this can be circumvented.

Coniah answered 22/1, 2015 at 10:34 Comment(1)
instead of moving a class into the package you could probably just access the construcor via reflection and then build a method handle/lambda for the constructor method to avoid calling overheadsEpiphora
G
0

Found a working solution with simple "secret" native Java library:

String longString = StringUtils.repeat("bla", 1000000);
char[] longArray = longString.toCharArray();
String fastCopiedString = SharedSecrets.getJavaLangAccess().newStringUnsafe(longArray);
Georgenegeorges answered 15/7, 2018 at 19:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.