How do I truncate a java String
so that I know it will fit in a given number of bytes storage once it is UTF-8 encoded?
Here is a simple loop that counts how big the UTF-8 representation is going to be, and truncates when it is exceeded:
public static String truncateWhenUTF8(String s, int maxBytes) {
int b = 0;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
// ranges from http://en.wikipedia.org/wiki/UTF-8
int skip = 0;
int more;
if (c <= 0x007f) {
more = 1;
}
else if (c <= 0x07FF) {
more = 2;
} else if (c <= 0xd7ff) {
more = 3;
} else if (c <= 0xDFFF) {
// surrogate area, consume next char as well
more = 4;
skip = 1;
} else {
more = 3;
}
if (b + more > maxBytes) {
return s.substring(0, i);
}
b += more;
i += skip;
}
return s;
}
This does handle surrogate pairs that appear in the input string. Java's UTF-8 encoder (correctly) outputs surrogate pairs as a single 4-byte sequence instead of two 3-byte sequences, so truncateWhenUTF8()
will return the longest truncated string it can. If you ignore surrogate pairs in the implementation then the truncated strings may be shorted than they needed to be.
I haven't done a lot of testing on that code, but here are some preliminary tests:
private static void test(String s, int maxBytes, int expectedBytes) {
String result = truncateWhenUTF8(s, maxBytes);
byte[] utf8 = result.getBytes(Charset.forName("UTF-8"));
if (utf8.length > maxBytes) {
System.out.println("BAD: our truncation of " + s + " was too big");
}
if (utf8.length != expectedBytes) {
System.out.println("BAD: expected " + expectedBytes + " got " + utf8.length);
}
System.out.println(s + " truncated to " + result);
}
public static void main(String[] args) {
test("abcd", 0, 0);
test("abcd", 1, 1);
test("abcd", 2, 2);
test("abcd", 3, 3);
test("abcd", 4, 4);
test("abcd", 5, 4);
test("a\u0080b", 0, 0);
test("a\u0080b", 1, 1);
test("a\u0080b", 2, 1);
test("a\u0080b", 3, 3);
test("a\u0080b", 4, 4);
test("a\u0080b", 5, 4);
test("a\u0800b", 0, 0);
test("a\u0800b", 1, 1);
test("a\u0800b", 2, 1);
test("a\u0800b", 3, 1);
test("a\u0800b", 4, 4);
test("a\u0800b", 5, 5);
test("a\u0800b", 6, 5);
// surrogate pairs
test("\uD834\uDD1E", 0, 0);
test("\uD834\uDD1E", 1, 0);
test("\uD834\uDD1E", 2, 0);
test("\uD834\uDD1E", 3, 0);
test("\uD834\uDD1E", 4, 4);
test("\uD834\uDD1E", 5, 4);
}
Updated Modified code example, it now handles surrogate pairs.
char
in Java is a 16 bit codepoint between U+0000 and U+FFFF. Surrogate pairs give you up to U+10FFFF. –
Interspace You should use CharsetEncoder, the simple getBytes()
+ copy as many as you can can cut UTF-8 charcters in half.
Something like this:
public static int truncateUtf8(String input, byte[] output) {
ByteBuffer outBuf = ByteBuffer.wrap(output);
CharBuffer inBuf = CharBuffer.wrap(input.toCharArray());
CharsetEncoder utf8Enc = StandardCharsets.UTF_8.newEncoder();
utf8Enc.encode(inBuf, outBuf, true);
System.out.println("encoded " + inBuf.position() + " chars of " + input.length() + ", result: " + outBuf.position() + " bytes");
return outBuf.position();
}
new String(output, 0, output.length - returnValue, CHARSET)
–
Lapsus UTF-8
be replaced with whatever encoding is the target column defined with? –
Obtect Here's what I came up with, it uses standard Java APIs so should be safe and compatible with all the unicode weirdness and surrogate pairs etc. The solution is taken from http://www.jroller.com/holy/entry/truncating_utf_string_to_the with checks added for null and for avoiding decoding when the string is fewer bytes than maxBytes.
/**
* Truncates a string to the number of characters that fit in X bytes avoiding multi byte characters being cut in
* half at the cut off point. Also handles surrogate pairs where 2 characters in the string is actually one literal
* character.
*
* Based on: http://www.jroller.com/holy/entry/truncating_utf_string_to_the
*/
public static String truncateToFitUtf8ByteLength(String s, int maxBytes) {
if (s == null) {
return null;
}
Charset charset = Charset.forName("UTF-8");
CharsetDecoder decoder = charset.newDecoder();
byte[] sba = s.getBytes(charset);
if (sba.length <= maxBytes) {
return s;
}
// Ensure truncation by having byte buffer = maxBytes
ByteBuffer bb = ByteBuffer.wrap(sba, 0, maxBytes);
CharBuffer cb = CharBuffer.allocate(maxBytes);
// Ignore an incomplete character
decoder.onMalformedInput(CodingErrorAction.IGNORE)
decoder.decode(bb, cb, true);
decoder.flush(cb);
return new String(cb.array(), 0, cb.position());
}
CharBuffer.allocate(maxBytes)
allocates too much. Could it be CharBuffer.allocate(s.length())
? –
Codicodices UTF-8 encoding has a neat trait that allows you to see where in a byte-set you are.
check the stream at the character limit you want.
- If its high bit is 0, it's a single-byte char, just replace it with 0 and you're fine.
- If its high bit is 1 and so is the next bit, then you're at the start of a multi-byte char, so just set that byte to 0 and you're good.
- If the high bit is 1 but the next bit is 0, then you're in the middle of a character, travel back along the buffer until you hit a byte that has 2 or more 1s in the high bits, and replace that byte with 0.
Example: If your stream is: 31 33 31 C1 A3 32 33 00, you can make your string 1, 2, 3, 5, 6, or 7 bytes long, but not 4, as that would put the 0 after C1, which is the start of a multi-byte char.
you can use -new String( data.getBytes("UTF-8") , 0, maxLen, "UTF-8");
You can calculate the number of bytes without doing any conversion.
foreach character in the Java string
if 0 <= character <= 0x7f
count += 1
else if 0x80 <= character <= 0x7ff
count += 2
else if 0x800 <= character <= 0xd7ff // excluding the surrogate area
count += 3
else if 0xdc00 <= character <= 0xffff
count += 3
else { // surrogate, a bit more complicated
count += 4
skip one extra character in the input stream
}
You would have to detect surrogate pairs (D800-DBFF and U+DC00–U+DFFF) and count 4 bytes for each valid surrogate pair. If you get the first value in the first range and the second in the second range, it's all ok, skip them and add 4. But if not, then it is an invalid surrogate pair. I am not sure how Java deals with that, but your algorithm will have to do right counting in that (unlikely) case.
Scanning from the tail end of the string is far more efficient that scanning from the beginning, especially on very long strings. So walen was on the right path, unfortunately that answer does not provide the correct truncation.
If you would like a solution that scans backwards only a few characters, this is the best option.
Using the data in billjamesdev's answer we can effectively scan backwards and correctly get the truncation on a character boundary.
public static String utf8ByteTrim(String s, int requestedTrimSize) {
final byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
int maxTrimSize = Integer.min(requestedTrimSize, bytes.length);
int trimSize = maxTrimSize;
if ((bytes[trimSize-1] & 0x80) != 0) { // inside a multibyte sequence
while ((bytes[trimSize - 1] & 0x40) == 0) { // 2nd, 3rd, 4th bytes
trimSize--;
}
trimSize--; // Get to the start of the UTF-8
// Now see if that final UTF-8 character fits.
// Assume the UTF-8 starts with binary 110xxxxx and is 2 bytes
int numBytes = 2;
if ((bytes[trimSize] & 0xF0) == 0xE0) {
// If the UTF-8 starts with binary 1110xxxx it is 3 bytes
numBytes = 3;
} else if ((bytes[trimSize] & 0xF8) == 0xF0) {
// If the UTF-8 starts with binary 11110xxx it is 3 bytes
numBytes = 4;
}
if( (trimSize + numBytes) == maxTrimSize) {
// The entire last UTF-8 character fits
trimSize = maxTrimSize;
}
}
return new String(bytes, 0, trimSize, StandardCharsets.UTF_8);
}
There is only one while loop that will execute at most 3 iterations as it walks backward. Then a few if statements will determine which character to truncate.
Some testing:
String test = "Aæ😂尝试"; // Sizes: (1,2,4,3,3) = 13 bytes
IntStream.range(1, 16).forEachOrdered(i ->
System.out.println("Size " + i + ": " + utf8ByteTrim(test, i))
);
---
Size 1: A
Size 2: A
Size 3: Aæ
Size 4: Aæ
Size 5: Aæ
Size 6: Aæ
Size 7: Aæ😂
Size 8: Aæ😂
Size 9: Aæ😂
Size 10: Aæ😂尝
Size 11: Aæ😂尝
Size 12: Aæ😂尝
Size 13: Aæ😂尝试
Size 14: Aæ😂尝试
Size 15: Aæ😂尝试
© 2022 - 2025 — McMap. All rights reserved.