How do I truncate a java string to fit in a given number of bytes, once UTF-8 encoded?
Asked Answered
F

7

49

How do I truncate a java String so that I know it will fit in a given number of bytes storage once it is UTF-8 encoded?

Foofaraw answered 23/9, 2008 at 6:3 Comment(0)
I
35

Here is a simple loop that counts how big the UTF-8 representation is going to be, and truncates when it is exceeded:

public static String truncateWhenUTF8(String s, int maxBytes) {
    int b = 0;
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);

        // ranges from http://en.wikipedia.org/wiki/UTF-8
        int skip = 0;
        int more;
        if (c <= 0x007f) {
            more = 1;
        }
        else if (c <= 0x07FF) {
            more = 2;
        } else if (c <= 0xd7ff) {
            more = 3;
        } else if (c <= 0xDFFF) {
            // surrogate area, consume next char as well
            more = 4;
            skip = 1;
        } else {
            more = 3;
        }

        if (b + more > maxBytes) {
            return s.substring(0, i);
        }
        b += more;
        i += skip;
    }
    return s;
}

This does handle surrogate pairs that appear in the input string. Java's UTF-8 encoder (correctly) outputs surrogate pairs as a single 4-byte sequence instead of two 3-byte sequences, so truncateWhenUTF8() will return the longest truncated string it can. If you ignore surrogate pairs in the implementation then the truncated strings may be shorted than they needed to be.

I haven't done a lot of testing on that code, but here are some preliminary tests:

private static void test(String s, int maxBytes, int expectedBytes) {
    String result = truncateWhenUTF8(s, maxBytes);
    byte[] utf8 = result.getBytes(Charset.forName("UTF-8"));
    if (utf8.length > maxBytes) {
        System.out.println("BAD: our truncation of " + s + " was too big");
    }
    if (utf8.length != expectedBytes) {
        System.out.println("BAD: expected " + expectedBytes + " got " + utf8.length);
    }
    System.out.println(s + " truncated to " + result);
}

public static void main(String[] args) {
    test("abcd", 0, 0);
    test("abcd", 1, 1);
    test("abcd", 2, 2);
    test("abcd", 3, 3);
    test("abcd", 4, 4);
    test("abcd", 5, 4);

    test("a\u0080b", 0, 0);
    test("a\u0080b", 1, 1);
    test("a\u0080b", 2, 1);
    test("a\u0080b", 3, 3);
    test("a\u0080b", 4, 4);
    test("a\u0080b", 5, 4);

    test("a\u0800b", 0, 0);
    test("a\u0800b", 1, 1);
    test("a\u0800b", 2, 1);
    test("a\u0800b", 3, 1);
    test("a\u0800b", 4, 4);
    test("a\u0800b", 5, 5);
    test("a\u0800b", 6, 5);

    // surrogate pairs
    test("\uD834\uDD1E", 0, 0);
    test("\uD834\uDD1E", 1, 0);
    test("\uD834\uDD1E", 2, 0);
    test("\uD834\uDD1E", 3, 0);
    test("\uD834\uDD1E", 4, 4);
    test("\uD834\uDD1E", 5, 4);

}

Updated Modified code example, it now handles surrogate pairs.

Interspace answered 23/9, 2008 at 7:30 Comment(9)
UTF-8 can encode any UCS2 character in 3 bytes or less. Check that page you reference. However, if you want to comply with UCS4 or UTF16 (which can both reference the entire charset), you'll need to allow for up to 6-byte characters in UTF8.Eleen
Bill: see the CESU-8 discussion on the wikipedia page. My understanding is UTF-8 is supposed to encode surrogate pairs as a single 4-byte sequence, not two 3-byte sequences.Interspace
It's not 2 three-byte, it's up to 1 6-byte sequence to store UCS4, which is a full 31-bit character, not 2 16-bit "pairs" (that's UTF16). A 6-byte seq = 1111110C 10CCCCCC 10CCCCCC 10CCCCCC 10CCCCCC 10CCCCCC where the C's are data bits. Right now, only enough chars are in use to need 4 bytes.Eleen
But 8 years ago, more than 16-bits wasn't even necessary. Expect to see 5-byte chars in the next decade as more dialects and "Klingon"-type language planes are added.Eleen
Bill: you are correct, my code does not handle code points above U+10FFFF -- which is where more than 4 UTF-8 bytes are required. But Java can't encode characters past U+10FFFF anyway. Each char in Java is a 16 bit codepoint between U+0000 and U+FFFF. Surrogate pairs give you up to U+10FFFF.Interspace
Well, then, it would seem my solution is in excess. Didn't know that about Java's character (my I18n work was done for EQ in C++). Nice chat. :)Eleen
That won’t work for graphemes. It’s just as bad to truncate a partial grapheme as it is to truncate a partial character.Mayday
@Mayday well, it actually isn't quite as bad because software won't choke on trying to decode themAnalphabetic
Does this really need to be O(N)? Why not truncate the bytes, then look at the last 4 to figure out if you cut off in the middle of a unicode character, so that it's O(1).Kingofarms
Q
26

You should use CharsetEncoder, the simple getBytes() + copy as many as you can can cut UTF-8 charcters in half.

Something like this:

public static int truncateUtf8(String input, byte[] output) {
    
    ByteBuffer outBuf = ByteBuffer.wrap(output);
    CharBuffer inBuf = CharBuffer.wrap(input.toCharArray());

    CharsetEncoder utf8Enc = StandardCharsets.UTF_8.newEncoder();
    utf8Enc.encode(inBuf, outBuf, true);
    System.out.println("encoded " + inBuf.position() + " chars of " + input.length() + ", result: " + outBuf.position() + " bytes");
    return outBuf.position();
}
Queenhood answered 23/9, 2008 at 6:11 Comment(3)
This worked great for me -- probably less efficient, but much harder to get wrong, and it works for any character set. Works nicely with a quick new String(output, 0, output.length - returnValue, CHARSET)Lapsus
@sigget's solution is similar and in addition returns the actual truncated string, instead of just the lengthCodicodices
If this was, let's say for Oracle, shouldn't the UTF-8 be replaced with whatever encoding is the target column defined with?Obtect
C
25

Here's what I came up with, it uses standard Java APIs so should be safe and compatible with all the unicode weirdness and surrogate pairs etc. The solution is taken from http://www.jroller.com/holy/entry/truncating_utf_string_to_the with checks added for null and for avoiding decoding when the string is fewer bytes than maxBytes.

/**
 * Truncates a string to the number of characters that fit in X bytes avoiding multi byte characters being cut in
 * half at the cut off point. Also handles surrogate pairs where 2 characters in the string is actually one literal
 * character.
 *
 * Based on: http://www.jroller.com/holy/entry/truncating_utf_string_to_the
 */
public static String truncateToFitUtf8ByteLength(String s, int maxBytes) {
    if (s == null) {
        return null;
    }
    Charset charset = Charset.forName("UTF-8");
    CharsetDecoder decoder = charset.newDecoder();
    byte[] sba = s.getBytes(charset);
    if (sba.length <= maxBytes) {
        return s;
    }
    // Ensure truncation by having byte buffer = maxBytes
    ByteBuffer bb = ByteBuffer.wrap(sba, 0, maxBytes);
    CharBuffer cb = CharBuffer.allocate(maxBytes);
    // Ignore an incomplete character
    decoder.onMalformedInput(CodingErrorAction.IGNORE)
    decoder.decode(bb, cb, true);
    decoder.flush(cb);
    return new String(cb.array(), 0, cb.position());
}
Collen answered 2/2, 2016 at 9:4 Comment(2)
CharBuffer.allocate(maxBytes) allocates too much. Could it be CharBuffer.allocate(s.length())?Codicodices
s.length() is not in bytes.Bonaparte
E
10

UTF-8 encoding has a neat trait that allows you to see where in a byte-set you are.

check the stream at the character limit you want.

  • If its high bit is 0, it's a single-byte char, just replace it with 0 and you're fine.
  • If its high bit is 1 and so is the next bit, then you're at the start of a multi-byte char, so just set that byte to 0 and you're good.
  • If the high bit is 1 but the next bit is 0, then you're in the middle of a character, travel back along the buffer until you hit a byte that has 2 or more 1s in the high bits, and replace that byte with 0.

Example: If your stream is: 31 33 31 C1 A3 32 33 00, you can make your string 1, 2, 3, 5, 6, or 7 bytes long, but not 4, as that would put the 0 after C1, which is the start of a multi-byte char.

Eleen answered 23/9, 2008 at 6:7 Comment(7)
java.sun.com/j2se/1.5.0/docs/api/java/io/… explains the modified UTF-8 encoding used by Java and demonstrates why this answer is correct.Impiety
BTW, this solution (the one bill @Bill James) is much more efficient than the currently accepted answer by @Matt Quail, because the former requires you to test 3 bytes at the most, whereas the latter requires you to test all characters in the text.Impiety
Alexander: the former requires you to first convert the string to UTF8, which requires iterating over all the characters in the text.Interspace
True, but the question does state "Once it is UTF-8 encoded". Presumably that price has been paid.Eleen
@Alexander: That’s because they screwed up. That’s just trying to paper over the blunder. Surrogate pairs HAVE NO BUSINESS IN UTF-8!Mayday
There's a special case that I think should be considered: We might actually be at the last byte of a multi-byte character (I guess we would have to look at the next byte to find out whether this is the case). In that case we should not go back (and thereby trim 1 character too many), but just stay where we are.Frantz
@Frantz that case would be solved by the 2nd rule above when you process the next byte... you still need to do so in order to place the termination byte (0).Eleen
T
8

you can use -new String( data.getBytes("UTF-8") , 0, maxLen, "UTF-8");

Tampere answered 24/10, 2018 at 6:53 Comment(3)
Although your solution looked the best this code gives me a StringIndexOutOfBoundsException: String index out of range: 300: String str = "kt on ivp (day 3) - part 2 - 19 haziran 2018 sal_ 11.36.21.mp4."; System.out.println("Len is " + str.getBytes(StandardCharsets.UTF_8.name()).length); String finalTitle = new String(str.getBytes(StandardCharsets.UTF_8.name()), 0, Constants.MAX_TITLE_LENGTH, StandardCharsets.UTF_8.name()); Constants.MAX_TITLE_LENGTH is 300. @Suresh Gupta Do you know why?Endearment
first, check your string length if it is less than your max limit it will though exception. Make sure your string length should be more than your max lmit before truncatingTampere
the original question was literally how to perform that check optimally.Eleen
T
3

You can calculate the number of bytes without doing any conversion.

foreach character in the Java string
  if 0 <= character <= 0x7f
     count += 1
  else if 0x80 <= character <= 0x7ff
     count += 2
  else if 0x800 <= character <= 0xd7ff // excluding the surrogate area
     count += 3
  else if 0xdc00 <= character <= 0xffff
     count += 3
  else { // surrogate, a bit more complicated
     count += 4
     skip one extra character in the input stream
  }

You would have to detect surrogate pairs (D800-DBFF and U+DC00–U+DFFF) and count 4 bytes for each valid surrogate pair. If you get the first value in the first range and the second in the second range, it's all ok, skip them and add 4. But if not, then it is an invalid surrogate pair. I am not sure how Java deals with that, but your algorithm will have to do right counting in that (unlikely) case.

Tart answered 23/9, 2008 at 7:47 Comment(0)
K
0

Scanning from the tail end of the string is far more efficient that scanning from the beginning, especially on very long strings. So walen was on the right path, unfortunately that answer does not provide the correct truncation.

If you would like a solution that scans backwards only a few characters, this is the best option.

Using the data in billjamesdev's answer we can effectively scan backwards and correctly get the truncation on a character boundary.

public static String utf8ByteTrim(String s, int requestedTrimSize) {
    final byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
    int maxTrimSize = Integer.min(requestedTrimSize, bytes.length);
    int trimSize = maxTrimSize;
    if ((bytes[trimSize-1] & 0x80) != 0) { // inside a multibyte sequence
        while ((bytes[trimSize - 1] & 0x40) == 0) { // 2nd, 3rd, 4th bytes
            trimSize--;
        }
        trimSize--;  // Get to the start of the UTF-8
        // Now see if that final UTF-8 character fits.
        // Assume the UTF-8 starts with binary 110xxxxx and is 2 bytes
        int numBytes = 2;  
        if ((bytes[trimSize] & 0xF0) == 0xE0) {
            // If the UTF-8 starts with binary 1110xxxx it is 3 bytes
            numBytes = 3;
        } else if ((bytes[trimSize] & 0xF8) == 0xF0) {
            // If the UTF-8 starts with binary 11110xxx it is 3 bytes
            numBytes = 4;
        }
        if( (trimSize + numBytes) == maxTrimSize)  {
            // The entire last UTF-8 character fits
            trimSize = maxTrimSize; 
        }
    }
    return new String(bytes, 0, trimSize, StandardCharsets.UTF_8);
}

There is only one while loop that will execute at most 3 iterations as it walks backward. Then a few if statements will determine which character to truncate.

Some testing:

String test = "Aæ😂尝试"; // Sizes: (1,2,4,3,3) = 13 bytes
IntStream.range(1, 16).forEachOrdered(i ->
        System.out.println("Size " + i + ": " + utf8ByteTrim(test, i))
);

---

Size 1: A
Size 2: A
Size 3: Aæ
Size 4: Aæ
Size 5: Aæ
Size 6: Aæ
Size 7: Aæ😂
Size 8: Aæ😂
Size 9: Aæ😂
Size 10: Aæ😂尝
Size 11: Aæ😂尝
Size 12: Aæ😂尝
Size 13: Aæ😂尝试
Size 14: Aæ😂尝试
Size 15: Aæ😂尝试
Kif answered 21/5, 2023 at 0:56 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.