Easy and fast solution
For UTF-8, the length is determined by the first byte.
UTF-8 - Wikipedia
public static String substring(String text, int bytes) {
byte[] utf8 = text.getBytes(StandardCharsets.UTF_8);
int length = 0;
for (int i = 0; i < utf8.length; ) {
if ((utf8[i] & 0b1111_1000) == 0b1111_0000) {
i += 4;
} else if ((utf8[i] & 0b1111_0000) == 0b1110_0000) {
i += 3;
} else if ((utf8[i] & 0b1110_0000) == 0b1100_0000) {
i += 2;
} else {
i += 1;
}
if (bytes < i) {
break;
}
length = i;
}
return new String(Arrays.copyOfRange(utf8, 0, length), StandardCharsets.UTF_8);
}
Alternatively, it can be determined as UTF-16.
(Be careful when handling surrogate pairs)
public static String substring(String text, int bytes) {
int endIndex = 0;
int utf8 = 0;
for (int i = 0; i < text.length(); i++) {
char c = text.charAt(i);
if (Character.isHighSurrogate(c)) {
continue;
} else if (Character.isLowSurrogate(c)) {
utf8 += 4;
} else if (c <= 0x007F) {
utf8 += 1;
} else if (c <= 0x07FF) {
utf8 += 2;
} else {
utf8 += 3;
}
if (bytes < utf8) {
break;
}
endIndex = i + 1;
}
return text.substring(0, endIndex);
}
For example
Example of cutting within 10 bytes.
Input : "Rückruf ins Ausland"
Output: "Rückruf i" (10 bytes)
Input : "Rückruf"
Output: "Rückruf" (8 bytes)
Input : "123456789ü"
Output: "123456789" (9 bytes)
Input : "Rüüüüüü"
Output: "Rüüüü" (9 bytes)
Input : "✅✅✅✅✅"
Output: "✅✅✅" (9 bytes)
Input : "🎉🎉🎉🎉🎉"
Output: "🎉🎉" (8 bytes)
Note: ü
is 2 bytes, ✅
is 3 bytes, 🎉
is 4 bytes