How to do substring for UTF8 string in java?
Asked Answered
A

9

7

Suppose I have the following string: Rückruf ins Ausland I need to insert it into the database which has a max size of 10. I did a normal substring in java and it extracted this string Rückruf in which is 10 characters. When it tries to insert this column I get the following oracle error:

java.sql.SQLException: ORA-12899: value too large for column "WAEL"."TESTTBL"."DESC" (actual: 11, maximum: 10) The reason for this is that the database has a AL32UTF8 character set thus the ü will take 2 chars.

I need to write a function in java that does this substring but taking into consideration that the ü takes 2 bytes so the returned substring in this case should be Rückruf i (9 chars). Any suggestions?

Amorphous answered 16/7, 2015 at 13:32 Comment(1)
Perhaps using character length semantics for defining column length could be an option.Lecithinase
M
1

If you want to trim the data in Java you must write a function that trims the string using the db charset used, something like this test case:

package test;

import java.io.UnsupportedEncodingException;

public class TrimField {

    public static void main(String[] args) {
        //UTF-8 is the db charset
        System.out.println(trim("Rückruf ins Ausland",10,"UTF-8"));
        System.out.println(trim("Rüückruf ins Ausland",10,"UTF-8"));
    }

    public static String trim(String value, int numBytes, String charset) {
        do {
            byte[] valueInBytes = null;
            try {
                valueInBytes = value.getBytes(charset);
            } catch (UnsupportedEncodingException e) {
                throw new RuntimeException(e.getMessage(), e);
            }
            if (valueInBytes.length > numBytes) {
                value = value.substring(0, value.length() - 1);
            } else {
                return value;
            }
        } while (value.length() > 0);
        return "";

    }

}
Mars answered 16/7, 2015 at 13:46 Comment(2)
This might cut off between Unicode codepoints that form a single grapheme cluster together. For example if the ü was represented as the decomposed form U+0075 U+0308 (i.e. "u" and the combining diaeresis), then this could leave "half" of the character in and not the other, turning ü into u. This could also cut apart between two surrogate pairs, which would have an equaly problematic result.Norbertonorbie
If the bound is a surrogate pair like 🎉, this will not give the correct result.Cenesthesia
K
2

You can calculate the correct length of a String in java converting the string to a byte array.

As an example see the code below:

System.out.println("Rückruf i".length()); // prints 9 
System.out.println("Rückruf i".getBytes().length); // prints 10 

If the current charset is not UTF-8 replace the code with:

System.out.println("Rückruf i".length()); // prints 9 
System.out.println("Rückruf i".getBytes("UTF-8").length); // prints 10 

If needed you can replace the UTF-8 with the charset you like to test for the length of string in that charset.

Kaput answered 16/7, 2015 at 13:39 Comment(6)
Yes, but what if the underlying database character set will change to something else ?Disestablish
The number of bytes may change depending on the encoding used. So this isn't universalPennipennie
This is the number of bytes used in java. If a character is present in UTF8 it is represented as 1 byte, if it a character of UTF16 not present in UTF8 it is represented by 2 bytes.Kaput
I see what you're saying that its consistent in Java, but i mean, it wont necessarily always match the database bytes. it would depend on the encoding in the database as well. In this specific case it match though.Pennipennie
I am sorry i ran this code and i am getting 9 in both case.Gratify
@Singh Perhaps your default charset is not UTF-8? If yes you have to replace with System.out.println("Rückruf i".getBytes("UTF-8").length); I added the solution for non UTF-8 charsetsKaput
V
2

If it has to be Java you could parse the string to bytes and trim the length of the array.

        String s = "Rückruf ins Ausland";
        byte[] bytes = s.getBytes("UTF-8");
        byte[] bytes2 = new byte[10];
        System.arraycopy(bytes, 0, bytes2, 0, 10);
        String trim = new String(bytes2, "UTF-8");
Variorum answered 16/7, 2015 at 13:57 Comment(5)
This worked and the nice thing about it is that it doesn't have any loop. it is straight-forwardAmorphous
I'm pretty sure it will truncate multi-byte characters if they are at the boundary of the trim. I based my solution on this, with the exception that I loop though and check that a new character won't cross the boundary.Pennipennie
yes you are right, I tried an example like 123456789ü and the trimmed string was 123456789? with a ? at the endAmorphous
@CarlosBribiescas You are right, didn't think about that!Variorum
I based my solution on this, and for preventing truncation of the last character, it checks to see if the last character of result is the same as in the input string, and if it's not the same removes it. https://mcmap.net/q/1447336/-how-to-do-substring-for-utf8-string-in-javaPhillida
U
2

The following horribly circumstantially walks through the entire string by full Unicode code point, so also char pairs (surrogate code points).

public String trim(String s, int length) {
    byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
    if (bytes.length <= length) {
        return s;
    }
    int totalByteCount = 0;
    for (int i = 0; i < s.length(); ) {
        int cp = s.codePointAt(i);
        int n = Character.charCount(cp);
        int byteCount = s.substring(i, i + n)
                .getBytes(StandardCharsets.UTF_8).length;
        if (totalByteCount + byteCount) > length) {
            break;
        }
        totalByteCount += byteCount;
        i += n;
    }
    return new String(bytes, 0, totalByteCount);
}

It can still be optimized a bit.

Undulatory answered 16/7, 2015 at 14:9 Comment(0)
P
2

Here is the best solution, it takes only 1 milliseconds to execute because it doesn't have any loops.

    /**
     * This function trims the text by requested max byte size
     *
     * @param text   text string
     * @param length maximum byte size
     * @return trimmed text
     */
    public static String trim(String text, int length) {
        byte[] inputBytes = text.getBytes(StandardCharsets.UTF_8);
        byte[] outputBytes = new byte[length];

        System.arraycopy(inputBytes, 0, outputBytes, 0, length);
        String result = new String(outputBytes, StandardCharsets.UTF_8);

        // check if last character is truncated
        int lastIndex = result.length() - 1;

        if (lastIndex >= 0 && result.charAt(lastIndex) != text.charAt(lastIndex)) {
            // last character is truncated so remove the last character
            return result.substring(0, lastIndex);
        }

        return result;
    }
Phillida answered 23/10, 2022 at 12:17 Comment(2)
trim("🎉", 4) becomes .Cenesthesia
@Cenesthesia Yes, you're right. I edited my answer to cover 1 character Strings.Phillida
C
2

Easy and fast solution

For UTF-8, the length is determined by the first byte. UTF-8 - Wikipedia

public static String substring(String text, int bytes) {
    byte[] utf8 = text.getBytes(StandardCharsets.UTF_8);

    int length = 0;
    for (int i = 0; i < utf8.length; ) {
        if ((utf8[i] & 0b1111_1000) == 0b1111_0000) {
            i += 4;
        } else if ((utf8[i] & 0b1111_0000) == 0b1110_0000) {
            i += 3;
        } else if ((utf8[i] & 0b1110_0000) == 0b1100_0000) {
            i += 2;
        } else {
            i += 1;
        }

        if (bytes < i) {
            break;
        }
        length = i;
    }

    return new String(Arrays.copyOfRange(utf8, 0, length), StandardCharsets.UTF_8);
}

Alternatively, it can be determined as UTF-16. (Be careful when handling surrogate pairs)

public static String substring(String text, int bytes) {
    int endIndex = 0;
    int utf8 = 0;
    for (int i = 0; i < text.length(); i++) {
        char c = text.charAt(i);
        if (Character.isHighSurrogate(c)) {
            continue;
        } else if (Character.isLowSurrogate(c)) {
            utf8 += 4;
        } else if (c <= 0x007F) {
            utf8 += 1;
        } else if (c <= 0x07FF) {
            utf8 += 2;
        } else {
            utf8 += 3;
        }

        if (bytes < utf8) {
            break;
        }
        endIndex = i + 1;
    }

    return text.substring(0, endIndex);
}

For example

Example of cutting within 10 bytes.

Input : "Rückruf ins Ausland"
Output: "Rückruf i"  (10 bytes)

Input : "Rückruf"
Output: "Rückruf" (8 bytes)

Input : "123456789ü"
Output: "123456789" (9 bytes)

Input : "Rüüüüüü"
Output: "Rüüüü"  (9 bytes)

Input : "✅✅✅✅✅"
Output: "✅✅✅"  (9 bytes)

Input : "🎉🎉🎉🎉🎉"
Output: "🎉🎉"  (8 bytes)

Note: ü is 2 bytes, is 3 bytes, 🎉 is 4 bytes

Cenesthesia answered 3/12, 2023 at 11:49 Comment(0)
D
1

I think that the best bet in this case would be substringing at the database level, with the Oracle SUBSTR function directly on the SQL QUERY .

For example :

INSERT INTO ttable (colname) VALUES (SUBSTR( ?, 1, 10 ))

Where the exclamation point stand for the SQL parameter sent through JDBC .

Disestablish answered 16/7, 2015 at 13:39 Comment(1)
If you have a 2 byte character starting right before the truncation, wont this truncate in the middle of a 2 byte character then store the data incorrectly? Or what happens in that case?Pennipennie
P
1

You need to have the encoding in the database match the encoding for java strings. Alternatively, you can convert the string using something like this and get the length that matches the encoding in the database. This will give you an accurate byte count. Otherwise, you're still just hoping that the encodings match.

    String string = "Rückruf ins Ausland";

    int curByteCount = 0;
    String nextChar;
    for(int index = 0; curByteCount +  
         (nextChar = string.substr(index,index + 1)).getBytes("UTF-8").length < trimmedBytes.length;  index++){
        curByteCount += nextChar.getBytes("UTF-8").length;

    }
    byte[] subStringBytes = new byte[10];
    System.arraycopy(string.getBytes("UTF-8"), 0, subStringBytes, 0, curByteCount);
    String trimed = new String(subStringBytes, "UTF-8");

This should do it. It also, shoudln't truncate a multi-byte character in the process. The assumption here is that the database is UTF-8 Encoding. Another assumption is that the string actually needs to be trimmed.

Pennipennie answered 16/7, 2015 at 13:42 Comment(0)
M
1

If you want to trim the data in Java you must write a function that trims the string using the db charset used, something like this test case:

package test;

import java.io.UnsupportedEncodingException;

public class TrimField {

    public static void main(String[] args) {
        //UTF-8 is the db charset
        System.out.println(trim("Rückruf ins Ausland",10,"UTF-8"));
        System.out.println(trim("Rüückruf ins Ausland",10,"UTF-8"));
    }

    public static String trim(String value, int numBytes, String charset) {
        do {
            byte[] valueInBytes = null;
            try {
                valueInBytes = value.getBytes(charset);
            } catch (UnsupportedEncodingException e) {
                throw new RuntimeException(e.getMessage(), e);
            }
            if (valueInBytes.length > numBytes) {
                value = value.substring(0, value.length() - 1);
            } else {
                return value;
            }
        } while (value.length() > 0);
        return "";

    }

}
Mars answered 16/7, 2015 at 13:46 Comment(2)
This might cut off between Unicode codepoints that form a single grapheme cluster together. For example if the ü was represented as the decomposed form U+0075 U+0308 (i.e. "u" and the combining diaeresis), then this could leave "half" of the character in and not the other, turning ü into u. This could also cut apart between two surrogate pairs, which would have an equaly problematic result.Norbertonorbie
If the bound is a surrogate pair like 🎉, this will not give the correct result.Cenesthesia
G
-1

Hey all the ASCII characters are less than 128. You can use the below code.

public class Test {
    public static void main(String[] args) {
        String s= "Rückruf ins Ausland";
        int length =10;
        for(int i=0;i<s.length();i++){
            if(!(((int)s.charAt(i))<128)){
                length--;                   
            }
        }
        System.out.println(s.substring(0,length));
    }
}

You can copy paste and check if it fulfills yuor need or it breaks anywhere.

Gratify answered 16/7, 2015 at 13:49 Comment(4)
Shouldn't length=9; be length--; and without the break? What if you have two "üü" in the String?Pulliam
Yeah right.. My bad.. Let me edit it and i have breaked the loop as well.Gratify
This solution works only for strings of 10 characters with at most one "two bytes" char. Any other strings returns always 9 or 10 or throw an indexOutOfBoundsExceptionKaput
Yeah i changed the solution as per the above comment. Can you check it now? Now it's working dynamically for any solution.Gratify

© 2022 - 2025 — McMap. All rights reserved.