How do I convert between ISO-8859-1 and UTF-8 in Java?
Asked Answered
S

8

85

Does anyone know how to convert a string from ISO-8859-1 to UTF-8 and back in Java?

I'm getting a string from the web and saving it in the RMS (J2ME), but I want to preserve the special chars and get the string from the RMS but with the ISO-8859-1 encoding. How do I do this?

Suberin answered 16/3, 2009 at 21:9 Comment(1)
possible duplicate of Encoding conversion in javaJanayjanaya
H
120

In general, you can't do this. UTF-8 is capable of encoding any Unicode code point. ISO-8859-1 can handle only a tiny fraction of them. So, transcoding from ISO-8859-1 to UTF-8 is no problem. Going backwards from UTF-8 to ISO-8859-1 will cause "replacement characters" (�) to appear in your text when unsupported characters are found.

To transcode text:

byte[] latin1 = ...
byte[] utf8 = new String(latin1, "ISO-8859-1").getBytes("UTF-8");

or

byte[] utf8 = ...
byte[] latin1 = new String(utf8, "UTF-8").getBytes("ISO-8859-1");

You can exercise more control by using the lower-level Charset APIs. For example, you can raise an exception when an un-encodable character is found, or use a different character for replacement text.

Hufnagel answered 16/3, 2009 at 22:21 Comment(3)
For more information on character encoding and why it rightfully doesn't make much sense to go from UTF-8 to ISO-8859 (or ASCII or ANSI for that matter), see this explanation: joelonsoftware.com/articles/Unicode.htmlCapreolate
Here's a good summary from said link: There are hundreds of traditional encodings which can only store some code points correctly and change all the other code points into question marks. Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (also useful for any Western European language). But try to store Russian or Hebrew letters [or special chars] in these encodings and you get a bunch of question marks. UTF 7, 8, 16, and 32 all have the nice property of being able to store any code point correctly.Capreolate
It might be worth mentioning that Windows-1252 (Windows Latin 1) extends ISO-8859-1 (official Latin 1) by filling in some of the "Unicode control" characters 0x80 - 0xbf. Even browsers on Mac and Linux respect that. So at some spots use Windows-1252 instead.Irisation
E
20

Which worked for me: ("üzüm bağları" is the correct written in Turkish)

Convert ISO-8859-1 to UTF-8:

String encodedWithISO88591 = "üzüm baÄları";
String decodedToUTF8 = new String(encodedWithISO88591.getBytes("ISO-8859-1"), "UTF-8");
//Result, decodedToUTF8 --> "üzüm bağları"

Convert UTF-8 to ISO-8859-1

String encodedWithUTF8 = "üzüm bağları";
String decodedToISO88591 = new String(encodedWithUTF8.getBytes("UTF-8"), "ISO-8859-1");
//Result, decodedToISO88591 --> "üzüm baÄları"
Enchiridion answered 12/8, 2016 at 8:45 Comment(4)
What would happen if you write the following code: String a=new String(encodedWithUTF8.getBytes("ISO88591"), "ISO-8859-1") and String b=new String(encodedWithUTF8.getBytes("ISO88591"), "UTF-8")? If the string is in one encoding and we get bytes using the other, what's going on under the hood?Iselaisenberg
You can try them and see the results on your IDE, and if you follow this URL docs.oracle.com/javase/7/docs/api/java/lang/… you will see the method definition. I don't know the exact detail of the process.Enchiridion
If someone needs this - I think the above commands would do the following: a would take UTF-8's bytes, convert them into ISO bytes and then use a table bytes->chars of ISO encoding to print the string out. In case of string b it would use a table bytes->chars of UTF-8 therefore mapping essencially ISO bytes according to UTF rules. a will be printed out OK even though it's ISO, because Java doesn't mess up it's inner storing of bytes. b may be damaged because some of the ISO's chars are going to be printed out as if they belonged to UTF encoding.Iselaisenberg
Is there any third party tool that can convert all the files in a repository to UTF-8?Alonzoaloof
P
7

Here is an easy way with String output (I created a method to do this):

public static String (String input){
    String output = "";
    try {
        /* From ISO-8859-1 to UTF-8 */
        output = new String(input.getBytes("ISO-8859-1"), "UTF-8");
        /* From UTF-8 to ISO-8859-1 */
        output = new String(input.getBytes("UTF-8"), "ISO-8859-1");
    } catch (UnsupportedEncodingException e) {
        e.printStackTrace();
    }
    return output;
}
// Example
input = "Música";
output = "Música";
Pantywaist answered 13/6, 2016 at 17:24 Comment(0)
A
6

If you have a String, you can do that:

String s = "test";
try {
    s.getBytes("UTF-8");
} catch(UnsupportedEncodingException uee) {
    uee.printStackTrace();
}

If you have a 'broken' String, you did something wrong, converting a String to a String in another encoding is defenetely not the way to go! You can convert a String to a byte[] and vice-versa (given an encoding). In Java Strings are AFAIK encoded with UTF-16 but that's an implementation detail.

Say you have a InputStream, you can read in a byte[] and then convert that to a String using

byte[] bs = ...;
String s;
try {
    s = new String(bs, encoding);
} catch(UnsupportedEncodingException uee) {
    uee.printStackTrace();
}

or even better (thanks to erickson) use InputStreamReader like that:

InputStreamReader isr;
try {
     isr = new InputStreamReader(inputStream, encoding);
} catch(UnsupportedEncodingException uee) {
    uee.printStackTrace();
}
Amidase answered 16/3, 2009 at 21:30 Comment(1)
If you have an InputStream, you should wrap it with an InputStreamReader.Hufnagel
C
2

The easiest way to convert an ISO-8859-1 string to UTF-8 string.

private static String convertIsoToUTF8(String example) throws UnsupportedEncodingException {
    return new String(example.getBytes("ISO-8859-1"), "utf-8");
}

If we want to convert an UTF-8 string to ISO-8859-1 string.

private static String convertUTF8ToISO(String example) throws UnsupportedEncodingException {
    return new String(example.getBytes("utf-8"), "ISO-8859-1");
}

Moreover, a method that converts an ISO-8859-1 string to UTF-8 string without using the constructor of class String.

public static String convertISO_to_UTF8_personal(String strISO_8859_1) {
    String res = "";
    int i = 0;
    for (i = 0; i < strISO_8859_1.length() - 1; i++) {
        char ch = strISO_8859_1.charAt(i);
        char chNext = strISO_8859_1.charAt(i + 1);
        if (ch <= 127) {
            res += ch;
        } else if (ch == 194 && chNext >= 128 && chNext <= 191) {
            res += chNext;
        } else if(ch == 195 && chNext >= 128 && chNext <= 191){
            int resNum = chNext + 64;
            res += (char) resNum;
        } else if(ch == 194){
            res += (char) 173;
        } else if(ch == 195){
            res += (char) 224;
        }
    }
    char ch = strISO_8859_1.charAt(i);
    if (ch <= 127 ){
        res += ch;
    }
    return res;
}

}

That method is based on enconding utf-8 to iso-8859-1 of this website. Encoding utf-8 to iso-8859-1

Caine answered 9/11, 2020 at 23:9 Comment(0)
C
1

Apache Commons IO Charsets class can come in handy:

String utf8String = new String(org.apache.commons.io.Charsets.ISO_8859_1.encode(latinString).array())
Cannonade answered 6/4, 2017 at 13:3 Comment(0)
F
1

Here is a function to convert UNICODE (ISO_8859_1) to UTF-8

public static String String_ISO_8859_1To_UTF_8(String strISO_8859_1) {
final StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < strISO_8859_1.length(); i++) {
  final char ch = strISO_8859_1.charAt(i);
  if (ch <= 127) 
  {
      stringBuilder.append(ch);
  }
  else 
  {
      stringBuilder.append(String.format("%02x", (int)ch));
  }
}
String s = stringBuilder.toString();
int len = s.length();
byte[] data = new byte[len / 2];
for (int i = 0; i < len; i += 2) {
    data[i / 2] = (byte) ((Character.digit(s.charAt(i), 16) << 4)
                         + Character.digit(s.charAt(i+1), 16));
}
String strUTF_8 =new String(data, StandardCharsets.UTF_8);
return strUTF_8;
}

TEST

String strA_ISO_8859_1_i = new String("الغلاف".getBytes(StandardCharsets.UTF_8), StandardCharsets.ISO_8859_1);

System.out.println("ISO_8859_1 strA est = "+ strA_ISO_8859_1_i + "\n String_ISO_8859_1To_UTF_8 = " + String_ISO_8859_1To_UTF_8(strA_ISO_8859_1_i));

RESULT

ISO_8859_1 strA est = اÙغÙا٠String_ISO_8859_1To_UTF_8 = الغلاف

Faxan answered 30/10, 2018 at 14:52 Comment(0)
M
1

Regex can also be good and be used effectively (Replaces all UTF-8 characters not covered in ISO-8859-1 with space):

String input = "€Tes¶ti©ng [§] al€l o€f i¶t _ - À ÆÑ with some 9umbers as"
            + " w2921**#$%!@# well Ü, or ü, is a chaŒracte⚽";
String output = input.replaceAll("[^\\u0020-\\u007e\\u00a0-\\u00ff]", " ");
System.out.println("Input = " + input);
System.out.println("Output = " + output);
Mihe answered 21/11, 2018 at 17:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.