Converting String from One Charset to Another
Asked Answered
B

1

8

I am working on converting a string from one charset to another and read many example on it and finally found below code, which looks nice to me and as a newbie to Charset Encoding, I want to know, if it is the right way to do it .

public static byte[] transcodeField(byte[] source, Charset from, Charset to) {
    return new String(source, from).getBytes(to);
} 

To convert String from ASCII to EBCDIC, I have to do:

System.out.println(new String(transcodeField(ebytes,
                Charset.forName("US-ASCII"), Charset.forName("Cp1047"))));

And to convert from EBCDIC to ASCII, I have to do:

System.out.println(new String(transcodeField(ebytes,
                Charset.forName("Cp1047"), Charset.forName("US-ASCII"))));
Bowlds answered 16/4, 2015 at 7:22 Comment(5)
Did you run your code? Did it work as expected?Lydalyddite
Please edit your question with this information.Lydalyddite
What I need to convert EBCDIC (HP) to be converted to ASCII, so what I am getting is not expectedBowlds
that's not what I needBowlds
This question is important because it asks for validation of a widely shared algorithm. However, to prevent people from thinking it is correct, please edit the question to make it clear that how wrong it is and consider accepting @Kayaman's answer.Inhibitor
B
28

The code you found (transcodeField) doesn't convert a String from one encoding to another, because a String doesn't have an encoding¹. It converts bytes from one encoding to another. The method is only useful if your use case satisfies 2 conditions:

  1. Your input data is bytes in one encoding
  2. Your output data needs to be bytes in another encoding

In that case, it's straight forward:

byte[] out = transcodeField(inbytes, Charset.forName(inEnc), Charset.forName(outEnc));

If the input data contains characters that can't be represented in the output encoding (such as converting complex UTF8 to ASCII) those characters will be replaced with the ? replacement symbol, and the data will be corrupted.

However a lot of people ask "How do I convert a String from one encoding to another", to which a lot of people answer with the following snippet:

String s = new String(source.getBytes(inputEncoding), outputEncoding);

This is complete bull****. The getBytes(String encoding) method returns a byte array with the characters encoded in the specified encoding (if possible, again invalid characters are converted to ?). The String constructor with the 2nd parameter creates a new String from a byte array, where the bytes are in the specified encoding. Now since you just used source.getBytes(inputEncoding) to get those bytes, they're not encoded in outputEncoding (except if the encodings use the same values, which is common for "normal" characters like abcd, but differs with more complex like accented characters éêäöñ).

So what does this mean? It means that when you have a Java String, everything is great. Strings are unicode, meaning that all of your characters are safe. The problem comes when you need to convert that String to bytes, meaning that you need to decide on an encoding. Choosing a unicode compatible encoding such as UTF8, UTF16 etc. is great. It means your characters will still be safe even if your String contained all sorts of weird characters. If you choose a different encoding (with US-ASCII being the least supportive) your String must contain only the characters supported by the encoding, or it will result in corrupted bytes.

Now finally some examples of good and bad usage.

String myString = "Feng shui in chinese is 風水";
byte[] bytes1 = myString.getBytes("UTF-8");  // Bytes correct
byte[] bytes2 = myString.getBytes("US-ASCII"); // Last 2 characters are now corrupted (converted to question marks)

String nordic = "Här är några merkkejä";
byte[] bytes3 = nordic.getBytes("UTF-8");  // Bytes correct, "weird" chars take 2 bytes each
byte[] bytes4 = nordic.getBytes("ISO-8859-1"); // Bytes correct, "weird" chars take 1 byte each
String broken = new String(nordic.getBytes("UTF-8"), "ISO-8859-1"); // Contains now "Här är några merkkejä"

The last example demonstrates that even though both of the encodings support the nordic characters, they use different bytes to represent them and using the wrong encoding when decoding results in Mojibake. Therefore there's no such thing as "converting a String from one encoding to another", and you should never use the broken example.

Also note that you should always specify the encoding used (with both getBytes() and new String()), because you can't trust that the default encoding is always the one you want.

As a last issue, Charset and Encoding aren't the same thing, but they're very much related.

¹ Technically the way a String is stored internally in the JVM is in UTF-16 encoding up to Java 8, and variable encoding from Java 9 onwards, but the developer doesn't need to care about that.


NOTE

It's possible to have a corrupted String and be able to uncorrupt it by fiddling with the encoding, which may be where this "convert String to other encoding" misunderstanding originates from.

// Input comes from network/file/other place and we have misconfigured the encoding 
String input = "Här är några merkkejä"; // UTF-8 bytes, interpreted wrongly as ISO-8859-1 compatible
byte[] bytes = input.getBytes("ISO-8859-1"); // Get each char as single byte
String asUtf8 = new String(bytes, "UTF-8"); // Recreate String as UTF-8

If no characters were corrupted in input, the string would now be "fixed". However the proper approach is to use the correct encoding when reading input, not fix it afterwards. Especially if there's a chance of it becoming corrupted.

Bigamous answered 3/9, 2016 at 16:6 Comment(5)
Pedantically, a java.lang.String is only UTF-16 so you can only convert a String from UTF-16 to UTF-16. Other conversions are either to or from a byte array as you point out.Inhibitor
The fact that a String is a counted sequence of UTF-16 code units is extremely important in Java when indexing and iterating (and performing char arithmetic!). That's why there are many methods that deal with Unicode codepoints instead.Inhibitor
You're right, I didn't even get into codepoints :) This issue is extremely complex and I wanted to provide a decent answer so I can link this to all the "string conversion" questions. If you have any suggestions for improvement, do tell. I'll try and improve this answer and shrink it down a bit.Bigamous
My only critique the one point about UTF-16.I think it is a good answer to an important question. The question is important because it brings up something that is so wrong.Inhibitor
"It's possible to have a corrupted String and be able to uncorrupt it by..." but this approach is so much less work. How could you corrupt the string using ISO-8859-1 to read it?Crepuscular

© 2022 - 2024 — McMap. All rights reserved.