Encoding conversion in java

Asked 23/10, 2008 at 8:54 Answered 2/9, 2015 at 10:31

Solved java character-encoding converters

Is there any free java library which I can use to convert string in one encoding to other encoding, something like iconv? I'm using Java version 1.3.

Thy answered 23/10, 2008 at 8:54 Comment(1)

Related: Save cyrillic while change String encoding from UTF-8 to windows-1251 in Java – Modiste 31/1, 2023 at 9:0

You don't need a library beyond the standard one - just use Charset. (You can just use the String constructors and getBytes methods, but personally I don't like just working with the names of character encodings. Too much room for typos.)

EDIT: As pointed out in comments, you can still use Charset instances but have the ease of use of the String methods: new String(bytes, charset) and String.getBytes(charset).

See "URL Encoding (or: 'What are those "%20" codes in URLs?')".

Sweeting answered 23/10, 2008 at 8:57 Comment(3)

I prefer new String(byte[], encoding) and String.getBytes(encoding) in most cases, because they are simple one-liners as opposed to the more powerful but more complicated API of Charset (which, BTW, is only available in Java 1.4+). – Taciturn 23/10, 2008 at 9:6

Yes, it's a shame that the Charset API is so complicated. The .NET System.Encoding class does this really well, IMO - and keeps the functionality out of String. – Sweeting 23/10, 2008 at 9:8

Links fixed. See free-scripts.net/html_tutorial/html/topics/urlencoding.htm – Scrotum 23/10, 2008 at 10:34

CharsetDecoder should be what you are looking for, no ?

Many network protocols and files store their characters with a byte-oriented character set such as ISO-8859-1 (ISO-Latin-1).
However, Java's native character encoding is ~~Unicode~~ UTF16BE (Sixteen-bit UCS Transformation Format, big-endian byte order).

See Charset. That doesn't mean UTF16 is the default charset (i.e.: the default "mapping between sequences of sixteen-bit Unicode code units and sequences of bytes"):

Every instance of the Java virtual machine has a default charset, which may or may not be one of the standard charsets.
[US-ASCII, ISO-8859-1 a.k.a. ISO-LATIN-1, UTF-8, UTF-16BE, UTF-16LE, UTF-16]
The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system.

This example demonstrates how to convert ISO-8859-1 encoded bytes in a ByteBuffer to a string in a CharBuffer and visa versa.

// Create the encoder and decoder for ISO-8859-1
Charset charset = Charset.forName("ISO-8859-1");
CharsetDecoder decoder = charset.newDecoder();
CharsetEncoder encoder = charset.newEncoder();

try {
    // Convert a string to ISO-LATIN-1 bytes in a ByteBuffer
    // The new ByteBuffer is ready to be read.
    ByteBuffer bbuf = encoder.encode(CharBuffer.wrap("a string"));

    // Convert ISO-LATIN-1 bytes in a ByteBuffer to a character ByteBuffer and then to a string.
    // The new ByteBuffer is ready to be read.
    CharBuffer cbuf = decoder.decode(bbuf);
    String s = cbuf.toString();
} catch (CharacterCodingException e) {
}

Scrotum answered 23/10, 2008 at 8:57 Comment(2)

Unicode is not an encoding! UTF-8, UTF-16 etc. are. See joelonsoftware.com/articles/Unicode.html – Unreflective 3/8, 2010 at 14:19

@SealedSun: very true. I have fixed that "java native encoding" section in my answer. – Scrotum 3/8, 2010 at 17:35

I would just like to add that if the String is originally encoded using the wrong encoding it might be impossible to change it to another encoding without errors. The question does not state that the conversion here is made from wrong encoding to correct encoding but I personally stumbled to this question just because of this situation so just a heads up for others as well.

This answer in other question gives an explanation why the conversion does not always yield correct results https://mcmap.net/q/20949/-quot-fix-quot-string-encoding-in-java

Tervalent answered 2/9, 2015 at 10:31 Comment(0)

It is a whole lot easier if you think of unicode as a character set (which it actually is - it is very basically the numbered set of all known characters). You can encode it as UTF-8 (1-3 bytes per character depending) or maybe UTF-16 (2 bytes per character or 4 bytes using surrogate pairs).

Back in the mist of time Java used to use UCS-2 to encode the unicode character set. This could only handle 2 bytes per character and is now obsolete. It was a fairly obvious hack to add surrogate pairs and move up to UTF-16.

A lot of people think they should have used UTF-8 in the first place. When Java was originally written unicode had far more than 65535 characters anyway...

Sumerlin answered 29/8, 2009 at 17:34 Comment(0)

Recommended topics

Hot tags