Encode String to UTF-8
Asked Answered
B

12

219

I have a String with a "ñ" character and I have some problems with it. I need to encode this String to UTF-8 encoding. I have tried it by this way, but it doesn't work:

byte ptext[] = myString.getBytes();
String value = new String(ptext, "UTF-8");

How do I encode that string to utf-8?

Babbler answered 20/4, 2011 at 11:55 Comment(9)
It's unclear what exactly you're trying to do. Does myString correctly contain the ñ character and you have problems converting it to a byte array (in that case see answers from Peter and Amir), or is myString corrupted and you're trying to fix it (in that case, see answers from Joachim and me)?Heteronomous
I need to send myString to a server with utf-8 encoding and I need to convert the "ñ" character to utf-8 encoding.Babbler
Well, if that server expects UTF-8 then what you need to send it are bytes, not a String. So as per Peter's answer, specify the encoding in the first line and drop the second line.Heteronomous
@Michael: I agree that it isn’t clear what the real intent is here. There seem to be a lot of questions where people are trying to explicit conversions between Strings and bytes rather than letting the {In,Out}putStream{Read,Writ}ers do it for them. I wonder why?Illative
@tchrist: my guess is that those questions are asked by people whose previous experience is with languages like C or PHP where a string is basically the same thing as a byte array and you have to track its encoding separately (and converting a string from one encoding to another one has meaning).Heteronomous
@Michael: Thanks, I suppose that makes sense. But it also makes it harder than it needs to be, doesn’t it? I am not very fond of languages that work that way, and so try to avoid working with them. I think Java’s model of Strings of characters instead of bytes makes things a whole lot easier. Perl and Python also share the “everything is Unicode strings” model. Yes, in all three you can still get at bytes if you work at it, but in practice it seems rare that you truly need to: that’s quite low-level. Plus it feels kinda like brushing a cat the wrong direction, if you know what I mean. :)Illative
@tchrist: I completely agree that a strong string abstraction is a very good thing. But C is from a time long before Unicode existed, when there was no single encoding that could represent all characters, and when any kind of abstraction over pure bytes would have been an intolerable performance penalty. Java programmers are lucky that it adapted Unicode relatively well from the beginning. Perl and Python are older and had Unicode support retrofitted, which makes it much less clean (explicit str/unicode duality in Python, nasty implicit UTF-8 flag in Perl.Heteronomous
@Michael: The Python duality is pretty annoying; I am always forgetting /u in Python; same problem with PHP. With Perl 5.14, now in RC1 testing, you can finally get all Unicode strings. Perl regexes are still a lot more Unicode-friendly than Java’s, but I’ve been working with the JDK7 people to fix that.Illative
possible duplicate of How to convert Strings to and from UTF8 byte arrays in JavaElisha
P
161

String objects in Java use the UTF-16 encoding that can't be modified*.

The only thing that can have a different encoding is a byte[]. So if you need UTF-8 data, then you need a byte[]. If you have a String that contains unexpected data, then the problem is at some earlier place that incorrectly converted some binary data to a String (i.e. it was using the wrong encoding).

* As a matter of implementation, String can internally use a ISO-8859-1 encoded byte[] when the range of characters fits it, but that is an implementation-specific optimization that isn't visible to users of String (i.e. you'll never notice unless you dig into the source code or use reflection to dig into a String object).

Pyuria answered 20/4, 2011 at 11:58 Comment(5)
Technically speaking, byte[] doesn't have any encoding. Byte array PLUS encoding can give you string though.Oklahoma
@Peter: true. But attaching an encoding to it only makes sense for byte[], it doesn't make sense for String (unless the encoding is UTF-16, in which case it makes sense but it still unnecessary information).Pyuria
String objects in Java use the UTF-16 encoding that can't be modified. Do you have an official source for this quote?Trinhtrini
@AhmadHajjar docs.oracle.com/javase/10/docs/api/java/lang/… : "The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes."Sunder
Thanks to you and rzymek for your helpful answers! You both saved my time! You theoretic part and rzymek by practical part.Despiteful
H
190

How about using

ByteBuffer byteBuffer = StandardCharsets.UTF_8.encode(myString)
Hyponitrite answered 20/4, 2011 at 11:57 Comment(7)
See my discussion with Peter. But if his assumption about the question is right, your solution would still not be idea since it returns a ByteBuffer.Heteronomous
But how do I obtain a encoded String? it returns a ByteBufferBabbler
@Alex: it's not possible to have an UTF-8 encoded Java String. You want bytes, so either use the ByteBuffer directly (could even be the best solution if your goal is to send it via a network collection) or call array() on it to get a byte[]Heteronomous
Good one, short and to the point... Of course, it needs some additional steps: new String(java.nio.charset.Charset.forName("UTF-8").encode(myString).array())Dumortierite
Something else that may be helpful is to use Guava's Charsets.UTF_8 enum instead of a String that may throw an UnsupportedEncodingException. String -> bytes: myString.getBytes(Charsets.UTF_8), and bytes -> String: new String(myByteArray, Charsets.UTF_8).Toluate
Even better, use StandardCharsets.UTF_8. Available in Java 1.7+.Sharleensharlene
The array return by array() will most likely be bigger than needed and padded, as it is the ByteBuffers internal array. Better to use string.getBytes(StandardCharsets.UTF_8) which will return a new array with the correct size.Gillam
P
161

String objects in Java use the UTF-16 encoding that can't be modified*.

The only thing that can have a different encoding is a byte[]. So if you need UTF-8 data, then you need a byte[]. If you have a String that contains unexpected data, then the problem is at some earlier place that incorrectly converted some binary data to a String (i.e. it was using the wrong encoding).

* As a matter of implementation, String can internally use a ISO-8859-1 encoded byte[] when the range of characters fits it, but that is an implementation-specific optimization that isn't visible to users of String (i.e. you'll never notice unless you dig into the source code or use reflection to dig into a String object).

Pyuria answered 20/4, 2011 at 11:58 Comment(5)
Technically speaking, byte[] doesn't have any encoding. Byte array PLUS encoding can give you string though.Oklahoma
@Peter: true. But attaching an encoding to it only makes sense for byte[], it doesn't make sense for String (unless the encoding is UTF-16, in which case it makes sense but it still unnecessary information).Pyuria
String objects in Java use the UTF-16 encoding that can't be modified. Do you have an official source for this quote?Trinhtrini
@AhmadHajjar docs.oracle.com/javase/10/docs/api/java/lang/… : "The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes."Sunder
Thanks to you and rzymek for your helpful answers! You both saved my time! You theoretic part and rzymek by practical part.Despiteful
I
93

In Java7 you can use:

import static java.nio.charset.StandardCharsets.*;

byte[] ptext = myString.getBytes(ISO_8859_1); 
String value = new String(ptext, UTF_8); 

This has the advantage over getBytes(String) that it does not declare throws UnsupportedEncodingException.

If you're using an older Java version you can declare the charset constants yourself:

import java.nio.charset.Charset;

public class StandardCharsets {
    public static final Charset ISO_8859_1 = Charset.forName("ISO-8859-1");
    public static final Charset UTF_8 = Charset.forName("UTF-8");
    //....
}
Ironsides answered 27/11, 2013 at 12:52 Comment(4)
This is the right answer. If someone wants to use a string datatype, he can use it in the right format. Rest of the answers are pointing to the byte formatted type.Gopherwood
Works in 6. Thanks.Swanee
Correct answer for me too. One thing though, when I used as above, German character changed to ?. So, I used this: byte[] ptext = myString.getBytes(UTF_8); String value = new String(ptext, UTF_8); This worked fine.Munson
The code sample doesn't make sense. If you first convert to ISO-8859-1, then that array of byte is not UTF-8, so the next line is totally incorrect. It will work for ASCII strings, of course, but then you could as well make a simple copy: String value = new String(myString);.Nevus
C
77

Use byte[] ptext = String.getBytes("UTF-8"); instead of getBytes(). getBytes() uses so-called "default encoding", which may not be UTF-8.

Condolent answered 20/4, 2011 at 11:57 Comment(4)
@Michael: he is clearly having trouble getting bytes from string. How is getBytes(encoding) missing the point? I think second line is there just to check if he can convert it back.Oklahoma
I interpret it as having a broken String and trying to "fix" it by converting to bytes and back (common misunderstanding). There's no actual indication that the second line is just checking the result.Heteronomous
@Michael, no there isn't, it's just my interpretation. Yours is simply different.Oklahoma
@Peter: you're right, we'd need clarification from Alex what he really means. Can't rescind the downvote though unless the answer is edited...Heteronomous
H
34

A Java String is internally always encoded in UTF-16 - but you really should think about it like this: an encoding is a way to translate between Strings and bytes.

So if you have an encoding problem, by the time you have String, it's too late to fix. You need to fix the place where you create that String from a file, DB or network connection.

Heteronomous answered 20/4, 2011 at 11:58 Comment(6)
It's a common mistake to believe that strings are internally encoded as UTF-16. Usually they are, but if, it is only an implementation specific detail of the String class. Since the internal storage of the character data is not accessible through the public API, a specific String implementation may decide to use any other encoding.Toon
@jarnbjo: The API explicitly states "A String represents a string in the UTF-16 format". Using anything else as internal format would be highly inefficient, and all actual implementations I know do use UTF-16 internally. So unless you can cite one that doesn't, you're engaging in pretty absurd hairsplitting.Heteronomous
Is it absurd to distinguish between public access and internal representation of data structures?Toon
@jarnbjo: so can you give an example for a JVM that does not internally represent Strings as UTF-16?Heteronomous
The JVM (as far as it is relevant to the VM at all) uses UTF-8 for string encoding, e.g. in the class files. The implementation of java.lang.String is decoupled from the JVM and I could easily implement the class for you using any other encoding for the internal representation if that is really necessary for you to realize that your answer is incorrect. Using UTF-16 as the internal format is in most cases highly inefficient as well when it comes to memory consumption and I don't see why e.g. Java implementations for embedded hardware wouldn't optimize for memory instead of performance.Toon
@jarnbjo: And once more: as long as you cannot give a concrete example of a JVM whose standard API implementation does internally use something other than UTF-16 to implement Strings, my statement is correct. And no, the String class is not really decoupled from the JVM, due to things like intern() and the constant pool.Heteronomous
W
25

You can try this way.

byte ptext[] = myString.getBytes("ISO-8859-1"); 
String value = new String(ptext, "UTF-8"); 
War answered 20/4, 2011 at 12:24 Comment(2)
I was going crazy. Thank you to get the bytes in "ISO-8859-1" first was the solution.Limousine
This is wrong. If your string includes Unicode characters, converting it to 8859-1 is going to throw an exception or worse give you an invalid string (maybe the string without those characters with code point 0x100 and over).Nevus
S
18

In a moment I went through this problem and managed to solve it in the following way

first i need to import

import java.nio.charset.Charset;

Then i had to declare a constant to use UTF-8 and ISO-8859-1

private static final Charset UTF_8 = Charset.forName("UTF-8");
private static final Charset ISO = Charset.forName("ISO-8859-1");

Then I could use it in the following way:

String textwithaccent="Thís ís a text with accent";
String textwithletter="Ñandú";

text1 = new String(textwithaccent.getBytes(ISO), UTF_8);
text2 = new String(textwithletter.getBytes(ISO),UTF_8);
Signore answered 8/4, 2018 at 22:16 Comment(0)
D
9
String value = new String(myString.getBytes("UTF-8"));

and, if you want to read from text file with "ISO-8859-1" encoded:

String line;
String f = "C:\\MyPath\\MyFile.txt";
try {
    BufferedReader br = Files.newBufferedReader(Paths.get(f), Charset.forName("ISO-8859-1"));
    while ((line = br.readLine()) != null) {
        System.out.println(new String(line.getBytes("UTF-8")));
    }
} catch (IOException ex) {
    //...
}
Dibucaine answered 19/2, 2015 at 19:34 Comment(0)
W
3

I have use below code to encode the special character by specifying encode format.

String text = "This is an example é";
byte[] byteText = text.getBytes(Charset.forName("UTF-8"));
//To get original string from byte.
String originalString= new String(byteText , "UTF-8");
Wyon answered 4/5, 2016 at 7:49 Comment(0)
G
2

A quick step-by-step guide how to configure NetBeans default encoding UTF-8. In result NetBeans will create all new files in UTF-8 encoding.

NetBeans default encoding UTF-8 step-by-step guide

  • Go to etc folder in NetBeans installation directory

  • Edit netbeans.conf file

  • Find netbeans_default_options line

  • Add -J-Dfile.encoding=UTF-8 inside quotation marks inside that line

    (example: netbeans_default_options="-J-Dfile.encoding=UTF-8")

  • Restart NetBeans

You set NetBeans default encoding UTF-8.

Your netbeans_default_options may contain additional parameters inside the quotation marks. In such case, add -J-Dfile.encoding=UTF-8 at the end of the string. Separate it with space from other parameters.

Example:

netbeans_default_options="-J-client -J-Xss128m -J-Xms256m -J-XX:PermSize=32m -J-Dapple.laf.useScreenMenuBar=true -J-Dapple.awt.graphics.UseQuartz=true -J-Dsun.java2d.noddraw=true -J-Dsun.java2d.dpiaware=true -J-Dsun.zip.disableMemoryMapping=true -J-Dfile.encoding=UTF-8"

here is link for Further Details

Galengalena answered 9/10, 2019 at 6:36 Comment(0)
C
0

This solved my problem

    String inputText = "some text with escaped chars"
    InputStream is = new ByteArrayInputStream(inputText.getBytes("UTF-8"));
Cerate answered 9/12, 2014 at 7:48 Comment(0)
W
0

The correct solution is also:

String myUTF8String = new String(sourceISOString.getBytes(Charsets.ISO_8859_1), Charsets.UTF_8);
Whoosh answered 12/10, 2023 at 9:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.