Convert International String to \u Codes in java

D

12

54

How can I convert an international (e.g. Russian) String to \u numbers (unicode numbers)
e.g. \u041e\u041a for OK ?

Disquisition answered 3/6, 2011 at 16:56 Comment(0)

M

6

In case you need this to write a .properties file you can just add the Strings into a Properties object and then save it to a file. It will take care for the conversion.

Microvolt answered 3/6, 2011 at 17:0 Comment(7)

Well you need to make sure that you save the file in UTF-8 format (perhaps UTF-16 or UCS-2/4 will work) or you will have problems. – Anaphylaxis 3/6, 2011 at 17:18

@ArtB: No, Properties interprets input files always as ISO-8859-1 (first unicode page) and also saves to that encoding. This is why it needs the \uXXXX escapes and creates them on saving. Although since Java version 1.6 Properties allows to read the input from a Reader object so that you would be able to make your own proprietary UTF-8 based properties file format. – Microvolt 3/6, 2011 at 17:26

Oh... doesn't that cause problems with non-first page languages? – Anaphylaxis 3/6, 2011 at 17:50

Yes, it results in comparatively large files for languages that use mostly characters outside 8859-1 the because the \uXXXX encoding is less space efficient than UTF-8 or UTF-16. It also makes it impossible to edit these files in any editor that is not aware of this special encoding. But at least it allows to save and load all unicode text to the extend that is supported by the Java VM in general. – Microvolt 3/6, 2011 at 18:0

@Microvolt I am not sure that \u notation will support Unicode characters outside Unicode BMP. – Gans 3/6, 2011 at 23:8

That's why I wrote to the extend that is supported by the Java VM in general. Actually it supports characters outside the BMP since Java treats these characters as surrogate pairs and thus they can be encoded in a \u pair as well. But the level of support for surrogates varies a lot in Java, from mostly nonexistent to somewhat supported in XML-Parsers or some Swing components. Also many of the basic String manipulation routines in java.lang seem to be surrogates aware by now (except for regexp as far as I know) but you can still cut a string in the middle of them if you like. – Microvolt 3/6, 2011 at 23:25

This seems like a really round-about solution. From the question, I assumed we were looking for some kind of method call String->String. – Gasify 5/8, 2016 at 13:26

I

62

there is a JDK tools executed via command line as following :

native2ascii -encoding utf8 src.txt output.txt

Example :

src.txt

بسم الله الرحمن الرحيم

output.txt

\u0628\u0633\u0645 \u0627\u0644\u0644\u0647 \u0627\u0644\u0631\u062d\u0645\u0646 \u0627\u0644\u0631\u062d\u064a\u0645

If you want to use it in your Java application, you can wrap this command line by :

String pathSrc = "./tmp/src.txt";
String pathOut = "./tmp/output.txt";
String cmdLine = "native2ascii -encoding utf8 " + new File(pathSrc).getAbsolutePath() + " " + new File(pathOut).getAbsolutePath();
Runtime.getRuntime().exec(cmdLine);
System.out.println("THE END");

Then read content of the new file.

Ible answered 24/9, 2013 at 10:11 Comment(4)

You can do it without starting a subprocess, see https://mcmap.net/q/339673/-how-to-parse-unicode-that-is-read-from-a-file-in-java-duplicate – Bookstore 8/12, 2014 at 13:47

This gist wraps the command line example above in a Bash script so it's easier to use. – Xuanxunit 31/7, 2017 at 17:58

This tool was removed in Java 9: #39400523 – Acarus 5/11, 2018 at 3:15

So what's the alterntive now that native2ascii is gone? – Marchellemarcher 23/10, 2022 at 10:7

G

24

You could use escapeJavaStyleString from org.apache.commons.lang.StringEscapeUtils.

Gans answered 3/6, 2011 at 16:59 Comment(4)

It appears this method has been renamed escapeJava in the 3.x versions – Grievance 24/6, 2013 at 23:19

and doesn't escape to \uXXXX – Hair 19/12, 2013 at 20:49

You better not use it ;) See the answer at: https://mcmap.net/q/25861/-how-to-unescape-a-java-string-literal-in-java – Bookstore 8/12, 2014 at 13:45

This method also escapes other special symbols, eg. quote ("). This may be an unwanted behaviour. – Diaphanous 12/12, 2016 at 14:34

K

16

I also had this problem. I had some Portuguese text with some special characters, but these characters where already in unicode format (ex.: \u00e3).

So I want to convert S\u00e3o to São.

I did it using the apache commons StringEscapeUtils. As @sorin-sbarnea said. Can be downloaded here.

Use the method unescapeJava, like this:

String text = "S\u00e3o"
text = StringEscapeUtils.unescapeJava(text);
System.out.println("text " + text);

(There is also the method escapeJava, but this one puts the unicode characters in the string.)

If any one knows a solution on pure Java, please tell us.

Kirov answered 14/5, 2012 at 16:8 Comment(1)

You're doing it the other way round, that's not what OP asked for. – Bookstore 8/12, 2014 at 13:49

B

16

Here's an improved version of ArtB's answer:

    StringBuilder b = new StringBuilder();

    for (char c : input.toCharArray()) {
        if (c >= 128)
            b.append("\\u").append(String.format("%04X", (int) c));
        else
            b.append(c);
    }

    return b.toString();

This version escapes all non-ASCII chars and works correctly for low Unicode code points like Ä.

Bookstore answered 8/12, 2014 at 13:42 Comment(2)

does it work for multibyte characters, e.g. when 4-6-8 bytes (2, 3, 4 java char values) in a row represent only one symbol? – Sickener 12/6, 2017 at 12:50

It doesn't, because it's iterating using a single char. – Bookstore 21/11, 2018 at 10:5

A

12

There are three parts to the answer

Get the Unicode for each character
Determine if it is in the Cyrillic Page
Convert to Hexadecimal.

To get each character you can iterate through the String using the charAt() or toCharArray() methods.

for( char c : s.toCharArray() )

The value of the char is the Unicode value.

The Cyrillic Unicode characters are any character in the following ranges:

Cyrillic:            U+0400–U+04FF ( 1024 -  1279)
Cyrillic Supplement: U+0500–U+052F ( 1280 -  1327)
Cyrillic Extended-A: U+2DE0–U+2DFF (11744 - 11775)
Cyrillic Extended-B: U+A640–U+A69F (42560 - 42655)

If it is in this range it is Cyrillic. Just perform an if check. If it is in the range use Integer.toHexString() and prepend the "\\u". Put together it should look something like this:

final int[][] ranges = new int[][]{ 
        {  1024,  1279 }, 
        {  1280,  1327 }, 
        { 11744, 11775 }, 
        { 42560, 42655 },
    };
StringBuilder b = new StringBuilder();

for( char c : s.toCharArray() ){
    int[] insideRange = null;
    for( int[] range : ranges ){
        if( range[0] <= c && c <= range[1] ){
            insideRange = range;
            break;
        }
    }

    if( insideRange != null ){
        b.append( "\\u" ).append( Integer.toHexString(c) );
    }else{
        b.append( c );
    }
}

return b.toString();

Edit: probably should make the check c < 128 and reverse the if and the else bodies; you probably should escape everything that isn't ASCII. I was probably too literal in my reading of your question.

Anaphylaxis answered 3/6, 2011 at 17:14 Comment(6)

This is the correct answer in my context. However, I believe "getCharArray()" should be "toCharArray". – Worldbeater 10/2, 2014 at 10:26

@JenS. Thank you, indeed, the method is in fact toCharArray(). – Anaphylaxis 10/2, 2014 at 19:53

This isn't correct for all Unicode characters! e.g. for German Ä it returns \uC4, not \u00c4. – Bookstore 8/12, 2014 at 13:13

@m01 I believe the original form of the question was specifically about Russian characters. – Anaphylaxis 8/12, 2014 at 15:40

Russian was given just as an example. Your example is ok though; the range checks in the if guard against this case. See also my answer for a generic approach. – Bookstore 8/12, 2014 at 16:13

"The value of the char is the Unicode value." Yes but more specifically, it is the UTF-16 code-unit value, with one or two UTF-16 code-units per Unicode codepoint. UTF-16 code-units are what you need to construct Java source code character escapes (whether used in literal strings or not). – Swansdown 8/12, 2014 at 23:58

I

7

There's a command-line tool that ships with java called native2ascii. This converts unicode files to ASCII-escaped files. I've found that this is a necessary step for generating .properties files for localization.

Insentient answered 3/6, 2011 at 17:27 Comment(0)

M

6

In case you need this to write a .properties file you can just add the Strings into a Properties object and then save it to a file. It will take care for the conversion.

Microvolt answered 3/6, 2011 at 17:0 Comment(7)

Well you need to make sure that you save the file in UTF-8 format (perhaps UTF-16 or UCS-2/4 will work) or you will have problems. – Anaphylaxis 3/6, 2011 at 17:18

@ArtB: No, Properties interprets input files always as ISO-8859-1 (first unicode page) and also saves to that encoding. This is why it needs the \uXXXX escapes and creates them on saving. Although since Java version 1.6 Properties allows to read the input from a Reader object so that you would be able to make your own proprietary UTF-8 based properties file format. – Microvolt 3/6, 2011 at 17:26

Oh... doesn't that cause problems with non-first page languages? – Anaphylaxis 3/6, 2011 at 17:50

Yes, it results in comparatively large files for languages that use mostly characters outside 8859-1 the because the \uXXXX encoding is less space efficient than UTF-8 or UTF-16. It also makes it impossible to edit these files in any editor that is not aware of this special encoding. But at least it allows to save and load all unicode text to the extend that is supported by the Java VM in general. – Microvolt 3/6, 2011 at 18:0

@Microvolt I am not sure that \u notation will support Unicode characters outside Unicode BMP. – Gans 3/6, 2011 at 23:8

That's why I wrote to the extend that is supported by the Java VM in general. Actually it supports characters outside the BMP since Java treats these characters as surrogate pairs and thus they can be encoded in a \u pair as well. But the level of support for surrogates varies a lot in Java, from mostly nonexistent to somewhat supported in XML-Parsers or some Swing components. Also many of the basic String manipulation routines in java.lang seem to be surrogates aware by now (except for regexp as far as I know) but you can still cut a string in the middle of them if you like. – Microvolt 3/6, 2011 at 23:25

This seems like a really round-about solution. From the question, I assumed we were looking for some kind of method call String->String. – Gasify 5/8, 2016 at 13:26

H

6

Apache commons StringEscapeUtils.escapeEcmaScript(String) returns a string with unicode characters escaped using the \u notation.

"Art of Beer 🎨 🍺" -> "Art of Beer \u1F3A8 \u1F37A"

Huihuie answered 18/7, 2016 at 19:9 Comment(0)

F

4

There is an Open Source java library MgntUtils that has a Utility that converts Strings to unicode sequence and vise versa:

result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);

The output of this code is:

\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World

The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc

Here is javadoc for the class StringUnicodeEncoderDecoder

Flaw answered 27/12, 2018 at 13:40 Comment(2)

This very useful library. It solved my problem for converting from cyrillic to unicode. Thank you Michael. – Dockage 20/6, 2020 at 15:22

@Dockage I am glad the library helped you. Can I ask you a small favor? could you please got to the article about my library and leave a comment? Here are 2 links: linkedin.com/pulse/…, community.oracle.com/blogs/michaelgantman/2016/01/26/… – Flaw 20/6, 2020 at 17:56

B

3

Just some basic Methods for that (inspired from native2ascii tool):

/**
 * Encode a String like äöü to \u00e4\u00f6\u00fc
 * 
 * @param text
 * @return
 */
public String native2ascii(String text) {
    if (text == null)
        return text;
    StringBuilder sb = new StringBuilder();
    for (char ch : text.toCharArray()) {
        sb.append(native2ascii(ch));
    }
    return sb.toString();
}

/**
 * Encode a Character like ä to \u00e4
 * 
 * @param ch
 * @return
 */
public String native2ascii(char ch) {
    if (ch > '\u007f') {
        StringBuilder sb = new StringBuilder();
        // write \udddd
        sb.append("\\u");
        StringBuffer hex = new StringBuffer(Integer.toHexString(ch));
        hex.reverse();
        int length = 4 - hex.length();
        for (int j = 0; j < length; j++) {
            hex.append('0');
        }
        for (int j = 0; j < 4; j++) {
            sb.append(hex.charAt(3 - j));
        }
        return sb.toString();
    } else {
        return Character.toString(ch);
    }
}

Biometry answered 9/2, 2018 at 12:50 Comment(0)

J

0

You could probably hack if from this JavaScript code:

/* convert 🙌 to \uD83D\uDE4C */
function text_to_unicode(string) {
  'use strict';

  function is_whitespace(c) { return 9 === c || 10 === c || 13 === c || 32 === c;  }
  function left_pad(string) { return Array(4).concat(string).join('0').slice(-1 * Math.max(4, string.length)); }

  string = string.split('').map(function(c){ return "\\u" + left_pad(c.charCodeAt(0).toString(16).toUpperCase()); }).join('');

  return string;
}


/* convert \uD83D\uDE4C to 🙌 */
function unicode_to_text(string) {
  var  prefix = "\\\\u"
     , regex  = new RegExp(prefix + "([\da-f]{4})","ig")
     ; 

  string = string.replace(regex, function(match, backtrace1){
    return String.fromCharCode( parseInt(backtrace1, 16) )
  });

  return string;
}

source: iCompile - Yet Another JavaScript Unicode Encode/Decode

Jolt answered 3/1, 2016 at 3:15 Comment(0)

S

0

this type name is Decode/Unescape Unicode. this site link online convertor.

Soundboard answered 10/6, 2020 at 11:20 Comment(0)

Recommended topics

Hot tags