Convert escaped Unicode character back to actual character

Asked 4/12, 2012 at 10:4 Answered 23/1 at 22:14

I have the following value in a string variable in Java which has UTF-8 characters encoded like below

Dodd\u2013Frank

instead of

Dodd–Frank

(Assume that I don't have control over how this value is assigned to this string variable)

Now how do I convert (encode) it properly and store it back in a String variable?

I found the following code

Charset.forName("UTF-8").encode(str);

But this returns a ByteBuffer, but I want a String back.

Edit:

Some more additional information.

When I use System.out.println(str); I get

Dodd\u2013Frank

I am not sure what is the correct terminology (UTF-8 or unicode). Pardon me for that.

Bacciferous answered 4/12, 2012 at 10:4 Comment(7)

the question is unclear to me. When you System.out.println(yourString); do you see (1) Dodd\u2013Frank or (2) Dodd–Frank ? – Churchgoer 4/12, 2012 at 10:6

Wrong, \u2013 is not an UTF-8 character, it is an escaped Unicode character. UTF-8 is a way of encoding UTF characters. – Bagasse 4/12, 2012 at 10:6

@Churchgoer and SirDarius I have updated the question with details. – Bacciferous 4/12, 2012 at 10:8

Have a look at StringEscapeUtils.unescapeJava() – Churchgoer 4/12, 2012 at 10:13

Check the Apache Doc: commons.apache.org/proper/commons-lang/javadocs/api-3.1/org/… – Shipentine 19/10, 2015 at 10:54

Just wanted to understand, why not "Dodd\u2013Frank".chars().forEach(a -> System.out.print((char) a)); ? – Mascon 10/7, 2018 at 16:56

org.apache.commons.lang3.StringEscapeUtils is deprecated, but moved to commons-text as import org.apache.commons.text.StringEscapeUtils which is not deprecated. – Sprouse 5/4, 2023 at 23:23

try

str = org.apache.commons.lang3.StringEscapeUtils.unescapeJava(str);

from Apache Commons Lang

Churchgoer answered 4/12, 2012 at 10:16 Comment(8)

If Java itself provides the functionality of parsing the value then why should we use any third party tool ? – Upi 4/12, 2012 at 10:17

@BhavikAmbani Then please explain how, because your answer definitly does not. – Bagasse 4/12, 2012 at 10:19

@BhavikAmbani in your own example, try System.out.println(string); before calling getBytes(); and see what happens ;) – Churchgoer 4/12, 2012 at 10:19

How come ? My answer solves the problem which is specified in the question asked, that convert unicode into readable string format. – Upi 4/12, 2012 at 10:20

@Churchgoer I have pasted that also you can check that this prints the perfect output, which I have taked from the console – Upi 4/12, 2012 at 10:21

@BhavikAmbani nope, when he prints out his string, he sees Dodd\u2013Frank, when we print your string we see Dodd-Frank. (before any conversion), his String is "Dodd\\u2013Frank", your String is "Dodd\u2013Frank" – Churchgoer 4/12, 2012 at 10:21

This might solve your issue in a simple case, but be careful. If you are trying to use this solution, for example, on a JSON encoded string with UTF8 chars that you want unescaped, it will unescape things that you DONT want touched: For example, if this String is inside a piece of JSON "\u003ca href=\"http:\/\/twitter.com\" rel=\"nofollow\"\u003eTwitter Web Client\u003c\/a\u003e" – Zo 29/6, 2016 at 3:25

str = org.apache.commons.text.StringEscapeUtils.unescapeJava(str); as commons.lang3 is deprecated. – Unipod 11/6, 2021 at 6:39

`java.util.Properties`

You can take advantage of the fact that java.util.Properties supports strings with \uXXXX escape sequences and do something like this:

Properties p = new Properties();
p.load(new StringReader("key = " + yourInputString));
System.out.println("Escaped value: " + p.getProperty("key"));

Inelegant, but functional.

To handle the possible IOExeception, you may want a try-catch.

Properties p = new Properties();
try { 
   p.load(new StringReader("key = " + input)); 
} catch (IOException e) { 
   e.printStackTrace();
}
System.out.println("Escaped value: " + p.getProperty("key"));

Duleba answered 4/6, 2014 at 20:27 Comment(4)

won't handle newlines – Relief 28/9, 2018 at 18:41

As written, true, though this solution could be applied to one line at a time. – Duleba 2/10, 2018 at 19:49

Yeah, I am just warning people as I faced that. I actually replaced new lines with some special string, converted and converted back, worked like a charm, but not perfect for production code. – Relief 3/10, 2018 at 9:52

Works. Another approach is to read in one line at a time using a BufferedReader or BufferedInputSteam similar and apply this algorithm to one line at a time. – Duleba 13/1, 2021 at 22:15

try

str = org.apache.commons.text.StringEscapeUtils.unescapeJava(str);

as org.apache.commons.lang3.StringEscapeUtils is deprecated.

Unipod answered 11/6, 2021 at 6:40 Comment(0)

Suppose you have a Unicode value, such as 00B0 (degree symbol, or superscript 'o', as in Spanish abbreviation for 'primero')

Here is a function that does just what you want:

public static String  unicodeToString( char  charValue )
{
    Character   ch = new Character( charValue );

    return ch.toString();
}

Gavan answered 30/6, 2016 at 18:31 Comment(0)

I used StringEscapeUtils.unescapeXml to unescape the string loaded from an API that gives XML result.

Nastassia answered 26/10, 2016 at 14:42 Comment(0)

UnicodeUnescaper from org.apache.commons:commons-text is also acceptable.

new UnicodeUnescaper().translate("Dodd\u2013Frank")

Tanah answered 4/11, 2020 at 19:51 Comment(1)

UnicodeUnescaper().translate(...) needs a writer presumably a StringWriter - you may as well just use import org.apache.commons.text.StringEscapeUtils.unescapeJava from commons-text. – Sprouse 5/4, 2023 at 23:27

In case you cannot add a dependency to your project, or you simply don't want to, here is a relatively simple implementation using a regular expression.

import java.util.regex.Pattern;

public final class UnicodeUnescape {

    private static final Pattern UNICODE_ESCAPE_PATTERN = 
            Pattern.compile("(?<!\\\\)\\\\u(\\p{XDigit}{4})");

    public static String unescape(String input) {
        return UNICODE_ESCAPE_PATTERN.matcher(input).replaceAll(match -> {
            char c = (char) Integer.parseInt(match.group(1), 16);
            return Character.toString(c);
        });
    }

    private UnicodeUnescape() {}
}

Though this is obviously not the most efficient implementation. Also, this will only handle Unicode escape sequences, unlike StringEscapeUtils#escapeJava(String) from Apache Commons Text.

Note that Matcher#replaceAll(Function) was added in Java 9.

Remove (?<!\\\\) from the regex if literals like \\u2013 should still be "unescaped" into \–.

Here's some very basic, non-exhaustive unit tests:

import static org.junit.jupiter.api.Assertions.assertEquals;

import org.junit.jupiter.api.DisplayName;
import org.junit.jupiter.api.Test;

class UnicodeUnescapeTests {

    @Test
    @DisplayName("Unicode sequence is unescaped")
    void testUnescape() {
        var unescaped = UnicodeUnescape.unescape("Dodd\\u2013Frank");
        assertEquals("Dodd–Frank", unescaped);
    }

    @Test
    @DisplayName("surrogate pair is unescaped")
    void testUnescapeSurrogatePair() {
        var unescaped = UnicodeUnescape.unescape("Dodd Frank \\uD83C\\uDF09");
        assertEquals("Dodd Frank 🌉", unescaped);
    }

    @Test
    @DisplayName("escaped Unicode sequence is unchanged")
    void testEscapedUnicodeSequence() {
        var unescaped = UnicodeUnescape.unescape("Dodd\\\\u2013Frank");
        assertEquals("Dodd\\\\u2013Frank", unescaped);
    }
}

Output (from Gradle):

UnicodeUnescapeTests > escaped Unicode sequence is unchanged PASSED
UnicodeUnescapeTests > surrogate pair is unescaped PASSED
UnicodeUnescapeTests > Unicode sequence is unescaped PASSED

Tremendous answered 23/1 at 22:14 Comment(0)

-3

You can convert that byte buffer to String like this :

import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.ByteBuffer

public static CharsetDecoder decoder = CharsetDecoder.newDecoder();

public static String byteBufferToString(ByteBuffer buffer)
{
    String data = "";
    try 
    {
        // EDITOR'S NOTE -- There is no 'position' method for ByteBuffer.
        //                   As such, this is pseudocode.
        int old_position = buffer.position();
        data = decoder.decode(buffer).toString();
        // reset buffer's position to its original so it is not altered:
        buffer.position(old_position);  
    }
    catch (Exception e)
    {
        e.printStackTrace();
        return "";
    }
    return data;
 }

Payoff answered 4/12, 2012 at 10:8 Comment(1)

decoder is object of CharsetDecoder class in java.nio package.Sorry to update that.See the edited answer.Thanks for reminding me.:) – Payoff 4/12, 2012 at 10:16

-3

Perhaps the following solution which decodes the string correctly without any additional dependencies.

This works in a scala repl, though should work just as good in Java only solution.

import java.nio.charset.StandardCharsets
import java.nio.charset.Charset

> StandardCharsets.UTF_8.decode(Charset.forName("UTF-8").encode("Dodd\u2013Frank"))
res: java.nio.CharBuffer = Dodd–Frank

Azarria answered 24/10, 2018 at 20:42 Comment(3)

Tried this, but what is actually decoding the UTF-8 character is the fact that it is given directly in the String. What your example does, is to take a UTF-8 String, encode that, decode that, and - luckily - we get the same output as the input. – Kelantan 6/3, 2019 at 11:47

curious, what is a string example would fail to convert for this solution? – Azarria 7/3, 2019 at 15:24

In the source "\u2013" is alread converted to the UTF-8 character. What would be a correct representation of the problem is "\\u2013" as the text to be converted contains the backslash and each character individually. – Kelantan 10/3, 2019 at 11:46

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

java.util.Properties

Recommended topics

Hot tags

`java.util.Properties`