Convert escaped Unicode character back to actual character
Asked Answered
B

9

36

I have the following value in a string variable in Java which has UTF-8 characters encoded like below

Dodd\u2013Frank

instead of

Dodd–Frank

(Assume that I don't have control over how this value is assigned to this string variable)

Now how do I convert (encode) it properly and store it back in a String variable?

I found the following code

Charset.forName("UTF-8").encode(str);

But this returns a ByteBuffer, but I want a String back.

Edit:

Some more additional information.

When I use System.out.println(str); I get

Dodd\u2013Frank

I am not sure what is the correct terminology (UTF-8 or unicode). Pardon me for that.

Bacciferous answered 4/12, 2012 at 10:4 Comment(7)
the question is unclear to me. When you System.out.println(yourString); do you see (1) Dodd\u2013Frank or (2) Dodd–Frank ?Churchgoer
Wrong, \u2013 is not an UTF-8 character, it is an escaped Unicode character. UTF-8 is a way of encoding UTF characters.Bagasse
@Churchgoer and SirDarius I have updated the question with details.Bacciferous
Have a look at StringEscapeUtils.unescapeJava()Churchgoer
Check the Apache Doc: commons.apache.org/proper/commons-lang/javadocs/api-3.1/org/…Shipentine
Just wanted to understand, why not "Dodd\u2013Frank".chars().forEach(a -> System.out.print((char) a)); ?Mascon
org.apache.commons.lang3.StringEscapeUtils is deprecated, but moved to commons-text as import org.apache.commons.text.StringEscapeUtils which is not deprecated.Sprouse
C
63

try

str = org.apache.commons.lang3.StringEscapeUtils.unescapeJava(str);

from Apache Commons Lang

Churchgoer answered 4/12, 2012 at 10:16 Comment(8)
If Java itself provides the functionality of parsing the value then why should we use any third party tool ?Upi
@BhavikAmbani Then please explain how, because your answer definitly does not.Bagasse
@BhavikAmbani in your own example, try System.out.println(string); before calling getBytes(); and see what happens ;)Churchgoer
How come ? My answer solves the problem which is specified in the question asked, that convert unicode into readable string format.Upi
@Churchgoer I have pasted that also you can check that this prints the perfect output, which I have taked from the consoleUpi
@BhavikAmbani nope, when he prints out his string, he sees Dodd\u2013Frank, when we print your string we see Dodd-Frank. (before any conversion), his String is "Dodd\\u2013Frank", your String is "Dodd\u2013Frank"Churchgoer
This might solve your issue in a simple case, but be careful. If you are trying to use this solution, for example, on a JSON encoded string with UTF8 chars that you want unescaped, it will unescape things that you DONT want touched: For example, if this String is inside a piece of JSON "\u003ca href=\"http:\/\/twitter.com\" rel=\"nofollow\"\u003eTwitter Web Client\u003c\/a\u003e"Zo
str = org.apache.commons.text.StringEscapeUtils.unescapeJava(str); as commons.lang3 is deprecated.Unipod
D
17

java.util.Properties

You can take advantage of the fact that java.util.Properties supports strings with \uXXXX escape sequences and do something like this:

Properties p = new Properties();
p.load(new StringReader("key = " + yourInputString));
System.out.println("Escaped value: " + p.getProperty("key"));

Inelegant, but functional.

To handle the possible IOExeception, you may want a try-catch.

Properties p = new Properties();
try { 
   p.load(new StringReader("key = " + input)); 
} catch (IOException e) { 
   e.printStackTrace();
}
System.out.println("Escaped value: " + p.getProperty("key"));
Duleba answered 4/6, 2014 at 20:27 Comment(4)
won't handle newlinesRelief
As written, true, though this solution could be applied to one line at a time.Duleba
Yeah, I am just warning people as I faced that. I actually replaced new lines with some special string, converted and converted back, worked like a charm, but not perfect for production code.Relief
Works. Another approach is to read in one line at a time using a BufferedReader or BufferedInputSteam similar and apply this algorithm to one line at a time.Duleba
U
2

try

str = org.apache.commons.text.StringEscapeUtils.unescapeJava(str);

as org.apache.commons.lang3.StringEscapeUtils is deprecated.

Unipod answered 11/6, 2021 at 6:40 Comment(0)
G
0

Suppose you have a Unicode value, such as 00B0 (degree symbol, or superscript 'o', as in Spanish abbreviation for 'primero')

Here is a function that does just what you want:

public static String  unicodeToString( char  charValue )
{
    Character   ch = new Character( charValue );

    return ch.toString();
}
Gavan answered 30/6, 2016 at 18:31 Comment(0)
N
0

I used StringEscapeUtils.unescapeXml to unescape the string loaded from an API that gives XML result.

Nastassia answered 26/10, 2016 at 14:42 Comment(0)
T
0

UnicodeUnescaper from org.apache.commons:commons-text is also acceptable.

new UnicodeUnescaper().translate("Dodd\u2013Frank")

Tanah answered 4/11, 2020 at 19:51 Comment(1)
UnicodeUnescaper().translate(...) needs a writer presumably a StringWriter - you may as well just use import org.apache.commons.text.StringEscapeUtils.unescapeJava from commons-text.Sprouse
T
0

In case you cannot add a dependency to your project, or you simply don't want to, here is a relatively simple implementation using a regular expression.

import java.util.regex.Pattern;

public final class UnicodeUnescape {

    private static final Pattern UNICODE_ESCAPE_PATTERN = 
            Pattern.compile("(?<!\\\\)\\\\u(\\p{XDigit}{4})");

    public static String unescape(String input) {
        return UNICODE_ESCAPE_PATTERN.matcher(input).replaceAll(match -> {
            char c = (char) Integer.parseInt(match.group(1), 16);
            return Character.toString(c);
        });
    }

    private UnicodeUnescape() {}
}

Though this is obviously not the most efficient implementation. Also, this will only handle Unicode escape sequences, unlike StringEscapeUtils#escapeJava(String) from Apache Commons Text.

Note that Matcher#replaceAll(Function) was added in Java 9.

Remove (?<!\\\\) from the regex if literals like \\u2013 should still be "unescaped" into \–.


Here's some very basic, non-exhaustive unit tests:

import static org.junit.jupiter.api.Assertions.assertEquals;

import org.junit.jupiter.api.DisplayName;
import org.junit.jupiter.api.Test;

class UnicodeUnescapeTests {

    @Test
    @DisplayName("Unicode sequence is unescaped")
    void testUnescape() {
        var unescaped = UnicodeUnescape.unescape("Dodd\\u2013Frank");
        assertEquals("Dodd–Frank", unescaped);
    }

    @Test
    @DisplayName("surrogate pair is unescaped")
    void testUnescapeSurrogatePair() {
        var unescaped = UnicodeUnescape.unescape("Dodd Frank \\uD83C\\uDF09");
        assertEquals("Dodd Frank 🌉", unescaped);
    }

    @Test
    @DisplayName("escaped Unicode sequence is unchanged")
    void testEscapedUnicodeSequence() {
        var unescaped = UnicodeUnescape.unescape("Dodd\\\\u2013Frank");
        assertEquals("Dodd\\\\u2013Frank", unescaped);
    }
}

Output (from Gradle):

UnicodeUnescapeTests > escaped Unicode sequence is unchanged PASSED
UnicodeUnescapeTests > surrogate pair is unescaped PASSED
UnicodeUnescapeTests > Unicode sequence is unescaped PASSED
Tremendous answered 23/1 at 22:14 Comment(0)
P
-3

You can convert that byte buffer to String like this :

import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.ByteBuffer

public static CharsetDecoder decoder = CharsetDecoder.newDecoder();

public static String byteBufferToString(ByteBuffer buffer)
{
    String data = "";
    try 
    {
        // EDITOR'S NOTE -- There is no 'position' method for ByteBuffer.
        //                   As such, this is pseudocode.
        int old_position = buffer.position();
        data = decoder.decode(buffer).toString();
        // reset buffer's position to its original so it is not altered:
        buffer.position(old_position);  
    }
    catch (Exception e)
    {
        e.printStackTrace();
        return "";
    }
    return data;
 }
Payoff answered 4/12, 2012 at 10:8 Comment(1)
decoder is object of CharsetDecoder class in java.nio package.Sorry to update that.See the edited answer.Thanks for reminding me.:)Payoff
A
-3

Perhaps the following solution which decodes the string correctly without any additional dependencies.

This works in a scala repl, though should work just as good in Java only solution.

import java.nio.charset.StandardCharsets
import java.nio.charset.Charset

> StandardCharsets.UTF_8.decode(Charset.forName("UTF-8").encode("Dodd\u2013Frank"))
res: java.nio.CharBuffer = Dodd–Frank
Azarria answered 24/10, 2018 at 20:42 Comment(3)
Tried this, but what is actually decoding the UTF-8 character is the fact that it is given directly in the String. What your example does, is to take a UTF-8 String, encode that, decode that, and - luckily - we get the same output as the input.Kelantan
curious, what is a string example would fail to convert for this solution?Azarria
In the source "\u2013" is alread converted to the UTF-8 character. What would be a correct representation of the problem is "\\u2013" as the text to be converted contains the backslash and each character individually.Kelantan

© 2022 - 2024 — McMap. All rights reserved.