StringEscapeUtils not handling utf-8
Asked Answered
C

2

6

I have a string like this

String incoming = "<html> <head></head> <body>  <p><span style=\"font-family: Arial;\">Ευχαριστώ (eff-kha-ri-STOE) Tι κανείς (tee-KAH-nis)? Mε συγχωρείτε.</span></p> </body></html>";

and I'm escaping it using the StringEscapeUtils

import org.apache.commons.text.StringEscapeUtils;
String escaped = StringEscapeUtils.escapeJava(incoming);

The result is

<html> <head></head> <body>  <p><span style=\"font-family: Arial;\">\u0395\u03C5\u03C7\u03B1\u03C1\u03B9\u03C3\u03C4\u03CE (eff-kha-ri-STOE) T\u03B9 \u03BA\u03B1\u03BD\u03B5\u03AF\u03C2 (tee-KAH-nis)? M\u03B5 \u03C3\u03C5\u03B3\u03C7\u03C9\u03C1\u03B5\u03AF\u03C4\u03B5.</span></p> </body></html>

I've tried converting it to utf-8 by getting the bytes and it doesn't work, is there any way I could get it fixed?

here's what I tried:

String s = new String(escaped.getBytes("UTF-8"), "UTF-8");

I've also tried a different library to escape the text still doesn't work.

Cuisse answered 11/12, 2019 at 7:19 Comment(0)
F
10

I'm assuming that you want the characters such as single quote, double quote and backslash in your input String to be escaped, but you want the Greek characters to remain unchanged.

Unfortunately StringEscapeUtils.escapeJava() will translate any text characters with a Unicode value > 0x7f to their Unicode Escape equivalents. For example, your sample data shows that the Greek letter tau (τ) is escaped to \u03C4 in the String returned by StringEscapeUtils.escapeJava(). I don't know why escapeJava() does this. Its Javadoc states "Escapes the characters in a String using Java String rules." but I couldn't find a formal definition of "Java String rules".

A simple way to to remove the Unicode escapes in the string returned by StringEscapeUtils.escapeJava() is to call the translate() method for the UnicodeUnescaper() class:

Translates escaped Unicode values of the form \u+\d\d\d\d back to Unicode. It supports multiple 'u' characters and will work with or without the +.

So calling UnicodeUnescaper.translate() will return a String that:

  • Leaves the escaped characters in a string, such as double quote, untouched.
  • Replaces the Unicode literals with their Greek character equivalents. For example, \u03C4 will be changed to τ.

The code is straightforward. Using your data:

import org.apache.commons.text.StringEscapeUtils;
import org.apache.commons.text.translate.UnicodeUnescaper;

void convert() {
    String incoming = "<html> <head></head> <body>  <p><span style=\"font-family: Arial;\">Ευχαριστώ (eff-kha-ri-STOE) Tι κανείς (tee-KAH-nis)? Mε συγχωρείτε.</span></p> </body></html>";
    String escaped = StringEscapeUtils.escapeJava(incoming); 
    String greekChars = new UnicodeUnescaper().translate(escaped);

    System.out.println("incoming:   " + incoming); 
    System.out.println("escaped:    " + escaped);    // Quotes are escaped, and Greek characters are converted to Unicode escapes.
    System.out.println("greekChars: " + greekChars); // Quotes remain escaped, but Unicode escapes are converted back to Greek characters.
}

This is the output from the println() calls:

run:
incoming:   <html> <head></head> <body>  <p><span style="font-family: Arial;">Ευχαριστώ (eff-kha-ri-STOE) Tι κανείς (tee-KAH-nis)? Mε συγχωρείτε.</span></p> </body></html>
escaped:    <html> <head></head> <body>  <p><span style=\"font-family: Arial;\">\u0395\u03C5\u03C7\u03B1\u03C1\u03B9\u03C3\u03C4\u03CE (eff-kha-ri-STOE) T\u03B9 \u03BA\u03B1\u03BD\u03B5\u03AF\u03C2 (tee-KAH-nis)? M\u03B5 \u03C3\u03C5\u03B3\u03C7\u03C9\u03C1\u03B5\u03AF\u03C4\u03B5.</span></p> </body></html>
greekChars: <html> <head></head> <body>  <p><span style=\"font-family: Arial;\">Ευχαριστώ (eff-kha-ri-STOE) Tι κανείς (tee-KAH-nis)? Mε συγχωρείτε.</span></p> </body></html>
BUILD SUCCESSFUL (total time: 0 seconds)

Notes:

  • Be sure to use package org.apache.commons.text.translate for UnicodeUnescaper. Older deprecated versions exist in org.apache.commons.lang3.text.translate. This is a link to the download page for Apache Commons Text, currently at version 1.8.
  • This is not an ideal solution, because it is calling UnicodeUnescaper.translate() to fix the mess created by StringEscapeUtils.escapeJava(). There may be other approaches that are cleaner (by using an alternative to StringEscapeUtils.escapeJava()), but this way seems to work fine for your data.
Foyer answered 14/12, 2019 at 5:14 Comment(3)
Thanks, This works like a charm for my use case. It fixes the mess caused by escapeJava()Cuisse
@orcluser [1] As quoted in my answer, UnicodeUnescaper.translate() will translate "escaped Unicode values of the form \u+\d\d\d\d back to Unicode". But the string you are passing to translate() (i.e. "f&uuml;r") is not of that form.Foyer
@orcluser ...[2] Also note that the Javadoc for escapeHtml() states that it "Escapes the characters in a String using HTML entities", and that is exactly what has happened in your example: "ü" was correctly escaped to "&uuml;". So your issue is unrelated to using German characters. Perhaps try replacing the call to escapeHtml() with a call to escapeJava() to resolve your issue if that is feasible?Foyer
B
0

I have faced the same problem and didn't want to do escapeJava() plus UnicodeUnescaper.translate() so I copied some code from the inside of StringEscapeUtils and came up with this

public class NamingUtil {
    private static final CharSequenceTranslator javaStringEscapeTranslator;

    static {
        javaStringEscapeTranslator = new AggregateTranslator(
            new LookupTranslator(Map.of("\"", "\\\"", "\\", "\\\\")),
            new LookupTranslator(EntityArrays.JAVA_CTRL_CHARS_ESCAPE));
    }

    public static String escapeJavaString(String value) {
        return javaStringEscapeTranslator.translate(value);
    }
}

It's more code but it should be faster and less prone to breaking silently than converting back and forth.

Bara answered 24/9, 2024 at 12:53 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.