I'm assuming that you want the characters such as single quote, double quote and backslash in your input String
to be escaped, but you want the Greek characters to remain unchanged.
Unfortunately StringEscapeUtils.escapeJava()
will translate any text characters with a Unicode value > 0x7f
to their Unicode Escape equivalents. For example, your sample data shows that the Greek letter tau (τ
) is escaped to \u03C4
in the String returned by StringEscapeUtils.escapeJava()
. I don't know why escapeJava()
does this. Its Javadoc states "Escapes the characters in a String using Java String rules." but I couldn't find a formal definition of "Java String rules".
A simple way to to remove the Unicode escapes in the string returned by StringEscapeUtils.escapeJava()
is to call the translate()
method for the UnicodeUnescaper()
class:
Translates escaped Unicode values of the form \u+\d\d\d\d back to
Unicode. It supports multiple 'u' characters and will work with or
without the +.
So calling UnicodeUnescaper.translate()
will return a String
that:
- Leaves the escaped characters in a string, such as double quote, untouched.
- Replaces the Unicode literals with their Greek character equivalents. For example,
\u03C4
will be changed to τ
.
The code is straightforward. Using your data:
import org.apache.commons.text.StringEscapeUtils;
import org.apache.commons.text.translate.UnicodeUnescaper;
void convert() {
String incoming = "<html> <head></head> <body> <p><span style=\"font-family: Arial;\">Ευχαριστώ (eff-kha-ri-STOE) Tι κανείς (tee-KAH-nis)? Mε συγχωρείτε.</span></p> </body></html>";
String escaped = StringEscapeUtils.escapeJava(incoming);
String greekChars = new UnicodeUnescaper().translate(escaped);
System.out.println("incoming: " + incoming);
System.out.println("escaped: " + escaped); // Quotes are escaped, and Greek characters are converted to Unicode escapes.
System.out.println("greekChars: " + greekChars); // Quotes remain escaped, but Unicode escapes are converted back to Greek characters.
}
This is the output from the println()
calls:
run:
incoming: <html> <head></head> <body> <p><span style="font-family: Arial;">Ευχαριστώ (eff-kha-ri-STOE) Tι κανείς (tee-KAH-nis)? Mε συγχωρείτε.</span></p> </body></html>
escaped: <html> <head></head> <body> <p><span style=\"font-family: Arial;\">\u0395\u03C5\u03C7\u03B1\u03C1\u03B9\u03C3\u03C4\u03CE (eff-kha-ri-STOE) T\u03B9 \u03BA\u03B1\u03BD\u03B5\u03AF\u03C2 (tee-KAH-nis)? M\u03B5 \u03C3\u03C5\u03B3\u03C7\u03C9\u03C1\u03B5\u03AF\u03C4\u03B5.</span></p> </body></html>
greekChars: <html> <head></head> <body> <p><span style=\"font-family: Arial;\">Ευχαριστώ (eff-kha-ri-STOE) Tι κανείς (tee-KAH-nis)? Mε συγχωρείτε.</span></p> </body></html>
BUILD SUCCESSFUL (total time: 0 seconds)
Notes:
- Be sure to use package
org.apache.commons.text.translate
for UnicodeUnescaper
. Older deprecated versions exist in org.apache.commons.lang3.text.translate
. This is a link to the download page for Apache Commons Text, currently at version 1.8.
- This is not an ideal solution, because it is calling
UnicodeUnescaper.translate()
to fix the mess created by StringEscapeUtils.escapeJava()
. There may be other approaches that are cleaner (by using an alternative to StringEscapeUtils.escapeJava()
), but this way seems to work fine for your data.