How do I convert special characters using java?
Asked Answered
A

4

9

I have strings like:

Avery® Laser & Inkjet Self-Adhesive

I need to convert them to

Avery Laser & Inkjet Self-Adhesive.

I.e. remove special characters and convert html special chars to regular ones.

Appointee answered 18/2, 2010 at 9:22 Comment(3)
I'm interested in why are you getting the HTML encoded strings... In my "ideal" app the programmer never should have to... (simply encode to html the result, but receiving it... never)Galah
It's legacy code which saves data it such raw format I need to read and convert it.Appointee
Oh. In case of strange chars... it looks like it originally was a UTF-8 char and was decoded (readed) as ISO-8859-1 (Western ISO)... by example. If you have a Ñ, it has 2 bytes in UTF-8, so if you read it in iso-western it reads to strange chars. If it's the case and you know the encodings you code use new String(byte[], encodingName) and someString.getBytes(encodingName) to obtain the good chars.Galah
T
20
Avery® Laser & Inkjet Self-Adhesive

First use StringEscapeUtils#unescapeHtml4() (or #unescapeXml(), depending on the original format) to unescape the & into a &. Then use String#replaceAll() with [^\x20-\x7e] to get rid of characters which aren't inside the printable ASCII range.

Summarized:

String clean = StringEscapeUtils.unescapeHtml4(dirty).replaceAll("[^\\x20-\\x7e]", "");

..which produces

Avery Laser & Inkjet Self-Adhesive

(without the trailing dot as in your example, but that wasn't present in the original ;) )

That said, this however look like more a request to workaround than a request to solution. If you elaborate more about the functional requirement and/or where this string did originate, we may be able to provide the right solution. The ® namely look like to be caused by using the wrong encoding to read the string in and the & look like to be caused by using a textbased parser to read the string in instead of a fullfledged HTML parser.

Tacet answered 18/2, 2010 at 15:16 Comment(1)
Yep, trailing dot is my typo) You're right saying this kind of strings are result of textbased parser reading html.Appointee
H
6

You can use the StringEscapeUtils class from Apache Commons Text project.

Homograft answered 18/2, 2010 at 9:27 Comment(0)
D
1

Maybe you can use something like:

yourTxt = yourTxt.replaceAll("&", "&");

in some project I did something like:

public String replaceAcutesHTML(String str) {

str = str.replaceAll("á","á");
str = str.replaceAll("é","é");
str = str.replaceAll("í","í");
str = str.replaceAll("ó","ó");
str = str.replaceAll("ú","ú");
str = str.replaceAll("Á","Á");
str = str.replaceAll("É","É");
str = str.replaceAll("Í","Í");
str = str.replaceAll("Ó","Ó");
str = str.replaceAll("Ú","Ú");
str = str.replaceAll("ñ","ñ");
str = str.replaceAll("Ñ","Ñ");

return str;

}

Dykstra answered 18/2, 2010 at 15:5 Comment(2)
That means that you need to unescape every occurrence of every placeholder in HTML, which is a pain, especially when someone has already written it for you.Intertype
That would work, but its not an ideal approach. To do that you'd have to build (and maintain) a set of all special characters to replace. It's better to use an existing library or encoder than to do manual replacements where possible. It also happens to be easier and less tedious to implement!Rakia
F
1

Incase you want to mimic what php function htmlspecialchars_decode does use php function get_html_translation_table() to dump the table and then use the java code like,

    static Hashtable html_specialchars_table = new Hashtable();
    static {
            html_specialchars_table.put("&lt;","<");
            html_specialchars_table.put("&gt;",">");
            html_specialchars_table.put("&amp;","&");
    }
    static String htmlspecialchars_decode_ENT_NOQUOTES(String s){
            Enumeration en = html_specialchars_table.keys();
            while(en.hasMoreElements()){
                    String key = (String)en.nextElement();
                    String val = (String)html_specialchars_table.get(key);
                    s = s.replaceAll(key, val);
            }
            return s;
    }
Fizgig answered 18/4, 2012 at 6:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.