Replace Unicode Control Characters
Asked Answered
R

1

8

I need to replace all special control character in a string in Java.

I want to ask the Google maps API v3, and Google doesn't seems to like these characters.

Example: http://www.google.com/maps/api/geocode/json?sensor=false&address=NEW%20YORK%C2%8F

This URL contains this character: http://www.fileformat.info/info/unicode/char/008f/index.htm

So I receive some data, and I need to geocode this data. I know some character would not pass the geocoding, but I don't know the exact list.

I was not able to find any documentation about this issue, so I think the list of characters that Google doesn't like is this one: http://www.fileformat.info/info/unicode/category/Cc/list.htm

Is there any already built function to get rid of these characters, or do I have to build a new one, with a replace one by one?

Or is there a good regexp to do the job done?

And does somebody know which exact list of characters Google doesn't like?

Edit : Google have create a webpage for this :

https://developers.google.com/maps/documentation/webservices/?hl=fr#BuildingURLs

Round answered 9/8, 2010 at 9:48 Comment(2)
can you manually get rid of the %C2%8F part of your URL to see if that URL is valid?Phonation
I can replace manually all the character that are not valid. The problem is that I don't know all the list (and I don't want to test one by one), and I don't want to do a replaceAll for each of the invalid character neitherRound
R
14

If you want to delete all characters in Other/Control Unicode category, you can do something like this:

    System.out.println(
        "a\u0000b\u0007c\u008fd".replaceAll("\\p{Cc}", "")
    ); // abcd

Note that this actually removes (among others) '\u008f' Unicode character from the string, not the escaped form "%8F" string.

If the blacklist is not nicely captured by one Unicode block/category, Java does have a powerful character class arithmetics featuring intersection, subtraction, etc that you can use. Alternatively you can also use a negated whitelist approach, i.e. instead of explicitly specifying what characters are illegal, you specify what are legal, and everything else then becomes illegal.

API links


Examples

Here's a subtraction example:

    System.out.println(
        "regular expressions: now you have two problems!!"
            .replaceAll("[a-z&&[^aeiou]]", "_")
    );
    //   _e_u_a_ e___e__io__: _o_ _ou _a_e __o __o__e__!!

The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.

[a-z&&[^aeiou]] matches [a-z] subtracted by [aeiou], i.e. all lowercase consonants.

The next example shows the negated whitelist approach:

    System.out.println(
        "regular expressions: now you have two problems!!"
            .replaceAll("[^a-z]", "_")
    );
    //   regular_expressions__now_you_have_two_problems__

Only lowercase letters a-z are legal; everything else is illegal.

Rajasthani answered 9/8, 2010 at 10:39 Comment(5)
The problem is that I am goign to use chinese, arabic, all the utf-8 character possible :) I will try with p{Cc} !!Round
@Scorpi0: the above are just examples. Find whatever Unicode category/block you want to black/white-list and compose the regex as you wish using elements shown here.Rajasthani
Oh, \p{Cc}, one more undocumented pattern expression. Nice one. Good to know.Oza
@BalusC: I'm no Unicode expert, but I think it is documented: "Categories may be specified with the optional prefix Is: Both \p{L} and \p{IsL} denote the category of Unicode letters. ". Replace L with Cc, or any other category name.Rajasthani
With Oracle Java 1.6.0_29 on Linux, "\\p{Cc}" didn't work for me but "\\p{C}" did (without the lowercase "c"). I have no idea whyGeophilous

© 2022 - 2024 — McMap. All rights reserved.