removing characters of a specific unicode range from a string
Asked Answered
L

5

17

I have a program that is parsting tweets in real time from the twitter stream api. Before storing them, I am encoding them as utf8. Certain characters end up appearing in the string as ?, ??, or ??? instead of their respective unicode codes and cause problems. Upon further investigation, I found that the problematic characters are from the "emoticon" block, U+1F600 - U+1F64F, and the "Miscellaneous Symbols And Pictographs" block, U+1F300 - U+1F5FF. I tried removing, but was unsuccessful as the matcher ended up replacing almost every character in the string, not just my desired unicode range.

String utf8tweet = "";
        try {
            byte[] utf8Bytes = status.getText().getBytes("UTF-8");

            utf8tweet = new String(utf8Bytes, "UTF-8");

        } 
        catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        }
Pattern unicodeOutliers = Pattern.compile("[\\u1f300-\\u1f64f]", Pattern.UNICODE_CASE | Pattern.CANON_EQ | Pattern.CASE_INSENSITIVE);
Matcher unicodeOutlierMatcher = unicodeOutliers.matcher(utf8tweet);
utf8tweet = unicodeOutlierMatcher.replaceAll(" ");

What can I do to remove these characters?

Luftwaffe answered 17/8, 2012 at 21:21 Comment(3)
When you say it doesn't work, exactly what behavior do you see? Instead of using the range [\\u1f300-\\u1f64f], did you try using a single character and see if that works? I suspect that the regex range syntax would have problems with unicode characters.Privateer
If you see ? instead of a Unicode character when displaying a Unicode-encoded string in a GUI component or in IDE console output; don' worry it is not due to Unicode encoding, it is due to the wrong choice of display font that doesn't support Unicode code points like Latin-1 fonts (255 code points only). Try to use any Unicode-supported fonts like Arial Unicode MSPrank
Sorry for not being specific! By "not work" I meant the character was not found by the matcher, or at least the replaceAll function wasn't applied to it. Thanks, eee! That is a good point. However, I am noticing unicodes in my output (i.e. "u20A2") while the characters in question remain as ??Luftwaffe
L
36

In the regex pattern add the negation operator ^. For filtering printable characters you could use the following expression [^\\x00-\\x7F] and you should get the desired result.

import java.io.UnsupportedEncodingException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class UTF8 {
    public static void main(String[] args) {
        String utf8tweet = "";
        try {
            byte[] utf8Bytes = "#Hello twitter  How are you?".getBytes("UTF-8");

            utf8tweet = new String(utf8Bytes, "UTF-8");

        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        }
        Pattern unicodeOutliers = Pattern.compile("[^\\x00-\\x7F]",
                Pattern.UNICODE_CASE | Pattern.CANON_EQ
                        | Pattern.CASE_INSENSITIVE);
        Matcher unicodeOutlierMatcher = unicodeOutliers.matcher(utf8tweet);

        System.out.println("Before: " + utf8tweet);
        utf8tweet = unicodeOutlierMatcher.replaceAll(" ");
        System.out.println("After: " + utf8tweet);
    }
}

Results in the following output:

Before: #Hello twitter  How are you?
After: #Hello twitter   How are you?

EDIT

To explain further, you could also keep expressing the range with the \u form in the following way [^\\u0000-\\u007F], which will match all the characters which are not the first 128 UNICODE characters (the same as before). If you want to extend the range to support extra characters, you can do so using the UNICODE character list here.

For example if you want to include vowels with accent (used in Spanish) you should extend the range to \u00FF, so you have [^\\u0000-\\u00FF] or [^\\x00-\\xFF]:

Before: #Hello twitter  How are you? á é í ó ú
After: #Hello twitter   How are you? á é í ó ú
Lyn answered 17/8, 2012 at 21:33 Comment(5)
The problematic characters were removed! :) (? represents one of the problematic characters in this case) But so were all characters... including # ! . BEFORE: #MentionSomeoneYouDontWannaLose@OG_RiiSky ! or i'd be ? . AFTER: MentionSomeoneYouDontWannaLose@OG_RiiSky or i d be Was the problematic character removed because the regex thought it was actually a question mark or was it actually able to pull it from that range?Luftwaffe
You're right. I edited the answer changing the used regex, it will match only printable characters.Lyn
Thanks! That is working so much better :) Out of curiosity, how did you get that new pattern from the unicode character range? It seems to be eliminating certain characters outside the range BEFORE: RT @JulianSerrano01: #ContraseñasQueTuve "notelavoyadecir" le puse esa contraseña a la unica PC de la casa en ese momento, se las decia ... AFTER: RT @JulianSerrano01: #Contrase asQueTuve "notelavoyadecir" le puse esa contrase a a la unica PC de la casa en ese momento, se las decia ...Luftwaffe
I got it from another SO quesiton I answered a little time ago :) (see the link at the end of the comment). I didn't initially think of it but then it seemed a proper solution. The regex proposed looks for the characters that are NOT printable, that is, that are not in the specified range. #11811801Lyn
Thanks for your edit!! I have altered the unicode range in the pattern to specify all the characters I want to allow. It is working perfectly :) For anyone curious, the pattern I ended up using is [^\\u0000-\\uFFEF], which allows pretty much all characters before specials and emoji emoticons that would break my program.Luftwaffe
C
24

First of all, the unicode block concerned is specified in java (strictly following the standard) as Character.UnicodeBlock MISCELLANEOUS_SYMBOLS_AND_PICTOGRAPHS. In a regex:

s = s.replaceAll("\\p{So}+", "");
Carabao answered 18/8, 2012 at 0:7 Comment(5)
you can use s.replaceAll("\\p{So}+", "") in correct Java (declared as OTHER_SYMBOLS)Lange
How do you find out that "So" corresponds to Miscellaneous? I'm using the verbose form of the block at the moment: [\\p{InMiscellaneousSymbolsAndPictographs}|\\p{InEmoticons}]+Arthritis
@Arthritis yes that was the reason I originally used the long name, to be found in the javadoc. Though definitely too long, that at least is self-documenting.Carabao
@Arthritis found this link on the java Pattern javadoc. See categories.Carabao
@Arthritis okay, "So" can be found in the javadoc: docs.oracle.com/javase/7/docs/api/java/lang/…Carabao
O
7

I tried this. The unicode ranges are from emoji ranges

    class EmojiEraser{

    private static final String EMOJI_RANGE_REGEX =
                "[\uD83C\uDF00-\uD83D\uDDFF]|[\uD83D\uDE00-\uD83D\uDE4F]|[\uD83D\uDE80-\uD83D\uDEFF]|[\u2600-\u26FF]|[\u2700-\u27BF]";
        private static final Pattern PATTERN = Pattern.compile(EMOJI_RANGE_REGEX);

        /**
         * Finds and removes emojies from @param input
         * 
         * @param input the input string potentially containing emojis (comes as unicode stringfied)
         * @return input string with emojis replaced
         */
        public String eraseEmojis(String input) {
            if (Strings.isNullOrEmpty(input)) {
                return input;
            }
            Matcher matcher = PATTERN.matcher(input);
            StringBuffer sb = new StringBuffer();
            while (matcher.find()) {
                matcher.appendReplacement(sb, "");
            }
            matcher.appendTail(sb);
            return sb.toString();
        }
}
Outmoded answered 6/9, 2015 at 4:53 Comment(1)
this regular expression not working you have another solution because when I use this regular expression at online with my string that time not give positive report my Unicode string is \u263A\uD83D\uDE0A\uD83D\uDE22\uD83D\uDC4DConoscenti
P
0

Assuming status.getText() returns a java.lang.String...

byte[] utf8Bytes = status.getText().getBytes("UTF-8");
utf8tweet = new String(utf8Bytes, "UTF-8");

The above transcoding operation produces the same results as:

utf8tweet = status.getText();

Java strings are implicitly UTF-16. UTF-16 and UTF-8 share the same character set (Unicode) so transforming from one to the other and back results in the original data.

Java regular expressions support the supplementary range using surrogate pairs. You can match them as described in the answers to this question.

As eee notes in his comment, you most likely have a font issue. Whether a grapheme can be displayed usually depends on the fonts available on the user's system, the chosen font and what form of font substitution the rendering technology supports.

Pigheaded answered 17/8, 2012 at 21:55 Comment(2)
I understand that the font may not be rendering the character, however the problem is I am sending these strings to my node.js server via socket.io. When node runs into that character on the server, it reads it as transport end (undefined) and drops my connection. So the characters have to be removed somehow :)Luftwaffe
@Luftwaffe - sounds like a problem with the transport protocol.Pigheaded
O
0

If you don't want to mess with regular expressions, then you can just test the unicode blocks instead:

private static final Set<Character.UnicodeBlock> BLACKLIST=Set.of(
    Character.UnicodeBlock.MISCELLANEOUS_SYMBOLS_AND_PICTOGRAPHS,
    Character.UnicodeBlock.EMOTICONS);

public String sanitize(String verbatim) {
    int cps=verbatim.codePoints()
        .filter(cp -> !BLACKLIST.contains(Character.UnicodeBlock.of(cp)))
        .toArray();
    return new String(cps, 0, cps.length);
}

Also, emoji processing libraries in Java are pretty good these days, and some handle pictographs too, like sigpwned/emoji4j. With that library, you could write the following code:

public String sanitize(String verbatim) {
    return new GraphemeMatcher(verbatim).replaceAll(mr -> "");
}

Disclaimer: I wrote that library, so I may be biased about its utility and simplicity. :)

Orpington answered 31/7, 2023 at 14:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.