How to remove non-valid unicode characters from strings in java
Asked Answered
F

4

8

I am using the CoreNLP Neural Network Dependency Parser to parse some social media content. Unfortunately, the file contains characters which are, according to fileformat.info, not valid unicode characters or unicode replacement characters. These are for example U+D83D or U+FFFD. If those characters are in the file, coreNLP responds with errors messages like this one:

Nov 15, 2015 5:15:38 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)

Based on this answer, I tried document.replaceAll("\\p{C}", ""); to just remove those characters. document here is just the document as a string. But that didn't help.

How can I remove those characters out of the string before passing it to coreNLP?

UPDATE (Nov 16th):

For the sake of completeness I should mention that I asked this question only in order to avoid the huge amount of error messages by preprocessing the file. CoreNLP just ignores characters it can't handle, so that is not the problem.

Fellowman answered 15/11, 2015 at 16:30 Comment(2)
The replaceAll method creates a new String; it doesn't modify document. Did you do document = document.replaceAll(...) (or something else to capture the return value)?Stepheniestephens
I used it in the instantiation of the DocumentProcessor class in this line: DocumentPreprocessor tokenizer = new DocumentPreprocessor(new StringReader(document.replaceAll("\\p{C}", "")));.Fellowman
F
8

In a way, both answers provided by Mukesh Kumar and GsusRecovery are helping, but not fully correct.

document.replaceAll("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]", "");

seems to replace all invalid characters. But CoreNLP seems to not support even more. I manually figured them out by running the parser on my whole corpus, which led to this:

document.replaceAll("[\\uD83D\\uFFFD\\uFE0F\\u203C\\u3010\\u3011\\u300A\\u166D\\u200C\\u202A\\u202C\\u2049\\u20E3\\u300B\\u300C\\u3030\\u065F\\u0099\\u0F3A\\u0F3B\\uF610\\uFFFC]", "");

So right now I am running two replaceAll() commands before handing the document to the parser. The complete code snippet is

// remove invalid unicode characters
String tmpDoc1 = document.replaceAll("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]", "");
// remove other unicode characters coreNLP can't handle
String tmpDoc2 = tmpDoc1.replaceAll("[\\uD83D\\uFFFD\\uFE0F\\u203C\\u3010\\u3011\\u300A\\u166D\\u200C\\u202A\\u202C\\u2049\\u20E3\\u300B\\u300C\\u3030\\u065F\\u0099\\u0F3A\\u0F3B\\uF610\\uFFFC]", "");
DocumentPreprocessor tokenizer = new DocumentPreprocessor(new StringReader(tmpDoc2));
for (List<HasWord> sentence : tokenizer) {
    List<TaggedWord> tagged = tagger.tagSentence(sentence);
    GrammaticalStructure gs = parser.predict(tagged);
    System.err.println(gs);
}

This is not necessarily a complete list of unsupported characters, though, which is why I opened an issue on GitHub.

Please note that CoreNLP automatically removes those unsupported characters. The only reason I want to preprocess my corpus is to avoid all those error messages.

UPDATE Nov 27ths

Christopher Manning just answered the GitHub Issue I opened. There are several ways to handle those characters using the class edu.stanford.nlp.process.TokenizerFactory;. Take this code example to tokenize a document:

DocumentPreprocessor tokenizer = new DocumentPreprocessor(new StringReader(document));
TokenizerFactory<? extends HasWord> factory=null;
factory=PTBTokenizer.factory();
factory.setOptions("untokenizable=noneDelete");
tokenizer.setTokenizerFactory(factory);

for (List<HasWord> sentence : tokenizer) {
    // do something with the sentence
}

You can replace noneDeletein line 4 with other options. I am citing Manning:

"(...) the complete set of six options combining whether to log a warning for none, the first, or all, and whether to delete them or to include them as single character tokens in the output: noneDelete, firstDelete, allDelete, noneKeep, firstKeep, allKeep."

That means, to keep the characters without getting all those error messages, the best way is to use the option noneKeep. This way is way more elegant than any attempt to remove those characters.

Fellowman answered 15/11, 2015 at 19:53 Comment(1)
Good work, i've updated my answer to optimize the process using a single "Not in one of the allowed unicode-group" approach. Try it and read associated documentation. Waiting for an official response to optionally refine it, i think may be the best approach.Cuisine
C
3

Remove specific unwanted chars with:

document.replaceAll("[\\uD83D\\uFFFD\\uFE0F\\u203C\\u3010]", "");

If you found others unwanted chars simply add with the same schema to the list.

UPDATE:

The unicode chars are splitted by the regex engine in 7 macro-groups (and several sub-groups) identified by a one letter (macro-group) or two letters (sub-group).

Basing my arguments on your examples and the unicode classes indicated in the always good resource Regular Expressions Site i think you can try a unique only-good-pass approach such as this:

document.replaceAll("[^\\p{L}\\p{N}\\p{Z}\\p{Sm}\\p{Sc}\\p{Sk}\\p{Pi}\\p{Pf}\\p{Pc}\\p{Mc}]","")

This regex remove anything that is not:

  • \p{L}: a letter in any language
  • \p{N}: a number
  • \p{Z}: any kind of whitespace or invisible separator
  • \p{Sm}\p{Sc}\p{Sk}: Math, Currency or generic marks as single char
  • \p{Mc}*: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
  • \p{Pi}\p{Pf}\p{Pc}*: Opening quote, Closing quote, words connectors (i.e. underscore)

*: i think these groups can be eligible to be removed as well for the purpose of CoreNPL.

This way you only need a single regex filter and you can handle groups of chars (with the same purpose) instead of single cases.

Cuisine answered 15/11, 2015 at 17:38 Comment(3)
Thanks for the update. I think this might be too much, though. For example, one problem was U+3010 (fileformat.info/info/unicode/char/3010/index.htm), which belongs to the group Ps (any kind of opening bracket). But wouldn't also (, [ or { be removed, unnecessarily in my case? Before I start to remove stuff I don't want to, I'd rather live with the error messages and let CoreNLP do the job itself.Fellowman
Test if there are differences in the output provided by CoreNPL using the filter (maybe this is the case, maybe not). Being a white-list you can always simply add the chars you want to save to the list as is "[^\\p{L}..\\(\\)\\[\\]\\{\\})]".Cuisine
Yeah you are right. Probably the best solution to my problem. Thanks!Fellowman
L
1

Just as You have a String as

String xml = "...."; xml = xml.replaceAll("[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]", "");

This will Solve your problem

Lorelle answered 15/11, 2015 at 16:55 Comment(3)
It says String literal is not properly closed by a double-quote.Fellowman
All the \u needs double escape -> \\uCuisine
Hm, ok, that did the trick. The U+D83Derrors seem to be gone, maybe also others (I have a huge corpus, so I am not sure). What I still get are U+FFFD, U+FE0F, U+203C andU+3010. At least I don't see anything else in the rush. How can I get rid of those? Another thing, could you specify what exactly is removed? I want to be sure that nothing I don't want to be removed is removed.Fellowman
A
0

Observed the negative impact in other places when we do replaceAll. So, I propose to replace characters if it is non BPM characters like below

private String removeNonBMPCharacters(final String input) {
    StringBuilder strBuilder = new StringBuilder();
    input.codePoints().forEach((i) -> {
        if (Character.isSupplementaryCodePoint(i)) {
            strBuilder.append("?");
        } else {
            strBuilder.append(Character.toChars(i));
        }
    });
    return strBuilder.toString();
}
Antiphrasis answered 24/1, 2019 at 8:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.