How can I match characters (with the intention of removing them) from outside the unicode Basic Multilingual Plane in java?
Java regex match characters outside Basic Multilingual Plane
Asked Answered
To remove all non-BMP characters, the following should work:
String sanitizedString = inputString.replaceAll("[^\u0000-\uFFFF]", "");
Have you actually tested this? Because your character range includes the surrogate range used to construct non-BMP codepoints. –
Sech
@Anon: As you pointed out in your own answer, regexps are evaluated at the level of codepoints, not codeunits, so it doesn't see surrogates. –
Doriedorin
Yes, this has been tested with non-BMP characters. –
Stagnate
@Doriedorin - actually, I assumed that regex was evaluated at the character level, and that the non-BMP codepoint was simply translated into surrogates. –
Sech
@Sech - More info on supplementary characters in java: java.sun.com/developer/technicalArticles/Intl/Supplementary –
Stagnate
WARNING: this substitution may introduce new astral characters by pairing previously unpaired surrogates, which may or may not be acceptable for the original question: try
String inputString = "\uD800\uD800\uDC00\uDC00";
. –
Weinberger Side note #1: for
\uD800\uD800\uDC00\uDC00
example @Anon's StringBuilder
solution produces exactly the same output as regex solution. Side note #2: applying this filtering twice (or more) may be required to ged rid of non-BMP chars completely. –
Christadelphian Are you looking for specific characters or all characters outside the BMP?
If the former, you can use a StringBuilder
to construct a string containing code points from the higher planes, and regex will work as expected:
String test = new StringBuilder().append("test").appendCodePoint(0x10300).append("test").toString();
Pattern regex = Pattern.compile(new StringBuilder().appendCodePoint(0x10300).toString());
Matcher matcher = regex.matcher(test);
matcher.find();
System.out.println(matcher.start());
If you're looking to remove all non-BMP characters from a string, then I'd use StringBuilder
directly rather than regex:
StringBuilder sb = new StringBuilder(test.length());
for (int ii = 0 ; ii < test.length() ; )
{
int codePoint = test.codePointAt(ii);
if (codePoint > 0xFFFF)
{
ii += Character.charCount(codePoint);
}
else
{
sb.appendCodePoint(codePoint);
ii++;
}
}
© 2022 - 2024 — McMap. All rights reserved.