Java regex match characters outside Basic Multilingual Plane
Asked Answered
B

2

19

How can I match characters (with the intention of removing them) from outside the unicode Basic Multilingual Plane in java?

Bettyannbettye answered 27/10, 2010 at 16:43 Comment(0)
S
27

To remove all non-BMP characters, the following should work:

String sanitizedString = inputString.replaceAll("[^\u0000-\uFFFF]", "");
Stagnate answered 27/10, 2010 at 17:19 Comment(7)
Have you actually tested this? Because your character range includes the surrogate range used to construct non-BMP codepoints.Sech
@Anon: As you pointed out in your own answer, regexps are evaluated at the level of codepoints, not codeunits, so it doesn't see surrogates.Doriedorin
Yes, this has been tested with non-BMP characters.Stagnate
@Doriedorin - actually, I assumed that regex was evaluated at the character level, and that the non-BMP codepoint was simply translated into surrogates.Sech
@Sech - More info on supplementary characters in java: java.sun.com/developer/technicalArticles/Intl/SupplementaryStagnate
WARNING: this substitution may introduce new astral characters by pairing previously unpaired surrogates, which may or may not be acceptable for the original question: try String inputString = "\uD800\uD800\uDC00\uDC00";.Weinberger
Side note #1: for \uD800\uD800\uDC00\uDC00 example @Anon's StringBuilder solution produces exactly the same output as regex solution. Side note #2: applying this filtering twice (or more) may be required to ged rid of non-BMP chars completely.Christadelphian
S
4

Are you looking for specific characters or all characters outside the BMP?

If the former, you can use a StringBuilder to construct a string containing code points from the higher planes, and regex will work as expected:

  String test = new StringBuilder().append("test").appendCodePoint(0x10300).append("test").toString();
  Pattern regex = Pattern.compile(new StringBuilder().appendCodePoint(0x10300).toString());

  Matcher matcher = regex.matcher(test);
  matcher.find();
  System.out.println(matcher.start());

If you're looking to remove all non-BMP characters from a string, then I'd use StringBuilder directly rather than regex:

  StringBuilder sb = new StringBuilder(test.length());
  for (int ii = 0 ; ii < test.length() ; )
  {
     int codePoint = test.codePointAt(ii);
     if (codePoint > 0xFFFF)
     {
        ii += Character.charCount(codePoint);
     }
     else
     {
        sb.appendCodePoint(codePoint);
        ii++;
     }
  }
Sech answered 27/10, 2010 at 17:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.