How can I iterate through the unicode codepoints of a Java String?
Asked Answered
S

4

116

So I know about String#codePointAt(int), but it's indexed by the char offset, not by the codepoint offset.

I'm thinking about trying something like:

But my concerns are

  • I'm not sure whether codepoints which are naturally in the high-surrogates range will be stored as two char values or one
  • this seems like an awful expensive way to iterate through characters
  • someone must have come up with something better.
Sst answered 6/10, 2009 at 20:13 Comment(0)
W
160

Yes, Java uses a UTF-16-esque encoding for internal representations of Strings, and, yes, it encodes characters outside the Basic Multilingual Plane (BMP) using the surrogacy scheme.

If you know you'll be dealing with characters outside the BMP, then here is the canonical way to iterate over the characters of a Java String:

final int length = s.length();
for (int offset = 0; offset < length; ) {
   final int codepoint = s.codePointAt(offset);

   // do something with the codepoint

   offset += Character.charCount(codepoint);
}
Winkler answered 6/10, 2009 at 20:21 Comment(14)
As for whether or not it's "expensive", well... there is no other way built into Java. But if you're dealing only with Latin/European/Cyrillic/Greek/Hebrew/Arabic scripts, then you just s.charAt() to your heart's content. :)Winkler
But you shouldn't. For instance if your program outputs XML and if someone gives it some obscure mathematical operator, suddenly your XML may be invalid.Lavonia
@Jonathan Feinberg That's what I thought. But here came that special mathematical E. UTF-16 works 99% of the time — but then it get really painful. Especially when the problems stay hidden for a long time.Involved
I would have used offset = s.offsetByCodePoints(offset, 1);. Is there some benefit in using offset += Character.charCount(codepoint); instead?Oeuvre
@PaulGroke Yes there is. The function offsetByCodePoints (it redirects to Character.offsetByCodePoints) is like 50 lines long with loops and stuff, meanwhile charCount is just a one liner with a numeric if, so I guess there is a lot of performance loss.Godevil
@Mechanicalsnail I don't understand your comment. Why would outputting XML cause this answer to misbehave?Hanks
@Hanks the answer is fine. He was referring to @Jonathan Feinberg's comment in which he advocates for using charAt() which is a bad ideaFrictional
"If you know you'll be dealing with characters outside the BMP" this is a bad omen.Unthankful
Small modification to make it more continue-friendly: final int length = s.length(); for (int codepoint, offset = 0; offset < length; offset += Character.charCount(codepoint)) { codepoint = s.codePointAt(offset); // do something with the codepoint }Dufour
Proposed approach brakes the rule of not changing the value of loop counter from within the body of the loop itself.Uncommercial
@Uncommercial What rule?Winkler
@JonathanFeinberg I'm referring to SonarQube rule rules.sonarsource.com/java/RSPEC-1994 labeled as critical. It is raised in the case when counter is incremented in the body of loop method instead of being incremented in a dedicated its increment clause.Uncommercial
@Uncommercial That's a silly rule of thumb that only makes sense for very basic loops. This is a more complex case which justifies an exception.Hanks
As a supplement, String rune = new String(new int[]{codepoint}, 0, 1); can be used to turn a codepoint into a readable single-char UTF-8 stringMeacham
J
88

Java 8 added CharSequence#codePoints which returns an IntStream containing the code points. You can use the stream directly to iterate over them:

string.codePoints().forEach(c -> ...);

or with a for loop by collecting the stream into an array:

for(int c : string.codePoints().toArray()){
    ...
}

These ways are probably more expensive than Jonathan Feinbergs's solution, but they are faster to read/write and the performance difference will usually be insignificant.

Journalist answered 6/1, 2015 at 11:46 Comment(2)
for (int c : (Iterable<Integer>) () -> string.codePoints().iterator()) also works.Galloping
Slightly shorter version of @saka1029:s code: for (int c : (Iterable<Integer>) string.codePoints()::iterator) ...Metric
S
9

Thought I'd add a workaround method that works with foreach loops (ref), plus you can convert it to java 8's new String#codePoints method easily when you move to java 8:

You can use it with foreach like this:

 for(int codePoint : codePoints(myString)) {
   ....
 }

Here's the method:

public static Iterable<Integer> codePoints(final String string) {
  return new Iterable<Integer>() {
    public Iterator<Integer> iterator() {
      return new Iterator<Integer>() {
        int nextIndex = 0;
        public boolean hasNext() {
          return nextIndex < string.length();
        }
        public Integer next() {
          int result = string.codePointAt(nextIndex);
          nextIndex += Character.charCount(result);
          return result;
        }
        public void remove() {
          throw new UnsupportedOperationException();
        }
      };
    }
  };
}

Or alternately if you just want to convert a string to an array of int codepoints (if your code could use a codepoint int array more easily) (might use more RAM than the above approach):

 public static List<Integer> stringToCodePoints(String in) {
    if( in == null)
      throw new NullPointerException("got null");
    List<Integer> out = new ArrayList<Integer>();
    final int length = in.length();
    for (int offset = 0; offset < length; ) {
      final int codepoint = in.codePointAt(offset);
      out.add(codepoint);
      offset += Character.charCount(codepoint);
    }
    return out;
  }

Thankfully uses "codePointAt" which safely handles the surrogate pair-ness of UTF-16 (java's internal string representation).

Sightread answered 14/2, 2014 at 23:4 Comment(0)
P
6

Iterating over code points is filed as a feature request at Sun.

See Bug Report

There is also an example on how to iterate over String CodePoints there.

Piers answered 6/10, 2009 at 20:22 Comment(1)
Java 8 now has a codePoints() method built in to String: docs.oracle.com/javase/8/docs/api/java/lang/…Quantity

© 2022 - 2024 — McMap. All rights reserved.