Iterating through Unicode codepoints character by character

Asked 26/11, 2011 at 22:5 Answered 26/11, 2011 at 22:11

I've got a series of Unicode codepoints. What I really need to do is iterate through these codepoints as a series of characters, not a series of codepoints, and determine properties of each individual character, e.g. is a letter, whatever.

For example, imagine that I was writing a Unicode-aware textbox, and the user entered a Unicode character that was more than one codepoint- for example, "e with diacritic". I know that this specific character can be represented as one codepoint as well, and can be normalized to that form, but I don't think that's possible in the general case. How could I implement backspace? It obviously can't just erase the last codepoint, because they might have just entered more than one codepoint.

How can I iterate over a bunch of Unicode codepoints as characters?

Edit: The Break Iterators offered by ICU appear to be pretty much what I need. However, I'm not using ICU, so any references on how to implement my own equivalent functionality would be an accepted answer.

Another edit: It turns out that the Windows API does indeed offer this functionality. MSDN just isn't very good about putting all the string functions in one place. CharNext is the function I'm looking for.

Stile answered 26/11, 2011 at 22:5 Comment(9)

How do you define "character" in this context? Something that translates to a single visual grapheme? – Vehement 26/11, 2011 at 22:31

@NicolBolas: Something like that. Ideally, what I'd mean is something that is entered by one key combination on the keyboard. – Stile 26/11, 2011 at 22:45

Unless and until you define character in terms of code points, no answer is possible. Unicode defines only two things: code points and extended grapheme clusters. It does not define character. Please rephrase your question in terms of code points and/or extended grapheme clusters, or else define your terms with sufficient precision as to make possible a programmic solution, which you have not yet bothered to do. – Overtop 26/11, 2011 at 23:49

@tchrist: Did you really have to go and post the same comment on every answer? I got it by reading it once. – Stile 27/11, 2011 at 0:12

@tchrist: You will also note that ICU calls theirs a CharacterInstance. Whilst I didn't define their relationship to codepoints, since if I knew that I wouldn't have a problem, I certainly did define how I expected them to behave- which should be enough. ICU gives the exact example that I used to define the behaviour of their "CharacterIterator". – Stile 27/11, 2011 at 0:32

"However, I'm not using ICU" Really, you should. This is after all what it is for. In order to do what BreakIterator does, you will need to be able to query the properties of unicode points to know if one can break between them or not. And that requires basically downloading the Unicode specification and building a table of codepoint ranges for different properties. Or you can just use ICU, which does it for you. – Vehement 27/11, 2011 at 0:45

related: Grapheme Cluster Boundaries see also my comment – Norikonorina 27/11, 2011 at 0:58

@NicolBolas: I could use ICU, which can do Unicode but has absolutely no idea how to be a C++ library- it's half excessive inheritance and heap allocation Java, and half error codes everywhere C. Or I could use the Windows API, which will very neatly back my own Unicode string class, which was actually written to exist in C++. – Stile 27/11, 2011 at 1:49

@tchrist: This is a bit late, but according to the Unicode Standard version 6.2, page 11: "Characters are the abstract representations of the smallest components of written language that have semantic value. They represent primarily, but not exclusively, the letters, punctuation, and other signs that constitute natural language text and technical notation." The document then goes on to provide a table illustrating the difference between a glyph and a character. If this is not a definition, then I don't know what is. – Textual 2/1, 2013 at 19:38

Use the ICU library.

http://site.icu-project.org/

for example:

http://icu-project.org/apiref/icu4c/classUnicodeString.html#ae3ffb6e15396dff152cb459ce4008f90

is the function that returns the character at a particular character offset in a string.

Bearskin answered 26/11, 2011 at 22:7 Comment(7)

Because I checked their interface documentation and none of it deals with what I need? None of it will recognize "e with diacritic" as one entity, not two. – Stile 26/11, 2011 at 22:18

is it worth the effort? how common are those "e with diacritic" cases? Note that I have no idea, I'm just asking. – Mockery 26/11, 2011 at 23:29

What is a character? Unicode has code points, whose API you have just described, and it has extended grapheme clusters, which are accessible through break iterators or the \X regex metacharacter sequence. – Overtop 26/11, 2011 at 23:51

@tchrist: The break iterator seems about right. However, I am not using ICU, so an answer would need to discuss an implementation of it. Admittedly, "Break Iterator" is not the name under which I would choose to look for such functionality and so I didn't see it. – Stile 27/11, 2011 at 0:29

And it has characters that correspond, if you like, to UTF-32 code points. Which is what I thought the original question was originally about. – Bearskin 27/11, 2011 at 2:46

@bmargulies: even in UTF-32, a codepoint is not a character. UTF-32 deals away with surrogate pairs, but there are other issues such as diacritics. Check out grapheme clusters in the Unicode text segmentation annex. – Parrott 27/11, 2011 at 13:47

@AndréCaron I'm a member of the UTC. I do know this stuff, I just judged that the OP wasn't trying to go all the way there. – Bearskin 27/11, 2011 at 22:57

The UTF8-CPP project has a bunch of clean, easy to read, STL-like algorithms to iterate over Unicode strings codepoint by codepoint, character by character, etc. You can look into that for inspiration.

Note that the "character by character" approach might not be obvious. One easy way to do it is to iterate over an UTF-32 string in normalization form C, which guarantees fixed length encoding.

Express answered 26/11, 2011 at 22:11 Comment(8)

It was my understanding that not all characters could be represented in form C. – Stile 26/11, 2011 at 22:20

Not aware of that ever causing a problem, but I could imagine that languages such as Vietnamese with an astounding number of diacritics causes a very large number of combinations for NFC. However, you'd have to read the Unicode spec to figure if NFC can represent everything or not. – Parrott 26/11, 2011 at 22:23

A "character by character" approach is not possible, because "characters" is not defined. – Overtop 26/11, 2011 at 23:49

@tchrist: Indeed, the entire Unicode standard does not define the term "character". However, there is a commonly accepted definition of 1 unit of display, which typically renders to a single glyph. The basically corresponds to a full precomposed set of codepoints in NFC. UTF-32 in NFC gives you what is closest to a single "character" per codepoint. – Parrott 27/11, 2011 at 1:20

@Andre: This was pretty close, but apparently, there are both ICU and WinAPI functions that serve the whole purpose. – Stile 27/11, 2011 at 1:48

Being in Normal Form C doesn't mean that there are no decomposed combining characters in a string. It only means that the sequences that can be combined, are. You'll still find combining characters in an NFC string if you've got obscure combinations that haven't been given their own combined code point. For what it's worth, there are combined code points covering the possible letters and tone marks for Vietnamese. – Aphorism 27/11, 2011 at 12:1

@AndréCaron Are you talking about an extended grapheme cluster then? That's what vim considers a character, for example. – Overtop 27/11, 2011 at 13:17

@tchrist: I was thinking of "user-perceived characters", which are indeed approximated by grapheme clusters. – Parrott 27/11, 2011 at 13:41

Recommended topics

Hot tags