You only count the characters that have the top two bits are not set to 10
(i.e., everything less that 0x80
or greater than 0xbf
).
That's because all the characters with the top two bits set to 10
are UTF-8 continuation bytes.
See here for a description of the encoding and how strlen
can work on a UTF-8 string.
For slicing and dicing UTF-8 strings, you basically have to follow the same rules. Any byte starting with a 0
bit or a 11
sequence is the start of a UTF-8 code point, all others are continuation characters.
Your best bet, if you don't want to use a third-party library, is to simply provide functions along the lines of:
utf8left (char *destbuff, char *srcbuff, size_t sz);
utf8mid (char *destbuff, char *srcbuff, size_t pos, size_t sz);
utf8rest (char *destbuff, char *srcbuff, size_t pos;
to get, respectively:
- the left
sz
UTF-8 bytes of a string.
- the
sz
UTF-8 bytes of a string, starting at pos
.
- the rest of the UTF-8 bytes of a string, starting at
pos
.
This will be a decent building block to be able to manipulate the strings sufficiently for your purposes.
However, you may need to tighten up your definition of what a character is, and hence how to calculate the size of a string.
If you consider a character to be a Unicode code point, the information above is perfectly adequate.
But you may prefer a different approach. The Annex 29 documentation detailing grapheme cluster boundaries has this snippet:
It is important to recognize that what the user thinks of as a "character" - a basic unit of a writing system for a language - may not be just a single Unicode code point.
One simple example is g̈
, which can be thought of as a single character but consists of the two Unicode code points:
0067 (g) LATIN SMALL LETTER G
; and
0308 (◌̈ ) COMBINING DIAERESIS
.
That would show up as two distinct Unicode characters were you to use the rule "any character not of the binary form 10xxxxxx
is the start of a new character".
Annex 29 also calls these grapheme clusters by a more user-friendly name, user-perceived characters. If it's those you wish to count, that annex gives further details.