How to count characters in a unicode string in C

Asked 4/9, 2011 at 8:15 Answered 21/12, 2023 at 10:18

Lets say I have a string:

char theString[] = "你们好āa";

Given that my encoding is utf-8, this string is 12 bytes long (the three hanzi characters are three bytes each, the latin character with the macron is two bytes, and the 'a' is one byte:

strlen(theString) == 12

How can I count the number of characters? How can i do the equivalent of subscripting so that:

theString[3] == "好"

How can I slice, and cat such strings?

Sightseeing answered 4/9, 2011 at 8:15 Comment(0)

You only count the characters that have the top two bits are not set to 10 (i.e., everything less that 0x80 or greater than 0xbf).

That's because all the characters with the top two bits set to 10 are UTF-8 continuation bytes.

See here for a description of the encoding and how strlen can work on a UTF-8 string.

For slicing and dicing UTF-8 strings, you basically have to follow the same rules. Any byte starting with a 0 bit or a 11 sequence is the start of a UTF-8 code point, all others are continuation characters.

Your best bet, if you don't want to use a third-party library, is to simply provide functions along the lines of:

utf8left (char *destbuff, char *srcbuff, size_t sz);
utf8mid  (char *destbuff, char *srcbuff, size_t pos, size_t sz);
utf8rest (char *destbuff, char *srcbuff, size_t pos;

to get, respectively:

the left sz UTF-8 bytes of a string.
the sz UTF-8 bytes of a string, starting at pos.
the rest of the UTF-8 bytes of a string, starting at pos.

This will be a decent building block to be able to manipulate the strings sufficiently for your purposes.

However, you may need to tighten up your definition of what a character is, and hence how to calculate the size of a string.

If you consider a character to be a Unicode code point, the information above is perfectly adequate.

But you may prefer a different approach. The Annex 29 documentation detailing grapheme cluster boundaries has this snippet:

It is important to recognize that what the user thinks of as a "character" - a basic unit of a writing system for a language - may not be just a single Unicode code point.

One simple example is g̈, which can be thought of as a single character but consists of the two Unicode code points:

0067 (g) LATIN SMALL LETTER G; and
0308 (◌̈ ) COMBINING DIAERESIS.

That would show up as two distinct Unicode characters were you to use the rule "any character not of the binary form 10xxxxxx is the start of a new character".

Annex 29 also calls these grapheme clusters by a more user-friendly name, user-perceived characters. If it's those you wish to count, that annex gives further details.

Auriculate answered 4/9, 2011 at 8:45 Comment(5)

Yes it seems I have to implement a lot of this myself.. I have managed to implement a u_strlen and u_charAt in the last hour. Should be able to cut slices based on that. – Sightseeing 4/9, 2011 at 9:47

Accepted because I did end up writing my own functions. – Sightseeing 4/9, 2011 at 15:57

Note: this ignores grapheme clusters described in UAX#29, i.e. "नि" is supposed to be seen as a single unit of text, but will give a length of 2 with the method in this answer. – Bipartite 2/11, 2016 at 20:42

If program's locale is UTF-8, then we could just use standard mbrlen() function instead. – Holston 21/12, 2023 at 10:13

@AliciaBytes: good point, though I realise I've taken a long time to respond :-) I've added extra information detailing grapheme clusters. – Auriculate 21/12, 2023 at 10:57

Try this for size:

#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

// returns the number of utf8 code points in the buffer at s
size_t utf8len(char *s)
{
    size_t len = 0;
    for (; *s; ++s) if ((*s & 0xC0) != 0x80) ++len;
    return len;
}

// returns a pointer to the beginning of the pos'th utf8 codepoint
// in the buffer at s
char *utf8index(char *s, size_t pos)
{    
    ++pos;
    for (; *s; ++s) {
        if ((*s & 0xC0) != 0x80) --pos;
        if (pos == 0) return s;
    }
    return NULL;
}

// converts codepoint indexes start and end to byte offsets in the buffer at s
void utf8slice(char *s, ssize_t *start, ssize_t *end)
{
    char *p = utf8index(s, *start);
    *start = p ? p - s : -1;
    p = utf8index(s, *end);
    *end = p ? p - s : -1;
}

// appends the utf8 string at src to dest
char *utf8cat(char *dest, char *src)
{
    return strcat(dest, src);
}

// test program
int main(int argc, char **argv)
{
    // slurp all of stdin to p, with length len
    char *p = malloc(0);
    size_t len = 0;
    while (true) {
        p = realloc(p, len + 0x10000);
        ssize_t cnt = read(STDIN_FILENO, p + len, 0x10000);
        if (cnt == -1) {
            perror("read");
            abort();
        } else if (cnt == 0) {
            break;
        } else {
            len += cnt;
        }
    }

    // do some demo operations
    printf("utf8len=%zu\n", utf8len(p));
    ssize_t start = 2, end = 3;
    utf8slice(p, &start, &end);
    printf("utf8slice[2:3]=%.*s\n", end - start, p + start);
    start = 3; end = 4;
    utf8slice(p, &start, &end);
    printf("utf8slice[3:4]=%.*s\n", end - start, p + start);
    return 0;
}

Sample run:

matt@stanley:~/Desktop$ echo -n 你们好āa | ./utf8ops 
utf8len=5
utf8slice[2:3]=好
utf8slice[3:4]=ā

Note that your example has an off by one error. theString[2] == "好"

Coda answered 4/9, 2011 at 10:4 Comment(5)

by any chance do you know of any implementation of strlen() for combining characters ? like 'a' with accent for example, should return 1 , not 2 – Dzungaria 26/9, 2016 at 16:38

@Nulik: That sounds utf8len, utf8len("ā") should return 1. – Coda 27/9, 2016 at 3:44

Are you sure the example in the question has an off by one error? 好 is two bytes long, but defining a string like that always adds a null character at the end, so 3 is correct, I believe. – Analgesic 21/8, 2020 at 13:6

Does this code cover all valid UTF8 or just a subset?? – Ahmed 9/2, 2021 at 19:50

@RichardMcFriendOluwamuyiwa i believe it should work on all utf8 – Coda 11/2, 2021 at 10:59

The easiest way is to use a library like ICU

Beauty answered 4/9, 2011 at 8:27 Comment(3)

@Mark.. I asked a couple of questions about ICU. People mostly replied that it was unnecessary for simple operations. #7294947 – Sightseeing 4/9, 2011 at 8:29

@trideceth12: in many cases, you actually want to access grapheme clusters, not characters; and implementing that from scratch is far more involved than just decoding UTF-8, so using a library might be a good idea – Memoir 4/9, 2011 at 9:5

@Christoph: Indeed so! And the ICU regex library support full Unicode extended grapheme clusters via the \X, making these things easy. That said, there are chunks of C code that do it all for themselves, like vim — however, that seems to use something more like \PM\pM*, and also is stuck working only on the BMP. Sigh. – Felecia 6/9, 2011 at 18:24

Depending on your notion of "character", this question can get more or less involved.

First off, you should transform your byte string into a string of unicode codepoints. You can do this with iconv() of ICU, though if this is the only thing you do, iconv() is a lot easier, and it's part of POSIX.

Your string of unicode codepoints could be something like a null-terminated uint32_t[], or if you have C1x, an array of char32_t. The size of that array (i.e. its number of elements, not its size in bytes) is the number of codepoints (plus the terminator), and that should give you a very good start.

However, the notion of a "printable character" is fairly complex, and you may prefer to count graphemes rather than codepoints - for instance, an a with an accent ^ can be expressed as two unicode codepoints, or as a combined legacy codepoint â - both are valid, and both are required by the unicode standard to be treated equally. There is a process called "normalization" which turns your string into a definite version, but there are many graphemes which are not expressible as a single codepoint, and in general there is no way around a proper library that understands this and counts graphemes for you.

That said, it's up to you to decide how complex your scripts are and how thoroughly you want to treat them. Transforming into unicode codepoints is a must, everything beyond that is at your discretion.

Don't hesitate to ask questions about ICU if you decide that you need it, but feel free to explore the vastly simpler iconv() first.

Littles answered 4/9, 2011 at 10:27 Comment(0)

In the real world, theString[3]=foo; is not a meaningful operation. Why would you ever want to replace a character at a particular position in the string with a different character? There's certainly no natural-language-text processing task for which this operation is meaningful.

Counting characters is also unlikely to be meaningful. How many characters (for your idea of "character") are there in "á"? How about "á"? Now how about "གི"? If you need this information for implementing some sort of text editing, you're going to have to deal with these hard questions, or just use an existing library/gui toolkit. I would recommend the latter unless you're an expert on world scripts and languages and think you can do better.

For all other purposes, strlen tells you exactly the piece of information that's actually useful: how much storage space a string takes. This is what's needed for combining and separating strings. If all you want to do is combine strings or separate them at a particular delimiter, snprintf (or strcat if you insist...) and strstr are all you need.

If you want to perform higher-level natural-language-text operations, like capitalization, line breaking, etc. or even higher-level operations like pluralization, tense changes, etc. then you'll need either a library like ICU or respectively something much higher-level and linguistically-capable (and specific to the language(s) you're working with).

Again, most programs do not have any use for this sort of thing and just need to assemble and parse text without any considerations to natural language.

Whitethorn answered 4/9, 2011 at 12:53 Comment(4)

@R The use is converting pinyin in numeral form (ni2hao3ma5) into pinyin with accents.. I have written my own functions now, based on the inherent meaning in the first byte of a unicode charpoint. It's a bit clunky but it does the job without the need to include a heavy library. – Sightseeing 4/9, 2011 at 15:56

@trideceth12: I did that same thing myself one. It was just a couple of lines a Perl. Really. – Felecia 6/9, 2011 at 18:26

I would argue that you almost never want to know how much "storage" there is, and what you really want when you're talking length is "characters", not bytes. Look at string processing, you're code would be broken on UTF8/UTF16 if you cannot answer queries like length in terms of graphemes. If you do not care about Unicode, and encode things in ASCII or UTF-32, then yes, maybe it's irrelevant for you. – Superpower 24/5, 2014 at 1:33

Graphemes or characters are only relevant to visual display (and sometimes, editing). That's 1% of what you do with strings, and usually isolated to GUI toolkit libraries. Everything else done with strings is completely agnostic and only cares (on C, where storage is explicit) about the storage requirements for the string. In other languages where storage is not explicit, you shouldn't even care about that. – Whitethorn 24/5, 2014 at 14:37

while (s[i]) {
    if ((s[i] & 0xC0) != 0x80)
        j++;
    i++;
}
return (j);

This will count characters in a UTF-8 String... (Found in this article: Even faster UTF-8 character counting)

However I'm still stumped on slicing and concatenating?!?

Sightseeing answered 4/9, 2011 at 8:27 Comment(3)

You really, really do want to use a wide string type. This is simply not an application where you can put a premium on conserving memory. We're talking about bytes on systems that have gigabytes to go around, anyway. You don't have random-access to characters in a UTF-8 encoding. UTF-8 is better suited as a storage/serialization format. But just FWIW, concatenation works "directly", as long as you don't have to worry about BOMs; treat the bytes as bytes. "slicing" needs to be better defined. – Efflux 4/9, 2011 at 8:49

Slicing and concatenating would then be just a search operation, surely? Linear search in the most obvious implementation. I'm with those that don't see any real benefit in avoiding wchar_t though, to be honest. – Girlish 4/9, 2011 at 8:50

@Karl: taking grapheme clusters into account, even UTF-32 often has to be treated as a variable-length coding... – Memoir 4/9, 2011 at 9:10

In general we should use a different data type for unicode characters.

For example, you can use the wide char data type

wchar_t theString[] = L"你们好āa";

Note the L modifier that tells that the string is composed of wide chars.

The length of that string can be calculated using the wcslen function, which behaves like strlen.

Persevere answered 4/9, 2011 at 8:35 Comment(6)

Except that wide chars are all 4 bytes each.. so "hello world" is 44 bytes instead of 11 bytes, and "大家，你们好" is 24 bytes instead of 18 bytes. – Sightseeing 4/9, 2011 at 8:40

Well, that is generally left to the implementation (in some cases they can be 2 byte long), but I can see your point here. – Persevere 4/9, 2011 at 8:45

@abahgat: that wchar_t doesn't necessarily use UTF-32 (ie the 2-byte case) makes this solution unportable... – Memoir 4/9, 2011 at 9:7

summary: wchar_t is NOT Unicode, because sizeof(wchar_t) is is compiler-dependent – Akkad 4/9, 2011 at 11:3

@user411312, it can be used for storing unicode characters, but the encoding is an implementation detail, note that the unicode character set is not fixed to any encoding – Collectivity 4/9, 2011 at 11:28

@user411312 wchar_t is UTF-32 for GCC (at least on unixoid systems) and UTF-16 on windows/msvc - so for the most popular systems wchar_t is (some) Unicode – Kemerovo 4/9, 2011 at 11:34

One thing that's not clear from the above answers is why it's not simple. Each character is encoded in one way or another - it doesn't have to be UTF-8, for example - and each character may have multiple encodings, with varying ways to handle combining of accents, etc. The rules are really complicated, and vary by encoding (e.g., utf-8 vs. utf-16).

This question has enormous security concerns, so it is imperative that this be done correctly. Use an OS-supplied library or a well-known third-party library to manipulate unicode strings; don't roll your own.

Entopic answered 4/9, 2011 at 14:9 Comment(0)

I did similar implementation years back. But I do not have code with me.

For each unicode characters, first byte describes the number of bytes follow it to construct a unicode character. Based on the first byte you can determine the length of each unicode character.

I think its a good UTF8 library. enter link description here

Declass answered 6/9, 2011 at 17:36 Comment(0)

If your program is running in a UTF-8 locale, then the standard mbrlen() function does exactly what you are looking for here.

Note that it will count the number of codepoints, so combining characters such as accents may be counted separately. If that's undesirable, you need a character handling library such as ICU.

Holston answered 21/12, 2023 at 10:18 Comment(0)

-1

A sequence of code points constitute a single syllable / letter / character in many other Non Western-European languages (eg: all Indic languages)

So, when you are counting the length OR finding the substring (there are definitely use cases of finding the substrings - let us say playing a hangman game), you need to advance syllable by syllable , not by code point by code point.

So the definition of the character/syllable and where you actually break the string into "chunks of syllables" depends upon the nature of the language you are dealing with. For example, the pattern of the syllables in many Indic languages (Hindi, Telugu, Kannada, Malayalam, Nepali, Tamil, Punjabi, etc.) can be any of the following

V  (Vowel in their primary form appearing at the beginning of the word)
C (consonant)
C + V (consonant + vowel in their secondary form)
C + C + V
C + C + C + V

You need to parse the string and look for the above patterns to break the string and to find the substrings.

I do not think it is possible to have a general purpose method which can magically break the strings in the above fashion for any unicode string (or sequence of code points) - as the pattern that works for one language may not be applicable for another letter;

I guess there may be some methods / libraries that can take some definition / configuration parameters as the input to break the unicode strings into such syllable chunks. Not sure though! Appreciate if some one can share how they solved this problem using any commercially available or open source methods.

Zigmund answered 20/10, 2012 at 2:41 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags