In a C program, I want to sort a list of valid UTF-8-encoded strings in Unicode code point order. No collation, no locale-awareness.
So I need a compare function. It's easy enough to write such a function that iterates over the unicode characters. (I happen to be using GLib, so I'd iterate withg_utf8_next_char
and compare the return values of g_utf8_next_char
.)
But what I'm wondering, out of curiousity and possibly simplicity and efficiency, is: will a simple byte-for-byte strcmp
(or g_strcmp
) actually do the same job? I'm thinking that it should, since UTF-8 encodes the most significant bits first, and a code point that needs encoding in N+1 bytes will have a larger initial byte than a code point that needs to be encoded in N bytes.
But maybe I'm missing something? Thanks in advance.
:-)
– Lactalbumin