Will strcmp compare utf-8 strings in code point order?
Asked Answered
L

1

11

In a C program, I want to sort a list of valid UTF-8-encoded strings in Unicode code point order. No collation, no locale-awareness.

So I need a compare function. It's easy enough to write such a function that iterates over the unicode characters. (I happen to be using GLib, so I'd iterate withg_utf8_next_char and compare the return values of g_utf8_next_char.)

But what I'm wondering, out of curiousity and possibly simplicity and efficiency, is: will a simple byte-for-byte strcmp (or g_strcmp) actually do the same job? I'm thinking that it should, since UTF-8 encodes the most significant bits first, and a code point that needs encoding in N+1 bytes will have a larger initial byte than a code point that needs to be encoded in N bytes.

But maybe I'm missing something? Thanks in advance.

Lactalbumin answered 20/8, 2013 at 7:57 Comment(0)
K
12

Yes, UTF-8 preserves codepoint order, so you can just use strcmp. That's one of the (many) beautiful points of UTF-8.

One caveat is that codepoints in Unicode are UTF-32 values, and some people who talk about collating Unicode strings in "codepoint" order are actually using the word "codepoint" incorrectly to mean "UTF-16 code unit". If you want the order to match UTF-16 code unit collation, a good bit more work is involved.

Kalpak answered 20/8, 2013 at 8:8 Comment(1)
Thanks a lot! I was about to follow up on my use case and how I don't think the caveat applies, and then saw that this information is right there in the standard I'm trying to implement: "Lexicographic comparison, which orders strings from least to greatest alphabetically, is based on the UCS codepoint values, which is equivalent to lexicographic ordering based on UTF-8." :-)Lactalbumin

© 2022 - 2024 — McMap. All rights reserved.