Can i use memcmp two compare multibyte characters string?

Asked 27/2, 2012 at 6:18 Answered 27/2, 2012 at 6:33

I am trying to write code to compare two string. In windows i can use strcmp but i want write for multibyte character string so that it compatible to all other platform Can i use memcmp? if no then is there any other API i can use or i need to write my own API.

Cress answered 27/2, 2012 at 6:18 Comment(1)

It depends on whether the two strings are using the same encoding. – Harken 27/2, 2012 at 6:24

You have to be careful. I'm not an expert on Unicode/multi byte encodings, but I know that with diacritics sometimes two strings can be considered equal when their bytes are not exactly the same. It's recommended to use pre-tested APIs, because string encodings can get pretty messy.

See the old new thing on case mapping. I can't think of a reference for the diacritics but if I do I'll post it.

Janejanean answered 27/2, 2012 at 6:33 Comment(5)

This is correct. For some cases, a memcmp will work. For 100% correctness, and especially if Unicode in any form is involved, memcmp will not work. Even simple characters like é can be represented more than one way--either as é (one Unicode character), or as ´ combined with e (in two Unicode characters). Most of the time, these don't get mixed and matched, so you might not see any problems at first, but eventually it will bite you. – Striation 27/2, 2012 at 6:38

Another way in which strings could be 'considered' equal, but not byte-equal is if your comparison is case invariant. In this case you need to perform what is termed case folding, which allows comparison of upper case, lower case, title case, and case invariant glyphs (which, as stated above could be in memory represented as multiple code points... or not). – Loess 27/2, 2012 at 6:44

Equal after normalization is not the same thing as equal. That's the whole point of normalization. OP was asking whether two strings strings are equal, not whether they are equivalent. – Harken 27/2, 2012 at 6:57

@Bingo: Case handling is worse. In Turkish the upper case of i is not I, it's İ (I with a dot above it) and the lower case of I isn't i, it's ı (dotless i), in which case you need to know the language in which a word is written. :) – Intercom 27/2, 2012 at 7:39

Here's a reference on the various Unicode normalization types (various ways that a character can be encoded). unicode.org/reports/tr15/#Introduction Note that UTF8 specifically requires the shortest-possible encoding for characters, but this is specific to UTF8, AFAIK--other types of Unicode are more lenient. – Striation 27/2, 2012 at 14:41

If the two strings are using the same encoding, you can use memcmp. If they are using UTF-8 and your strings don't contain the NULL character (U+0000), you could even use strcmp, since, in the absence of NULL itself, 0 does not appear in UTF-8 encoded strings. Another option is to convert your strings to wide characters using mbstowcs.

Harken answered 27/2, 2012 at 6:30 Comment(7)

This will have false negatives--two identical strings can be encoded into different byte patterns. You need to compare with a Unicode savvy function. – Striation 27/2, 2012 at 6:40

@Striation - Can you provide an example of how identical strings can have different UTF-8 encodings? Or, for that matter, how this could happen with any other signle encoding (like ISO 8859-1)? I did make the point that the strings needed to be using the same encoding. – Harken 27/2, 2012 at 6:56

@Ted Hopp : With UTF-8, you may encode a character in overlong-form (a sequence that decodes to a value that should use a shorter sequence : this sentence is from wikipedia). In this case, memcmp returns wrong answer but UTF-8 aware compare function returns the right answer... – Trews 27/2, 2012 at 7:50

@Trews - As of Unicode version 3.0, the standard forbids the generation of non-shortest form UTF-8 sequences. (It's conformance clause C12 in the standard.) A string encoded with an overlong form is not using legal UTF-8 encoding. (The same Wikipedia page lists "overlong form" under the section Invalid byte sequences.) – Harken 27/2, 2012 at 8:6

@Ted Hopp : If you use memcmp/strcmp for ill-formed UTF8 strings, they will return OK as if they were valid sequences. If you use an UTF8 aware compare function, it will/must return error if either of the strings is ill-formed. This was my point, I am against ill-formed UTF8 too... – Trews 27/2, 2012 at 9:21

"0 does not appear in UTF-8 encoded strings." This is wrong. The UTF-8 encoding of the code point 0 is 0x00 (one byte). – Require 18/6, 2018 at 11:39

@SebastianUllrich - Good point. I had overlooked that. I'll update my answer. – Harken 18/6, 2018 at 13:4

If the strings both use the same encoding, memcmp will work fine. Keep in mind that wide characters are different sizes on different platforms, however.

If the strings use different encodings, you will need a library such as ICU to deal with it.

Bick answered 27/2, 2012 at 6:21 Comment(0)

Recommended topics

Hot tags