how can I compare utf8 string such as persian words in c++?

I

3

2

I want to compare strings in Persian (utf8). I know I must use some thing like L"گل" and it must be saved in wchar_t * or wstring. the question is when I compare by the function compare() strings I dont get the right result.

Inglebert answered 21/8, 2011 at 21:55 Comment(4)

Do you have C++11 (e.g. GCC 4.6)? – Opalina 21/8, 2011 at 21:58

Do you mean compare for equality, or compare for the purpose of sorting, or just what? – Rudolph 21/8, 2011 at 21:58

compare for equality actually – Inglebert 21/8, 2011 at 22:3

and I am working on windows xp visual studio 2008 – Inglebert 21/8, 2011 at 22:4

O

2

If the strings that you want to compare are in a specific, definite encoding already, then don't use wchar_t and don't use L"" literals -- those are not for Unicode, but for implementation-defined, opaque encodings only.

If your strings are in UTF-8, use a string of chars. If you want to convert them to raw Unicode codepoints (UCS-4/UTF-32), or if you already have them in that form, store them in a string of uint32_ts, or char32_ts if you have a modern compiler.

If you have C++11, your literal can be char str8[] = u8"گل"; or char32_t str32[] = U"گل";. See this topic for some more on this.

If you want to interact with command line arguments or the environment, use iconv() to convert from WCHAR to UTF-32 or UTF-8.

Opalina answered 23/8, 2011 at 20:52 Comment(0)

S

3

wchar_t is not for UTF-8, but (depending on the platform) typically either UTF-16 or UCS-32. If you want to work on UTF-8, use plain old char * or string, and their comparison functions for equality. If you want human-meaingful sorting, it gets much more involved (no matter which encoding you use).

Sake answered 21/8, 2011 at 22:22 Comment(3)

The String.Compare operates on two String, and String does not have constructor from wchar, so most likely you are constrcting from your wchar as a char in error, and you are hitting a null termination early, and hence why your compare fails -- if you operate with UTF-8 you can store everything as char and everything should work fine EXCEPT that "greater than" and "less than" will give you problems, but you may have had problems with those in wchar as well... – Adhesion 21/8, 2011 at 22:36

Note that any Unicode encoding, including UTF-8, 16 or 32 cannot be compared byte-wise for anything other than byte-equality. The display may be identical, but the bytes used (such as R->L markers, multi-codepoint display modifiers, and similar used in non-English languages such as Persian) will not be. – Okoka 21/8, 2011 at 22:53

@Yann Ramin: That's why the Unicode collation algorithm handles normalization and default ignorables. I often get myself a collator object with the right strength levels set and then call its equality method so I don't have to worry about Unicode's funny ideas of equal inequalities or inequal equalities or such. – Unity 22/8, 2011 at 3:54

O

3

Unicode is notoriously difficult to compare.

Note that any Unicode encoding, including UTF-8, 16 or 32 cannot be compared byte-wise for anything other than byte-equality. The display may be identical, but the bytes used (such as R->L markers, surrogate pairs, display modifiers, and similar used in non-English languages such as Persian) will not be.

Generally, you need to normalize Unicode before you can make a realistic comparison if the meaning of the text has any significance:

http://userguide.icu-project.org/transforms/normalization

Okoka answered 21/8, 2011 at 22:57 Comment(1)

Text is notoriously difficult to compare. ASCII cheats by ignoring 95% of all text in the world. – Prefecture 22/8, 2011 at 8:46

O

2

If the strings that you want to compare are in a specific, definite encoding already, then don't use wchar_t and don't use L"" literals -- those are not for Unicode, but for implementation-defined, opaque encodings only.

If your strings are in UTF-8, use a string of chars. If you want to convert them to raw Unicode codepoints (UCS-4/UTF-32), or if you already have them in that form, store them in a string of uint32_ts, or char32_ts if you have a modern compiler.

If you have C++11, your literal can be char str8[] = u8"گل"; or char32_t str32[] = U"گل";. See this topic for some more on this.

If you want to interact with command line arguments or the environment, use iconv() to convert from WCHAR to UTF-32 or UTF-8.

Opalina answered 23/8, 2011 at 20:52 Comment(0)

Recommended topics

Hot tags