I want to compare strings in Persian (utf8). I know I must use some thing like L"گل" and it must be saved in wchar_t * or wstring. the question is when I compare by the function compare() strings I dont get the right result.
If the strings that you want to compare are in a specific, definite encoding already, then don't use wchar_t
and don't use L""
literals -- those are not for Unicode, but for implementation-defined, opaque encodings only.
If your strings are in UTF-8, use a string of char
s. If you want to convert them to raw Unicode codepoints (UCS-4/UTF-32), or if you already have them in that form, store them in a string of uint32_t
s, or char32_t
s if you have a modern compiler.
If you have C++11, your literal can be char str8[] = u8"گل";
or char32_t str32[] = U"گل";
. See this topic for some more on this.
If you want to interact with command line arguments or the environment, use iconv()
to convert from WCHAR to UTF-32 or UTF-8.
wchar_t
is not for UTF-8, but (depending on the platform) typically either UTF-16 or UCS-32. If you want to work on UTF-8, use plain old char *
or string
, and their comparison functions for equality. If you want human-meaingful sorting, it gets much more involved (no matter which encoding you use).
Unicode is notoriously difficult to compare.
Note that any Unicode encoding, including UTF-8, 16 or 32 cannot be compared byte-wise for anything other than byte-equality. The display may be identical, but the bytes used (such as R->L markers, surrogate pairs, display modifiers, and similar used in non-English languages such as Persian) will not be.
Generally, you need to normalize Unicode before you can make a realistic comparison if the meaning of the text has any significance:
If the strings that you want to compare are in a specific, definite encoding already, then don't use wchar_t
and don't use L""
literals -- those are not for Unicode, but for implementation-defined, opaque encodings only.
If your strings are in UTF-8, use a string of char
s. If you want to convert them to raw Unicode codepoints (UCS-4/UTF-32), or if you already have them in that form, store them in a string of uint32_t
s, or char32_t
s if you have a modern compiler.
If you have C++11, your literal can be char str8[] = u8"گل";
or char32_t str32[] = U"گل";
. See this topic for some more on this.
If you want to interact with command line arguments or the environment, use iconv()
to convert from WCHAR to UTF-32 or UTF-8.
© 2022 - 2024 — McMap. All rights reserved.