How to deal with Unicode strings in C/C++ in a cross-platform friendly way?
Asked Answered
E

3

7

On platforms different than Windows you could easily use char * strings and treat them as UTF-8.

The problem is that on Windows you are required to accept and send messages using wchar* strings (W). If you'll use the ANSI functions (A) you will not support Unicode.

So if you want to write truly portable application you need to compile it as Unicode on Windows.

Now, In order to keep the code clean I would like to see what is the recommended way of dealing with strings, a way that minimize ugliness in the code.

Type of strings you may need: std::string, std::wstring, std::tstring,char *,wchat_t *, TCHAR*, CString (ATL one).

Issues you may encounter:

  • cout/cerr/cin and their Unicode variants wcout,wcerr,wcin
  • all renamed wide string functions and their TCHAR macros - like strcmp, wcscmp and _tcscmp.
  • constant strings inside code, with TCHAR you will have to fill your code with _T() macros.

What approach do you see as being best? (examples are welcome)

Personally I would go for a std::tstring approach but I would like to see how would do to the conversions where they are necessary.

Earthworm answered 27/4, 2010 at 16:19 Comment(1)
utf8everywhere.org explains it all.Jara
P
3

I can only suggest you to check this library out: http://cppcms.sourceforge.net/boost_locale/docs/
It might help, it's a boost candidate for now but I believe it will make it.

Papillote answered 27/4, 2010 at 16:24 Comment(3)
The newer documentation is placed at cppcms.sourceforge.net/boost_locale/html/tutorial.htmlBohun
it works fine. I just wait for some fixes in boost-build in order to make bjam/boost-build able find ICU library correctly and build boost-locale.Bohun
Those who can, do. Those who can't, boost.Oshiro
C
1

You can keep all your strings UTF-8 encoded and just convert them to UTF-16 before interacting with WIn32 API. Take a look at UTF8-CPP library for some easy to use conversion functions

Cobaltite answered 27/4, 2010 at 19:1 Comment(0)
B
1

If you writing portable code:

1st Never use wchar_t it is nor portable and its encoding is not well defined between platforms (utf-16 windows/utf-32 all others).

Never use TChar, use plain std::string encoded as UTF-8.

When dealing with Brain Damaged Win32 API just convert UTF-8 string to UTF-16 before calling it.

See https://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful as well about how Windows project adopt UTF-8 as main encoding.

Bohun answered 28/4, 2010 at 15:44 Comment(3)
in visual studio, when I do std::string msg = "महसुस";, I cannot view it. And everything is replaced by question mark. Any idea?Floaty
The SO post referenced in this answer is now a dead link. Seems like it was an important post....Laubin
There is nothing brain damaged about windows using UTF-16. Windows began Unicode support with windows NT (released 1993). UTF-8 was only invented as a concept in Sep 1992 cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt, so there was no way Windows could adopt it. In fact Windows was a highly progressive early adopter of Unicode - it turns out perhaps too early.Sinless

© 2022 - 2024 — McMap. All rights reserved.