What is the optimal multiplatform way of dealing with Unicode strings under C++?
Asked Answered
S

6

20

I know that there are already several questions on StackOverflow about std::string versus std::wstring or similar but none of them proposed a full solution.

In order to obtain a good answer I should define the requirements:

  • multiplatform usage, must work on Windows, OS X and Linux
  • minimal effort for conversion to/from platform specific Unicode strings like CFStringRef, wchar_t *, char* as UTF-8 or other types as they are required by OS API. Remark: I don't need code-page convertion support because I expect to use only Unicode compatible functions on all operating systems supported.
  • if requires an external library, this one should be open-source and under a very liberal license like BSD but not LGPL.
  • be able to use a printf format syntax or similar.
  • easy way of string allocation/deallocation
  • performance is not very important because I assume that the Unicode strings are used only for application UI.
  • some example could would be appreciated

I would really appreciate only one proposed solution per answer, by doing this people may vote for their prefered alternative. If you have more than one alternative just add another answer.

Please indicate something that did worked for you.

Related questions:

Signore answered 10/1, 2010 at 17:13 Comment(3)
What do you mean by "dealing with Unicode strings"? Do you simply want something that can store sequences of Unicode code points? Something that correctly handles culture-specific collation? Something which is able to deal with canonical and noncanonical forms of a string?Alonso
@jalf Good point! I forgot to mention that I deliberately excluded advanced string usage like string modification or normalization forms, sorting. I assumed that for the simplicity I would only use these strings for display (but I may need to use printf format or string concatenations but not more.) Anything more than this would require ICU or other libs.Signore
Qt is great multi platform UI framework with great Unicode support. Only license doesn't meet your requirement (LGPL), but in case of Qt this enforces on you just dynamic linking of this library, nothing else.Wildlife
L
5

Same as Adam Rosenfield answer (+1), but I use UTFCPP instead.

Longplaying answered 10/1, 2010 at 18:19 Comment(1)
Which works just as well with std::wstring for internal representation. Take your pick.Baumbaugh
N
7

I would strongly recommend using UTF-8 internally in your application, using regular old char* or std::string for data storage. For interfacing with APIs that use a different encoding (ASCII, UTF-16, etc.), I'd recommend using libiconv, which is licensed under the LGPL.

Example usage:

class TempWstring
{
public:
  TempWstring(const char *str)
  {
    assert(sUTF8toUTF16 != (iconv_t)-1);
    size_t inBytesLeft = strlen(str);
    size_t outBytesLeft = 2 * (inBytesLeft + 1);  // worst case
    mStr = new char[outBytesLeft];
    char *outBuf = mStr;
    int result = iconv(sUTF8toUTF16, &str, &inBytesLeft, &outBuf, &outBytesLeft);
    assert(result == 0 && inBytesLeft == 0);
  }

  ~TempWstring()
  {
    delete [] mStr;
  }

  const wchar_t *Str() const { return (wchar_t *)mStr; }

  static void Init()
  {
    sUTF8toUTF16 = iconv_open("UTF-16LE", "UTF-8");
    assert(sUTF8toUTF16 != (iconv_t)-1);
  }

  static void Shutdown()
  {
    int err = iconv_close(sUTF8toUTF16);
    assert(err == 0);
  }

private:
  char *mStr;

  static iconv_t sUTF8toUTF16;
};

iconv_t TempWstring::sUTF8toUTF16 = (iconv_t)-1;

// At program startup:
TempWstring::Init();

// At program termination:
TempWstring::Shutdown();

// Now, to convert a UTF-8 string to a UTF-16 string, just do this:
TempWstring x("Entr\xc3\xa9""e");  // "Entrée"
const wchar_t *ws = x.Str();  // valid until x goes out of scope

// A less contrived example:
HWND hwnd = CreateWindowW(L"class name",
                          TempWstring("UTF-8 window title").Str(),
                          dwStyle, x, y, width, height, parent, menu, hInstance, lpParam);
Noach answered 10/1, 2010 at 17:19 Comment(5)
So every trivial string operation requires a conversion?Grandmamma
You recommendation is going the EXACT opposite way of all OS. Internally Win/Mac use UTF-16 (becuase it is fixed size (not really but for most practical purposes) (really its UCS-2 but don't tell anybody)). While storage is done in UTF-8.Petey
Almost all programs on modern UNIX systems use UTF-8 as internal representations for Unicode strings. (Yes yes, Cocoa likes its UCS-2 but it's not really UNIX.)Kamerman
@Martin York No, it really is UTF-16, not UCS-2. Windows started as UCS-2, but today most of the stuff is surrogate aware (I know of one thing that is not, might be more, but these are bugs, overall the thing is UTF-16)Souza
I think it does not go well with the concept of char type (in C++). Since in your solution "char" no longer stores a single character. Usually UTF-8 (and other variable-sized encodings) are used as external encodings and internally code should use fixed-size encoding.Audry
L
5

Same as Adam Rosenfield answer (+1), but I use UTFCPP instead.

Longplaying answered 10/1, 2010 at 18:19 Comment(1)
Which works just as well with std::wstring for internal representation. Take your pick.Baumbaugh
Z
3

I was recently on a project that decided to use std::wstring for a cross-platform project because "wide strings are Unicode, right?" This led to a number of headaches:

  • How big is the scalar value in a wstring? Answer: It's up to the compiler implementation. In Visual Studio (Win), it is 16 bits. But in Xcode (Mac), it is 32 bits.
  • This led to an unfortunate decision to use UTF-16 for communication over the wire. But which UTF-16? There are two: UTF-16BE (big-endian) and UTF16-LE (little-endian). Not being clear on this led to even more bugs.

When you are in platform-specific code, it makes sense to use the platform's native representation to communicate with its APIs. But for any code that is shared across platforms, or communicates between platforms, avoid all ambiguity and use UTF-8.

Zeena answered 10/1, 2010 at 18:24 Comment(4)
Which UTF-16 coming over the wire is easy You just make sure the BOM is sent as the first character. The receiving layer (the one above transport then re-arranges the message as required. But I agree UTF-8 for transport is easier and usually more compact (and transcoding UTF-16 -> UTF-8 is trivial).Petey
Like transport on the wire. Storage is easier if you use UTF-8.Petey
I think that if you are using UTF-16 over the wire you should stick with network endianess - this is big-endian. No need to make any protocol more complex.Signore
@Martin, good point -- except they wouldn't have known a BOM if it came up and bit them.Zeena
S
1

Rule of thumb: use the native platform Unicode form for processing (UTF-16 or UTF-32), and UTF-8 for data interchange (communication, storage).

If all the native APIs use UTF-16 (for instance in Windows), having your strings as UTF-8 means you will have to convert all input to UTF-16, call the Win API, then convert the answer to UTF-8. Quite a pain.

But if the main problem is the UI, the strings are the simple problem. The more difficult one is the UI framework. And for that I would recommend wxWidgets (http://www.wxWidgets.org). Supports many platforms, mature (17 years and still very active), native widgets, Unicode, liberal license.

Souza answered 11/1, 2010 at 7:13 Comment(0)
I
1

I'd go for UTF16 representation in memory and UTF-8 or 16 on harddisk or wire. The main reason: UTF16 has a fixed size for each "letter". This simplifies a lot of duties when working with the string (searching, replacing parts, ...).

The only reason for UTF-8 is the reduced memory usage for "western/latin" letters. You can use this representation for disc-storage or transportation over network. It has also the benefit that you need not worry over byte-order when loading/saving to disc/wire.

With these reasons in mind, I'd go for std::wstring internally or - if your GUI library offers a Widestring, use that (like QString from QT). And for disc-storage, I'd write a small platform independent wrapper for the platform api. Or I'd check out unicode.org if they have platformindependent code available for this conversion.


for clarification: korean / japanese letters are NOT western / latin. Japanese are for exampli Kanji. That's why I mentioned the latin character set.


for UTF-16 not being 1 character / 2 bytes. This assumption is only true for characters being on the base multilingual plane (see: http://en.wikipedia.org/wiki/UTF16). Still most user of UTF-16 assume that all characters are on the BMP. If this can't be guaranteed for your application, you can switch to UTF32 or switch to UTF8.

Still UTF-16 is used for the reasons mentioned above in a lot of APIs (e.g. Windows, QT, Java, .NET, wxWidgets)

Impassion answered 11/1, 2010 at 8:2 Comment(6)
UTF16 does not have a fixed size for each letter.Gardening
UTF-8 has other benefits, such as being able to be processed by the standard C string functions.Gardening
A propos "reduced memory usage for western/latin letters": things are trickier than they seem. Wikipedia says: "For example both the Japanese and the Korean UTF-8 article on Wikipedia take more space if saved as UTF-16 than the original UTF-8 version".Mutz
@Carl Seleborg Yes, things are indeed trickier. The html in Wikipedia has a lot of markup that is plain ASCII. For other formats might be different. But the only way to say what takes more in memory, you really have to measure. If some browser takes the html from Wikipedia and converts it in memory to UTF-16, because that's how the browser does the job, then the original encoding is irrelevant.Souza
@Craig McQueen: "able to be processed by the standard C string functions" This is only true in the Unix/Linux/Mac world, and only if you don't forget to set the locale to foo_bar.UTF-8 The Windows C runtime does not handle UTF-8.Souza
I used to run with UTF-16 before (UTF-32 in *nix). 'Twas a perfectly unbalanced coice: does not cope with all cases, isn't easy to port.Baumbaugh
S
0

You can store UTF-16 inside std::string. So in principle you could use std::string for all platforms, and store inside the encoding preferred by the platform (UTF-8 for Linux, UTF-16 for Windows, etc.). This will leave you with something simple at the C++ types level, but having to track the encoding of strings. This may be simple if the application is self-contained, and less simple if it has to interoperate (cf. storage, wire format).

The risk of storing UTF-16 inside std::string is that sooner or later you will call .c_str() and the result will be interpreted as ending at the first 0, which for std::string s = reinterpret_cast<char *>(L"hello") will be at s[1].

Shreve answered 24/9, 2021 at 14:20 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.