Reading/writing/printing UTF-8 in C++11

Asked 18/3, 2013 at 9:10 Answered 18/3, 2013 at 10:53

Solved utf-8 c++11 wchar-t utf-32 codecvt

I have been exploring C++11's new Unicode functionality, and while other C++11 encoding questions have been very helpful, I have a question about the following code snippet from cppreference. The code writes and then immediately reads a text file saved with UTF-8 encoding.

// Write
std::ofstream("text.txt") << u8"z\u6c34\U0001d10b";

// Read
std::wifstream file1("text.txt");
file1.imbue(std::locale("en_US.UTF8"));
std::cout << "Normal read from file (using default UTF-8/UTF-32 codecvt)\n";
for(wchar_t c; file1 >> c; ) // ?
   std::cout << std::hex << std::showbase << c << '\n';

My question is quite simply, why is a wchar_t needed in the for loop? A u8 string literal can be declared using a simple char * and the bit layout of the UTF-8 encoding should tell the system the character's width. It appears there is some automatic conversion from UTF-8 to UTF-32 (hence the wchar_t), but if this is the case, why is the conversion necessary?

Undulate answered 18/3, 2013 at 9:10 Comment(2)

It depends on a lot of things. Notable, correct UTF8 behaviour is extremely hard if not impossible using Windows in a console application (requiring at least a good number of non-standard API calls IIRC) – Minstrel 18/3, 2013 at 10:57

wchar_t is used because wifstream is used, and wifstream performs that "some automatic conversion" you mention. My point was to show the difference between that automatic conversion (as implemented for one particular platform) and the explicit, portable, locale-independent, Unicode conversion provided by codecvt_utf8_utf16. – Gradient 18/3, 2013 at 14:29

You use wchar_t because you're reading the file using wifstream; if you were reading using ifstream you'd use char, and similarly for char16_t and char32_t.

Assuming (as the example does) that wchar_t is 32-bit, and that the native character set that it represents is UTF-32 (UCS-4), then this is the simplest way to read a file as UTF-32; it is presented as such in the example for contrast to reading a file as UTF-16. A more portable method would be to use basic_ifstream<char32_t> and std::codecvt_utf8<char32_t> explicitly, as this is guaranteed to convert from a UTF-8 input stream to UTF-32 elements.

Visitor answered 18/3, 2013 at 10:53 Comment(3)

+1, I wrote that example and contrast was what I was going for. – Gradient 18/3, 2013 at 13:54

Ah I see! So is it therefore better practice to always explicitly convert UTF-8 to a wider wchar_t or is it still acceptable to just extract the raw UTF-8 bytes into a native char array using an ifstream? I'm not sure whether to infer from @Cubbi's example that the latter is bad practice, or whether it is just outside the scope of the example. – Undulate 19/3, 2013 at 0:47

@PLPiper yes you can always read whatever multibyte encoding the file has into a char array, without engaging any of the conversions. There isn't a lot that can be done with such array within standard C++ (other than converting to wide first), but plenty of libraries take utf8 inputs. – Gradient 19/3, 2013 at 2:26

The idea of the cppreference code snippet you used is to show how to read a UTF-8 file into a UTF-16 string that's why they write the file using an ofstream but read it using a wifstream (hence the wchar_t).

Adventurism answered 18/3, 2013 at 9:23 Comment(0)

Recommended topics

Hot tags