Seeking istreambuf_iterator <wchar_t> clarifications, reading a complete text file of Unicode characters
Asked Answered
S

1

13

In the book “Effective STL” by Scott Meyers, there is a nice example of reading an entire text file into a std::string object:

std::string sData; 

/*** Open the file for reading, binary mode ***/
std::ifstream ifFile (“MyFile.txt”, std::ios_base::binary); // Open for input, binary mode

/*** Read in all the data from the file into one string object ***/
sData.assign (std::istreambuf_iterator <char> (ifFile),
              std::istreambuf_iterator <char> ());

Note that it reads it in as 8-byte characters. This works very well. Recently though I have need for reading a file containing Unicode text (i.e., two bytes per char). However, when I try to (naively) change it to read the data from a Unicode text file into a std::wstring object like so:

std::wstring wsData; 

/*** Open the file for reading, binary mode ***/
std::wifstream ifFile (“MyFile.txt”, std::ios_base::binary); // Open for input, binary mode

/*** Read in all the data from the file into one string object ***/
wsData.assign (std::istreambuf_iterator <wchar_t> (ifFile),
               std::istreambuf_iterator <wchar_t> ());

The string that I get back, while being of wide characters, still has the alternate nulls. For example, if the file contains the Unicode string “ABC”, the bytes of the file (ignoring the Unicode lead bytes of 0xFF, 0xFE) are: <’A’> <0> <’B’> <0> <’C’> <0>

The first code fragment above would correctly result in the following contents of the (char) string:
sData [0] = ‘A’
sData [1] = 0x00
sData [2] = ‘B’
sData [3] = 0x00
sData [4] = ‘C’
sData [5] = 0x00

However, when the second code fragment is run, it undesirably results in the following contents of the (wchar_t) string:
wsData [0] = L‘A’
wsData [1] = 0x0000
wsData [2] = L‘B’
wsData [3] = 0x0000
wsData [4] = L‘C’
wsData [5] = 0x0000

It’s as if the file were still being read byte by byte and then just simply translated into individual wchar_t characters.

I would have thought that the std::istreambuf_iterator, being specialized to wchar_t , should have resulted in the file being read two bytes at a time, shouldn’t it? If not, what’s its purpose then?

I have traced into the templates (no easy feat ;-), and the iterator does indeed still seem to be reading the file byte by byte and passing it on to its internal convert routine which dutifully states that conversion is done after each byte (not only after receiving 2 bytes).

I have searched a number of sites on the web (including this one) for this seemingly trivial task but have not found an explanation of this behavior or a good alternative that does not involve more code than I feel should be necessary (e.g., A Google search of the web produces that same second code fragment as a viable piece of code as well).

The only thing that I have found that works is the following, and I consider that to be a cheat as it needs direct access to the wstring’s internal buffer and then type-coerces it at that.

std::wstring wsData; 

/*** Open the file for reading, binary mode ***/
std::wifstream ifFile (“MyFile.txt”, std::ios_base::binary); // Open for input, binary mode

wsData.resize (<Size of file in bytes> / sizeof (wchar_t));

ifFile.read ((char *) &wsData [0], <Size of file in bytes>);

Oh, and to forestall the inevitable “Why open the file in binary mode, why not in text mode” question, that open is intentional as if the file was opened in text mode (default), it means that CR/LF ("\r\n" or 0x0D0A) sequences will be converted into just LF ("\n" or 0x0A) sequences, whereas a pure byte read of the file would have preserved them. Regardless, for those diehards, changing that had, unsurprisingly, no effect.

So two questions here, why does the second case not work as one might expect (i.e., what is going on with those iterators), and what would be your favorite “kosher STL-way” of loading a file of Unicode characters into a wstring?

What am I missing here; it has to be something silly.

Chris

Sprawl answered 5/1, 2013 at 1:34 Comment(1)
Welcome to Stack Overflow. Thank you for writing such a precise and detailed first question!Nebulous
L
13

You must be disppointed with SO to have received no answers to your first question after 4-and-half-months. It is a good question, and most good questions are answered (well or badly) within minutes. Two the likely reasons for the neglect of yours are:

  • You did not tag it "C++", so many C++ programmers who might have been able to help will never have noticed it. (I have now tagged it "C++".)

  • Your question is about unicode stream-handling, which is no-one's idea of cool coding.

The misconception that has thwarted your investigations seems to be this: You appear to believe that a wide-character stream, std::wfstream, and wide-character string, std::wstring, are respectively the same as a "unicode stream" and a "unicode string", and specifically that they are respectively the same as a UTF-16 stream and a UTF-16 string. Neither of these things is true.

An std::wifstream (std::basic_ifstream<wchar_t>) is an input stream that converts an external sequence of bytes to an internal sequence of wchar_t, according to a specified or default encoding of the external sequence.

Likewise an std::wofstream (std::basic_ofstream<wchar_t>) is an output stream that converts an internal sequence of wchar_t to an external sequence of bytes, according to a specified or default encoding of the external sequence.

And an std::wstring (std::basic_string<wchar_t>) is a string type that simply stores a sequence of wchar_t, without knowledge of the encoding - if-any - from which they resulted.

Unicode is a family of byte-sequence encodings - UTF-8/-16/-32, and some more obscure others - related by the principle that UTF-N encodes alphabets using a sequence of 1 or more N-bit units per symbol. UTF-16 is apparently the encoding you are trying to read into a std::wstring. You say:

I would have thought that the std::istreambuf_iterator, being specialized to wchar_t, should have resulted in the file being read two bytes at a time, shouldn't it? If not, what's its purpose then?

But once you know that wchar_t is not necessarily 2 bytes wide (it is in Microsoft's C libraries, both 32 and 64-bit, but in GCC's it is 4 bytes wide), and also that a UTF-16 code-point (character) need not fit into 2 bytes (it can require 4), you will see that that specifying an extraction unit of wchar_t cannot be all there is to decoding a UTF-16 stream.

When you construct and open your input stream with:

std::wifstream ifFile ("MyFile.txt", std::ios_base::binary);

It is prepared to extract characters (of some alphabet) from "MyFile.txt" into values of type wchar_t and it will extract those characters from the byte-sequence in the file according to the encoding specified by the std::locale that is operative on the stream when it does the extracting.

Your code does not specify an std::locale for your stream, so the library's default takes effect. That default is the global C++ locale, which in turn by default is the "C" locale; and the "C" locale assumes the "identity encoding" of I/O byte sequences, i.e. 1 byte = 1 character ( setting aside the newline exception for text-mode I/O).

Thus, when you employ your std::istreambuf_iterator<wchar_t> to extract the characters, the extraction proceeds by converting each byte in the file to a wchar_t which it appends to the std::wstring wsData. The bytes in the file are, as you say:

0xFF, 0xFE, 'A', 0x00, 'B', 0x00, 'C', 0x00

The first two, which you discount as "unicode lead bytes", are indeed a UTF-16 byte-order mark (BOM) but in the default encoding they just are what they are.

Accordingly the wide-characters assigned to wsData are, as you observed:

0x00FF, 0x00FE, L'A', 0x0000, L'B', 0x0000, L'C', 0x0000

It's as if the file were still being read byte by byte and then just simply translated into individual wchar_t characters.

because that it precisely what is happening.

To stop this happening, you need to do something before you start extracting characters from the stream to tell it that it is supposed to decode a UTF-16 character sequence. The way to do that is conceptually rather tortuous. You need to imbue the stream with an std::locale that possesses an std::locale::facet that is an instantiation std::codecvt<InternT, ExternT, StateT> (or is derived from such) which will provide the stream with the correct methods from decoding UTF-16 into wchar_t.

But the gist of this is that you need to plug the right UTF-16 encoder/decoder into the stream and in practice it is (or should be) simple enough. I am guessing that your compiler is a recent MS VC++. If that's right, then, you can fix your code by:

  • Adding #include <locale> and #include <codecvt> to your headers
  • Adding the line:

    ifFile.imbue(std::locale(ifFile.getloc(),new std::codecvt_utf16<wchar_t,0x10ffff,std::little_endian>));

right after:

std::wifstream ifFile ("MyFile.txt", std::ios_base::binary);

The effect of this new line is to "imbue" ifFile with a new locale that is the same as the one it already had - ifFile.getloc() - but with a modified encoder/decoder facet - std::codecvt_utf16<wchar_t,0x10ffff,std::little_endian>. This codecvt facet is one that will decode UTF-16 characters with a maximum value of 0x10ffff into little-endian wchar_t values (0x10ffff being the maximum value of UTF-16 code-points).

When you debug into the code thus amended you will now find that wsData is only 4 wide-characters long and that those characters are:

0xFEFF, L'A', L'B', L'C'

as you expect them to be, with the first one being the UTF-16 little-endian BOM.

Notice that the order FE,FF is the reverse of what it was before application of the codecvt facet, showing us that little-endian decoding was done as requested. And it needed to be. Just edit the new line by removing std::little_endian, debug it again, and you will then find that the first element of wsData becomes 0xFFFE and that other three wide-characters become pictograms of the IICore pictographic character-set (if your debugger can display them). (Now, whenever a colleague complains in amazement that their code is turning English Unicode into "Chinese", you will know a likely explanation.)

Should you want to populate wsData without the leading BOM, you can do that by amending the new line again and replacing std::little_endian with std::codecvt_mode(std::little_endian|std::consume_header)

Finally, you may well have noted a bug in the new code, namely that a 2-byte wchar_t is insufficiently wide to represent the UTF-16 code-points between 0x100000 and 0x10ffff that could be read.

You will get away with this as long as all the code-points you have to read lie in the UTF-16 Basic Multilingual Plane, which spans [0,0xffff], and you might know that all inputs will forever obey that constraint. Otherwise, a 16-bit wchar_t is not fit for purpose. Replace:

  • wchar_t with char32_t
  • std::wstring with std::basic_string<char32_t>
  • std::wifstream with std::basic_ifstream<char32_t>

and the code is fully fit to read an abitrary UTF-16 encoded file into a string.

(Readers who are working with the the GNU C++ library will find that as of v4.7.2 it does not yet provide the <codecvt> standard header. The header <bits/codecvt.h> exists and presumbly will sometime graduate to being <codecvt>, but at this point it only exports the specializations class codecvt<char, char, mbstate_t> and class codecvt<wchar_t, char, mbstate_t>, which are respectively the identity conversion and the conversion between ASCII/UTF-8 and wchar_t. To solve the OP's problem you need to subclass std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type> yourself, as per this answer)

Leak answered 20/5, 2013 at 17:10 Comment(8)
16 bit wchar_t is fine for UTF-16 code, you just have to remember that each codepoint may take one or more wchar_t. But that's not that big a deal if you also recall that each "glyph" on the screen can be composed of many codepoints anyway. I'd hardly consider using wchar_t in this way a "bug". Simply nonportable.Perfumer
I've experimented with this and still have to disagree. Take the 2-unit codepoint U+1D11E (the G-clef). This will be UTF-16LE encoded in a file containing FF FE 34 D8 1E DD. Read this file in the manner of my fix (with std::consume_header, built with MS VC++ 2012) but keep the stream and string instantiated for wchar_t. Then wsData ends up with length 1 and wsData[0] is d11e. The first code unit has been lost. Replace wchar_t with char32_t then we get 1d11e, correctly. Don't know whether the MS library is buggy here, but anyway we corrupt the data with wchar_t.Leak
Er, yeah, that would be a bug in the MS library. I am pretty sure it should not be doing that.Perfumer
Wait, according to this page that behavior is per spec! WTF!Perfumer
Right, we're both the wiser for Chris Weiner's first question :)Leak
Ouch! I really must appologize for not checking back here. The question reamained unanswered for so long that I let it fade into memory. It was only my chance finding of this article (in another site, sorry) link that caused me to remember this question and look back (surprisingly I never got a notification that an answer was even posted :-( Thanks Mike and Mooing for your information and to a level of detail above and beyond the call of duty!Sprawl
@ChrisWiesner. You're welcome. It was most unusual for your question to be ignored for so long. Should you wish to mark my answer as accepted, how to do so is hereLeak
+n This answer absolutely rocks, and I wish I could up-vote it a dozen more times.Alexei

© 2022 - 2024 — McMap. All rights reserved.