In the book “Effective STL” by Scott Meyers, there is a nice example of reading an entire text file into a std::string object:
std::string sData;
/*** Open the file for reading, binary mode ***/
std::ifstream ifFile (“MyFile.txt”, std::ios_base::binary); // Open for input, binary mode
/*** Read in all the data from the file into one string object ***/
sData.assign (std::istreambuf_iterator <char> (ifFile),
std::istreambuf_iterator <char> ());
Note that it reads it in as 8-byte characters. This works very well. Recently though I have need for reading a file containing Unicode text (i.e., two bytes per char). However, when I try to (naively) change it to read the data from a Unicode text file into a std::wstring object like so:
std::wstring wsData;
/*** Open the file for reading, binary mode ***/
std::wifstream ifFile (“MyFile.txt”, std::ios_base::binary); // Open for input, binary mode
/*** Read in all the data from the file into one string object ***/
wsData.assign (std::istreambuf_iterator <wchar_t> (ifFile),
std::istreambuf_iterator <wchar_t> ());
The string that I get back, while being of wide characters, still has the alternate nulls. For example, if the file contains the Unicode string “ABC”, the bytes of the file (ignoring the Unicode lead bytes of 0xFF, 0xFE) are: <’A’> <0> <’B’> <0> <’C’> <0>
The first code fragment above would correctly result in the following contents of the (char) string:
sData [0] = ‘A’
sData [1] = 0x00
sData [2] = ‘B’
sData [3] = 0x00
sData [4] = ‘C’
sData [5] = 0x00
However, when the second code fragment is run, it undesirably results in the following contents of the (wchar_t) string:
wsData [0] = L‘A’
wsData [1] = 0x0000
wsData [2] = L‘B’
wsData [3] = 0x0000
wsData [4] = L‘C’
wsData [5] = 0x0000
It’s as if the file were still being read byte by byte and then just simply translated into individual wchar_t characters.
I would have thought that the std::istreambuf_iterator, being specialized to wchar_t , should have resulted in the file being read two bytes at a time, shouldn’t it? If not, what’s its purpose then?
I have traced into the templates (no easy feat ;-), and the iterator does indeed still seem to be reading the file byte by byte and passing it on to its internal convert routine which dutifully states that conversion is done after each byte (not only after receiving 2 bytes).
I have searched a number of sites on the web (including this one) for this seemingly trivial task but have not found an explanation of this behavior or a good alternative that does not involve more code than I feel should be necessary (e.g., A Google search of the web produces that same second code fragment as a viable piece of code as well).
The only thing that I have found that works is the following, and I consider that to be a cheat as it needs direct access to the wstring’s internal buffer and then type-coerces it at that.
std::wstring wsData;
/*** Open the file for reading, binary mode ***/
std::wifstream ifFile (“MyFile.txt”, std::ios_base::binary); // Open for input, binary mode
wsData.resize (<Size of file in bytes> / sizeof (wchar_t));
ifFile.read ((char *) &wsData [0], <Size of file in bytes>);
Oh, and to forestall the inevitable “Why open the file in binary mode, why not in text mode” question, that open is intentional as if the file was opened in text mode (default), it means that CR/LF ("\r\n" or 0x0D0A) sequences will be converted into just LF ("\n" or 0x0A) sequences, whereas a pure byte read of the file would have preserved them. Regardless, for those diehards, changing that had, unsurprisingly, no effect.
So two questions here, why does the second case not work as one might expect (i.e., what is going on with those iterators), and what would be your favorite “kosher STL-way” of loading a file of Unicode characters into a wstring?
What am I missing here; it has to be something silly.
Chris