Writing utf16 to file in binary mode

Asked 16/10, 2008 at 7:17 Answered 20/9, 2012 at 7:45

I'm trying to write a wstring to file with ofstream in binary mode, but I think I'm doing something wrong. This is what I've tried:

ofstream outFile("test.txt", std::ios::out | std::ios::binary);
wstring hello = L"hello";
outFile.write((char *) hello.c_str(), hello.length() * sizeof(wchar_t));
outFile.close();

Opening test.txt in for example Firefox with encoding set to UTF16 it will show as:

h�e�l�l�o�

Could anyone tell me why this happens?

EDIT:

Opening the file in a hex editor I get:

FF FE 68 00 00 00 65 00 00 00 6C 00 00 00 6C 00 00 00 6F 00 00 00

Looks like I get two extra bytes in between every character for some reason?

Towe answered 16/10, 2008 at 7:17 Comment(1)

Add a facet to the local associated with the stream to do the conversion from wchar_t to the correct output. See below. – Cheerly 16/10, 2008 at 13:1

I suspect that sizeof(wchar_t) is 4 in your environment - i.e. it's writing out UTF-32/UCS-4 instead of UTF-16. That's certainly what the hex dump looks like.

That's easy enough to test (just print out sizeof(wchar_t)) but I'm pretty sure it's what's going on.

To go from a UTF-32 wstring to UTF-16 you'll need to apply a proper encoding, as surrogate pairs come into play.

Ment answered 16/10, 2008 at 7:47 Comment(2)

Yeah, you are correct wchar_t is of size 4, I'm in a mac. So that explains a lot :) I'm aware of the surrogate pairs in UTF-16, will have to look into that a bit more. – Towe 16/10, 2008 at 8:1

From the output you can not tell it it is UTF-16 or UTF-32 all it shows is that wchar_t is 4 bytes wide. The encoding of the string is not defined by the language (though it is most likely to be UCS-4). – Cheerly 16/10, 2008 at 13:10

Here we run into the little used locale properties. If you output your string as a string (rather than raw data) you can get the locale to do the appropriate conversion auto-magically.

N.B.This code does not take into account edianness of the wchar_t character.

#include <locale>
#include <fstream>
#include <iostream>
// See Below for the facet
#include "UTF16Facet.h"

int main(int argc,char* argv[])
{
   // construct a custom unicode facet and add it to a local.
   UTF16Facet *unicodeFacet = new UTF16Facet();
   const std::locale unicodeLocale(std::cout.getloc(), unicodeFacet);

   // Create a stream and imbue it with the facet
   std::wofstream   saveFile;
   saveFile.imbue(unicodeLocale);


   // Now the stream is imbued we can open it.
   // NB If you open the file stream first. Any attempt to imbue it with a local will silently fail.
   saveFile.open("output.uni");
   saveFile << L"This is my Data\n";


   return(0);
}

The File: UTF16Facet.h

 #include <locale>

class UTF16Facet: public std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type>
{
   typedef std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type> MyType;
   typedef MyType::state_type          state_type;
   typedef MyType::result              result;


   /* This function deals with converting data from the input stream into the internal stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result  do_in(state_type &s,
                           const char  *from,const char *from_end,const char* &from_next,
                           wchar_t     *to,  wchar_t    *to_limit,wchar_t*    &to_next) const
   {
       // Loop over both the input and output array/
       for(;(from < from_end) && (to < to_limit);from += 2,++to)
       {
           /*Input the Data*/
           /* As the input 16 bits may not fill the wchar_t object
            * Initialise it so that zero out all its bit's. This
            * is important on systems with 32bit wchar_t objects.
            */
           (*to)                               = L'\0';

           /* Next read the data from the input stream into
            * wchar_t object. Remember that we need to copy
            * into the bottom 16 bits no matter what size the
            * the wchar_t object is.
            */
           reinterpret_cast<char*>(to)[0]  = from[0];
           reinterpret_cast<char*>(to)[1]  = from[1];
       }
       from_next   = from;
       to_next     = to;

       return((from > from_end)?partial:ok);
   }



   /* This function deals with converting data from the internal stream to a C/C++ file stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result do_out(state_type &state,
                           const wchar_t *from, const wchar_t *from_end, const wchar_t* &from_next,
                           char          *to,   char          *to_limit, char*          &to_next) const
   {
       for(;(from < from_end) && (to < to_limit);++from,to += 2)
       {
           /* Output the Data */
           /* NB I am assuming the characters are encoded as UTF-16.
            * This means they are 16 bits inside a wchar_t object.
            * As the size of wchar_t varies between platforms I need
            * to take this into consideration and only take the bottom
            * 16 bits of each wchar_t object.
            */
           to[0]     = reinterpret_cast<const char*>(from)[0];
           to[1]     = reinterpret_cast<const char*>(from)[1];

       }
       from_next   = from;
       to_next     = to;

       return((to > to_limit)?partial:ok);
   }
};

Cheerly answered 16/10, 2008 at 12:56 Comment(1)

Note, that your Facet implements the conversion to/from UCS-2, not UTF-16. UTF-16 is a variable length encoding that instruments so called surrogate pairs. UCS-2 is a subset of Unicode, which is the reason UTF-16 has been invented. – Keegan 4/5, 2017 at 21:12

I suspect that sizeof(wchar_t) is 4 in your environment - i.e. it's writing out UTF-32/UCS-4 instead of UTF-16. That's certainly what the hex dump looks like.

That's easy enough to test (just print out sizeof(wchar_t)) but I'm pretty sure it's what's going on.

To go from a UTF-32 wstring to UTF-16 you'll need to apply a proper encoding, as surrogate pairs come into play.

Ment answered 16/10, 2008 at 7:47 Comment(2)

Yeah, you are correct wchar_t is of size 4, I'm in a mac. So that explains a lot :) I'm aware of the surrogate pairs in UTF-16, will have to look into that a bit more. – Towe 16/10, 2008 at 8:1

It is easy if you use the C++11 standard (because there are a lot of additional includes like "utf8" which solves this problems forever).

But if you want to use multi-platform code with older standards, you can use this method to write with streams:

Read the article about UTF converter for streams
Add stxutif.h to your project from sources above

Open the file in ANSI mode and add the BOM to the start of a file, like this:

std::ofstream fs;
fs.open(filepath, std::ios::out|std::ios::binary);

unsigned char smarker[3];
smarker[0] = 0xEF;
smarker[1] = 0xBB;
smarker[2] = 0xBF;

fs << smarker;
fs.close();

Then open the file as UTF and write your content there:

std::wofstream fs;
fs.open(filepath, std::ios::out|std::ios::app);

std::locale utf8_locale(std::locale(), new utf8cvt<false>);
fs.imbue(utf8_locale); 

fs << .. // Write anything you want...

Mihe answered 20/9, 2012 at 7:45 Comment(0)

On windows using wofstream and the utf16 facet defined above fails becuase the wofstream converts all bytes with the value 0A to 2 bytes 0D 0A, this is irrespective of how you pass the 0A byte in, '\x0A', L'\x0A', L'\x000A', '\n', L'\n' and std::endl all give the same result. On windows you have to open the file with an ofstream (not a wofsteam) in binary mode and write the output just like it is done in the original post.

Encephaloma answered 22/5, 2009 at 11:16 Comment(0)

The provided Utf16Facet didn't work in gcc for big strings, here is the version that worked for me... This way the file will be saved in UTF-16LE. For UTF-16BE, simply invert the assignments in do_in and do_out, e.g. to[0] = from[1] and to[1] = from[0]

#include <locale>
#include <bits/codecvt.h>


class UTF16Facet: public std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type>
{
   typedef std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type> MyType;
   typedef MyType::state_type          state_type;
   typedef MyType::result              result;


   /* This function deals with converting data from the input stream into the internal stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result  do_in(state_type &s,
                           const char  *from,const char *from_end,const char* &from_next,
                           wchar_t     *to,  wchar_t    *to_limit,wchar_t*    &to_next) const
   {

       for(;from < from_end;from += 2,++to)
       {
           if(to<=to_limit){
               (*to)                               = L'\0';

               reinterpret_cast<char*>(to)[0]  = from[0];
               reinterpret_cast<char*>(to)[1]  = from[1];

               from_next   = from;
               to_next     = to;
           }
       }

       return((to != to_limit)?partial:ok);
   }



   /* This function deals with converting data from the internal stream to a C/C++ file stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result do_out(state_type &state,
                           const wchar_t *from, const wchar_t *from_end, const wchar_t* &from_next,
                           char          *to,   char          *to_limit, char*          &to_next) const
   {

       for(;(from < from_end);++from, to += 2)
       {
           if(to <= to_limit){

               to[0]     = reinterpret_cast<const char*>(from)[0];
               to[1]     = reinterpret_cast<const char*>(from)[1];

               from_next   = from;
               to_next     = to;
           }
       }

       return((to != to_limit)?partial:ok);
   }
};

Anniceannie answered 9/6, 2012 at 2:59 Comment(0)

You should look at the output file in a hex editor such as WinHex so you can see the actual bits and bytes, to verify that the output is actually UTF-16. Post it here and let us know the result. That will tell us whether to blame Firefox or your C++ program.

But it looks to me like your C++ program works and Firefox is not interpreting your UTF-16 correctly. UTF-16 calls for two bytes for every character. But Firefox is printing twice as many characters as it should, so it is probably trying to interpret your string as UTF-8 or ASCII, which generally just have 1 byte per character.

When you say "Firefox with encoding set to UTF16" what do you mean? I'm skeptical that that work work.

Permissive answered 16/10, 2008 at 7:30 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags