void SomeRandomClass::readUtf16LeFile( const std::string& theFileName ) { boost::locale::generator gen; std::ifstream file( theFileName ); auto utf8Locale = gen.generate( "UTF-8" ); std::locale cvtLocale( utf8Locale, new std::codecvt_utf8_utf16<char>() ); file.imbue( utf8Locale ); std::string line; std::cout.imbue( utf8Locale ); for ( int i = 0; i < 3; i++ ) { std::getline( file, line ); std::cout << line << std::endl; } }

Reading UTF-16 writing UTF-8

The first question you have to clarify, is about what variation of UTF16 you are reading:

is it UTF-16LE (i.e. generated under windows) ?
is it UTF-16BE (generated by wstream by default) ?
is it UTF16 with a BOM ?

The next question is to know whether you can really output your UTF8 or UTF16 on the console, knowing that the default windows console can really cause headakes for that.

Step 1: Make sure that the problem is no related to the win console

So here a small code to read an UTF-16LE and check the content with a native windows function (you just have to include <windows.h> in your console app):

    wifstream is16(filename);
    is16.imbue(locale(is16.getloc(), new codecvt_utf16<wchar_t, 0x10ffff, little_endian>()));
    wstring wtext, wline;
    for (int i = 0; getline(is16, wline); i++)
        wtext += wline + L"\n";
    MessageBoxW(NULL, wtext.c_str(), L"UTF16-Little Endian", MB_OK);

If your file is an UTF-16 with a BOM, just replace litte_endian with consume_header.

Step 2: Convert your utf16-string back into utf8 string

You have to use a string converter:

    wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> converter;

    wifstream is16(filename);
    is16.imbue(locale(is16.getloc(), new codecvt_utf16<wchar_t, 0x10ffff, little_endian>()));
    wstring wline;
    string u8line; 
    for (int i = 0; i < 10 && getline(is16, wline); i++) {
         u8line = converter.to_bytes(wline);
         cout << u8line<<endl; 
    }

This will show you the ascii caracters well on the win console. However all the utf8 encodings will appear as garbage (unless you're more successful than I for setting the console to display the unicode font).

Step 3: check the utf8 encoding using a file

As win console is pretty bad at it, the best thing would be to write the charset that you produced directly into a file and open this file with a text editor (lke Notepad++) wich can show you the encoding.

Nota bene: all this was done using only standard library (except for the intermediary MessageBoxW()) and its locale.

Further steps

If you want to detect the encoding, the first thing to start with is to see if there is a BOM, at the very begin of your file (opened for binary input, default "C" locale) :

char bom_utf8[]{0xEF, 0xBB, 0xBF};
char bom_utf16be[] { 0xFE, 0xFF};
char bom_utf16le[] { 0xFf, 0xFe};
char bom_utf32be[] { 0, 0, 0xFf, 0xFe};
char bom_uff32le[] { 0xFf, 0xFe, 0, 0};

Just load the first few bytes, and compare with this data.

If you've found one, it's ok. If not, you'll have to iterate through the file.

A quick approximation if you expect western languages, is the following: If you find a lot of null bytes (>25% <50%), it's probably utf16. If you find more than 50% of nulls, it's probably utf32.

But a more precise approach could make sense. For instance, to verify if the file is UTF16, you just have to implement a small state machine that checks that anytimes a first word has a high byte between 0xD8 and 0xDB, the next word has its high byte between 0xDC and 0xDF. What's high and what's low depend of course if it's little or big endian.

For UTF8 it's a similar practice,but the state machine is a little bit more complex because the bit pattern of the first char defines how many chars must follow, and each of the follwer must have a bit pattern (c & 0xC0) == 0x80.

Reading UTF-16 writing UTF-8

Further steps

Recommended topics

Hot tags