I'm working with code that expects utf8-encoded std::string variables. I want to be able to handle a user-supplied file that potentially has utf-16 encoding (I don't know the encoding at design time, but eventually want to be able to deal with utf8/16/32), read it line-by-line, and forward each line to the rest of the code as a utf8-encoded std::string.
I have c++11 (really, the current MSVC subset of c++11) and boost 1.55.0 to work with. I'll need the code to work on both Linux and Windows variants eventually. For now, I'm just prototyping on Windows with Visual Studio 2013 Update 4, running on Windows 7. I'm open to additional dependencies, but they'd need to have an established cross-platform (meaning windows and *nix) track record, and shouldn't be GPL/LGPL.
I've been making assumptions that I don't seem to be able to find a way to validate, and I have code that is not working.
One assumption is that, since I ultimately want each line from these files in a std::string variable, I should be working with std::ifstream imbued with a properly-constructed codecvt such that the incoming utf16 stream can be converted to utf8.
Is this assumption realistic? The alternative, I thought, would be that I'd have to do some encoding checks on the text file, and then choose wifstream/wstring or ifstream/string based on the results, which seemed more unappealing than I'd like to start with. Of course, if that's the right (or the only realistic) path, I'm open to it.
I realize that I may likely need to do some encoding detection anyway, but for now, I am not so concerned about the encoding detection part, just focusing on getting utf16 file contents into utf8 std::string.
I have tried a variety of different combinations of locale and codecvt, none of which have worked. Below is the latest incarnation of what I thought might work, but doesn't:
void
SomeRandomClass::readUtf16LeFile( const std::string& theFileName )
{
boost::locale::generator gen;
std::ifstream file( theFileName );
auto utf8Locale = gen.generate( "UTF-8" );
std::locale cvtLocale( utf8Locale,
new std::codecvt_utf8_utf16<char>() );
file.imbue( utf8Locale );
std::string line;
std::cout.imbue( utf8Locale );
for ( int i = 0; i < 3; i++ )
{
std::getline( file, line );
std::cout << line << std::endl;
}
}
The behavior I see with this code is that the result of each call to getline() is an empty string, regardless of the file contents.
This same code works fine (meaning, each getline() call returns a correctly-encoded non-empty string) on a utf8-encoded version of the same file if I omit lines 3 and 5 of the above method.
For whatever reason, I could not find any examples anywhere here on SO or on http://en.cppreference.com/, or elsewhere in the wild, of anyone trying to do this same thing.
All ideas/suggestions (conformant to requirements above) welcome.