Converting C++ std::wstring to utf8 with std::codecvt_xxx
Asked Answered
P

2

5

C++11 has tools to convert wide char strings std::wstring from/to utf8 representation: std::codecvt, std::codecvt_utf8, std::codecvt_utf8_utf16 etc.

Which one is usable by Windows app to convert regular wide char Windows strings std::wstring to utf8 std::string? Is it always works without configuring locales?

Pneumoencephalogram answered 29/5, 2016 at 20:12 Comment(4)
Possible duplicate of Convert wstring to string encoded in UTF-8Levison
@Levison I posted this question after reading the page you mentioned ))) I do not see a clear answer to my question on that pagePneumoencephalogram
Does this not answer your question? According to a comment "[t]his works for Windows if you use VS2012 or later".Levison
Thank you! It works like a charm.Pneumoencephalogram
S
7

Depends how you convert them.
You need to specify the source encoding type and the target encoding type.
wstring is not a format, it just defines a data type.

Now usually when one says "Unicode", one means UTF16 which is what Microsoft Windows uses, and that is usuasly what wstring contains.

So, the right way to convert from UTF8 to UTF16:

     std::string utf8String = "blah blah";

     std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
     std::wstring utf16String = convert.from_bytes( utf8String );

And the other way around:

     std::wstring utf16String = "blah blah";

     std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
     std::string utf8String = convert.to_bytes( utf16String );

And to add to the confusion:
When you use std::string on a windows platform (like when you use a multibyte compilation), It's NOT UTF8. They use ANSI.
More specifically, the default encoding language your windows is using.

Also, note that wstring is not exactly the same as UTF-16.

When compiling in Unicode the windows API commands expect these formats:

CommandA - multibyte - ANSI
CommandW - Unicode - UTF16

Selry answered 31/5, 2016 at 7:26 Comment(1)
"usually when one says "Unicode", one means UTF16" - Uhm... When one says "Unicode" I would hope that one knows Unicode, and doesn't confuse the standard with an arbitrary encoding. "When you use std::string on a windows platform [...], It's NOT UTF8. They use ANSI." - The character encoding used for std::string is determined by the implementation (i.e. compiler), not the target platform. You can write a compiler that uses UTF-8 encoding for std::string on Windows.Levison
P
3

Seems that std::codecvt_utf8 works well for conversion std::wstring -> utf8. It passed all my tests. (Windows app, Visual Studio 2015, Windows 8 with EN locale)

I needed a way to convert filenames to UTF8. Therefore my test is about filenames.

In my app I use boost::filesystem::path 1.60.0 to deal with file path. It works well, but not able to convert filenames to UTF8 properly. Internally Windows version of boost::filesystem::path uses std::wstring to store the file path. Unfortunately, build-in conversion to std::string works bad.

Test case:

  • create file with mixed symbols c:\test\皀皁皂皃的 (some random Asian symbols)
  • scan dir with boost::filesystem::directory_iterator, get boost::filesystem::path for the file
  • convert it to the std::string via build-in conversion filenamePath.string()
  • you get c:\test\?????. Asian symbols converted to '?'. Not good.

boost::filesystem uses std::codecvt internally. It doesn't work for conversion std::wstring -> std::string.

Instead of build-in boost::filesystem::path conversion you can define conversion function as this (original snippet):

std::string utf8_to_wstring(const std::wstring & str)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
    return myconv.to_bytes(str);
}

Then you can convert filepath to UTF8 easily: utf8_to_wstring(filenamePath.wstring()). It works perfectly.

It works for any filepath. I tested ASCII strings c:\test\test_file, Asian strings c:\test\皀皁皂皃的, Russian strings c:\test\абвгд, mixed strings c:\test\test_皀皁皂皃的, c:\test\test_абвгд, c:\test\test_皀皁皂皃的_абвгд. For every string I receive valid UTF8 representation.

Pneumoencephalogram answered 30/5, 2016 at 17:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.