how can I convert wstring to u16string?
Asked Answered
S

3

10

I want to convert wstring to u16string in C++.

I can convert wstring to string, or reverse. But I don't know how convert to u16string.

u16string CTextConverter::convertWstring2U16(wstring str)

{

        int iSize;
        u16string szDest[256] = {};
        memset(szDest, 0, 256);
        iSize = WideCharToMultiByte(CP_UTF8, NULL, str.c_str(), -1, NULL, 0,0,0);

        WideCharToMultiByte(CP_UTF8, NULL, str.c_str(), -1, szDest, iSize,0,0);
        u16string s16 = szDest;
        return s16;
}

Error in WideCharToMultiByte(CP_UTF8, NULL, str.c_str(), -1, szDest, iSize,0,0);'s szDest. Cause of u16string can't use with LPSTR.

How can I fix this code?

Sailmaker answered 11/3, 2017 at 11:38 Comment(4)
A little suggestion, change parameter of your function to const wstring& str to avoid the unneccessary copying.Uela
You may consider accepting the answer of @Davislor instead. Although my answer may be good enough for you, Davislor's answer is much more comprehensive and may help other people looking for a platform-independent solution. Thanks!Uela
@Uela That’s very kind and professional of you.Lilybelle
@Uela I can’t believe I went back to this, but there is in fact a standard method to do this, and it works in clang++ -std=c++14 -stdlib=libc++, and in Visual C++ 19 with some workarounds. The libstdc++ library is just bugged (as of March ’17). Updated my answer.Lilybelle
U
9

For a platform-independent solution see this answer.

If you need a solution only for the Windows platform, the following code will be sufficient:

std::wstring wstr( L"foo" );
std::u16string u16str( wstr.begin(), wstr.end() );

On the Windows platform, a std::wstring is interchangeable with std::u16string because sizeof(wstring::value_type) == sizeof(u16string::value_type) and both are UTF-16 (little endian) encoded.

wstring::value_type = wchar_t
u16string::value_type = char16_t

The only difference is that wchar_t is signed, whereas char16_t is unsigned. So you only have to do sign conversion, which can be performed by using the u16string constructor that takes an iterator pair as arguments. This constructor will implicitly convert wchar_t to char16_t.

Full example console application:

#include <windows.h>
#include <string>

int main()
{
    static_assert( sizeof(std::wstring::value_type) == sizeof(std::u16string::value_type),
        "std::wstring and std::u16string are expected to have the same character size" );
   
    std::wstring wstr( L"foo" );
    std::u16string u16str( wstr.begin(), wstr.end() );
   
    // The u16string constructor performs an implicit conversion like:
    wchar_t wch = L'A';
    char16_t ch16 = wch;
   
    // Need to reinterpret_cast because char16_t const* is not implicitly convertible
    // to LPCWSTR (aka wchar_t const*).
    ::MessageBoxW( 0, reinterpret_cast<LPCWSTR>( u16str.c_str() ), L"test", 0 );
   
    return 0;
}
Uela answered 11/3, 2017 at 11:56 Comment(5)
Here is where I was going to say, this is not portable code because, on many implementations, a wstring is UCS-4, not UTF-16, and the conversion fails for characters from U+10000 to U+10FFFF. However, after writing a bunch of test code, what I discovered is that the standard library does not implement conversion from UCS-4 or wchar_t to UTF-16 directly (although the hook is there), but only round-trip conversion through UTF-8. Nobody would actually do that. Even so, this code will fail on some implementations.Lilybelle
@Lilybelle Thanks for pointing that out! I was aware of that, that's why I've put the static_assert in there. As OP tagged question with "winapi" I thought the trivial conversion was sufficient. But it's useful to have an answer with platform-independent code too.Uela
There is a better answer for none Windows platforms.Gearard
For others: the encoded bytes order (endianness) of the basic types char16_t depends on the system's endianness. By saying "on the windows platform , std::wstring and u16string are both UTF-16 (little endian) encoded", it indicates the fact that most windows are installed on x86 (little endian).Angloindian
Why don't just u16string ustr=(char16_t*)a_wstring.c_str() on Windows?Mumps
L
4

Update

I had thought the standard version did not work, but in fact this was simply due to bugs in the Visual C++ and libstdc++ 3.4.21 runtime libraries. It does work with clang++ -std=c++14 -stdlib=libc++. Here is a version that tests whether the standard method works on your compiler:

#include <codecvt>
#include <cstdlib>
#include <cstring>
#include <cwctype>
#include <iostream>
#include <locale>
#include <clocale>
#include <vector>

using std::cout;
using std::endl;
using std::exit;
using std::memcmp;
using std::size_t;

using std::wcout;

#if _WIN32 || _WIN64
// Windows needs a little non-standard magic for this to work.
#include <io.h>
#include <fcntl.h>
#include <locale.h>
#endif

using std::size_t;

void init_locale(void)
// Does magic so that wcout can work.
{
#if _WIN32 || _WIN64
  // Windows needs a little non-standard magic.
  constexpr char cp_utf16le[] = ".1200";
  setlocale( LC_ALL, cp_utf16le );
  _setmode( _fileno(stdout), _O_U16TEXT );
#else
  // The correct locale name may vary by OS, e.g., "en_US.utf8".
  constexpr char locale_name[] = "";
  std::locale::global(std::locale(locale_name));
  std::wcout.imbue(std::locale());
#endif
}

int main(void)
{
  constexpr char16_t msg_utf16[] = u"¡Hola, mundo! \U0001F600"; // Shouldn't assume endianness.
  constexpr wchar_t msg_w[] = L"¡Hola, mundo! \U0001F600";
  constexpr char32_t msg_utf32[] = U"¡Hola, mundo! \U0001F600";
  constexpr char msg_utf8[] = u8"¡Hola, mundo! \U0001F600";

  init_locale();

  const std::codecvt_utf16<wchar_t, 0x1FFFF, std::little_endian> converter_w;
  const size_t max_len = sizeof(msg_utf16);
  std::vector<char> out(max_len);
  std::mbstate_t state;
  const wchar_t* from_w = nullptr;
  char* to_next = nullptr;

  converter_w.out( state, msg_w, msg_w+sizeof(msg_w)/sizeof(wchar_t), from_w, out.data(), out.data() + out.size(), to_next );


  if (memcmp( msg_utf8, out.data(), sizeof(msg_utf8) ) == 0 ) {
    wcout << L"std::codecvt_utf16<wchar_t> converts to UTF-8, not UTF-16!" << endl;
  } else if ( memcmp( msg_utf16, out.data(), max_len ) != 0 ) {
    wcout << L"std::codecvt_utf16<wchar_t> conversion not equal!" << endl;
  } else {
    wcout << L"std::codecvt_utf16<wchar_t> conversion is correct." << endl;
  }
  out.clear();
  out.resize(max_len);

  const std::codecvt_utf16<char32_t, 0x1FFFF, std::little_endian> converter_u32;
  const char32_t* from_u32 = nullptr;
  converter_u32.out( state, msg_utf32, msg_utf32+sizeof(msg_utf32)/sizeof(char32_t), from_u32, out.data(), out.data() + out.size(), to_next );

  if ( memcmp( msg_utf16, out.data(), max_len ) != 0 ) {
    wcout << L"std::codecvt_utf16<char32_t> conversion not equal!" << endl;
  } else {
    wcout << L"std::codecvt_utf16<char32_t> conversion is correct." << endl;
  }

  wcout << msg_w << endl;
  return EXIT_SUCCESS;
}

Previous

A bit late to the game, but here’s a version that additionally checks whether wchar_t is 32-bits (as it is on Linux), and if so, performs surrogate-pair conversion. I recommend saving this source as UTF-8 with a BOM. Here is a link to it on ideone.

#include <cassert>
#include <cwctype>
#include <cstdlib>
#include <iomanip>
#include <iostream>
#include <locale>
#include <string>

#if _WIN32 || _WIN64
// Windows needs a little non-standard magic for this to work.
#include <io.h>
#include <fcntl.h>
#include <locale.h>
#endif

using std::size_t;

void init_locale(void)
// Does magic so that wcout can work.
{
#if _WIN32 || _WIN64
  // Windows needs a little non-standard magic.
  constexpr char cp_utf16le[] = ".1200";
  setlocale( LC_ALL, cp_utf16le );
  _setmode( _fileno(stdout), _O_U16TEXT );
#else
  // The correct locale name may vary by OS, e.g., "en_US.utf8".
  constexpr char locale_name[] = "";
  std::locale::global(std::locale(locale_name));
  std::wcout.imbue(std::locale());
#endif
}

std::u16string make_u16string( const std::wstring& ws )
/* Creates a UTF-16 string from a wide-character string.  Any wide characters
 * outside the allowed range of UTF-16 are mapped to the sentinel value U+FFFD,
 * per the Unicode documentation. (http://www.unicode.org/faq/private_use.html
 * retrieved 12 March 2017.) Unpaired surrogates in ws are also converted to
 * sentinel values.  Noncharacters, however, are left intact.  As a fallback,
 * if wide characters are the same size as char16_t, this does a more trivial
 * construction using that implicit conversion.
 */
{
  /* We assume that, if this test passes, a wide-character string is already
   * UTF-16, or at least converts to it implicitly without needing surrogate
   * pairs.
   */
  if ( sizeof(wchar_t) == sizeof(char16_t) ) {
    return std::u16string( ws.begin(), ws.end() );
  } else {
    /* The conversion from UTF-32 to UTF-16 might possibly require surrogates.
     * A surrogate pair suffices to represent all wide characters, because all
     * characters outside the range will be mapped to the sentinel value
     * U+FFFD.  Add one character for the terminating NUL.
     */
    const size_t max_len = 2 * ws.length() + 1;
    // Our temporary UTF-16 string.
    std::u16string result;

    result.reserve(max_len);

    for ( const wchar_t& wc : ws ) {
      const std::wint_t chr = wc;

      if ( chr < 0 || chr > 0x10FFFF || (chr >= 0xD800 && chr <= 0xDFFF) ) {
        // Invalid code point.  Replace with sentinel, per Unicode standard:
        constexpr char16_t sentinel = u'\uFFFD';
        result.push_back(sentinel);
      } else if ( chr < 0x10000UL ) { // In the BMP.
        result.push_back(static_cast<char16_t>(wc));
      } else {
        const char16_t leading = static_cast<char16_t>( 
          ((chr-0x10000UL) / 0x400U) + 0xD800U );
        const char16_t trailing = static_cast<char16_t>( 
          ((chr-0x10000UL) % 0x400U) + 0xDC00U );

        result.append({leading, trailing});
      } // end if
    } // end for

   /* The returned string is shrunken to fit, which might not be the Right
    * Thing if there is more to be added to the string.
    */
    result.shrink_to_fit();

    // We depend here on the compiler to optimize the move constructor.
    return result;
  } // end if
  // Not reached.
}

int main(void)
{
  static const std::wstring wtest(L"☪☮∈✡℩☯✝ \U0001F644");
  static const std::u16string u16test(u"☪☮∈✡℩☯✝ \U0001F644");
  const std::u16string converted = make_u16string(wtest);

  init_locale();

  std::wcout << L"sizeof(wchar_t) == " << sizeof(wchar_t) << L".\n";

  for( size_t i = 0; i <= u16test.length(); ++i ) {
    if ( u16test[i] != converted[i] ) {
      std::wcout << std::hex << std::showbase
                 << std::right << std::setfill(L'0')
                 << std::setw(4) << (unsigned)converted[i] << L" ≠ "
                 << std::setw(4) << (unsigned)u16test[i] << L" at "
                 << i << L'.' << std::endl;
      return EXIT_FAILURE;
    } // end if
  } // end for

  std::wcout << wtest << std::endl;

  return EXIT_SUCCESS;
}

Footnote

Since someone asked: The reason I suggest UTF-8 with BOM is that some compilers, including MSVC 2015, will assume a source file is encoded according to the current code page unless there is a BOM or you specify an encoding on the command line. No encoding works on all toolchains, unfortunately, but every tool I’ve used that’s modern enough to support C++14 also understands the BOM.

Lilybelle answered 12/3, 2017 at 4:29 Comment(4)
I suggest to include the link in your post, so it won't get buried in the comments. I would propably do result.reserve(ws.length()) instead of 2 *ws.length() + 1 because surrogates are quite rare and thus we could avoid reallocation in the common case.Uela
Thanks. I’ve edited the code a lot since you posted that; among other improvements, it now handles invalid codepoints by mapping them to a sentinel value instead of throwing an assertion. New link (also added to the post): ideone.com/nrCowQLilybelle
@Uela Go to Twitter and you’ll find plenty of strings that are the worst-case scenario, entirely composed of characters outside the BMP. Kids these days and their emoji.Lilybelle
Confirmed to work with MSVC++2017 (if you select a proper font in the console, it shows most characters except for the "face with rolling eyes" \U0001F644).Uela
S
0
- To convert CString to std:wstring and string

    string CString2string(CString str)
    {
        int bufLen = WideCharToMultiByte(CP_UTF8, 0, (LPCTSTR)str, -1, NULL, 0, NULL,NULL);
        char *buf = new char[bufLen];
        WideCharToMultiByte(CP_UTF8, 0, (LPCTSTR)str, -1, buf, bufLen, NULL, NULL);
        string sRet(buf);
        delete[] buf;
        return sRet;
    }
    CString strFileName = "test.txt";
    wstring wFileName(strFileName.GetBuffer());
    strFileName.ReleaseBuffer();
    string sFileName = CString2string(strFileName);

- To convert string to CString

    CString string2CString(string s)
    {
        int bufLen = MultiByteToWideChar(CP_ACP, 0, s.c_str(), -1, NULL, 0);
        WCHAR *buf = new WCHAR[bufLen];
        MultiByteToWideChar(CP_ACP, 0, s.c_str(), -1, buf, bufLen);
        CString strRet(buf);
        delete[] buf;
        return strRet;
    }
    string sFileName = "test.txt";
    CString strFileName = string2CString(sFileName);
Spitfire answered 10/1, 2022 at 5:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.