How do you properly use WideCharToMultiByte
Asked Answered
S

4

80

I've read the documentation on WideCharToMultiByte, but I'm stuck on this parameter:

lpMultiByteStr
[out] Pointer to a buffer that receives the converted string.

I'm not quite sure how to properly initialize the variable and feed it into the function

Scoliosis answered 19/10, 2008 at 3:33 Comment(1)
Is there any reason why you seem to ask questions but accept no answers? It's usually good practice on these sites to reward good answers with feedback in recognition for the time people invest in answering your question. You've got a few very good answers below... (nudge)Mhd
B
154

Here's a couple of functions (based on Brian Bondy's example) that use WideCharToMultiByte and MultiByteToWideChar to convert between std::wstring and std::string using utf8 to not lose any data.

// Convert a wide Unicode string to an UTF8 string
std::string utf8_encode(const std::wstring &wstr)
{
    if( wstr.empty() ) return std::string();
    int size_needed = WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), NULL, 0, NULL, NULL);
    std::string strTo( size_needed, 0 );
    WideCharToMultiByte                  (CP_UTF8, 0, &wstr[0], (int)wstr.size(), &strTo[0], size_needed, NULL, NULL);
    return strTo;
}

// Convert an UTF8 string to a wide Unicode String
std::wstring utf8_decode(const std::string &str)
{
    if( str.empty() ) return std::wstring();
    int size_needed = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), NULL, 0);
    std::wstring wstrTo( size_needed, 0 );
    MultiByteToWideChar                  (CP_UTF8, 0, &str[0], (int)str.size(), &wstrTo[0], size_needed);
    return wstrTo;
}
Bouchier answered 22/10, 2010 at 17:59 Comment(10)
It should be noted that prior to C++11 std::string and std::wstring were not guaranteed to have their memory be contiguous.Cacophony
I seriously doubt there has ever been a commercially available stl implementation that doesn't have contiguous vectors. The fact that contiguous memory wasn't required in the first C++ spec was an oversight: herbsutter.com/2008/04/07/…Bouchier
@Bouchier The previous comment was about strings, not vectors. Strings were not guaranteed to be contiguous in C++98 (not the result of the referred to by Sutter), although all real-world implementations make them contiguous.Inoperative
@CHris_F wasn't c_str() guaranteed to return contiguous one regardless of implementation?Bertiebertila
@Swift c_str() is guaranteed to return a pointer to a contiguous buffer but prior to C++11 this was not guaranteed to be the same as the internal representation of the string.Cacophony
which usually is ok, because c_str supposes to be "const" string, one way out. Only case when it is a problem is if we take content of c_str() put it in another string , then would compare at some point. But actually implementations I dealt with were caring about that.Bertiebertila
How does this handle non-English letters such as the Scandinavian ÅåÄäÖöÆæØø? From what I can see it becomes garbled. :-(Cockleshell
Things like this have become somewhat mandatory since the C++17 deprecation around <codecvt>. Example: https://mcmap.net/q/24170/-how-to-read-a-utf-16-text-file-in-c-17/3543437Abert
I wouldn't recommend using (int)wstr.size(), it will fail to put a \0 terminator on the string. Instead just pass -1 to have the function autodetect the string length.Morsel
@Morsel The string/wstring constructor only needs the characters before the terminator string, it will add it's own terminator. So we don't want or need the \0 to be copiedPeso
B
40

Elaborating on the answer provided by Brian R. Bondy: Here's an example that shows why you can't simply size the output buffer to the number of wide characters in the source string:

#include <windows.h>
#include <stdio.h>
#include <wchar.h>
#include <string.h>

/* string consisting of several Asian characters */
wchar_t wcsString[] = L"\u9580\u961c\u9640\u963f\u963b\u9644";

int main() 
{

    size_t wcsChars = wcslen( wcsString);

    size_t sizeRequired = WideCharToMultiByte( 950, 0, wcsString, -1, 
                                               NULL, 0,  NULL, NULL);

    printf( "Wide chars in wcsString: %u\n", wcsChars);
    printf( "Bytes required for CP950 encoding (excluding NUL terminator): %u\n",
             sizeRequired-1);

    sizeRequired = WideCharToMultiByte( CP_UTF8, 0, wcsString, -1,
                                        NULL, 0,  NULL, NULL);
    printf( "Bytes required for UTF8 encoding (excluding NUL terminator): %u\n",
             sizeRequired-1);
}

And the output:

Wide chars in wcsString: 6
Bytes required for CP950 encoding (excluding NUL terminator): 12
Bytes required for UTF8 encoding (excluding NUL terminator): 18
Bacciferous answered 19/10, 2008 at 19:52 Comment(4)
An excellent example of an important and often neglected aspect of codepage/encoding conversion!Included
-1 The OP asks for help with the lpMultiByteStr parameter. This answer isn't answering the OP, it is a tangent to another posted answer.Adamant
@Error454: They didn't have comments in 2008. Just flag it.Earing
+1 for excluding null, the returned sizeRequired includes space for a null, so proper initialization of lpMultiByteStr must take this into accountWebfoot
P
20

You use the lpMultiByteStr [out] parameter by creating a new char array. You then pass this char array in to get it filled. You only need to initialize the length of the string + 1 so that you can have a null terminated string after the conversion.

Here are a couple of useful helper functions for you, they show the usage of all parameters.

#include <string>

std::string wstrtostr(const std::wstring &wstr)
{
    // Convert a Unicode string to an ASCII string
    std::string strTo;
    char *szTo = new char[wstr.length() + 1];
    szTo[wstr.size()] = '\0';
    WideCharToMultiByte(CP_ACP, 0, wstr.c_str(), -1, szTo, (int)wstr.length(), NULL, NULL);
    strTo = szTo;
    delete[] szTo;
    return strTo;
}

std::wstring strtowstr(const std::string &str)
{
    // Convert an ASCII string to a Unicode String
    std::wstring wstrTo;
    wchar_t *wszTo = new wchar_t[str.length() + 1];
    wszTo[str.size()] = L'\0';
    MultiByteToWideChar(CP_ACP, 0, str.c_str(), -1, wszTo, (int)str.length());
    wstrTo = wszTo;
    delete[] wszTo;
    return wstrTo;
}

--

Anytime in documentation when you see that it has a parameter which is a pointer to a type, and they tell you it is an out variable, you will want to create that type, and then pass in a pointer to it. The function will use that pointer to fill your variable.

So you can understand this better:

//pX is an out parameter, it fills your variable with 10.
void fillXWith10(int *pX)
{
  *pX = 10;
}

int main(int argc, char ** argv)
{
  int X;
  fillXWith10(&X);
  return 0;
}
Prepare answered 19/10, 2008 at 3:41 Comment(6)
The code should take into account that the number of bytes required in the multibyte char string may be more than the number of characters in the wide character string. A single wide character may result in 2 or more bytes in the multibyte char string, depending on the encodings involved.Bacciferous
Asian charactes come to mind as an example, but it really depends on the code page that is used for the conversion. In your example, it would probably not be a problem, because any non-ANSI character would be replaced by a question mark.Munson
To get the size needed for the conversion, call WideCharToMultiByte with 0 as the size of target buffer. It will then return the number of bytes needed for the target buffer size.Munson
Is there a portable, i.e. POSIX, way to do this? WideCharToMultiByte is a Windows function.Magpie
number of bytes or number of wide char count is break with this code when working something like gb2312.Forgave
Since 2008 Windows has changed! Vista -> Win 10. Brian Bondy's answer is still the clearest and most focussed on the OPs question. The documentation linked in the question has moved to hereAubreir
E
1

Here is a C implementation of both WideCharToMultiByte and MultiByteToWideChar. In both cases I ensure to tack a null character to the end of the destination buffers.

MultiByteToWideChar does not null-terminate an output string if the input string length is explicitly specified without a terminating null character.

And

WideCharToMultiByte does not null-terminate an output string if the input string length is explicitly specified without a terminating null character.

Even if someone specifies -1 and passes in a null terminated string I still allocate enough space for an additional null character because for my use case this was not an issue.

wchar_t* utf8_decode( const char* str, int nbytes ) {    
    int nchars = 0;
    if ( ( nchars = MultiByteToWideChar( CP_UTF8, 
        MB_ERR_INVALID_CHARS, str, nbytes, NULL, 0 ) ) == 0 ) {
        return NULL;
    }

    wchar_t* wstr = NULL;
    if ( !( wstr = malloc( ( ( size_t )nchars + 1 ) * sizeof( wchar_t ) ) ) ) {
        return NULL;
    }

    wstr[ nchars ] = L'\0';
    if ( MultiByteToWideChar( CP_UTF8, MB_ERR_INVALID_CHARS, 
        str, nbytes, wstr, ( size_t )nchars ) == 0 ) {
        free( wstr );
        return NULL;
    }
    return wstr;
} 


char* utf8_encode( const wchar_t* wstr, int nchars ) {
    int nbytes = 0;
    if ( ( nbytes = WideCharToMultiByte( CP_UTF8, WC_ERR_INVALID_CHARS, 
        wstr, nchars, NULL, 0, NULL, NULL ) ) == 0 ) {
        return NULL;
    }

    char* str = NULL;
    if ( !( str = malloc( ( size_t )nbytes + 1 ) ) ) {
        return NULL;
    }

    str[ nbytes ] = '\0';
    if ( WideCharToMultiByte( CP_UTF8, WC_ERR_INVALID_CHARS, 
        wstr, nchars, str, nbytes, NULL, NULL ) == 0 ) {
        free( str );
        return NULL;
    }
    return str;
}
Estuarine answered 7/4, 2021 at 2:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.