Difference between MBCS and UTF-8 on Windows

Asked 21/7, 2010 at 11:11 Answered 12/5, 2013 at 1:21

Solved windows unicode character-encoding mbcs

I am reading about the charater set and encodings on Windows. I noticed that there are two compiler flags in Visual Studio compiler (for C++) called MBCS and UNICODE. What is the difference between them ? What I am not getting is how UTF-8 is conceptually different from a MBCS encoding ? Also, I found the following quote in MSDN:

Unicode is a 16-bit character encoding

This negates whatever I read about the Unicode. I thought unicode can be encoded with different encodings such as UTF-8 and UTF-16. Can somebody shed some more light on this confusion?

Vellicate answered 21/7, 2010 at 11:11 Comment(0)

118

I noticed that there are two compiler flags in Visual Studio compiler (for C++) called MBCS and UNICODE. What is the difference between them ?

Many functions in the Windows API come in two versions: One that takes char parameters (in a locale-specific code page) and one that takes wchar_t parameters (in UTF-16).

int MessageBoxA(HWND hWnd, const char* lpText, const char* lpCaption, unsigned int uType);
int MessageBoxW(HWND hWnd, const wchar_t* lpText, const wchar_t* lpCaption, unsigned int uType);

Each of these function pairs also has a macro without the suffix, that depends on whether the UNICODE macro is defined.

#ifdef UNICODE
   #define MessageBox MessageBoxW
#else
   #define MessageBox MessageBoxA
#endif

In order to make this work, the TCHAR type is defined to abstract away the character type used by the API functions.

#ifdef UNICODE
    typedef wchar_t TCHAR;
#else
    typedef char TCHAR;
#endif

This, however, was a bad idea. You should always explicitly specify the character type.

What I am not getting is how UTF-8 is conceptually different from a MBCS encoding ?

MBCS stands for "multi-byte character set". For the literal minded, it seems that UTF-8 would qualify.

But in Windows, "MBCS" only refers to character encodings that can be used with the "A" versions of the Windows API functions. This includes code pages 932 (Shift_JIS), 936 (GBK), 949 (KS_C_5601-1987), and 950 (Big5), ~~but NOT UTF-8.~~

To use UTF-8, you have to convert the string to UTF-16 using MultiByteToWideChar, call the "W" version of the function, and call WideCharToMultiByte on the output. This is essentially what the "A" functions actually do, which makes me wonder why Windows doesn't just support UTF-8.

This inability to support the most common character encoding makes the "A" version of the Windows API useless. Therefore, you should always use the "W" functions.

Update: As of Windows 10 build 1903 (May 2019 update), UTF-8 is now supported as an "ANSI" code page. Thus, my original (2010) recommendation to always use "W" functions no longer applies, unless you need to support old versions of Windows. See UTF-8 Everywhere for text-handling advice.

Unicode is a 16-bit character encoding

This negates whatever I read about the Unicode.

MSDN is wrong. Unicode is a 21-bit coded character set that has several encodings, the most common being UTF-8, UTF-16, and UTF-32. (There are other Unicode encodings as well, such as GB18030, UTF-7, and UTF-EBCDIC.)

Whenever Microsoft refers to "Unicode", they really mean UTF-16 (or UCS-2). This is for historical reasons. Windows NT was an early adopter of Unicode, back when 16 bits was thought to be enough for everyone, and UTF-8 was only used on Plan 9. So UCS-2 was Unicode.

Turkoman answered 21/7, 2010 at 13:42 Comment(9)

"This is for historical" I wonder why they haven't fixed their documentation in the previous >15 years. – Sinistrous 23/6, 2012 at 8:50

They are Microsoft. History is false. Resistance is futile. – Welcome 18/7, 2012 at 6:32

I think there's still a difference between UTF-16 and UCS-2: UTF-16 can extend characters into a total of 32 bits, to define what doesn't fit into 16 bits. UCS-2 is only 16 bits, but using UCS-2, for example, in SQL Server, conserves the full UTF-16 encoding, because it stores extended 32bit UTF-16 chars in two 16bit characters. Correct me if I am wrong! – Incondite 20/9, 2012 at 9:19

+1 Very useful, thank you. The Wikipedia Unicode article does not mention anything about Unicode being a 21-bit coded char set. Could you please provide further information on your source? Thank you. – Rehearsal 5/11, 2013 at 16:23

@ErikHart: Older versions of Windows treated WCHAR strings as UCS-2, which is limited to the 16-bit Basic Multilingual Plane (BMP). Modern versions of Windows use the same APIs, but treat the WCHAR strings as UTF-16, which allows the encoding of code points beyond the BMP using surrogate pairs. This doesn't give you a full 32-bits, but that's okay. UTF-16 gives you code points from 0 to 0x10FFFF, which is the entire Unicode repertoire. Note that UTF-8 is also limited (by the standard) to 0-0x10FFFF, even though the scheme it employs could give you a full 32-bit range. – Plasticizer 4/12, 2013 at 17:6

This is a great read: utf8everywhere.org This issue has been brought to the forefront again because MBCS has been deprecated in VS2013 (MFC)... – Assentation 7/3, 2014 at 16:37

@Rehearsal perhaps unicode.org? unicode.org/faq/utf_bom.html, first question. – Evacuate 15/3, 2021 at 20:14

@YakovGalka Windows and Microsoft docs are changing... "New Windows applications should use Unicode to avoid the inconsistencies of varied code pages and for ease of localization." learn.microsoft.com/en-us/windows/win32/intl/code-pages "Until recently, Windows has emphasized "Unicode" -W variants over -A APIs. However, recent releases have used the ANSI code page and -A APIs as a means to introduce UTF-8 support to apps. [...]" learn.microsoft.com/en-us/windows/apps/design/globalizing/… – Dewaynedewberry 11/5, 2022 at 18:15

@Dewaynedewberry yep; they finally introduced UTF-8 in Windows 7 in 2019; somebody in Microsoft read the manifest and did the right thing. Resistance isn't always futile. I will keep my 2012 comment for historical reasons though ;) – Sinistrous 11/5, 2022 at 19:30

_MBCS and _UNICODE are macros to determine which version of TCHAR.H routines to call. For example, if you use _tcsclen to count the length of a string, the preprocessor would map _tcsclen to different version according to the two macros: _MBCS and _UNICODE.

_UNICODE & _MBCS Not Defined: strlen  
_MBCS Defined: _mbslen  
_UNICODE Defined: wcslen

To explain the difference of these string length counting functions, consider following example.
If you have a computer box that run Windows Simplified Chinese edition which use GBK(936 code page), you compile a gbk-file-encoded source file and run it.

printf("%d\n", _mbslen((const unsigned char*)"I爱你M"));
printf("%d\n", strlen("I爱你M"));
printf("%d\n", wcslen((const wchar_t*)"I爱你M"));

The result would be 4 6 3.

Here is the hexdecimal representation of I爱你M in GBK.

GBK:             49 B0 AE C4 E3 4D 00

_mbslen knows this string is encoded in GBK, so it could intepreter the string correctly and get the right result 4 words: 49 as I, B0 AE as 爱, C4 E3 as 你, 4D as M.

strlen only knows 0x00, so it get 6.

wcslen consider this hexdeciaml array is encoded in UTF16LE, and it count two bytes as one word, so it get 3 words: 49 B0, AE C4, E3 4D.

as @xiaokaoy pointed out, the only valid terminator for wcslen is 00 00. Thus the result is not guranteed to be 3 if the following byte is not 00.

Stayathome answered 22/10, 2012 at 12:14 Comment(4)

Great. But in my humble opinion, the return value of wcslen((const wchar_t*)"I爱你M") is not guaranteed to be 3. If 49 B0 AE C4 E3 4D 00 is not followed by a byte 00, wcslen will return a value greater than 3. – Machmeter 11/4, 2015 at 1:30

I mean, only 00 00 is considered to be a wide nul character. – Machmeter 12/4, 2015 at 12:25

No. L"I爱你M" is guaranteed to end with 4D 00 00 00. But (const wchar_t*)"I爱你M" isn't. – Machmeter 12/4, 2015 at 13:2

Just a writing issue, _UNICODE & _MBCS Not Defined: strlen should be written Both _UNICODE & _MBCS Not Defined: strlen. Because I thought it means _UNICODE is defined and _MBCS is not defined: strlen. – Noblenobleman 3/10, 2016 at 19:12

MBCS means Multi-Byte Character Set and describes any character set where a character is encoded into (possibly) more than 1 byte.

The ANSI / ASCII character sets are not multi-byte.

UTF-8, however, is a multi-byte encoding. It encodes any Unicode character as a sequence of 1, 2, 3, or 4 octets (bytes).

However, UTF-8 is only one out of several possible concrete encodings of the Unicode character set. Notably, UTF-16 is another, and happens to be the encoding used by Windows / .NET (IIRC). Here's the difference between UTF-8 and UTF-16:

UTF-8 encodes any Unicode character as a sequence of 1, 2, 3, or 4 bytes.
UTF-16 encodes most Unicode characters as 2 bytes, and some as 4 bytes.

It is therefore not correct that Unicode is a 16-bit character encoding. It's rather something like a 21-bit encoding (or even more these days), as it encompasses a character set with code points U+000000 up to U+10FFFF.

Solubility answered 21/7, 2010 at 11:17 Comment(5)

Sure, but in the Windows API documentation they use Unicode to mean UTF-16. (I suspect the support for that is limited and it's safer to assume UCS-2.) Yes, the Unicode standard goes beyond 21-bits. – Metameric 21/7, 2010 at 11:20

That piece of documentation might make it look as if Unicode were UTF-16, however that would not be correct (if at all, it's the other way around). UTF-16 is just one encoding of Unicode. And yes, in fact they might actually mean UCS-2, not UTF-16. I'm not entirely sure about that. – Solubility 21/7, 2010 at 11:29

Windows NT only supported UCS-2. Windows has supported full UTF-16 since Windows 2000, IIRC. – Michaels 21/7, 2010 at 21:49

This answer is true, but to understand the Microsoft documentation, you need to recognize that MSDN uses some of these terms in specific ways that differ from their generally accepted literal meanings. When MSDN says Unicode, they mean UTF-16 (or, in really old versions, UCS-2). When MSDN says ANSI code pages they mean Windows code pages, most of which are single-byte, but some of which (e.g., 950) are indeed MBCS. There's sort-of a "UTF-8 code page" (65001), but it's not well supported other than for converting between UTF-8 and UTF-16. – Plasticizer 4/12, 2013 at 17:29

This answer needs more details on exactly what selecting MBCS does. – Dewaynedewberry 11/5, 2022 at 18:32

As a footnote to the other answers, MSDN has a document Generic-Text Mappings in TCHAR.H with handy tables summarizing how the preprocessor directives _UNICODE and _MBCS change the definition of different C/C++ types.

As to the phrasing "Unicode" and "Multi-Byte Character Set", people have already described what the effects are. I just want to emphasize that both of those are Microsoft-speak for some very specific things. (That is, they mean something less general and more particular-to-Windows than one might expect if coming from a non-Microsoft-specific understanding of text internationalization.) Those exact phrases show up and tend to get their own separate sections/subsections of microsoft technical documents, e.g. in Text and Strings in Visual C++

Jeanettajeanette answered 12/5, 2013 at 1:21 Comment(1)

The link for the MSDN "Generic-Text Mappings in TCHAR.H" document is no longer valid. Here is an Internet Archive WaybackMachine link with the content... web.archive.org/web/20150519040130/https://msdn.microsoft.com/… – Dewaynedewberry 11/5, 2022 at 18:34

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags