Why isn't UTF-8 allowed as the "ANSI" code page?
Asked Answered
C

4

19

The Windows _setmbcp function allows any valid code page...

(except UTF-7 and UTF-8, which are not supported)

OK, not supporting UTF-7 makes sense: Characters have non-unique representations and that introduces complexity and security risks.

But why not UTF-8?

As I understand it, the "ANSI" versions of the Windows API functions convert their arguments to UTF-16, call the equivalent "W" function, and convert any strings in the output to "ANSI". This is what I've been doing manually. So why can't Windows do it for me?

Chelyuskin answered 8/6, 2010 at 6:4 Comment(1)
Did you know that CP65001 is Windows' name for UTF-8? It's not well documented but you can use it in a surprising number of places though there are some bugs for instance in WriteFile().Affluence
B
7

The "ANSI" codepage is basically legacy: Windows 9X era. All modern software should be Unicode (that is, UTF-16) based anyway.

Basically, when the Ansi code page stuff was originally designed, UTF-8 wasn't even invented and so support for multi-byte encodings was rather haphazard (i.e. most Ansi code pages are single byte, with the exception of some East Asian code pages which are one-or-two byte). Adding support for "proper" multi-byte encodings was probably deemed not worth the effort when all new development should be done in UTF-16 anyway.

Backsight answered 8/6, 2010 at 6:9 Comment(2)
I agree that all new development should be in Unicode. But I had reasons to propose using UTF-8 instead of UTF-16. (1) My team wrote a million lines of non-Unicode-aware code before anyone gave a damn about it, and now it would be a massive effort to change all those char-based strings to wchar_t-based ones. (2) We have plans to port our product to Linux, on which UTF-8 tends to be preferred.Chelyuskin
As of Windows Version 1903 (May 2019 Update), you can Use the UTF-8 code pageCoorg
T
7

_setmbcp() is a VC++ RTL function, not a Win32 API function. It only affects how the RTL interprets strings. It has no effect whatsoever on Win32 API A functions. When they call their W counterparts internally, the A functions always use MultiByteToWideChar() and WideCharToMultiByte() specifying codepage 0 (CP_ACP) to use the system default Ansi codepage for the conversions.

Truncation answered 21/7, 2010 at 22:0 Comment(1)
Does Microsoft state this anywhere explicitly? If they indeed do this then I see no reason why shouldn't be there a way to somehow tell the runtime to use CP_UTF8 when using the ANSI function.Caphaitien
G
6

Michael Kaplan, an internationalization expert from Microsoft, tried to answer this on his blog.

Basically his explanation is that even though the "ANSI" versions of Windows API functions are meant to handle different code pages, historically there was an implicit expectation that character encodings would require at most two bytes per code point. UTF-8 doesn't meet that expectation, and changing all of those functions now would require a massive amount of testing.

Glomeration answered 3/2, 2014 at 9:42 Comment(10)
ANSI code pages are not limited to two bytes in Windows. The progression of char was SBCS->DBCS->MBCS and for wchar_t was UCS2->UTF16. I see no good reason for MBCS not to work with a UTF8 code page and char.Essence
@Essence What is an example of an ANSI code page supported by Windows that uses more than two bytes per code point? AFAIK, for Windows, MBCS means DBCS (and DBCS means 1- or 2-byte characters), and msdn.microsoft.com/en-us/library/cwe8bzh0.aspx apparently confirms that.Glomeration
see Code Page Identifiers I posted above: Windows XP and later: GB18030 Simplified Chinese (4 byte) The reference you posted states: "Support for a form of multibyte character set (MBCS) called double-byte character set (DBCS) on all platforms." IOW the form called DBCS is a subset of MBCS. This is for "all platforms." See further down on the page: "When run on an MBCS-enabled version of the Windows operating system [tools are] completely MBCS-enabled."Essence
Note that on Code Page Identifiers UTF-7 and UTF-8 are listed. I haven't tried this, but UTF-8 is certainly not single or double byte.Essence
@Essence Yes, CP_UTF7 and CP_UTF8 have been available for a long time. They are not, however, available to use as the default code page. They're only for use with MultiByteToWideChar/WideCharToMultiByte. And yes, DBCS is a form of MBCS, but even further down on the page I cited, it explicitly states "MBCS always means DBCS. Character sets wider than 2 bytes are not supported." On msdn.microsoft.com/en-us/library/b6ewb9fy.aspx , it states: "With MBCS, characters can be 1 or 2 bytes in size."Glomeration
@Essence And note that my answer (and Michael Kaplan's explanation) states that the ANSI versions of Windows API functions expect 1 or 2 bytes per code point. It does not mean that code pages themselves cannot use more than 2 bytes per character. It means that any code page used as the default code page (the one that controls the xxxA functions) cannot use more than 2 bytes per character. If you can find evidence to the contrary, please provide an example.Glomeration
Although the use of the distinct symbol MBCS would be unnecessary, and the APIs all allow for the possibility of any number of bytes (e.g. NextChar()), based on the unequivocal (if not contradictory) documentation, I'd say that you're right that there appear to be no actual (supported for default code page) implementations of MBCS > DBCS. Makes me want to write a layer to convert between char/UTF8 and wchar_t/UTF-16 for all text APIs so that cross plat builds no longer have to jump through hoops. Performance hit yes, but what I see is lots of bugs in cross-plat code that assumes ANSI is UTF8.Essence
@Essence Yes, it would of course be possible to support code points of any number of bytes with NextChar. The point is that Microsoft says that they have existing code that doesn't do use that, and they decided that it's not worth the time or effort to find, change, and test it all.Glomeration
Not to mention all the legacy mission-critical programs that could be broken by the new functions not being bug-for-bug compatibles...Galliard
they finally made it possible to use UTF-8 as a localeDarnley
D
6

The reason is exactly like what was said in jamesdlin's answers and the comments below it: MBCS is the same as DBCS in Windows and some functions don't work with characters that are longer than 2 bytes

Microsoft said that a UTF-8 locale might break some functions as they were written to assume multibyte encodings used no more than 2 bytes per character, thus code pages with more bytes such as UTF-8 (and also GB 18030, cp54936) could not be set as the locale.

https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows#UTF-8

So UTF-8 was allowed in functions like read/write but not when using as a locale


However Microsoft has finally fixed that so now we can use UTF-8 as a locale. In fact MS even started recommending the ANSI APIs (-A) again instead of the Unicode (-W) versions like before. There are some new options in MSVC: /execution-charset:utf-8 and /utf-8 to set the charset, or you can also set the ActiveCodePage property in appxmanifest of the UWP app

Since Windows 10 insider build 17035, before those options were introduced, a "Beta: Use Unicode UTF-8 for worldwide language support" checkbox had also been added for setting the locale code page to UTF-8

Beta: Use Unicode UTF-8 for worldwide language support

To open that dialog box open start menu, type "region" and select Region settings > Additional date, time & regional settings > Change date, time, or number formats > Administrative

After enabling it you can call setlocale() to change to UTF-8 locale:

Starting in Windows 10 build 17134 (April 2018 Update), the Universal C Runtime supports using a UTF-8 code page. This means that char strings passed to C runtime functions will expect strings in the UTF-8 encoding. To enable UTF-8 mode, use "UTF-8" as the code page when using setlocale. For example, setlocale(LC_ALL, ".utf8") will use the current default Windows ANSI code page (ACP) for the locale and UTF-8 for the code page.

UTF-8 Support

You can also use this in older Windows versions

To use this feature on an OS prior to Windows 10, such as Windows 7, you must use app-local deployment or link statically using version 17134 of the Windows SDK or later. For Windows 10 operating systems prior to 17134, only static linking is supported.

See also

Darnley answered 24/8, 2020 at 7:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.