If I want to make the following work on Windows, what is the correct locale and how do I detect that it is actually present: Does this code work universaly, or is it just my system?
In the past UTF-8 (and some other code pages) wasn't allowed as the system locale because
Microsoft said that a UTF-8 locale might break some functions as they were written to assume multibyte encodings used no more than 2 bytes per character, thus code pages with more bytes such as UTF-8 (and also GB 18030, cp54936) could not be set as the locale.
https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows#UTF-8
However Microsoft has gradually introduced UTF-8 locale support and started recommending the ANSI APIs (-A
) again instead of the Unicode (-W
) versions like before
Until recently, Windows has emphasized "Unicode"
-W
variants over-A
APIs. However, recent releases have used the ANSI code page and-A
APIs as a means to introduce UTF-8 support to apps. If the ANSI code page is configured for UTF-8,-A
APIs operate in UTF-8. This model has the benefit of supporting existing code built with-A
APIs without any code changes.-A vs. -W APIs
Firstly they added a "Beta: Use Unicode UTF-8 for worldwide language support" checkbox since Windows 10 insider build 17035 for setting the locale code page to UTF-8
To open that dialog box open start menu, type "region" and select Region settings > Additional date, time & regional settings > Change date, time, or number formats > Administrative
After enabling it you can call setlocal
as normal:
Starting in Windows 10 build 17134 (April 2018 Update), the Universal C Runtime supports using a UTF-8 code page. This means that
char
strings passed to C runtime functions will expect strings in the UTF-8 encoding. To enable UTF-8 mode, use "UTF-8" as the code page when usingsetlocale
. For example,setlocale(LC_ALL, ".utf8")
will use the current default Windows ANSI code page (ACP) for the locale and UTF-8 for the code page.
You can also use this in older Windows versions
To use this feature on an OS prior to Windows 10, such as Windows 7, you must use app-local deployment or link statically using version 17134 of the Windows SDK or later. For Windows 10 operating systems prior to 17134, only static linking is supported.
Later in 2019 they added the ability for programs to use the UTF-8 locale without even setting the UTF-8 beta flag above. You can use the /execution-charset:utf-8
or /utf-8
options when compiling with MSVC or set the ActiveCodePage property in appxmanifest
Although there isn't good support for named locales, Visual Studio 2010 does include the UTF-8 conversion facets required by C++11: std::codecvt_utf8
for UCS2 and std::codecvt_utf8_utf16
for UTF-16:
#include <fstream>
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
void prepare_file()
{
// UTF-8 data
char utf8[] = {'\x7a', // latin small letter 'z' U+007a
'\xe6','\xb0','\xb4', // CJK ideograph "water" U+6c34
'\xf0','\x9d','\x84','\x8b'}; // musical sign segno U+1d10b
std::ofstream fout("text.txt");
fout.write(utf8, sizeof utf8);
}
void test_file_utf16()
{
std::wifstream fin("text.txt");
fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf8_utf16<wchar_t>));
std::cout << "Read from file using UTF-8/UTF-16 codecvt\n";
for(wchar_t c; fin >> c; )
std::cout << std::hex << std::showbase << c << '\n';
}
void test_file_ucs2()
{
std::wifstream fin("text.txt");
fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf8<wchar_t>));
std::cout << "Read from file using UTF-8/UCS2 codecvt\n";
for(wchar_t c; fin >> c; )
std::cout << std::hex << std::showbase << c << '\n';
}
int main()
{
prepare_file();
test_file_utf16();
test_file_ucs2();
}
this outputs, on my Visual Studio 2010 EE SP1
Read from file using UTF-8/UTF-16 codecvt
0x7a
0x6c34
0xd834
0xdd0b
Read from file using UTF-8/UCS2 codecvt
0x7a
0x6c34
0xd10b
Press any key to continue . . .
Per MSDN, it would be named "english_us.65001". But code page 65001 is somewhat flaky on Windows.
MessageBoxA("Hellö")
. However, it has explicit support: MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, utf8input.c_str(), ...
–
Varietal © 2022 - 2024 — McMap. All rights reserved.
/utf-8
option seems to be unrelated with the checkbox though. It sets the execution and source charsets of the binary but I might be wrong. – Nuno