What is the Windows equivalent for en_US.UTF-8 locale?
Asked Answered
H

3

17

If I want to make the following work on Windows, what is the correct locale and how do I detect that it is actually present: Does this code work universaly, or is it just my system?

Hormuz answered 1/12, 2010 at 12:52 Comment(0)
S
14

In the past UTF-8 (and some other code pages) wasn't allowed as the system locale because

Microsoft said that a UTF-8 locale might break some functions as they were written to assume multibyte encodings used no more than 2 bytes per character, thus code pages with more bytes such as UTF-8 (and also GB 18030, cp54936) could not be set as the locale.

https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows#UTF-8

However Microsoft has gradually introduced UTF-8 locale support and started recommending the ANSI APIs (-A) again instead of the Unicode (-W) versions like before

Until recently, Windows has emphasized "Unicode" -W variants over -A APIs. However, recent releases have used the ANSI code page and -A APIs as a means to introduce UTF-8 support to apps. If the ANSI code page is configured for UTF-8, -A APIs operate in UTF-8. This model has the benefit of supporting existing code built with -A APIs without any code changes.

-A vs. -W APIs


Firstly they added a "Beta: Use Unicode UTF-8 for worldwide language support" checkbox since Windows 10 insider build 17035 for setting the locale code page to UTF-8

Beta: Use Unicode UTF-8 for worldwide language support

To open that dialog box open start menu, type "region" and select Region settings > Additional date, time & regional settings > Change date, time, or number formats > Administrative

After enabling it you can call setlocal as normal:

Starting in Windows 10 build 17134 (April 2018 Update), the Universal C Runtime supports using a UTF-8 code page. This means that char strings passed to C runtime functions will expect strings in the UTF-8 encoding. To enable UTF-8 mode, use "UTF-8" as the code page when using setlocale. For example, setlocale(LC_ALL, ".utf8") will use the current default Windows ANSI code page (ACP) for the locale and UTF-8 for the code page.

UTF-8 Support

You can also use this in older Windows versions

To use this feature on an OS prior to Windows 10, such as Windows 7, you must use app-local deployment or link statically using version 17134 of the Windows SDK or later. For Windows 10 operating systems prior to 17134, only static linking is supported.


Later in 2019 they added the ability for programs to use the UTF-8 locale without even setting the UTF-8 beta flag above. You can use the /execution-charset:utf-8 or /utf-8 options when compiling with MSVC or set the ActiveCodePage property in appxmanifest

Septivalent answered 17/8, 2020 at 15:42 Comment(1)
A nice recap of the new feature! It's amazing it took them so long to say "let's just use utf-8 in the C strings". The /utf-8 option seems to be unrelated with the checkbox though. It sets the execution and source charsets of the binary but I might be wrong.Nuno
O
12

Although there isn't good support for named locales, Visual Studio 2010 does include the UTF-8 conversion facets required by C++11: std::codecvt_utf8 for UCS2 and std::codecvt_utf8_utf16 for UTF-16:

#include <fstream>
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
void prepare_file()
{
    // UTF-8 data
    char utf8[] = {'\x7a',                       // latin small letter 'z' U+007a
                   '\xe6','\xb0','\xb4',         // CJK ideograph "water"  U+6c34
                   '\xf0','\x9d','\x84','\x8b'}; // musical sign segno U+1d10b
    std::ofstream fout("text.txt");
    fout.write(utf8, sizeof utf8);
}
void test_file_utf16()
{
    std::wifstream fin("text.txt");
    fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf8_utf16<wchar_t>));
    std::cout << "Read from file using UTF-8/UTF-16 codecvt\n";
    for(wchar_t c; fin >> c; )
        std::cout << std::hex << std::showbase << c << '\n';
}
void test_file_ucs2()
{
    std::wifstream fin("text.txt");
    fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf8<wchar_t>));
    std::cout << "Read from file using UTF-8/UCS2 codecvt\n";
    for(wchar_t c; fin >> c; )
        std::cout << std::hex << std::showbase << c << '\n';
}
int main()
{
    prepare_file();
    test_file_utf16();
    test_file_ucs2();
}

this outputs, on my Visual Studio 2010 EE SP1

Read from file using UTF-8/UTF-16 codecvt
0x7a
0x6c34
0xd834
0xdd0b
Read from file using UTF-8/UCS2 codecvt
0x7a
0x6c34
0xd10b
Press any key to continue . . .
Oxazine answered 26/9, 2011 at 21:41 Comment(0)
V
1

Per MSDN, it would be named "english_us.65001". But code page 65001 is somewhat flaky on Windows.

Varietal answered 1/12, 2010 at 15:52 Comment(7)
Can you please comment more on the "somewhat flaky"?Fosdick
@Let_Me_Be: I can't summarize it better than google.com/search?q=site%3Ablogs.msdn.com+65001Varietal
@Varietal I'm sorry but I just can't find anything both current and detailed enough. What I understand from the short blog posts I read is that Windows doesn't have UTF-8 support at all (which just doesn't make any sense).Fosdick
@Let_Me_Be: It doesn't have implicit support. You can't call MessageBoxA("Hellö"). However, it has explicit support: MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, utf8input.c_str(), ...Varietal
@Varietal OK, so since my code is explicitly converting into a UTF-8 locale, it should work (damn, I really need to install Windows somewhere so I can test this).Fosdick
@Let_Me_Be: What all these answers try to say is that there is no utf-8 locale on windows.Gauge
@Gauge there wasn't, but now there is and MS actually recommended to use the UTF-8 locale for portabilitySeptivalent

© 2022 - 2024 — McMap. All rights reserved.