C++11 std::cout << "string literal in UTF-8" to Windows cmd console? (Visual Studio 2015)
Asked Answered
N

2

6

Summary: What should I do to print correctly a string literal defined in the source code that was stored in UTF-8 encoding (Windows CP 65001) to a cmd console using std::cout stream?

Motivation: I would like to modify the excellent Catch unit-testing framework (as an experiment) so that it would display my texts with accented characters. The modification should be simple, reliable, and should be also useful for other languages and working environments so that it could be accepted by the author as an enhancement. Or if you know Catch and if there is some alternative solution, could you post it?

Details: Let's start with the Czech version of the "quick brown fox..."

#include <iostream>
#include "windows.h"

using namespace std;

int main()
{
    cout << "\n-------------------------- default cmd encoding = 852 -------------------\n";
    cout << "Příšerně žluťoučký kůň úpěl ďábelské ódy!" << endl;

    cout << "\n-------- Windows Central European (1250) set for the cmd console --------\n";
    SetConsoleOutputCP(1250);
    std::cout << "Příšerně žluťoučký kůň úpěl ďábelské ódy!" << std::endl;

    cout << "\n------------- Windows UTF-8 (65001) set for the cmd console -------------\n";
    SetConsoleOutputCP(CP_UTF8);
    std::cout << "Příšerně žluťoučký kůň úpěl ďábelské ódy!" << std::endl;
}

It prints the following (font set to Lucida Console): enter image description here

The cmd default encoding is 852, the default windows encoding is 1250, and the source code was saved using 65001 encoding (UTF-8 with BOM). The SetConsoleOutputCP(1250); changes the cmd encoding (programmatically) the same way as the chcp 1250 does.

Observation: When setting the 1250 encoding, the UTF-8 string literal is printed correctly. I believe it can be explained, but it is really strange. Is there any decent, human, general way to solve the problem?

Update: The "narrow string literal" is stored using Windows-1250 encoding in my case (native Windows encoding for Central European). It seems to be independent on the encoding of the source code. The compiler saves it in the windows native encoding. Because of that, switching cmd to that encoding gives the desired output. It is uggly, but how can I get the native windows encoding programmatically (to pass it to the SetConsoleOutputCP(cpX))? What I need is a constant that is valid for the machine where the compilation happened. It should not be a native encoding for the machine where the executable runs.

The C++11 introduced also u8"the UTF-8 string literal", but it does not seem to fit with SetConsoleOutputCP(CP_UTF8);

Nefertiti answered 1/9, 2015 at 11:55 Comment(4)
Possibly related: #18904581Farny
@luk32: Thanks for the references. I will look at it.Nefertiti
When compiling a UTF-8 source in MSVC, it will translate the string literals into native encoding if the file starts with UTF-8 BOM. When you remove it, your test string should be printed correctly in the third case.Shulock
@Melebius: Thanks for pointing it out. I have just found it in parallel, and I am going to write a partial answer. However, I will be more than happy to accept any more elaborate answer.Nefertiti
N
2

This is a partial answer found via hopping the link by luk32 and confirming the Melebius comments (see below the question). This is not the complete answer, and I will be happy to accept your follow-up comment.

I have just found the UTF-8 Everywhere Manifesto that touches the problem. The point 17. Q: How do I write UTF-8 string literal in my C++ code? says (also explicit for Microsoft C++ compiler):

However the most straightforward way is to just write the string as-is and save the source file encoded in UTF-8:

                                "∃y ∀x ¬(x ≺ y)"

Unfortunately, MSVC converts it to some ANSI codepage, corrupting the string. To work around this, save the file in UTF-8 without BOM. MSVC will assume that it is in the correct codepage and will not touch your strings. However, it renders it impossible to use Unicode identifiers and wide string literals (that you will not be using anyway).

I really like the manifesto. To make it short, using rude words, and possibly oversimplified, it says:

Ignore the wstring, wchar_t, and the like things. Ignore the codepages. Ignore the string literal prefixes like L, u, U, u8. Use UTF-8 everywhere. Write all literals "naturally". Ensure it is also stored in the compiled binary.

If the following code is stored with UTF-8 without BOM...

#include <iomanip>
#include <iostream>
#include "windows.h"

using namespace std;

int main()
{
    SetConsoleOutputCP(CP_UTF8);
    cout << "Příšerně žluťoučký kůň úpěl ďábelské ódy!" << endl;

    int cnt = 0;
    for (unsigned int c : "Příšerně žluťoučký kůň úpěl ďábelské ódy!") 
    {
        cout << hex << setw(2) << setfill('0') << (c & 0xff);
        ++cnt;
        if (cnt % 16 == 0)      cout << endl;
        else if (cnt % 8 == 0)  cout << " | ";
        else if (cnt % 4 == 0)  cout << "  ";
        else                    cout << ' ';
    }
    cout << endl;
}

It prints (should be UTF-8 encoded)...

enter image description here

When saving the source as UTF-8 with BOM, it prints a different result...

enter image description here

However, the problem remains -- how to set the console encoding programmatically so that the UTF-8 string is printed correctly.

I gave up. The cmd console is simply crippled, and it is not worth to fix it from outside. I am accepting my own comment only to close the question. If anyone finds a decent solution related to the Catch unit test framework (could be completely different), I will be glad to accept his/her comment as the answer.

Nefertiti answered 1/9, 2015 at 14:45 Comment(2)
I also use UTF-8 this way for outputting Swedish texts, it works fine with MSVC2015 as long as there's no BOM in the .cpp file. Note: never edit the file using Notepad, it will create a BOM. Use Wordpad.Juttajutty
@HenrySkoglund: Thanks for the hint. (I am using Notepad++ for simple things. It is possible to choose with or without BOM also in that editor,) Do you send the UTF-8 text to cmd console through std::cout?Nefertiti
A
0

MSVC compiler tries to encode your const strings in the code with your local encoding. In your case, it uses code page 852. So even your cmd output tries to read and output the string with code page 1250, the string is in fact stored with code page 852. Such incompatibility between storage and read creates wrong output.
One way to solve this is to store the string in a file encoded with code page 1250. Visual Studio Code provides such functionality. You can read the file as a binary file(i.e byte by byte) to a char buffer, and then output the buffer.

char * memblock = new char[1024];
std::ifstream file("src.txt", std::ios::in | std::ios::binary | std::ios::ate);
int size;
if (file.is_open())
{
    size = file.tellg();
    memblock = new char[size];
    file.seekg(0, std::ios::beg);
    file.read(memblock, size);
    file.close();
}
else
{
    std::cout << "File not opened." << std::endl;
}
memblock[size] = 0;
std::cout << memblock << std::endl;

enter image description here

Amuse answered 18/2, 2018 at 4:21 Comment(1)
Thanks, Fawkes. The problem is I need to store the file in UTF-8 for other reasons.Nefertiti

© 2022 - 2024 — McMap. All rights reserved.