c++, cout and UTF-8
Asked Answered
S

4

8

Hopefully a simple question: cout seems to die when handling strings that end with a multibyte UTF-8 char, am I doing something wrong? This is with GCC (Mingw) on Win7 x64.

**Edit Sorry if I wasn't clear enough, I'm not concerned about the missing glyphs or how the bytes are interpreted, merely that they are not showing at all right after the call to cout << s4 (missing BAR). Any further couts after the first display no text whatsoever!

#include <cstdio>
#include <iostream>
#include <string>

int main() {
    std::string s1("abc");
    std::string s2("…");  // … = 0xE2 80 A6
    std::string s3("…abc");
    std::string s4("abc…");

    //In C
    fwrite(s1.c_str(), s1.size(), 1, stdout);
    printf(" FOO ");
    fwrite(s2.c_str(), s2.size(), 1, stdout);
    printf(" BAR ");
    fwrite(s3.c_str(), s3.size(), 1, stdout);
    printf(" FOO ");
    fwrite(s4.c_str(), s4.size(), 1, stdout);
    printf(" BAR\n\n"); 

    //C++
    std::cout << s1 << " FOO " << s2 << " BAR " << s3 << " FOO " << s4 << " BAR ";
}

// results:

// abc FOO ��� BAR ���abc FOO abc… BAR

// abc FOO ��� BAR ���abc FOO abc…
Singband answered 5/8, 2011 at 9:3 Comment(10)
Where are you running your program? The Windows command prompt really doesn't like Unicode much, so while your program might output text just fine, the console doesn't know what to do with it.Shooin
@jalf: The Windows console subsystem doesn't have real issues. WriteConsoleW works reasonably well given correct fonts. Windows doesn't like UTF-8, though, which means that WriteConsoleA is going to choke here.Excavation
Works for me on Ubuntu/gnome-terminal/GCC. I suspect getting this right requires both C++ correctness and taking platform specifics into account.Retract
@MSalters: Oh true, I should've been more specific.Shooin
Pipe the output into a file and open that file in notepad. What happens?Drava
Calling SetConsoleCP(65001) is required to switch the console to utf8. Finding a fixed pitch font that is capable of displaying Unicode glyphs is going to be the hard problem.Picasso
@Hans Passant: Lucinda Console Truetype should do the trick. See support.microsoft.com/kb/99795Excavation
@Excavation - it doesn't, it has very few glyphs. Check it out with charmap.exePicasso
The next problem you're battling is that the CRT code doesn't handle a Unicode code page properly. Fixed in the next version of VS, fallback to WriteConsole(). If you get the impression you are trying to do something that isn't well supported then you're right.Picasso
@MSalters: Not being able to handle UTF-8 is not a real issue??? It’s a deathblow.Caskey
C
2

This is really no surprise. Unless your terminal is set to UTF-8 coding, how does it know that s2 isn't supposed to be "(Latin small letter a with circumflex)(Euro sign)(Pipe)", supposing that your terminal is set to ISO-8859-1 according to http://www.ascii-code.com/

By the way, cout is not "dying" as it clearly continues to produce output after your test string.

Curtiscurtiss answered 5/8, 2011 at 9:52 Comment(2)
Good point. std::cout only echoes a stream of bytes to the outside world. How they are interpreted is between you and the program which will ultimately read those bytes.Belicia
@Singband - Yep, cout outputs nothing, but if you use printf then you get what you expect (unless you haven't used the correct console font, and done chcp 65001Voluptuary
F
4

If you want your program to use your current locale, call setlocale(LC_ALL, "") as the first thing in your program. Otherwise the program's locale is C and what it will do to non-ASCII characters is not knowable by us mere humans.

Formic answered 5/8, 2011 at 10:1 Comment(1)
+1 to n.m. On Windows, calling setlocale(LC_ALL, "") and doing chcp 65001 was the trick for Unicode in the consoleEtte
C
2

This is really no surprise. Unless your terminal is set to UTF-8 coding, how does it know that s2 isn't supposed to be "(Latin small letter a with circumflex)(Euro sign)(Pipe)", supposing that your terminal is set to ISO-8859-1 according to http://www.ascii-code.com/

By the way, cout is not "dying" as it clearly continues to produce output after your test string.

Curtiscurtiss answered 5/8, 2011 at 9:52 Comment(2)
Good point. std::cout only echoes a stream of bytes to the outside world. How they are interpreted is between you and the program which will ultimately read those bytes.Belicia
@Singband - Yep, cout outputs nothing, but if you use printf then you get what you expect (unless you haven't used the correct console font, and done chcp 65001Voluptuary
B
0

The Windows console does not handle non-local-codepage characters by default.

You'll need to make sure you have a Unicode-capable font set in the console window, and that the codepage is set to UTF-8 through a call to chcp. This is not a guaranteed success though. Note that `wcout´ changes nothing if the console can't show the fancy characters because its font is botched.

On all modern Linux distros, the console is set to UTF-8 and this should work out of the box.

Bushy answered 5/8, 2011 at 9:29 Comment(0)
S
0

As others have pointed out, std::cout is agnostic about this, at least in "C" locale (the default). On the other hand, your console window must be set up to display UTF-8: code page 65001. Try invoking chcp 65001 before executing your program. (This has worked for me in the past.)

Salvadorsalvadore answered 5/8, 2011 at 9:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.