UTF-8 output on Windows console
Asked Answered
S

5

11

The following code shows unexpected behaviour on my machine (tested with Visual C++ 2008 SP1 on Windows XP and VS 2012 on Windows 7):

#include <iostream>
#include "Windows.h"

int main() {
    SetConsoleOutputCP( CP_UTF8 );
    std::cout << "\xc3\xbc";
    int fail = std::cout.fail() ? '1': '0';
    fputc( fail, stdout );
    fputs( "\xc3\xbc", stdout );
}

I simply compiled with cl /EHsc test.cpp.

Windows XP: Output in a console window isü0ü (translated to Codepage 1252, originally shows some line drawing characters in the default Codepage, perhaps 437).

When I change the settings of the console window to use the "Lucida Console" character, set and run my test.exe again, the output is changed to , which means:

  • the character ü can be written using fputs and its UTF-8 encoding C3 BC;
  • std::cout does not work for whatever reason;
  • the streams failbit is setting after trying to write the character.

Windows 7: Output using Consolas is ��0ü. Even more interesting. The correct bytes are written, probably (at least when redirecting the output to a file) and the stream state is ok, but the two bytes are written as separate characters.

I tried to raise this issue on "Microsoft Connect" (see [here), but MS has not been very helpful. You might as well look here as something similar has been asked before.

Can you reproduce this problem?

What am I doing wrong? Shouldn't the std::cout and the fputs have the same effect?

Soviet answered 2/11, 2009 at 10:33 Comment(4)
i've had trouble with c++ iostreams before. there's lots of hidden nastiness that causes problems. this isn't worth of an answer, but when iostreams gives you trouble, use c's stdio, i've had to many times before with issues just like this.Braca
Yes, using iostreams is more complicated than stdio, there are even full-length text books about this. But iostreams give you a great deal of flexibility, which I am using gladly.Soviet
Is'nt it a problem of the Windows Console ? I remember that it's not unicode aware by any means, creating lot of such problems...Delois
As you see, I can output UTF-8 encoded string in the Windows console (via fputs) and I can type UTF-8 encoded files with the type command (after having done chcp 65001). Thus I thought it can handle this encoding…Soviet
S
1

It's time to close this now. Stephan T. Lavavej says the behaviour is "by design", although I cannot follow this explanation.

My current knowledge is: Windows XP console in UTF-8 codepage does not work with C++ iostreams.

Windows XP is getting out of fashion now and so does VS 2008. I'd be interested to hear if the problem still exists on newer Windows systems.

On Windows 7 the effect is probably due to the way the C++ streams output characters. As seen in an answer to Properly print utf8 characters in windows console, UTF-8 output fails with C stdio when printing one byte after after another like putc('\xc3'); putc('\xbc'); as well. Perhaps this is what C++ streams do here.

Soviet answered 19/6, 2013 at 10:55 Comment(1)
It exists :( i'm trying to found a workaround in the stackoverflow.com/questions/23584160/… You will be welcome :)Rexrexana
F
6

I understand the question is quite old, but if someone would still be interested, below is my solution. I've implemented a quite simple std::streambuf descendant and then passed it to each of standard streams on the very beginning of program execution.

This allows you to use UTF-8 everywhere in your program. On input, data is taken from console in Unicode and then converted and returned to you in UTF-8. On output the opposite is done, taking data from you in UTF-8, converting it to Unicode and sending to console. No issues found so far.

Also note, that this solution doesn't require any codepage modification, with either SetConsoleCP, SetConsoleOutputCP or chcp, or something else.

That's the stream buffer:

class ConsoleStreamBufWin32 : public std::streambuf
{
public:
    ConsoleStreamBufWin32(DWORD handleId, bool isInput);

protected:
    // std::basic_streambuf
    virtual std::streambuf* setbuf(char_type* s, std::streamsize n);
    virtual int sync();
    virtual int_type underflow();
    virtual int_type overflow(int_type c = traits_type::eof());

private:
    HANDLE const m_handle;
    bool const m_isInput;
    std::string m_buffer;
};

ConsoleStreamBufWin32::ConsoleStreamBufWin32(DWORD handleId, bool isInput) :
    m_handle(::GetStdHandle(handleId)),
    m_isInput(isInput),
    m_buffer()
{
    if (m_isInput)
    {
        setg(0, 0, 0);
    }
}

std::streambuf* ConsoleStreamBufWin32::setbuf(char_type* /*s*/, std::streamsize /*n*/)
{
    return 0;
}

int ConsoleStreamBufWin32::sync()
{
    if (m_isInput)
    {
        ::FlushConsoleInputBuffer(m_handle);
        setg(0, 0, 0);
    }
    else
    {
        if (m_buffer.empty())
        {
            return 0;
        }

        std::wstring const wideBuffer = utf8_to_wstring(m_buffer);
        DWORD writtenSize;
        ::WriteConsoleW(m_handle, wideBuffer.c_str(), wideBuffer.size(), &writtenSize, NULL);
    }

    m_buffer.clear();

    return 0;
}

ConsoleStreamBufWin32::int_type ConsoleStreamBufWin32::underflow()
{
    if (!m_isInput)
    {
        return traits_type::eof();
    }

    if (gptr() >= egptr())
    {
        wchar_t wideBuffer[128];
        DWORD readSize;
        if (!::ReadConsoleW(m_handle, wideBuffer, ARRAYSIZE(wideBuffer) - 1, &readSize, NULL))
        {
            return traits_type::eof();
        }

        wideBuffer[readSize] = L'\0';
        m_buffer = wstring_to_utf8(wideBuffer);

        setg(&m_buffer[0], &m_buffer[0], &m_buffer[0] + m_buffer.size());

        if (gptr() >= egptr())
        {
            return traits_type::eof();
        }
    }

    return sgetc();
}

ConsoleStreamBufWin32::int_type ConsoleStreamBufWin32::overflow(int_type c)
{
    if (m_isInput)
    {
        return traits_type::eof();
    }

    m_buffer += traits_type::to_char_type(c);
    return traits_type::not_eof(c);
}

The usage then is as follows:

template<typename StreamT>
inline void FixStdStream(DWORD handleId, bool isInput, StreamT& stream)
{
    if (::GetFileType(::GetStdHandle(handleId)) == FILE_TYPE_CHAR)
    {
        stream.rdbuf(new ConsoleStreamBufWin32(handleId, isInput));
    }
}

// ...

int main()
{
    FixStdStream(STD_INPUT_HANDLE, true, std::cin);
    FixStdStream(STD_OUTPUT_HANDLE, false, std::cout);
    FixStdStream(STD_ERROR_HANDLE, false, std::cerr);

    // ...

    std::cout << "\xc3\xbc" << std::endl;

    // ...
}

Left out wstring_to_utf8 and utf8_to_wstring could easily be implemented with WideCharToMultiByte and MultiByteToWideChar WinAPI functions.

Fuze answered 12/2, 2014 at 13:8 Comment(5)
That was a helpful idea. For output I ended up with a class derived from std::stringbuf (so I don't have to do the buffering by myself) and just implemented sync() doing the conversion. Instead of hard-wiring the output sink in the code, my sync() inserts the converted string into the streams original streambuf.Soviet
Nice solution! In my Windows7 system, I found that using SetConsoleOutputCP function call does not work. Mike.dld's answer works! I found one wstring_to_utf8 and utf8_to_wstring implementation here: Convert wstring to string encoded in UTF-8, hope that can help others.Kai
Hi, I found that by using this method, std::cout works correctly, but I just tried the printf("\xc3\xbc"); function, it does not work. Can you help to solve the printf() issue? Thanks.Kai
OK, I have one method to solve the printf() issue, I just added my solution as an answer in this question.Kai
@mike.dld: This solution is great, but unfortunately the use case std::string line; std::getline( std::cin, line ); will not work 100%. Although the input is read correctly to line the echo'ed output is garbage, if the UTF-16 representation of the input is assembled as an UTF-16 surrogate pair. All fine if only one UTF-16 byte is needed. This can be easily reproduced using either Command Prompt or PowerShell inside the "Windows Terminal" and then paste e.g. 🚀 🍀 🔥 to the program when it waits in std::getline. ReadConsoleW (w. ENABLE_LINE_INPUT) fails on UTF-16 surrogate pairs.Poff
P
1

I'm guessing the C++ default locale is getting involved. By default, it will use the code page provide by GetThreadLocale() to determine the text encoding of non-wstring stuff.

This generally defaults to CP1252. You could try using SetThreadLocale() to get to UTF-8 (if it even does that, can't recall), with the hope that std::locale defaults to something that can handle your UTF-8 encoding.

Propman answered 4/11, 2009 at 3:56 Comment(1)
I looked at this again, but SetThreadLocale does not deal with encoding, or I don't understand the documentation msdn.microsoft.com/en-us/library/dd374051(VS.85).aspx. I tried a little bit with std::cout.imbue but to no avail. This issue remains unsolved...Soviet
S
1

It's time to close this now. Stephan T. Lavavej says the behaviour is "by design", although I cannot follow this explanation.

My current knowledge is: Windows XP console in UTF-8 codepage does not work with C++ iostreams.

Windows XP is getting out of fashion now and so does VS 2008. I'd be interested to hear if the problem still exists on newer Windows systems.

On Windows 7 the effect is probably due to the way the C++ streams output characters. As seen in an answer to Properly print utf8 characters in windows console, UTF-8 output fails with C stdio when printing one byte after after another like putc('\xc3'); putc('\xbc'); as well. Perhaps this is what C++ streams do here.

Soviet answered 19/6, 2013 at 10:55 Comment(1)
It exists :( i'm trying to found a workaround in the stackoverflow.com/questions/23584160/… You will be welcome :)Rexrexana
K
0

I just follow mike.dld's answer in this question, and add the printf support for the UTF-8 string.

As mkluwe mentioned in his answer that by default, printf function will output to the console one by one byte, while the console can't handle single byte correctly. My method is quite simple, I use the snprintf function to print the whole content to a internal string buffer, and then dump the buffer to std::cout.

Here is the full testing code:

#include <iostream>
#include <locale>
#include <windows.h>
#include <cstdlib>

using namespace std;

// https://mcmap.net/q/15777/-convert-wstring-to-string-encoded-in-utf-8
#include <codecvt>
#include <string>

// convert UTF-8 string to wstring
std::wstring utf8_to_wstring (const std::string& str)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
    return myconv.from_bytes(str);
}

// convert wstring to UTF-8 string
std::string wstring_to_utf8 (const std::wstring& str)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
    return myconv.to_bytes(str);
}

// https://mcmap.net/q/15587/-utf-8-output-on-windows-console
// mike.dld's answer
class ConsoleStreamBufWin32 : public std::streambuf
{
public:
    ConsoleStreamBufWin32(DWORD handleId, bool isInput);

protected:
    // std::basic_streambuf
    virtual std::streambuf* setbuf(char_type* s, std::streamsize n);
    virtual int sync();
    virtual int_type underflow();
    virtual int_type overflow(int_type c = traits_type::eof());

private:
    HANDLE const m_handle;
    bool const m_isInput;
    std::string m_buffer;
};

ConsoleStreamBufWin32::ConsoleStreamBufWin32(DWORD handleId, bool isInput) :
    m_handle(::GetStdHandle(handleId)),
    m_isInput(isInput),
    m_buffer()
{
    if (m_isInput)
    {
        setg(0, 0, 0);
    }
}

std::streambuf* ConsoleStreamBufWin32::setbuf(char_type* /*s*/, std::streamsize /*n*/)
{
    return 0;
}

int ConsoleStreamBufWin32::sync()
{
    if (m_isInput)
    {
        ::FlushConsoleInputBuffer(m_handle);
        setg(0, 0, 0);
    }
    else
    {
        if (m_buffer.empty())
        {
            return 0;
        }

        std::wstring const wideBuffer = utf8_to_wstring(m_buffer);
        DWORD writtenSize;
        ::WriteConsoleW(m_handle, wideBuffer.c_str(), wideBuffer.size(), &writtenSize, NULL);
    }

    m_buffer.clear();

    return 0;
}

ConsoleStreamBufWin32::int_type ConsoleStreamBufWin32::underflow()
{
    if (!m_isInput)
    {
        return traits_type::eof();
    }

    if (gptr() >= egptr())
    {
        wchar_t wideBuffer[128];
        DWORD readSize;
        if (!::ReadConsoleW(m_handle, wideBuffer, ARRAYSIZE(wideBuffer) - 1, &readSize, NULL))
        {
            return traits_type::eof();
        }

        wideBuffer[readSize] = L'\0';
        m_buffer = wstring_to_utf8(wideBuffer);

        setg(&m_buffer[0], &m_buffer[0], &m_buffer[0] + m_buffer.size());

        if (gptr() >= egptr())
        {
            return traits_type::eof();
        }
    }

    return sgetc();
}

ConsoleStreamBufWin32::int_type ConsoleStreamBufWin32::overflow(int_type c)
{
    if (m_isInput)
    {
        return traits_type::eof();
    }

    m_buffer += traits_type::to_char_type(c);
    return traits_type::not_eof(c);
}

template<typename StreamT>
inline void FixStdStream(DWORD handleId, bool isInput, StreamT& stream)
{
    if (::GetFileType(::GetStdHandle(handleId)) == FILE_TYPE_CHAR)
    {
        stream.rdbuf(new ConsoleStreamBufWin32(handleId, isInput));
    }
}

// some code are from this blog
// https://blog.csdn.net/witton/article/details/108087135

#define printf(fmt, ...) __fprint(stdout, fmt, ##__VA_ARGS__ )

int __vfprint(FILE *fp, const char *fmt, va_list va)
{
    // https://mcmap.net/q/15780/-which-of-sprintf-snprintf-is-more-secure
    size_t nbytes = snprintf(NULL, 0, fmt, va) + 1; /* +1 for the '\0' */
    char *str = (char*)malloc(nbytes);
    snprintf(str, nbytes, fmt, va);
    std::cout << str;
    free(str);
    return nbytes;
}

int __fprint(FILE *fp, const char *fmt, ...)
{
    va_list va;
    va_start(va, fmt);
    int n = __vfprint(fp, fmt, va);
    va_end(va);
    return n;
}

int main()
{
    FixStdStream(STD_INPUT_HANDLE, true, std::cin);
    FixStdStream(STD_OUTPUT_HANDLE, false, std::cout);
    FixStdStream(STD_ERROR_HANDLE, false, std::cerr);

    // ...

    std::cout << "\xc3\xbc" << std::endl;

    printf("\xc3\xbc");

    // ...
    return 0;
}

The source code is saved in UTF-8 format, and build under Msys2's GCC and run under Windows 7 64bit. Here is the result

ü
ü
Kai answered 2/8, 2022 at 7:8 Comment(0)
K
0

I make this answer because it really was a head-buster and it is much simpler than it seems.

#include <iostream>
#include <string>

#ifdef _WIN32
#include <windows.h>
#endif

int main() {

#ifdef _WIN32
  SetConsoleOutputCP(CP_UTF8);
#endif

  std::string content = "Hello Word 😃";    
  std::cout << content << std::endl;

  return 0;
}

Tested in Windows 11:

clang main.cpp -o main.exe -std=c++20

Tested in Windows Subsystem for Linux:

g++ main.cpp -o test.elf --std=c++20
Karlik answered 8/10, 2023 at 0:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.