C++: output contents of a Unicode file to console in Windows
Asked Answered
P

6

6

I've read a bunch of articles and forums posts discussing this problem all of the solutions seem way too complicated for such a simple task.

Here's a sample code straight from cplusplus.com:

// reading a text file
#include <iostream>
#include <fstream>
#include <string>
using namespace std;

int main () {
  string line;
  ifstream myfile ("example.txt");
  if (myfile.is_open())
  {
    while ( myfile.good() )
    {
      getline (myfile,line);
      cout << line << endl;
    }
    myfile.close();
  }

  else cout << "Unable to open file"; 

  return 0;
}

It works fine as long as example.txt has only ASCII characters. Things get messy if I try to add, say, something in Russian.

In GNU/Linux it's as simple as saving the file as UTF-8.

In Windows, that doesn't work. Converting the file into UCS-2 Little Endian (what Windows seems to use by default) and changing all the functions into their wchar_t counterparts doesn't do the trick either.

Isn't there some kind of a "correct" way to get this done without doing all kinds of magic encoding conversions?

Peltate answered 5/2, 2011 at 19:35 Comment(5)
You can do this but it does take a little work. You should be able to find the information you need with a web search. Also, Windows uses UTF-16 rather than UCS-2.Madera
Duplicate: #4882531Mukluk
Give up it's too complicated on Windows, I tried once and I lost a lot of time.Uncommitted
@Adam Rosenfield: That doesn't answer the question. chcp 65001 doesn't do the trick.Peltate
How to trick the different UCS2 endiannes between windows and linux ?Altonaltona
T
6

The Windows console supports unicode, sort of. It does not support left-to-right and "complex scripts". To print a UTF-16 file with Visual C++, use the following:

   _setmode(_fileno(stdout), _O_U16TEXT);   

And use wcout instead of cout.

There is no support for a "UTF8" code page so for UTF-8 you will have to use MultiBytetoWideChar

More on console support for unicode can be found in this blog

Toothed answered 5/2, 2011 at 21:44 Comment(4)
I don't think you can use C++ objects because they always convert to some 8-bit encoding. That means you have to use wprintf, as described in this blog.Geothermal
I tried UTF-8, UCS-2 Big Endian and UCS-2 Little Endian for the file encoding. Neither produced legible output when using _setmode and wcout.Peltate
UTF-8 is not supported. You need to use UCS-2 and the correct types / functions (wstring instead of string, and L"" for string literals not "").Toothed
I confirm that Windows console doesn't display surrogate pairs properly. For example, console in Windows 10 has default font Consolas that supports emoji 😀 (can be checked in VS editor with the same font). Its codepoint is U+1F600 and UTF-16 surragate pair is D83D DE00. Just try to WriteConsoleW() with it and it will be displayed as 2 question marks in squares. But copying by mouse from console gives correct character in clipboard. Also if to call ReadConsoleW() and paste this character into console, buffer will contain corresponding surrogate pair. So, internal buffer of console is correct.Connivent
W
2

The right way to output to a console on Windows using cout is to first call GetConsoleOutputCP, and then convert the input you have into the console code page. Alternatively, use WriteConsoleW, passing a wchar_t*.

Waylonwayman answered 5/2, 2011 at 19:39 Comment(2)
And I get 437 which is "IBM437 OEM United States". SetConsoleOutputCP(CP_UTF8) doesn't help.Peltate
So you need to convert your input to cp437. Notice that CP_UTF8 isn't supported very well; if you want to output Cyrillic, use some of the other code pages supporting Cyrillic.Cly
G
1

For reading UTF-8 or UTF-16 strings from a file, you can use the extended mode string of _wfopen_s and fgetws. I don't think there is a C++ interface for these extensions yet. The easiest way to print to the console is described in Michael Kaplan's blog:

#include <fcntl.h>
#include <io.h>
#include <stdio.h>

int main(void) {
    _setmode(_fileno(stdout), _O_U16TEXT);
    wprintf(L"\x043a\x043e\x0448\x043a\x0430 \x65e5\x672c\x56fd\n");
    return 0;
}

Avoid GetConsoleOutputCP, it is only retained for compatibility with the 8-bit API.

Geothermal answered 6/2, 2011 at 20:20 Comment(2)
Michael Kaplan's blog is no more there (Resource Not Found)Konstantin
I can confirm DEC 2018 this is still unchanged. Also if this causes a crash if afterwards, something uses printf family or std::cout ...Wrightson
T
0

While Windows console windows are UCS-2 based, they don't support UTF-8 properly.

You might make things work by setting the console window's active output code page to UTF-8 temporarily, using the appropriate API functions. Note that those functions distinguish between input code page and output code page. However, [cmd.exe] really doesn't like UTF-8 as active code page, so don't set that as a permanent code page.

Otherwise, you can use the Unicode console window functions.

Cheers & hth.,

Telemeter answered 5/2, 2011 at 19:44 Comment(11)
@David: Like, 90%, since it apparently uses very simple array to hold contents. But I haven't tried an UTF-16 surrogate pair with console window. If it works (does it?) I'll just say, hurray, thank you, I was wrong. :-)Telemeter
@David: yes, sure, it's UCS-2. 2 bytes per character doesn't leave room for surrogate pair. Cheers, & thanks for making me check this,Telemeter
@Alf I still think it's UTF-16Madera
@David: you've been given link to the relevant documentation. you have presented no evidence that the documentation is wrong. i can't argue with such irrational denial.Telemeter
@Alf I was thinking of the WriteConsole API.Madera
@David: i understand, you don't hold a realistic conceptual picture of what goes on, and then logic isn't very convincing. just try outputting a surrogate pair via your chosen API. if it works, fine, then Microsofts documentation is wrong, and I'm wrong, and Wikipedia is wrong, and so on, and that would be great. ;-)Telemeter
@alf I know that I don't know much about console. I'm trying to learn more. There's no need to bark at me!Madera
It's UTF-16. Here's a link: msdn.microsoft.com/en-us/library/dd374069(v=vs.85).aspx. Time for Alf to post some links after his tirade. Note you still need a font that supports >= U+10000 characters, so "just try it" doesn't prove anything.Edessa
@Mark: i have already posted link to the relevant documentation about console windows. your link is to irrelevant documentation, about Windows applications in general, which you must have understood, and so is a lie. Your "time for Alf to post some links" is a lie, since you must have seen both the link and references to it. Your "after his tirade" is a lie. your "doesn't prove anything" inverts the burden of proof, which is a fallacy. so, i count 3 lies and 1 fallacy in your response. plus, it's factually wrong.Telemeter
@Mark @Alf I believe that Alf is correct. You try writing a surrogate pair to the console and see how many glyphs appear. But Alf, there's no need to get quite so hot and bothered about it!Madera
Actually, the interface suggested has bigger problems than just surrogate pairs. UCS-2 or UTF-16, you still have to account for decomposed diacriticals. (U+0041 U+0301 is Á). This is in fact a harder problem than surrogate pairs, as you don't know how many diacriticals might follow.Selfsatisfied
S
0
#include <stdio.h>

int main (int argc, char *argv[])
{
    // do chcp 65001 in the console before running this
    printf ("γασσο γεο!\n");
}

Works perfectly if you do chcp 65001 in the console before running your program.

Caveats:

  • I'm using 64 bit Windows 7 with VC++ Express 2010
  • The code is in a file encoded as UTF-8 without BOM - I wrote it in a text editor, not using the VC++ IDE, then used VC++ to compile it.
  • The console has a TrueType font - this is important

Don't know if these things make too much difference...

Can't speak for chars off the BMP, give it a whirl and leave a comment.

Suzette answered 7/2, 2011 at 0:13 Comment(3)
chcp 65001 doesn't work, ask Microsoft why they decided to make it unsupported.Italicize
Tnx you solved my prob. Im french and the good code page for me was 819. (so +1)Flanigan
I finally fixed the problem in my program changing the codepage at startusing SetConsoleOutputCP(1252).Flanigan
A
-1

Just to be clear, some here have mentioned UTF8. UTF8 is a multibyte format, which in some documentation is mistakenly referred to as Unicode. Unicode is always just two bytes.

I've used this previously posted solution with Visual Studio 2008. I don't know if if works with later versions of Visual Studio.

   #include <iostream>
   #include <fnctl.h>
   #include <io.h>
   #include <tchar.h>

   <code ommitted>


   _setmode(_fileno(stdout), _O_U16TEXT); 

   std::wcout << _T("This is some text to print\n");

I used macros to switch between std::wcout and std::cout, and also to remove the _setmode call for ASCII builds, thus allowing compiling either for ASCII and UNICODE. This works. I have not yet tested using std::endl, but I that might work wcout and Unicode (not sure), i.e.

   std::wcout << _T("This is some text to print") << std::endl;
Archi answered 30/9, 2013 at 19:46 Comment(1)
Unicode is not just two bites because it's not an encoding but a character set: The difference between UTF-8 and Unicode?Konstantin

© 2022 - 2024 — McMap. All rights reserved.