How to read an UTF-8 encoded file containing Chinese characters and output them correctly on console?
Asked Answered
C

3

6

I am writing a web crawler to fetch some Chinese web files. The fetched files are encoded in utf-8. And I need to read those file to do some parse, such as extracting the URLs and Chinese Characters. But I found that when I read the file into a std::string variable and output it into the console, the Chinese characters became garbage characters. I applied the boost::regex into the std::string variable and can extract all URLs but Chinese characters.

How can I solves those problems?

P.S. My CPP files are encoded as ANSI by default, the operating system is Win8 in Chinese Language;

Coppock answered 25/11, 2013 at 14:13 Comment(3)
Sounds like you need to change the 'code page' from UTF-8 to whatever code page your console uses for Chinese characters. Call MultiByteToWideChar to change from UTF-8 to Unicode followed by WideCharToMultiByte to change from Unicode to your local code page.Esmeraldaesmerelda
Most probably the console's fault. Try >'ing to a file. If it turns out to be valid UTF-8 with Chinese characters, then your program is working fine and this is a Windows question. (Of course, you may still need to change your program to work around Windows, but you'll know who's at fault.)Blackcock
@Blackcock Yes, when I redirect the std::string variable into another file, the content is still valid UTF-8 with Chiese characters. My console's code page is "936(ANSI/OEM - 简体中文 GBK)".Coppock
M
8

This code may help (it was compiled with VC++ 2010). I tested it with an UTF-8 file containing non-latin characters and it seems to work, but I don't know if it will work fine with Chinese characters. Check the following links for more information: _setmode and codecvt_utf8.

#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <codecvt>
#include <fcntl.h>
#include <io.h>

using namespace std;    // Sorry for this!

void read_all_lines(const wchar_t *filename)
{
    wifstream wifs;
    wstring txtline;
    int c = 0;

    wifs.open(filename);
    if(!wifs.is_open())
    {
        wcerr << L"Unable to open file" << endl;
        return;
    }
    // We are going to read an UTF-8 file
    wifs.imbue(locale(wifs.getloc(), new codecvt_utf8<wchar_t, 0x10ffff, consume_header>()));
    while(getline(wifs, txtline))
        wcout << ++c << L'\t' << txtline << L'\n';
    wcout << endl;
}

int _tmain(int argc, _TCHAR* argv[])
{
    // Console output will be UTF-16 characters
    _setmode(_fileno(stdout), _O_U16TEXT);
    if(argc < 2)
    {
        wcerr << L"Filename expected!" << endl;
        return 1;
    }
    read_all_lines(argv[1]);
    return 0;
}

If Chinese characters don't look as expected, make sure the console is using a font that supports UTF-16 (ie. don't use bitmap fonts).

Misvalue answered 26/11, 2013 at 4:8 Comment(1)
Did your solution work in other platform or only VC under Windows?Democritus
A
1

In general, use the w variants, (wstring, wfstream, wcout), set your locales to match the requirements, hang an L on the front of string literals. locale::global(locale("")) sets up to match the environment default, then on each stream that isn't running according to that default e.g. wcout.imbue(locale("Chinese_China.936")) might be Microsoft's name for your terminal's locale settings. This has always been enough to do what I want, hope it works as well for you.

#include <iostream>
#include <locale>
using namespace std;
int main() {
  locale::global(locale(""));
  wstring word;
  while (wcin >>word)
    wcout<<word<<'\n';
  wcout<<L"好運n";
}
Angularity answered 25/11, 2013 at 15:41 Comment(0)
S
0

if you need to display characters correctly, you can use libiconv from GNU. if you only need to process urls, std::string works fine. the problem is windows console's code page, not the string itself. use locale depends on os and stdc++lib's implementation, so I don't encourage using .

window's MultiByteToWideChar may help, but you need to check MS's specifications on how there functions perform conversions on strings.

Subsistent answered 25/11, 2013 at 15:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.