Read Unicode UTF-8 file into wstring
Asked Answered
H

7

48

How can I read a Unicode (UTF-8) file into wstring(s) on the Windows platform?

Harlow answered 23/1, 2011 at 18:4 Comment(4)
By "Unicode" do you mean UTF-8 or UTF-16? And what platform are you using?Ology
Read this article : Reading UTF-8 with C++ streamsErmina
Another good article : UTF-8 with C++ in a Portable WayErmina
On windows, you should use std::string for UTF-8 and std::wstring for UTF-16.Olette
H
45

With C++11 support, you can use std::codecvt_utf8 facet which encapsulates conversion between a UTF-8 encoded byte string and UCS2 or UCS4 character string and which can be used to read and write UTF-8 files, both text and binary.

In order to use facet you usually create locale object that encapsulates culture-specific information as a set of facets that collectively define a specific localized environment. Once you have a locale object, you can imbue your stream buffer with it:

#include <sstream>
#include <fstream>
#include <codecvt>

std::wstring readFile(const char* filename)
{
    std::wifstream wif(filename);
    wif.imbue(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
    std::wstringstream wss;
    wss << wif.rdbuf();
    return wss.str();
}

which can be used like this:

std::wstring wstr = readFile("a.txt");

Alternatively you can set the global C++ locale before you work with string streams which causes all future calls to the std::locale default constructor to return a copy of the global C++ locale (you don't need to explicitly imbue stream buffers with it then):

std::locale::global(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
Hardshell answered 15/5, 2012 at 19:1 Comment(4)
Does that new codecvt_utf8 require a corresponding delete?Bodrogi
No neet to explicitly delete codecvt_utf8. This is done in the destructor of std::locale when the refcounter of codecvt_utf8 becomes zero (see en.cppreference.com/w/cpp/locale/locale/%7Elocale)Gunthar
For those using this answer, std::locale::empty() has a problem on clang: error: no member named 'empty' in 'std::__1::locale'.Jorgejorgensen
Sadly, all of the useful parts of codecvt have been deprecated in C++20.Affectation
B
14

According to a comment by @Hans Passant, the simplest way is to use _wfopen_s. Open the file with mode rt, ccs=UTF-8.

Here is another pure C++ solution that works at least with VC++ 2010:

#include <locale>
#include <codecvt>
#include <string>
#include <fstream>
#include <cstdlib>

int main() {
    const std::locale empty_locale = std::locale::empty();
    typedef std::codecvt_utf8<wchar_t> converter_type;
    const converter_type* converter = new converter_type;
    const std::locale utf8_locale = std::locale(empty_locale, converter);
    std::wifstream stream(L"test.txt");
    stream.imbue(utf8_locale);
    std::wstring line;
    std::getline(stream, line);
    std::system("pause");
}

Except for locale::empty() (here locale::global() might work as well) and the wchar_t* overload of the basic_ifstream constructor, this should even be pretty standard-compliant (where “standard” means C++0x, of course).

Bickerstaff answered 23/1, 2011 at 20:40 Comment(3)
Why don't you delete converter?Fistula
"Overload 7 is typically called with its second argument, f, obtained directly from a new-expression: the locale is responsible for calling the matching delete from its own destructor." linkDennisedennison
This works well. Curious, as I can't find a lot of info on it, and mine works fine without it, what is stream.imbue doing exactly? It seems as though it is setting some type of default type, but is this needed? Also, for first line remark, put your getline in a while(getline(stream, line)) loop to see more than the first line.Hinshaw
S
12

Here's a platform-specific function for Windows only:

size_t GetSizeOfFile(const std::wstring& path)
{
    struct _stat fileinfo;
    _wstat(path.c_str(), &fileinfo);
    return fileinfo.st_size;
}

std::wstring LoadUtf8FileToString(const std::wstring& filename)
{
    std::wstring buffer;            // stores file contents
    FILE* f = _wfopen(filename.c_str(), L"rtS, ccs=UTF-8");

    // Failed to open file
    if (f == NULL)
    {
        // ...handle some error...
        return buffer;
    }

    size_t filesize = GetSizeOfFile(filename);

    // Read entire file contents in to memory
    if (filesize > 0)
    {
        buffer.resize(filesize);
        size_t wchars_read = fread(&(buffer.front()), sizeof(wchar_t), filesize, f);
        buffer.resize(wchars_read);
        buffer.shrink_to_fit();
    }

    fclose(f);

    return buffer;
}

Use like so:

std::wstring mytext = LoadUtf8FileToString(L"C:\\MyUtf8File.txt");

Note the entire file is loaded in to memory, so you might not want to use it for very large files.

Sundew answered 23/1, 2011 at 18:24 Comment(6)
Might as well go the whole way: _wfopen(filename.c_str(), L"rt, ccs=UTF-8"); Conversion is now automatic.Contradistinguish
Actually, rolled it back, docs on the _wfopen say it converts to wide characters automatically, and this code doesn't take that in to account.Sundew
Only the filename. Quote: Simply using _wfopen has no effect on the coded character set used in the file stream. Contradistinguish
Are you sure? The way I interpreted the docs, specifying t in the mode as well as ccs=UTF-8 causes characters to be converted as they are read to and from the stream.Sundew
@Ashley: Yes, the quote refers to using _wfopen without the ccs= mode specifier. You need both _wfopen (according to the manual _wfopen_s is to be preferred) and ccs=UTF-8.Bickerstaff
Late edit in August: turns out @Hans Passant's way is better - edited the answer to use that instead!Sundew
A
5
#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <cstdlib>

int main()
{
    std::wifstream wif("filename.txt");
    wif.imbue(std::locale("zh_CN.UTF-8"));

    std::wcout.imbue(std::locale("zh_CN.UTF-8"));
    std::wcout << wif.rdbuf();
}
Allieallied answered 3/11, 2017 at 3:49 Comment(5)
Hi. Thanks for sharing. Appreciated. Can you add a bit more context? Why this answer to an 6 years old questions. Thanks.Accord
I have the some question recently, but I have solved now, I want to share my solution to help others.Allieallied
That's nice. But how is your answer different from @LihO's answer? You just use a different locale, right?Accord
Didn't work for me. Ended up using <codecvt> from @HardshellSilkworm
Worked for me for reading and writing on Windows using VS2022 and C++20. ThanksSufism
A
1

Recently dealt with all the encodings, solved this way. It is better to use std::u32string as it has stable size on all platforms, and most fonts work with utf-32 format. (the file should still be in utf-8)

std::u32string readFile(std::string filename) {
    std::basic_ifstream<char32_t> fin(filename);
    std::u32string str{};
    std::getline(fin, str, U'\0');
    return str;
}

Feel free to use standard functions other than gcount, and save the result of tellg to pos_type only. Also, be sure to pass separator to std::getline (if you don't do this, the function gives exception std::bad_cast)

Arius answered 7/8, 2022 at 7:49 Comment(0)
C
0

This question was addressed in Confused about C++'s std::wstring, UTF-16, UTF-8 and displaying strings in a windows GUI. In sum, wstring is based upon the UCS-2 standard, which is the predecessor of UTF-16. This is a strictly two byte standard. I believe this covers Arabic.

Chrystal answered 23/1, 2011 at 18:24 Comment(31)
I think you can use wstring with UTF-16Newsom
@Daivd: Actually you are incorrect, and this is a common misunderstanding. UTF-16 covers 1,112,064 code points from 0 to 0x10FFFF. The scheme requires a variable length storage of either one or two 16-bit words, whereas UCS-2 was strictly one 16-bit word. If you trace back the definition wchar_t, you will find that it is has as it's root a primative type of 16-bits (usually a short).Chrystal
@David: Technically, a wstring is just an array of 16-bit integers on Windows. You can store UCS-2 or UTF-16 data or whatever you like in it. Most Windows APIs do accept UTF-16 strings nowadays.Bickerstaff
@Philip I thought all Windows APIs are UTF-16 now. Which ones take UCS-2?Newsom
@Thomas I'm afraid the misunderstanding is on you. I know about variable length of UTF-16 and surrogate pairs. But that is perfectly compatible with wstring. A surrogate pair takes 2 wchar_t elements.Newsom
@Philipp: you can store a subset of UTF-16 characters in a wstring. For example, you cannot store the Balinese script characters in a wstring, but there are valid UTF-16 encodings for these characters. en.wikipedia.org/wiki/Balinese_scriptChrystal
@Thomas that's not correct. UTF-16 uses 16 bit code units, i.e. a wchar_t on Windows.Newsom
@Thomas I have to agree with David. You can store any Unicode code point in a wstring if you treat it as an UTF-16 string. Non-BMP code points will need two code units, but there's nothing wrong with that.Bickerstaff
@Philipp: scatch my previous. I meant to refer to the Brāhmī script, which is even more obscureChrystal
@David: I think (but I'm not sure, I'm not using Windows right now) that the console still doesn't handle non-BMP characters. It is debatable whether that has something to do with the API itself.Bickerstaff
@Thomas anything with a defined Unicode code point can be represented in UTF-16Newsom
@Bickerstaff the console is a whole world of pain! Even getting it to display non ANSI code points is an exercise of extreme masochism!Newsom
@David: No, it's two lines, see blogs.msdn.com/b/michkap/archive/2008/03/18/8306597.aspxBickerstaff
@Bickerstaff Very interesting! I'm used to Python on Windows which has rubbish console support.Newsom
@David: We seem to be arguing about semantics. You said "I think in can use wstring with UTF-16." That means more than store. It means store and have it interpreted correctly by at least stdio. I just tried using SMP characters with wcout and a wstring on Windows 7 pro 64-bit, and got a whole lot of gibberish.Chrystal
@Thomas That doesn't mean the problem is with wstring.Newsom
@David I think that's a Python problem, not a Windows problem. I know the Python devs try hard to get Unicode support everywhere, but I think it's hard to bring the actual Windows semantics to a model that assumes that operating system streams are always byte-based and encoding-agnostic (that is true for Unix file and console streams and for Windows file streams, but not for the Windows console). I haven't studied the Python source code, but I think that at least some time in the past they assumed this model to hold.Bickerstaff
@Bickerstaff It's just a real shame that the Windows console feels a little neglected.Newsom
@Thomas: I don't think the MSVC++ iostreams library does any kind of Unicode except allowing Unicode file names. All solutions for using Unicode in C++ are effectively pure C solutions, either using the Windows API directly or using nonstandard extensions to the C library.Bickerstaff
@Philipp, I agree. That's why I say that wstring is UCS-2 and not UTF-16.Chrystal
@David: the problem is not with wstring storage, it's with typical wstring usage and UTF-16. Can can store UTF-16 in a bitset if you want, but is that using it with UTF-16? Not really.Chrystal
@thomas what would you use instead of wstring?Newsom
@Thomas: The MSVC++ standard library doesn't support UCS-2 either. Last time I checked, the C++ locales didn't support any Unicode locale, making Unicode output essentially impossible.Bickerstaff
Correction: The MSVC++ library does support UTF-16 and UTF-32 for the types char16_t and char32_t, that would essentially solve the issue for file I/O.Bickerstaff
@David: There's no good answer. What to use I guess depends on framework, platform, specific I/O requirements, etc. In general, if one must support non-BMP, char32_t and UTF-32 seems safer.Chrystal
@Thomas No the question is what you use instead of wstring for UTF-16Newsom
@David, convert it to UTF-32, then use string<char32_t>. Or, in .Net use system.text.UTF32EncodingChrystal
@David, unless, of course, you can guarentee BMP, then there's no issue.Chrystal
@thomas have you heard of surrogate pairs? UTF-16 is designed to be used with 16 code units. Outside BMP is fine. Are you aware that UTF-16 can encode all Unicode code points?Newsom
@David, yes I'm aware. The problem is that many APIs that use wstrings don't know the difference. They interpret surrogate pairs as two 16-bit codes points. But since the surrogate pairs are in the invalid range of the BMP, they are ignored.Chrystal
@thomas that would be a criticism of the API but your original point is that wstring is no good for storing UTF-16. Anyway which APIs are you referring to. I'm curious to know which ones don't support Unicode.Newsom
B
-6

This is a bit raw, but how about reading the file as plain old bytes then cast the byte buffer to wchar_t* ?

Something like:

#include <iostream>
#include <fstream>
std::wstring ReadFileIntoWstring(const std::wstring& filepath)
{
    std::wstring wstr;
    std::ifstream file (filepath.c_str(), std::ios::in|std::ios::binary|std::ios::ate);
    size_t size = (size_t)file.tellg();
    file.seekg (0, std::ios::beg);
    char* buffer = new char [size];
    file.read (buffer, size);
    wstr = (wchar_t*)buffer;
    file.close();
    delete[] buffer;
    return wstr;
}
Bicarbonate answered 18/10, 2012 at 20:53 Comment(1)
I think it won't work -- the file contains UTF-8 not a sequence of wchar_t.Antony

© 2022 - 2024 — McMap. All rights reserved.