Read Unicode UTF-8 file into wstring

Asked 23/1, 2011 at 18:4 Answered 7/8, 2022 at 7:49

c++file unicode utf-8 wstring

How can I read a Unicode (UTF-8) file into wstring(s) on the Windows platform?

Harlow answered 23/1, 2011 at 18:4 Comment(4)

By "Unicode" do you mean UTF-8 or UTF-16? And what platform are you using? – Ology 23/1, 2011 at 18:7

Read this article : Reading UTF-8 with C++ streams – Ermina 23/1, 2011 at 18:25

Another good article : UTF-8 with C++ in a Portable Way – Ermina 23/1, 2011 at 18:27

On windows, you should use std::string for UTF-8 and std::wstring for UTF-16. – Olette 23/1, 2011 at 19:28

With C++11 support, you can use std::codecvt_utf8 facet which encapsulates conversion between a UTF-8 encoded byte string and UCS2 or UCS4 character string and which can be used to read and write UTF-8 files, both text and binary.

In order to use facet you usually create locale object that encapsulates culture-specific information as a set of facets that collectively define a specific localized environment. Once you have a locale object, you can imbue your stream buffer with it:

#include <sstream>
#include <fstream>
#include <codecvt>

std::wstring readFile(const char* filename)
{
    std::wifstream wif(filename);
    wif.imbue(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
    std::wstringstream wss;
    wss << wif.rdbuf();
    return wss.str();
}

which can be used like this:

std::wstring wstr = readFile("a.txt");

Alternatively you can set the global C++ locale before you work with string streams which causes all future calls to the std::locale default constructor to return a copy of the global C++ locale (you don't need to explicitly imbue stream buffers with it then):

std::locale::global(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));

Hardshell answered 15/5, 2012 at 19:1 Comment(4)

Does that new codecvt_utf8 require a corresponding delete? – Bodrogi 5/9, 2016 at 6:45

No neet to explicitly delete codecvt_utf8. This is done in the destructor of std::locale when the refcounter of codecvt_utf8 becomes zero (see en.cppreference.com/w/cpp/locale/locale/%7Elocale) – Gunthar 14/10, 2016 at 16:0

For those using this answer, std::locale::empty() has a problem on clang: error: no member named 'empty' in 'std::__1::locale'. – Jorgejorgensen 21/3, 2019 at 22:55

Sadly, all of the useful parts of codecvt have been deprecated in C++20. – Affectation 19/11, 2020 at 14:10

According to a comment by @Hans Passant, the simplest way is to use _wfopen_s. Open the file with mode rt, ccs=UTF-8.

Here is another pure C++ solution that works at least with VC++ 2010:

#include <locale>
#include <codecvt>
#include <string>
#include <fstream>
#include <cstdlib>

int main() {
    const std::locale empty_locale = std::locale::empty();
    typedef std::codecvt_utf8<wchar_t> converter_type;
    const converter_type* converter = new converter_type;
    const std::locale utf8_locale = std::locale(empty_locale, converter);
    std::wifstream stream(L"test.txt");
    stream.imbue(utf8_locale);
    std::wstring line;
    std::getline(stream, line);
    std::system("pause");
}

Except for locale::empty() (here locale::global() might work as well) and the wchar_t* overload of the basic_ifstream constructor, this should even be pretty standard-compliant (where “standard” means C++0x, of course).

Bickerstaff answered 23/1, 2011 at 20:40 Comment(3)

Why don't you delete converter? – Fistula 28/9, 2013 at 19:34

"Overload 7 is typically called with its second argument, f, obtained directly from a new-expression: the locale is responsible for calling the matching delete from its own destructor." link – Dennisedennison 29/7, 2015 at 18:17

This works well. Curious, as I can't find a lot of info on it, and mine works fine without it, what is stream.imbue doing exactly? It seems as though it is setting some type of default type, but is this needed? Also, for first line remark, put your getline in a while(getline(stream, line)) loop to see more than the first line. – Hinshaw 25/9, 2016 at 3:55

Here's a platform-specific function for Windows only:

size_t GetSizeOfFile(const std::wstring& path)
{
    struct _stat fileinfo;
    _wstat(path.c_str(), &fileinfo);
    return fileinfo.st_size;
}

std::wstring LoadUtf8FileToString(const std::wstring& filename)
{
    std::wstring buffer;            // stores file contents
    FILE* f = _wfopen(filename.c_str(), L"rtS, ccs=UTF-8");

    // Failed to open file
    if (f == NULL)
    {
        // ...handle some error...
        return buffer;
    }

    size_t filesize = GetSizeOfFile(filename);

    // Read entire file contents in to memory
    if (filesize > 0)
    {
        buffer.resize(filesize);
        size_t wchars_read = fread(&(buffer.front()), sizeof(wchar_t), filesize, f);
        buffer.resize(wchars_read);
        buffer.shrink_to_fit();
    }

    fclose(f);

    return buffer;
}

Use like so:

std::wstring mytext = LoadUtf8FileToString(L"C:\\MyUtf8File.txt");

Note the entire file is loaded in to memory, so you might not want to use it for very large files.

Sundew answered 23/1, 2011 at 18:24 Comment(6)

Might as well go the whole way: _wfopen(filename.c_str(), L"rt, ccs=UTF-8"); Conversion is now automatic. – Contradistinguish 23/1, 2011 at 18:46

Actually, rolled it back, docs on the _wfopen say it converts to wide characters automatically, and this code doesn't take that in to account. – Sundew 23/1, 2011 at 19:4

Only the filename. Quote: Simply using _wfopen has no effect on the coded character set used in the file stream. – Contradistinguish 23/1, 2011 at 20:4

Are you sure? The way I interpreted the docs, specifying t in the mode as well as ccs=UTF-8 causes characters to be converted as they are read to and from the stream. – Sundew 23/1, 2011 at 20:33

@Ashley: Yes, the quote refers to using _wfopen without the ccs= mode specifier. You need both _wfopen (according to the manual _wfopen_s is to be preferred) and ccs=UTF-8. – Bickerstaff 23/1, 2011 at 20:42

Late edit in August: turns out @Hans Passant's way is better - edited the answer to use that instead! – Sundew 11/8, 2011 at 15:42

#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <cstdlib>

int main()
{
    std::wifstream wif("filename.txt");
    wif.imbue(std::locale("zh_CN.UTF-8"));

    std::wcout.imbue(std::locale("zh_CN.UTF-8"));
    std::wcout << wif.rdbuf();
}

Allieallied answered 3/11, 2017 at 3:49 Comment(5)

Hi. Thanks for sharing. Appreciated. Can you add a bit more context? Why this answer to an 6 years old questions. Thanks. – Accord 3/11, 2017 at 4:22

I have the some question recently, but I have solved now, I want to share my solution to help others. – Allieallied 3/11, 2017 at 5:46

That's nice. But how is your answer different from @LihO's answer? You just use a different locale, right? – Accord 3/11, 2017 at 6:43

Didn't work for me. Ended up using <codecvt> from @Hardshell – Silkworm 15/11, 2019 at 18:29

Worked for me for reading and writing on Windows using VS2022 and C++20. Thanks – Sufism 4/7, 2023 at 8:56

Recently dealt with all the encodings, solved this way. It is better to use std::u32string as it has stable size on all platforms, and most fonts work with utf-32 format. (the file should still be in utf-8)

std::u32string readFile(std::string filename) {
    std::basic_ifstream<char32_t> fin(filename);
    std::u32string str{};
    std::getline(fin, str, U'\0');
    return str;
}

Feel free to use standard functions other than gcount, and save the result of tellg to pos_type only. Also, be sure to pass separator to std::getline (if you don't do this, the function gives exception std::bad_cast)

Arius answered 7/8, 2022 at 7:49 Comment(0)

This question was addressed in Confused about C++'s std::wstring, UTF-16, UTF-8 and displaying strings in a windows GUI. In sum, wstring is based upon the UCS-2 standard, which is the predecessor of UTF-16. This is a strictly two byte standard. I believe this covers Arabic.

Chrystal answered 23/1, 2011 at 18:24 Comment(31)

I think you can use wstring with UTF-16 – Newsom 23/1, 2011 at 19:2

@Daivd: Actually you are incorrect, and this is a common misunderstanding. UTF-16 covers 1,112,064 code points from 0 to 0x10FFFF. The scheme requires a variable length storage of either one or two 16-bit words, whereas UCS-2 was strictly one 16-bit word. If you trace back the definition wchar_t, you will find that it is has as it's root a primative type of 16-bits (usually a short). – Chrystal 23/1, 2011 at 19:59

@David: Technically, a wstring is just an array of 16-bit integers on Windows. You can store UCS-2 or UTF-16 data or whatever you like in it. Most Windows APIs do accept UTF-16 strings nowadays. – Bickerstaff 23/1, 2011 at 20:8

@Philip I thought all Windows APIs are UTF-16 now. Which ones take UCS-2? – Newsom 23/1, 2011 at 20:10

@Thomas I'm afraid the misunderstanding is on you. I know about variable length of UTF-16 and surrogate pairs. But that is perfectly compatible with wstring. A surrogate pair takes 2 wchar_t elements. – Newsom 23/1, 2011 at 20:13

@Philipp: you can store a subset of UTF-16 characters in a wstring. For example, you cannot store the Balinese script characters in a wstring, but there are valid UTF-16 encodings for these characters. en.wikipedia.org/wiki/Balinese_script – Chrystal 23/1, 2011 at 20:15

@Thomas that's not correct. UTF-16 uses 16 bit code units, i.e. a wchar_t on Windows. – Newsom 23/1, 2011 at 20:18

@Thomas I have to agree with David. You can store any Unicode code point in a wstring if you treat it as an UTF-16 string. Non-BMP code points will need two code units, but there's nothing wrong with that. – Bickerstaff 23/1, 2011 at 20:22

@Philipp: scatch my previous. I meant to refer to the Brāhmī script, which is even more obscure – Chrystal 23/1, 2011 at 20:23

@David: I think (but I'm not sure, I'm not using Windows right now) that the console still doesn't handle non-BMP characters. It is debatable whether that has something to do with the API itself. – Bickerstaff 23/1, 2011 at 20:23

@Thomas anything with a defined Unicode code point can be represented in UTF-16 – Newsom 23/1, 2011 at 20:24

@Bickerstaff the console is a whole world of pain! Even getting it to display non ANSI code points is an exercise of extreme masochism! – Newsom 23/1, 2011 at 20:25

@David: No, it's two lines, see blogs.msdn.com/b/michkap/archive/2008/03/18/8306597.aspx – Bickerstaff 23/1, 2011 at 20:33

@Bickerstaff Very interesting! I'm used to Python on Windows which has rubbish console support. – Newsom 23/1, 2011 at 20:36

@David: We seem to be arguing about semantics. You said "I think in can use wstring with UTF-16." That means more than store. It means store and have it interpreted correctly by at least stdio. I just tried using SMP characters with wcout and a wstring on Windows 7 pro 64-bit, and got a whole lot of gibberish. – Chrystal 23/1, 2011 at 20:38

@Thomas That doesn't mean the problem is with wstring. – Newsom 23/1, 2011 at 20:40

@David I think that's a Python problem, not a Windows problem. I know the Python devs try hard to get Unicode support everywhere, but I think it's hard to bring the actual Windows semantics to a model that assumes that operating system streams are always byte-based and encoding-agnostic (that is true for Unix file and console streams and for Windows file streams, but not for the Windows console). I haven't studied the Python source code, but I think that at least some time in the past they assumed this model to hold. – Bickerstaff 23/1, 2011 at 20:48

@Bickerstaff It's just a real shame that the Windows console feels a little neglected. – Newsom 23/1, 2011 at 20:49

@Thomas: I don't think the MSVC++ iostreams library does any kind of Unicode except allowing Unicode file names. All solutions for using Unicode in C++ are effectively pure C solutions, either using the Windows API directly or using nonstandard extensions to the C library. – Bickerstaff 23/1, 2011 at 20:50

@Philipp, I agree. That's why I say that wstring is UCS-2 and not UTF-16. – Chrystal 23/1, 2011 at 20:53

@David: the problem is not with wstring storage, it's with typical wstring usage and UTF-16. Can can store UTF-16 in a bitset if you want, but is that using it with UTF-16? Not really. – Chrystal 23/1, 2011 at 20:59

@thomas what would you use instead of wstring? – Newsom 23/1, 2011 at 21:9

@Thomas: The MSVC++ standard library doesn't support UCS-2 either. Last time I checked, the C++ locales didn't support any Unicode locale, making Unicode output essentially impossible. – Bickerstaff 23/1, 2011 at 21:20

Correction: The MSVC++ library does support UTF-16 and UTF-32 for the types char16_t and char32_t, that would essentially solve the issue for file I/O. – Bickerstaff 23/1, 2011 at 22:25

@David: There's no good answer. What to use I guess depends on framework, platform, specific I/O requirements, etc. In general, if one must support non-BMP, char32_t and UTF-32 seems safer. – Chrystal 23/1, 2011 at 22:50

@Thomas No the question is what you use instead of wstring for UTF-16 – Newsom 23/1, 2011 at 22:53

@David, convert it to UTF-32, then use string<char32_t>. Or, in .Net use system.text.UTF32Encoding – Chrystal 23/1, 2011 at 23:4

@David, unless, of course, you can guarentee BMP, then there's no issue. – Chrystal 23/1, 2011 at 23:6

@thomas have you heard of surrogate pairs? UTF-16 is designed to be used with 16 code units. Outside BMP is fine. Are you aware that UTF-16 can encode all Unicode code points? – Newsom 23/1, 2011 at 23:11

@David, yes I'm aware. The problem is that many APIs that use wstrings don't know the difference. They interpret surrogate pairs as two 16-bit codes points. But since the surrogate pairs are in the invalid range of the BMP, they are ignored. – Chrystal 23/1, 2011 at 23:21

@thomas that would be a criticism of the API but your original point is that wstring is no good for storing UTF-16. Anyway which APIs are you referring to. I'm curious to know which ones don't support Unicode. – Newsom 23/1, 2011 at 23:26

-6

This is a bit raw, but how about reading the file as plain old bytes then cast the byte buffer to wchar_t* ?

Something like:

#include <iostream>
#include <fstream>
std::wstring ReadFileIntoWstring(const std::wstring& filepath)
{
    std::wstring wstr;
    std::ifstream file (filepath.c_str(), std::ios::in|std::ios::binary|std::ios::ate);
    size_t size = (size_t)file.tellg();
    file.seekg (0, std::ios::beg);
    char* buffer = new char [size];
    file.read (buffer, size);
    wstr = (wchar_t*)buffer;
    file.close();
    delete[] buffer;
    return wstr;
}

Bicarbonate answered 18/10, 2012 at 20:53 Comment(1)

I think it won't work -- the file contains UTF-8 not a sequence of wchar_t. – Antony 12/5, 2021 at 18:45

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags