How can I read a Unicode (UTF-8) file into wstring
(s) on the Windows platform?
With C++11 support, you can use std::codecvt_utf8 facet which encapsulates conversion between a UTF-8 encoded byte string and UCS2 or UCS4 character string and which can be used to read and write UTF-8 files, both text and binary.
In order to use facet you usually create locale object that encapsulates culture-specific information as a set of facets that collectively define a specific localized environment. Once you have a locale object, you can imbue your stream buffer with it:
#include <sstream>
#include <fstream>
#include <codecvt>
std::wstring readFile(const char* filename)
{
std::wifstream wif(filename);
wif.imbue(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
std::wstringstream wss;
wss << wif.rdbuf();
return wss.str();
}
which can be used like this:
std::wstring wstr = readFile("a.txt");
Alternatively you can set the global C++ locale before you work with string streams which causes all future calls to the std::locale
default constructor to return a copy of the global C++ locale (you don't need to explicitly imbue stream buffers with it then):
std::locale::global(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
new codecvt_utf8
require a corresponding delete
? –
Bodrogi According to a comment by @Hans Passant, the simplest way is to use _wfopen_s. Open the file with mode rt, ccs=UTF-8
.
Here is another pure C++ solution that works at least with VC++ 2010:
#include <locale>
#include <codecvt>
#include <string>
#include <fstream>
#include <cstdlib>
int main() {
const std::locale empty_locale = std::locale::empty();
typedef std::codecvt_utf8<wchar_t> converter_type;
const converter_type* converter = new converter_type;
const std::locale utf8_locale = std::locale(empty_locale, converter);
std::wifstream stream(L"test.txt");
stream.imbue(utf8_locale);
std::wstring line;
std::getline(stream, line);
std::system("pause");
}
Except for locale::empty()
(here locale::global()
might work as well) and the wchar_t*
overload of the basic_ifstream
constructor, this should even be pretty standard-compliant (where “standard” means C++0x, of course).
delete converter
? –
Fistula Here's a platform-specific function for Windows only:
size_t GetSizeOfFile(const std::wstring& path)
{
struct _stat fileinfo;
_wstat(path.c_str(), &fileinfo);
return fileinfo.st_size;
}
std::wstring LoadUtf8FileToString(const std::wstring& filename)
{
std::wstring buffer; // stores file contents
FILE* f = _wfopen(filename.c_str(), L"rtS, ccs=UTF-8");
// Failed to open file
if (f == NULL)
{
// ...handle some error...
return buffer;
}
size_t filesize = GetSizeOfFile(filename);
// Read entire file contents in to memory
if (filesize > 0)
{
buffer.resize(filesize);
size_t wchars_read = fread(&(buffer.front()), sizeof(wchar_t), filesize, f);
buffer.resize(wchars_read);
buffer.shrink_to_fit();
}
fclose(f);
return buffer;
}
Use like so:
std::wstring mytext = LoadUtf8FileToString(L"C:\\MyUtf8File.txt");
Note the entire file is loaded in to memory, so you might not want to use it for very large files.
Simply using _wfopen has no effect on the coded character set used in the file stream.
–
Contradistinguish t
in the mode as well as ccs=UTF-8
causes characters to be converted as they are read to and from the stream. –
Sundew _wfopen
without the ccs=
mode specifier. You need both _wfopen
(according to the manual _wfopen_s
is to be preferred) and ccs=UTF-8
. –
Bickerstaff #include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <cstdlib>
int main()
{
std::wifstream wif("filename.txt");
wif.imbue(std::locale("zh_CN.UTF-8"));
std::wcout.imbue(std::locale("zh_CN.UTF-8"));
std::wcout << wif.rdbuf();
}
Recently dealt with all the encodings, solved this way. It is better to use std::u32string
as it has stable size on all platforms, and most fonts work with utf-32 format. (the file should still be in utf-8)
std::u32string readFile(std::string filename) {
std::basic_ifstream<char32_t> fin(filename);
std::u32string str{};
std::getline(fin, str, U'\0');
return str;
}
Feel free to use standard functions other than gcount
, and save the result of tellg
to pos_type
only. Also, be sure to pass separator to std::getline
(if you don't do this, the function gives exception std::bad_cast
)
This question was addressed in Confused about C++'s std::wstring, UTF-16, UTF-8 and displaying strings in a windows GUI. In sum, wstring is based upon the UCS-2 standard, which is the predecessor of UTF-16. This is a strictly two byte standard. I believe this covers Arabic.
wstring
is just an array of 16-bit integers on Windows. You can store UCS-2 or UTF-16 data or whatever you like in it. Most Windows APIs do accept UTF-16 strings nowadays. –
Bickerstaff wstring
if you treat it as an UTF-16 string. Non-BMP code points will need two code units, but there's nothing wrong with that. –
Bickerstaff wstring
. –
Newsom iostreams
library does any kind of Unicode except allowing Unicode file names. All solutions for using Unicode in C++ are effectively pure C solutions, either using the Windows API directly or using nonstandard extensions to the C library. –
Bickerstaff char16_t
and char32_t
, that would essentially solve the issue for file I/O. –
Bickerstaff This is a bit raw, but how about reading the file as plain old bytes then cast the byte buffer to wchar_t* ?
Something like:
#include <iostream>
#include <fstream>
std::wstring ReadFileIntoWstring(const std::wstring& filepath)
{
std::wstring wstr;
std::ifstream file (filepath.c_str(), std::ios::in|std::ios::binary|std::ios::ate);
size_t size = (size_t)file.tellg();
file.seekg (0, std::ios::beg);
char* buffer = new char [size];
file.read (buffer, size);
wstr = (wchar_t*)buffer;
file.close();
delete[] buffer;
return wstr;
}
© 2022 - 2024 — McMap. All rights reserved.