How to portably write std::wstring to file?

Asked 29/10, 2010 at 16:31 Answered 3/7 at 13:5

Solved c++file unicode wstring wofstream

I have a wstring declared as such:

// random wstring
std::wstring str = L"abcàdëefŸg€hhhhhhhµa";

~~The literal would be UTF-8 encoded, because my source file is.~~

[EDIT: According to Mark Ransom this is not necessarily the case, the compiler will decide what encoding to use - let us instead assume that I read this string from a file encoded in e.g. UTF-8]

I would very much like to get this into a file reading (when text editor is set to the correct encoding)

abcàdëefŸg€hhhhhhhµa

but ofstream is not very cooperative (refuses to take wstring parameters), and wofstream supposedly needs to know locale and encoding settings. I just want to output this set of bytes. How does one normally do this?

EDIT: It must be cross platform, and should not rely on the encoding being UTF-8. I just happen to have a set of bytes stored in a wstring, and want to output them. It could very well be UTF-16, or plain ASCII.

Vierra answered 29/10, 2010 at 16:31 Comment(5)

Win32 API provides WideCharToMultiByte for this purpose. – Fledge 29/10, 2010 at 16:39

I need a cross platform solution, sorry. – Vierra 29/10, 2010 at 16:41

Why not use the standard locale functionality from C++? stdcxx.apache.org/doc/stdlibref/codecvt-byname.html – Dramshop 29/10, 2010 at 17:11

@basilevs: see comment to your answer – Vierra 29/10, 2010 at 17:29

More information on the encoding of L"" strings: #1810843 – Microbiology 29/10, 2010 at 21:55

Why not write the file as a binary. Just use ofstream with the std::ios::binary setting. The editor should be able to interpret it then. Don't forget the Unicode flag 0xFEFF at the beginning. You might be better of writing with a library, try one of these:

http://www.codeproject.com/KB/files/EZUTF.aspx

http://www.gnu.org/software/libiconv/

http://utfcpp.sourceforge.net/

Ethanethane answered 29/10, 2010 at 16:57 Comment(2)

The problem is that I won't know that this is UTF-8, so I'll have to do without the BOM. But still, I'll see if I can use binary. It's a bit hairy for what I'm doing, though - I'd rather avoid it if possible. – Vierra 29/10, 2010 at 16:59

I have decided to drop unicode support, it's not worth it in my case. Yet, I feel this answer was the closest one to a working solution, so you get the accepted status (at least for now). – Vierra 29/10, 2010 at 22:31

For std::wstring you need std::wofstream

std::wofstream f(L"C:\\some file.txt");
f << str;
f.close();

Sickly answered 14/8, 2013 at 8:11 Comment(1)

This doesn't work in Windows if the string actually contains non-8bit characters – Vietcong 27/4, 2018 at 4:7

std::wstring is for something like UTF-16 or UTF-32, not UTF-8. For UTF-8, you probably just want to use std::string, and write out via std::cout. Just FWIW, C++0x will have Unicode literals, which should help clarify situations like this.

Genisia answered 29/10, 2010 at 16:39 Comment(34)

Unfortunately I very much need wstring for UTF-8. UTF-8 code points can take up several bytes, and I need to be able to manipulate the string. – Vierra 29/10, 2010 at 16:41

For the in-practice it's worth noting that newer versions of MinGW g++ (for Windows) support UTF-8 w/BOM, so that g++ can compile UTF-8 encoded source code that also can be compiled with Visual C++. – Auntie 29/10, 2010 at 16:42

@oystein: what Jerry is telling you is (1) that wstring does not give you UTF-8 encoding, and (2) that string does, if your source code is UTF-8 encoded. Cheers & hth., – Auntie 29/10, 2010 at 16:43

@alf: Are you saying that storing a UTF-8 string in a std::wstring will mess up the encoding? That is not my experience... – Vierra 29/10, 2010 at 16:45

@oystein: wstring simply isn't UTF-8. You can store UTF-8 in a std::string, but you must be very careful using string methods such as find. – Bookplate 29/10, 2010 at 16:46

@roger: What do you mean it isn't UTF-8? As far as I know it's just a string class implemented with w_chars, encoding should not matter – Vierra 29/10, 2010 at 16:50

@oystein: uhm, disregard last comment. With g++ you get UTF-8, with MSVC 10.0 you get translation to Windows ANSI. Thinking about it it depends on the "execution character set", so, no compiler-independent guarantee. :-( And one problem with the C++0x Unicode literals, they don't seem to be supported by MSVC 10.0, although the types are supported. Cheers, & sorry for disinformation, – Auntie 29/10, 2010 at 16:50

@oystein: wchar_t can't (reasonably) represent UTF-8 — its entire raison d'être is to represent wide characters instead of a multibyte encoding. – Bookplate 29/10, 2010 at 16:51

@alf: I don't think I understand what you are talking about, UTF-8 source files compile fine for me...? – Vierra 29/10, 2010 at 16:52

@roger: I don't see why that would be a problem? – Vierra 29/10, 2010 at 16:54

@oystein: regarding compilation, older versions of g++ choked on a BOM (Byte Order Mark) at the start of an UTF-8 source code file. The snag was/is that Visual C++ required the BOM. Cheers, – Auntie 29/10, 2010 at 16:54

@alf: Ah, ok. I'm using g++ 4.5.1, which seems to handle it without problems – Vierra 29/10, 2010 at 16:57

wstring is a string represented by UNICODE codes which have constant (microsoft thinks in its own way) length of 6 (but usually implemented as 4 or 2)) bytes. UTF-8 is a multybyte representation which encodes UNICODE codes as sequences of 1-6 bytes. – Dramshop 29/10, 2010 at 17:13

No, wstring is just a basic_string<wchar_t>. Nothing more. – Vierra 29/10, 2010 at 17:21

@oystein: yes, but the whole point of UTF-8 is to encode a code point into 8-bit "chunks". wchar_t is specifically intended for dealing with "chunks" that are larger than 8 bits. As such, while you can store UTF-8 into a wchar_t, it's utterly pointless to do so. char is guaranteed to be (at least) 8 bits, which (in turn) guarantees that it will hold UTF-8 data without a problem. – Genisia 29/10, 2010 at 17:25

@jerry: The problem is that many common UTF-8 characters use two (or more) "chunks", and such create a major headache when assuming that each element (char) in a std::string is a character, which it won't be in that case. Using a wstring, there is more space in each element, and the probability of an element being a whole character increases. – Vierra 29/10, 2010 at 17:36

@oystein: storing utf-8 in a wstring will be exactly identical to storing it in a string, except you'll always be wasting a 1 or 3 bytes for every element. The wchar_t's do not magically absorb multi-byte sequences. – Sacksen 29/10, 2010 at 17:45

@Inverse: The amount of bytes wasted would depend on the platform, but yes. The advandage of using wstring is that I can more safely assume that each element contains one character, not e.g. a half one. – Vierra 29/10, 2010 at 17:48

@oystein: That's true only if you/your editor actually encodes that character into UTF-16 or UTF-32/UCS-4. Codepoint X converted to UTF-8 will always use the same number of bytes, and they'll always be 8 bits apiece -- storing them into something larger will just waste space. For wchar_t to do any good, you need to use UTF-16 or UTF-32/UCS-4 (depending on what size of wchar_t your compiler supports -- MS => 16 bits, gcc => 32 bits). – Genisia 29/10, 2010 at 17:51

@oystein you can't safely assume that. Microsoft makes use of UTF-16 for theirs wide strings. That means only two bytes per unit and up to six per character. – Dramshop 29/10, 2010 at 17:51

@basilevs: That's why I said "more safely" - compared to std::string – Vierra 29/10, 2010 at 17:53

@jerry: Not sure if I'm getting what you're trying to say here, but according to en.wikipedia.org/wiki/UTF-8 UTF-8 is a varaible length encoding, which in UTF-8's case means that a character could be 1 byte (8 bits) or more. – Vierra 29/10, 2010 at 17:56

@jerry implies that wchar_t is not supposed to store multibyte encodings. His claim is true but irrelevant as your code doesn't try do do so. You are working with wide stings only, not myltibyte ones. – Dramshop 29/10, 2010 at 18:2

@basilevs: I do not get what you are saying, my strings certainly contain multiple bytes :) And UTF-8 is a variable length encoding, which implies that it could be multibyte. – Vierra 29/10, 2010 at 18:5

By multiple I mean codepoint of variable length. Yours are (more or less) of constant length. – Dramshop 29/10, 2010 at 18:15

And you are not using UTF-8 at runtime in your example – Dramshop 29/10, 2010 at 18:21

@basilevs: No, UTF-8 codepoints are of variable length - am I misunderstanding you completely here? And could you please clarify what you mean by "using UTF-8 at runtime"? – Vierra 29/10, 2010 at 18:31

You are using wide characters at runtime. It is UTF-16 on windows, UCS32 (might be wrong) on Linux. No UTF-8 here. UTF-8 codepoints are of variable length but you are not using it at runtime. – Dramshop 29/10, 2010 at 18:36

Why so much misunderstanding? It seemed clear to me that the question was about working with wchar_t strings in the program, then automatically converting to UTF-8 on output. I remember a similar question from a couple of days ago with a good answer but I can't find it now. – Microbiology 29/10, 2010 at 18:56

@Basilevs: What do you mean when you say that wide characters are UTF-16 on Windows? In VC++ a w_char is simply a short integer, IIRC. The fact that Windows use UTF-16 internally does not affect the encoding of the string I store in my variable – Vierra 29/10, 2010 at 20:17

@Mark: Lots of misunderstanding here for sure. I don't really see why a conversion is needed, the string is already encoded as UTF-8, I'm just storing it in a wstring. I'm probably doing something fundamentally wrong. – Vierra 29/10, 2010 at 20:18

@oystein, that's part of the misunderstanding - even if your .cpp is in UTF-8, the string is not. It's Unicode all right, but it's in whatever format your compiler generates for wchar_t which most certainly won't be UTF-8. – Microbiology 29/10, 2010 at 20:24

@Mark: Ah, now we are starting to make sense. Are you sure about this? Got any references? I was told that the encoding would be determined by the document encoding. Anyway, that does not really change anything. – Vierra 29/10, 2010 at 21:11

I was told that document encoding is left when there is no L prefix. – Dramshop 30/10, 2010 at 3:48

http://www.codeproject.com/KB/files/EZUTF.aspx

http://www.gnu.org/software/libiconv/

http://utfcpp.sourceforge.net/

Ethanethane answered 29/10, 2010 at 16:57 Comment(2)

C++ has means to perform a conversion from wide character to localized ones on output or file write. Use codecvt facet for that purpose.

You may use standard std::codecvt_byname, or a non-standard codecvt_facet implementation.

#include <locale>
using namespace std;
typedef codecvt_facet<wchar_t, char, mbstate_t> Cvt;
locale utf8locale(locale(), new codecvt_byname<wchar_t, char, mbstate_t> ("en_US.UTF-8"));
wcout.imbue(utf8locale);
wcout << L"Hello, wide to multybyte world!" << endl;

Beware that on some platforms codecvt_byname can only emit conversion only for locales that are installed in the system. I therefore recommend to search stackoverflow for "utf8 codecvt " and make a choice from many referenes of custom codecvt implementations listed.

EDIT: As OP states that the string is already encoded, all he should do is to remove prefixes L and "w" from every token of his code.

Dramshop answered 29/10, 2010 at 17:3 Comment(12)

Actually codecvt might be used to perform any conversions needed, but the most used one and provided by STL are input/output operations. – Dramshop 29/10, 2010 at 17:7

Yes, but I do not want to convert anything, or am I missing something? The string is already encoded – Vierra 29/10, 2010 at 17:28

Then why are you making compilator to convert it to UNICODE with L prefix? Just output it with narrow streams. – Dramshop 29/10, 2010 at 17:41

Encoded - means stored in an external encoding. In your case you write in external encoding. Then compiler converts your code to UNICODE, internal encoding and stores that in object module. Therefore if you want to output something you should perform a backward conversion or stop making compiler do the unnecessary. – Dramshop 29/10, 2010 at 17:46

@basilevs: The L prefix does not magically make the compilator convert it to unicode, it just means that the string is a w_char literal. A wide string. – Vierra 29/10, 2010 at 17:50

Well you sure know better. Might as well post the output of the test program to make me blush. – Dramshop 29/10, 2010 at 17:58

@basilevs: I'm not trying to be rude or anything. Storing the string as std::string and outputting it with ofstream obviously works. But that does not solve my problem, which is why I created this question in the first place. – Vierra 29/10, 2010 at 18:3

BTW, here are semiworking implementation of codecvt based on winapi and iconv. They illustrate the problem of codepoint sizes: fakedetector.cvs.sourceforge.net/viewvc/fakedetector/fakebase/… fakedetector.cvs.sourceforge.net/viewvc/fakedetector/fakebase/… – Dramshop 29/10, 2010 at 18:7

My point is that wide literal IS stored in wide codepoints as string constant on compile time. Therefore there is now way (except some dirty microsoft hacks) to output that const without some kind of conversion (windows allows UTF16 output). Conversion may be done by explicit function call or by imbue of locale needed into wide output stream. – Dramshop 29/10, 2010 at 18:11

God damn that Microsoft! It's making explanations so much harder! – Dramshop 29/10, 2010 at 18:16

@basilevs: Well, I'll make it easy for you: take that constant and throw it into the nearest thrash bin - it was just an example :) The point is that I have a string of unknowng encoding (probably UTF-8) stored in a wstring. – Vierra 29/10, 2010 at 21:16

As I mentioned in a comment to another answer, that is almost impossible to do. You can't widen unknown encoding. Widening is a process to make a codepoint take a larger space to ease the processing of data. If you can't widen the input, you should work with it in its raw form. std::string of vector<char> are appropriate containers for that. Narrow streams should be used with unknown encoding. – Dramshop 30/10, 2010 at 3:46

There is a (Windows-specific) solution that should work for you here. Basically, convert wstring to UTF-8 codepage and then use ofstream.

#include < windows.h >

std::string to_utf8(const wchar_t* buffer, int len)
{
        int nChars = ::WideCharToMultiByte(
                CP_UTF8,
                0,
                buffer,
                len,
                NULL,
                0,
                NULL,
                NULL);
        if (nChars == 0) return "";

        string newbuffer;
        newbuffer.resize(nChars) ;
        ::WideCharToMultiByte(
                CP_UTF8,
                0,
                buffer,
                len,
                const_cast< char* >(newbuffer.c_str()),
                nChars,
                NULL,
                NULL); 

        return newbuffer;
}

std::string to_utf8(const std::wstring& str)
{
        return to_utf8(str.c_str(), (int)str.size());
}

int main()
{
        std::ofstream testFile;

        testFile.open("demo.xml", std::ios::out | std::ios::binary); 

        std::wstring text =
                L"< ?xml version=\"1.0\" encoding=\"UTF-8\"? >\n"
                L"< root description=\"this is a naïve example\" >\n< /root >";

        std::string outtext = to_utf8(text);

        testFile << outtext;

        testFile.close();

        return 0;
}

Overspread answered 29/10, 2010 at 16:39 Comment(5)

That's all nice, but I won't know the encoding of my string, and such this won't really help.. Also I need to be cross-platform – Vierra 29/10, 2010 at 16:56

@luke - I did link to that, in the first line of the first version of the response. – Overspread 29/10, 2010 at 16:57

aaaaahhh, I already had the link in my history, so it looked like plain text. Terribly sorry. – Autoharp 29/10, 2010 at 16:59

@Autoharp - np at all; @oystein - I will leave this here for future reference anyway - sorry it's not useful in your scenario. – Overspread 29/10, 2010 at 17:2

The unique response wich works for me ... Thanks – Fraise 7/11, 2021 at 1:20

Note that wide streams output only char * variables, so maybe you should try using the c_str() member function to convert a std::wstring and then output it to the file. Then it should probably work?

Castlereagh answered 29/10, 2010 at 16:43 Comment(2)

Did not seem to work for me, not with wofstream and not with ofstream – Vierra 29/10, 2010 at 16:48

Aah oops. Sorry for not being helpful. – Castlereagh 29/10, 2010 at 16:48

I had the same problem some time ago, and wrote down the solution I found on my blog. You might want to check it out to see if it might help, especially the function wstring_to_utf8.

http://pileborg.org/b2e/blog5.php/2010/06/13/unicode-utf-8-and-wchar_t

Rutter answered 29/10, 2010 at 17:8 Comment(3)

Thank you for that, but it's not quite what I'm after, since I do not know what encoding my string will be in. For this example I just picked UTF-8. Also I don't think w_char is guaranteed to be able to contain a 4-byte character (UCS-4)? It is on Linux, but I think Windows users will face some problems here. – Vierra 29/10, 2010 at 17:27

The link is now broken. – Knurly 2/1, 2018 at 5:5

That's not how you spell "had". – Pharmacopoeia 8/2, 2019 at 18:11

#include <iostream>
#include <string>
#include <filesystem>
#include <fstream>


int main() {
    std::wstring fileName = L"./WOut.txt";
    std::filesystem::path filePath = fileName;
    {
        std::wstring wstr = L"abcàdëefŸg€hhhhhhhµa";
        std::ofstream output (filePath);
        output << std::string (wstr.begin (), wstr.end ()) << std::endl;
    }
    
    system ("cat WOut.txt");
    std::string str;
    {
        std::ifstream in (filePath);
        in >> str;
    }
    
    std::cout << std::string (str.begin (), str.end ())  << std::endl;
    return 0;

}

Apples answered 3/7 at 13:5 Comment(0)

-1

You should not use UTF-8 encoded source file if you want to write portable code. Sorry.

  std::wstring str = L"abcàdëefŸg€hhhhhhhµa";

(I am not sure if this actually hurts the standard, but I think it is. But even if, to be safe you should not.)

Yes, purely using std::ostream will not work. There are many ways to convert a wstring to UTF-8. My favorite is using the International Components for Unicode. It's a big lib, but it's great. You get a lot of extras and things you might need in the future.

Duffie answered 29/10, 2010 at 17:41 Comment(4)

Sorry, I feel people don't get the point of this question, maybe I'm not clear enough. The problem is not UTF-8. This was just an example I picked. I will probably read the (w)string from a file, it could have any encoding. The problem is writing it back to a file. – Vierra 29/10, 2010 at 17:44

I see.Then you probably just have to make sure to open the file in binary mode. – Duffie 29/10, 2010 at 18:19

@oystein, Wow, I got your problem now. If you don't know the encoding you can't transform codepoints. If you can't do that, there is no meaning in wchar_t. Top voted answer is sure right. – Dramshop 29/10, 2010 at 18:27

Probably, see inf.ig.sh's answer. I might end up with that. @basilevs: There is a reason I'm using wchar_t. I want to do lots of heavy manipulation on that string before I write it back, and have to rely on each element of my string being one whole character. That's not going to be the case with std::string as soon as you step outside the english-speaking world. With wide strings, it'll be likely enough that I can live with it. – Vierra 29/10, 2010 at 21:21

-1

From my experience of working with different character encodings I would recommend that you only deal with UTF-8 at load and save time. You're in for a world of pain if you try and store the internal representation in UTF-8 since a single character could be anything from 1 byte to 4. So simple operations like strlen require looking at every byte to decide len rather than the allocated buffer (although you can optimize by looking at the first byte in the char sequence, e.g. 00..7f is a single byte char, c2..df indicates a 2 byte char etc).

People quite often refer to 'Unicode strings' when they mean UTF-16 and on Windows a wchar_t is a fixed 2 bytes. In Windows I think wchar_t is simply:

typedef SHORT wchar_t;

The full UTF-32 4 byte representation is rarely required and very wasteful, here what the Unicode Standard (5.0) has to say on it:

"On average more than 99% of all UTF-16 is expressed using single code units... UTF-16 provides the right mix of compact size with the ability to handle the occassional character outside the BMP"

In short, use whcar_t as your internal representation and do conversions when loading and saving (and don't worry about full Unicode unless you know you need it).

With regard to performing the actual conversion have a look at the ICU project:

http://site.icu-project.org/

Rotifer answered 29/10, 2010 at 17:48 Comment(6)

Some sensible words here. I was trying to avoid encodings at all, to be honest, since I really won't know what I'll get thrown at me in this case. That makes doing any conversions difficult. Storing it as a vector<char> (or similar) would mean that I have to make my own string class, and unicode support is really not worth that much coding time. It's starting to look like I'm going to drop unicode support for now, but we'll see. – Vierra 29/10, 2010 at 18:1

(1) It's often more useful to know how many bytes are in a string (for memory allocation, disk space, etc.), than it is to know how many characters are in a string. For this purpose, strlen does work correctly for UTF-8. – Jaehne 29/10, 2010 at 18:42

(2) It's not true that "most OSes consider a wchar_t as fixed 2 bytes" or as UTF-16. That's a Windows thing, done for backwards compatibility with UCS-2-based older versions of NT. On Linux, wchar_t is usually UTF-32. So, for cross-platform code, you either need to use UTF-8 or typedef your own UTF-16 / UTF-32 types. Fortunately, the new C++ standard will have char16_t and char32_t. – Jaehne 29/10, 2010 at 18:50

@Jaehne To be honest I spend most of my time in Win world so I can't argue on other OSes. The Unicode Standard (5.0) states "On average more than 99% of all UTF-16 is expressed using single code units... UTF-16 provides the right mix of compact size with the ability to handle the occassional character outside the BMP". That's my main point. With regard to how useful it is to know character sizes rather than byte sizes... try writing any character processing code without knowing character lengths! UTF-8 is great for portability (no byte ordering issues) but not for working in. – Rotifer 29/10, 2010 at 21:10

I've written a lot of string-handling code that doesn't care about character lengths. Consider for example, a routine to convert DOS-style line breaks to Unix-style ones. It doesn't matter if the 3 bytes "\xE2\x82\xAC" represent a single character; you're just going to output them unchanged. All you care about is '\r' and '\n' which are the same in UTF-8 as they are in ASCII. – Jaehne 29/10, 2010 at 21:59

We'll had to agree to disagree. I work with a commercial product that deals with all sorts of encodings and I couldn't imagine trying to work with it in UTF-8 when trying to sync up character position on screen and character position in the buffer. – Rotifer 1/11, 2010 at 10:23

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags