How to portably write std::wstring to file?
Asked Answered
V

10

24

I have a wstring declared as such:

// random wstring
std::wstring str = L"abcàdëefŸg€hhhhhhhµa";

The literal would be UTF-8 encoded, because my source file is.

[EDIT: According to Mark Ransom this is not necessarily the case, the compiler will decide what encoding to use - let us instead assume that I read this string from a file encoded in e.g. UTF-8]

I would very much like to get this into a file reading (when text editor is set to the correct encoding)

abcàdëefŸg€hhhhhhhµa

but ofstream is not very cooperative (refuses to take wstring parameters), and wofstream supposedly needs to know locale and encoding settings. I just want to output this set of bytes. How does one normally do this?

EDIT: It must be cross platform, and should not rely on the encoding being UTF-8. I just happen to have a set of bytes stored in a wstring, and want to output them. It could very well be UTF-16, or plain ASCII.

Vierra answered 29/10, 2010 at 16:31 Comment(5)
Win32 API provides WideCharToMultiByte for this purpose.Fledge
I need a cross platform solution, sorry.Vierra
Why not use the standard locale functionality from C++? stdcxx.apache.org/doc/stdlibref/codecvt-byname.htmlDramshop
@basilevs: see comment to your answerVierra
More information on the encoding of L"" strings: #1810843Microbiology
E
7

Why not write the file as a binary. Just use ofstream with the std::ios::binary setting. The editor should be able to interpret it then. Don't forget the Unicode flag 0xFEFF at the beginning. You might be better of writing with a library, try one of these:

http://www.codeproject.com/KB/files/EZUTF.aspx

http://www.gnu.org/software/libiconv/

http://utfcpp.sourceforge.net/

Ethanethane answered 29/10, 2010 at 16:57 Comment(2)
The problem is that I won't know that this is UTF-8, so I'll have to do without the BOM. But still, I'll see if I can use binary. It's a bit hairy for what I'm doing, though - I'd rather avoid it if possible.Vierra
I have decided to drop unicode support, it's not worth it in my case. Yet, I feel this answer was the closest one to a working solution, so you get the accepted status (at least for now).Vierra
S
48

For std::wstring you need std::wofstream

std::wofstream f(L"C:\\some file.txt");
f << str;
f.close();
Sickly answered 14/8, 2013 at 8:11 Comment(1)
This doesn't work in Windows if the string actually contains non-8bit charactersVietcong
G
16

std::wstring is for something like UTF-16 or UTF-32, not UTF-8. For UTF-8, you probably just want to use std::string, and write out via std::cout. Just FWIW, C++0x will have Unicode literals, which should help clarify situations like this.

Genisia answered 29/10, 2010 at 16:39 Comment(34)
Unfortunately I very much need wstring for UTF-8. UTF-8 code points can take up several bytes, and I need to be able to manipulate the string.Vierra
For the in-practice it's worth noting that newer versions of MinGW g++ (for Windows) support UTF-8 w/BOM, so that g++ can compile UTF-8 encoded source code that also can be compiled with Visual C++.Auntie
@oystein: what Jerry is telling you is (1) that wstring does not give you UTF-8 encoding, and (2) that string does, if your source code is UTF-8 encoded. Cheers & hth.,Auntie
@alf: Are you saying that storing a UTF-8 string in a std::wstring will mess up the encoding? That is not my experience...Vierra
@oystein: wstring simply isn't UTF-8. You can store UTF-8 in a std::string, but you must be very careful using string methods such as find.Bookplate
@roger: What do you mean it isn't UTF-8? As far as I know it's just a string class implemented with w_chars, encoding should not matterVierra
@oystein: uhm, disregard last comment. With g++ you get UTF-8, with MSVC 10.0 you get translation to Windows ANSI. Thinking about it it depends on the "execution character set", so, no compiler-independent guarantee. :-( And one problem with the C++0x Unicode literals, they don't seem to be supported by MSVC 10.0, although the types are supported. Cheers, & sorry for disinformation,Auntie
@oystein: wchar_t can't (reasonably) represent UTF-8 — its entire raison d'être is to represent wide characters instead of a multibyte encoding.Bookplate
@alf: I don't think I understand what you are talking about, UTF-8 source files compile fine for me...?Vierra
@roger: I don't see why that would be a problem?Vierra
@oystein: regarding compilation, older versions of g++ choked on a BOM (Byte Order Mark) at the start of an UTF-8 source code file. The snag was/is that Visual C++ required the BOM. Cheers,Auntie
@alf: Ah, ok. I'm using g++ 4.5.1, which seems to handle it without problemsVierra
wstring is a string represented by UNICODE codes which have constant (microsoft thinks in its own way) length of 6 (but usually implemented as 4 or 2)) bytes. UTF-8 is a multybyte representation which encodes UNICODE codes as sequences of 1-6 bytes.Dramshop
No, wstring is just a basic_string<wchar_t>. Nothing more.Vierra
@oystein: yes, but the whole point of UTF-8 is to encode a code point into 8-bit "chunks". wchar_t is specifically intended for dealing with "chunks" that are larger than 8 bits. As such, while you can store UTF-8 into a wchar_t, it's utterly pointless to do so. char is guaranteed to be (at least) 8 bits, which (in turn) guarantees that it will hold UTF-8 data without a problem.Genisia
@jerry: The problem is that many common UTF-8 characters use two (or more) "chunks", and such create a major headache when assuming that each element (char) in a std::string is a character, which it won't be in that case. Using a wstring, there is more space in each element, and the probability of an element being a whole character increases.Vierra
@oystein: storing utf-8 in a wstring will be exactly identical to storing it in a string, except you'll always be wasting a 1 or 3 bytes for every element. The wchar_t's do not magically absorb multi-byte sequences.Sacksen
@Inverse: The amount of bytes wasted would depend on the platform, but yes. The advandage of using wstring is that I can more safely assume that each element contains one character, not e.g. a half one.Vierra
@oystein: That's true only if you/your editor actually encodes that character into UTF-16 or UTF-32/UCS-4. Codepoint X converted to UTF-8 will always use the same number of bytes, and they'll always be 8 bits apiece -- storing them into something larger will just waste space. For wchar_t to do any good, you need to use UTF-16 or UTF-32/UCS-4 (depending on what size of wchar_t your compiler supports -- MS => 16 bits, gcc => 32 bits).Genisia
@oystein you can't safely assume that. Microsoft makes use of UTF-16 for theirs wide strings. That means only two bytes per unit and up to six per character.Dramshop
@basilevs: That's why I said "more safely" - compared to std::stringVierra
@jerry: Not sure if I'm getting what you're trying to say here, but according to en.wikipedia.org/wiki/UTF-8 UTF-8 is a varaible length encoding, which in UTF-8's case means that a character could be 1 byte (8 bits) or more.Vierra
@jerry implies that wchar_t is not supposed to store multibyte encodings. His claim is true but irrelevant as your code doesn't try do do so. You are working with wide stings only, not myltibyte ones.Dramshop
@basilevs: I do not get what you are saying, my strings certainly contain multiple bytes :) And UTF-8 is a variable length encoding, which implies that it could be multibyte.Vierra
By multiple I mean codepoint of variable length. Yours are (more or less) of constant length.Dramshop
And you are not using UTF-8 at runtime in your exampleDramshop
@basilevs: No, UTF-8 codepoints are of variable length - am I misunderstanding you completely here? And could you please clarify what you mean by "using UTF-8 at runtime"?Vierra
You are using wide characters at runtime. It is UTF-16 on windows, UCS32 (might be wrong) on Linux. No UTF-8 here. UTF-8 codepoints are of variable length but you are not using it at runtime.Dramshop
Why so much misunderstanding? It seemed clear to me that the question was about working with wchar_t strings in the program, then automatically converting to UTF-8 on output. I remember a similar question from a couple of days ago with a good answer but I can't find it now.Microbiology
@Basilevs: What do you mean when you say that wide characters are UTF-16 on Windows? In VC++ a w_char is simply a short integer, IIRC. The fact that Windows use UTF-16 internally does not affect the encoding of the string I store in my variableVierra
@Mark: Lots of misunderstanding here for sure. I don't really see why a conversion is needed, the string is already encoded as UTF-8, I'm just storing it in a wstring. I'm probably doing something fundamentally wrong.Vierra
@oystein, that's part of the misunderstanding - even if your .cpp is in UTF-8, the string is not. It's Unicode all right, but it's in whatever format your compiler generates for wchar_t which most certainly won't be UTF-8.Microbiology
@Mark: Ah, now we are starting to make sense. Are you sure about this? Got any references? I was told that the encoding would be determined by the document encoding. Anyway, that does not really change anything.Vierra
I was told that document encoding is left when there is no L prefix.Dramshop
E
7

Why not write the file as a binary. Just use ofstream with the std::ios::binary setting. The editor should be able to interpret it then. Don't forget the Unicode flag 0xFEFF at the beginning. You might be better of writing with a library, try one of these:

http://www.codeproject.com/KB/files/EZUTF.aspx

http://www.gnu.org/software/libiconv/

http://utfcpp.sourceforge.net/

Ethanethane answered 29/10, 2010 at 16:57 Comment(2)
The problem is that I won't know that this is UTF-8, so I'll have to do without the BOM. But still, I'll see if I can use binary. It's a bit hairy for what I'm doing, though - I'd rather avoid it if possible.Vierra
I have decided to drop unicode support, it's not worth it in my case. Yet, I feel this answer was the closest one to a working solution, so you get the accepted status (at least for now).Vierra
D
6

C++ has means to perform a conversion from wide character to localized ones on output or file write. Use codecvt facet for that purpose.

You may use standard std::codecvt_byname, or a non-standard codecvt_facet implementation.

#include <locale>
using namespace std;
typedef codecvt_facet<wchar_t, char, mbstate_t> Cvt;
locale utf8locale(locale(), new codecvt_byname<wchar_t, char, mbstate_t> ("en_US.UTF-8"));
wcout.imbue(utf8locale);
wcout << L"Hello, wide to multybyte world!" << endl;

Beware that on some platforms codecvt_byname can only emit conversion only for locales that are installed in the system. I therefore recommend to search stackoverflow for "utf8 codecvt " and make a choice from many referenes of custom codecvt implementations listed.

EDIT: As OP states that the string is already encoded, all he should do is to remove prefixes L and "w" from every token of his code.

Dramshop answered 29/10, 2010 at 17:3 Comment(12)
Actually codecvt might be used to perform any conversions needed, but the most used one and provided by STL are input/output operations.Dramshop
Yes, but I do not want to convert anything, or am I missing something? The string is already encodedVierra
Then why are you making compilator to convert it to UNICODE with L prefix? Just output it with narrow streams.Dramshop
Encoded - means stored in an external encoding. In your case you write in external encoding. Then compiler converts your code to UNICODE, internal encoding and stores that in object module. Therefore if you want to output something you should perform a backward conversion or stop making compiler do the unnecessary.Dramshop
@basilevs: The L prefix does not magically make the compilator convert it to unicode, it just means that the string is a w_char literal. A wide string.Vierra
Well you sure know better. Might as well post the output of the test program to make me blush.Dramshop
@basilevs: I'm not trying to be rude or anything. Storing the string as std::string and outputting it with ofstream obviously works. But that does not solve my problem, which is why I created this question in the first place.Vierra
BTW, here are semiworking implementation of codecvt based on winapi and iconv. They illustrate the problem of codepoint sizes: fakedetector.cvs.sourceforge.net/viewvc/fakedetector/fakebase/… fakedetector.cvs.sourceforge.net/viewvc/fakedetector/fakebase/…Dramshop
My point is that wide literal IS stored in wide codepoints as string constant on compile time. Therefore there is now way (except some dirty microsoft hacks) to output that const without some kind of conversion (windows allows UTF16 output). Conversion may be done by explicit function call or by imbue of locale needed into wide output stream.Dramshop
God damn that Microsoft! It's making explanations so much harder!Dramshop
@basilevs: Well, I'll make it easy for you: take that constant and throw it into the nearest thrash bin - it was just an example :) The point is that I have a string of unknowng encoding (probably UTF-8) stored in a wstring.Vierra
As I mentioned in a comment to another answer, that is almost impossible to do. You can't widen unknown encoding. Widening is a process to make a codepoint take a larger space to ease the processing of data. If you can't widen the input, you should work with it in its raw form. std::string of vector<char> are appropriate containers for that. Narrow streams should be used with unknown encoding.Dramshop
O
5

There is a (Windows-specific) solution that should work for you here. Basically, convert wstring to UTF-8 codepage and then use ofstream.

#include < windows.h >

std::string to_utf8(const wchar_t* buffer, int len)
{
        int nChars = ::WideCharToMultiByte(
                CP_UTF8,
                0,
                buffer,
                len,
                NULL,
                0,
                NULL,
                NULL);
        if (nChars == 0) return "";

        string newbuffer;
        newbuffer.resize(nChars) ;
        ::WideCharToMultiByte(
                CP_UTF8,
                0,
                buffer,
                len,
                const_cast< char* >(newbuffer.c_str()),
                nChars,
                NULL,
                NULL); 

        return newbuffer;
}

std::string to_utf8(const std::wstring& str)
{
        return to_utf8(str.c_str(), (int)str.size());
}

int main()
{
        std::ofstream testFile;

        testFile.open("demo.xml", std::ios::out | std::ios::binary); 

        std::wstring text =
                L"< ?xml version=\"1.0\" encoding=\"UTF-8\"? >\n"
                L"< root description=\"this is a naïve example\" >\n< /root >";

        std::string outtext = to_utf8(text);

        testFile << outtext;

        testFile.close();

        return 0;
}
Overspread answered 29/10, 2010 at 16:39 Comment(5)
That's all nice, but I won't know the encoding of my string, and such this won't really help.. Also I need to be cross-platformVierra
@luke - I did link to that, in the first line of the first version of the response.Overspread
aaaaahhh, I already had the link in my history, so it looked like plain text. Terribly sorry.Autoharp
@Autoharp - np at all; @oystein - I will leave this here for future reference anyway - sorry it's not useful in your scenario.Overspread
The unique response wich works for me ... ThanksFraise
C
1

Note that wide streams output only char * variables, so maybe you should try using the c_str() member function to convert a std::wstring and then output it to the file. Then it should probably work?

Castlereagh answered 29/10, 2010 at 16:43 Comment(2)
Did not seem to work for me, not with wofstream and not with ofstreamVierra
Aah oops. Sorry for not being helpful.Castlereagh
R
0

I had the same problem some time ago, and wrote down the solution I found on my blog. You might want to check it out to see if it might help, especially the function wstring_to_utf8.

http://pileborg.org/b2e/blog5.php/2010/06/13/unicode-utf-8-and-wchar_t

Rutter answered 29/10, 2010 at 17:8 Comment(3)
Thank you for that, but it's not quite what I'm after, since I do not know what encoding my string will be in. For this example I just picked UTF-8. Also I don't think w_char is guaranteed to be able to contain a 4-byte character (UCS-4)? It is on Linux, but I think Windows users will face some problems here.Vierra
The link is now broken.Knurly
That's not how you spell "had".Pharmacopoeia
A
0
#include <iostream>
#include <string>
#include <filesystem>
#include <fstream>


int main() {
    std::wstring fileName = L"./WOut.txt";
    std::filesystem::path filePath = fileName;
    {
        std::wstring wstr = L"abcàdëefŸg€hhhhhhhµa";
        std::ofstream output (filePath);
        output << std::string (wstr.begin (), wstr.end ()) << std::endl;
    }
    
    system ("cat WOut.txt");
    std::string str;
    {
        std::ifstream in (filePath);
        in >> str;
    }
    
    std::cout << std::string (str.begin (), str.end ())  << std::endl;
    return 0;

}
Apples answered 3/7 at 13:5 Comment(0)
D
-1

You should not use UTF-8 encoded source file if you want to write portable code. Sorry.

  std::wstring str = L"abcàdëefŸg€hhhhhhhµa";

(I am not sure if this actually hurts the standard, but I think it is. But even if, to be safe you should not.)

Yes, purely using std::ostream will not work. There are many ways to convert a wstring to UTF-8. My favorite is using the International Components for Unicode. It's a big lib, but it's great. You get a lot of extras and things you might need in the future.

Duffie answered 29/10, 2010 at 17:41 Comment(4)
Sorry, I feel people don't get the point of this question, maybe I'm not clear enough. The problem is not UTF-8. This was just an example I picked. I will probably read the (w)string from a file, it could have any encoding. The problem is writing it back to a file.Vierra
I see.Then you probably just have to make sure to open the file in binary mode.Duffie
@oystein, Wow, I got your problem now. If you don't know the encoding you can't transform codepoints. If you can't do that, there is no meaning in wchar_t. Top voted answer is sure right.Dramshop
Probably, see inf.ig.sh's answer. I might end up with that. @basilevs: There is a reason I'm using wchar_t. I want to do lots of heavy manipulation on that string before I write it back, and have to rely on each element of my string being one whole character. That's not going to be the case with std::string as soon as you step outside the english-speaking world. With wide strings, it'll be likely enough that I can live with it.Vierra
R
-1

From my experience of working with different character encodings I would recommend that you only deal with UTF-8 at load and save time. You're in for a world of pain if you try and store the internal representation in UTF-8 since a single character could be anything from 1 byte to 4. So simple operations like strlen require looking at every byte to decide len rather than the allocated buffer (although you can optimize by looking at the first byte in the char sequence, e.g. 00..7f is a single byte char, c2..df indicates a 2 byte char etc).

People quite often refer to 'Unicode strings' when they mean UTF-16 and on Windows a wchar_t is a fixed 2 bytes. In Windows I think wchar_t is simply:

typedef SHORT wchar_t;

The full UTF-32 4 byte representation is rarely required and very wasteful, here what the Unicode Standard (5.0) has to say on it:

"On average more than 99% of all UTF-16 is expressed using single code units... UTF-16 provides the right mix of compact size with the ability to handle the occassional character outside the BMP"

In short, use whcar_t as your internal representation and do conversions when loading and saving (and don't worry about full Unicode unless you know you need it).

With regard to performing the actual conversion have a look at the ICU project:

http://site.icu-project.org/

Rotifer answered 29/10, 2010 at 17:48 Comment(6)
Some sensible words here. I was trying to avoid encodings at all, to be honest, since I really won't know what I'll get thrown at me in this case. That makes doing any conversions difficult. Storing it as a vector<char> (or similar) would mean that I have to make my own string class, and unicode support is really not worth that much coding time. It's starting to look like I'm going to drop unicode support for now, but we'll see.Vierra
(1) It's often more useful to know how many bytes are in a string (for memory allocation, disk space, etc.), than it is to know how many characters are in a string. For this purpose, strlen does work correctly for UTF-8.Jaehne
(2) It's not true that "most OSes consider a wchar_t as fixed 2 bytes" or as UTF-16. That's a Windows thing, done for backwards compatibility with UCS-2-based older versions of NT. On Linux, wchar_t is usually UTF-32. So, for cross-platform code, you either need to use UTF-8 or typedef your own UTF-16 / UTF-32 types. Fortunately, the new C++ standard will have char16_t and char32_t.Jaehne
@Jaehne To be honest I spend most of my time in Win world so I can't argue on other OSes. The Unicode Standard (5.0) states "On average more than 99% of all UTF-16 is expressed using single code units... UTF-16 provides the right mix of compact size with the ability to handle the occassional character outside the BMP". That's my main point. With regard to how useful it is to know character sizes rather than byte sizes... try writing any character processing code without knowing character lengths! UTF-8 is great for portability (no byte ordering issues) but not for working in.Rotifer
I've written a lot of string-handling code that doesn't care about character lengths. Consider for example, a routine to convert DOS-style line breaks to Unix-style ones. It doesn't matter if the 3 bytes "\xE2\x82\xAC" represent a single character; you're just going to output them unchanged. All you care about is '\r' and '\n' which are the same in UTF-8 as they are in ASCII.Jaehne
We'll had to agree to disagree. I work with a commercial product that deals with all sorts of encodings and I couldn't imagine trying to work with it in UTF-8 when trying to sync up character position on screen and character position in the buffer.Rotifer

© 2022 - 2024 — McMap. All rights reserved.