WChars, Encodings, Standards and Portability

Asked 10/6, 2011 at 0:35 Answered 11/6, 2011 at 21:18

The following may not qualify as a SO question; if it is out of bounds, please feel free to tell me to go away. The question here is basically, "Do I understand the C standard correctly and is this the right way to go about things?"

I would like to ask for clarification, confirmation and corrections on my understanding of character handling in C (and thus C++ and C++0x). First off, an important observation:

Portability and serialization are orthogonal concepts.

Portable things are things like C, unsigned int, wchar_t. Serializable things are things like uint32_t or UTF-8. "Portable" means that you can recompile the same source and get a working result on every supported platform, but the binary representation may be totally different (or not even exist, e.g. TCP-over-carrier pigeon). Serializable things on the other hand always have the same representation, e.g. the PNG file I can read on my Windows desktop, on my phone or on my toothbrush. Portable things are internal, serializable things deal with I/O. Portable things are typesafe, serializable things need type punning. </preamble>

When it comes to character handling in C, there are two groups of things related respectively to portability and serialization:

wchar_t, setlocale(), mbsrtowcs()/wcsrtombs(): The C standard says nothing about "encodings"; in fact, it is entirely agnostic to any text or encoding properties. It only says "your entry point is main(int, char**); you get a type wchar_t which can hold all your system's characters; you get functions to read input char-sequences and make them into workable wstrings and vice versa.
iconv() and UTF-8,16,32: A function/library to transcode between well-defined, definite, fixed encodings. All encodings handled by iconv are universally understood and agreed upon, with one exception.

The bridge between the portable, encoding-agnostic world of C with its wchar_t portable character type and the deterministic outside world is iconv conversion between WCHAR-T and UTF.

So, should I always store my strings internally in an encoding-agnostic wstring, interface with the CRT via wcsrtombs(), and use iconv() for serialization? Conceptually:

                        my program
    <-- wcstombs ---  /==============\   --- iconv(UTF8, WCHAR_T) -->
CRT                   |   wchar_t[]  |                                <Disk>
    --- mbstowcs -->  \==============/   <-- iconv(WCHAR_T, UTF8) ---
                            |
                            +-- iconv(WCHAR_T, UCS-4) --+
                                                        |
       ... <--- (adv. Unicode malarkey) ----- libicu ---+

Practically, that means that I'd write two boiler-plate wrappers for my program entry point, e.g. for C++:

// Portable wmain()-wrapper
#include <clocale>
#include <cwchar>
#include <string>
#include <vector>

std::vector<std::wstring> parse(int argc, char * argv[]); // use mbsrtowcs etc

int wmain(const std::vector<std::wstring> args); // user starts here

#if defined(_WIN32) || defined(WIN32)
#include <windows.h>
extern "C" int main()
{
  setlocale(LC_CTYPE, "");
  int argc;
  wchar_t * const * const argv = CommandLineToArgvW(GetCommandLineW(), &argc);
  return wmain(std::vector<std::wstring>(argv, argv + argc));
}
#else
extern "C" int main(int argc, char * argv[])
{
  setlocale(LC_CTYPE, "");
  return wmain(parse(argc, argv));
}
#endif
// Serialization utilities

#include <iconv.h>

typedef std::basic_string<uint16_t> U16String;
typedef std::basic_string<uint32_t> U32String;

U16String toUTF16(std::wstring s);
U32String toUTF32(std::wstring s);

/* ... */

Is this the right way to write an idiomatic, portable, universal, encoding-agnostic program core using only pure standard C/C++, together with a well-defined I/O interface to UTF using iconv? (Note that issues like Unicode normalization or diacritic replacement are outside the scope; only after you decide that you actually want Unicode (as opposed to any other coding system you might fancy) is it time to deal with those specifics, e.g. using a dedicated library like libicu.)

Updates

Following many very nice comments I'd like to add a few observations:

If your application explicitly wants to deal with Unicode text, you should make the iconv-conversion part of the core and use uint32_t/char32_t-strings internally with UCS-4.
Windows: While using wide strings is generally fine, it appears that interaction with the console (any console, for that matter) is limited, as there does not appear to be support for any sensible multi-byte console encoding and mbstowcs is essentially useless (other than for trivial widening). Receiving wide-string arguments from, say, an Explorer-drop together with GetCommandLineW+CommandLineToArgvW works (perhaps there should be a separate wrapper for Windows).
File systems: File systems don't seem to have any notion of encoding and simply take any null-terminated string as a file name. Most systems take byte strings, but Windows/NTFS takes 16-bit strings. You have to take care when discovering which files exist and when handling that data (e.g. char16_t sequences that do not constitute valid UTF16 (e.g. naked surrogates) are valid NTFS filenames). The Standard C fopen is not able to open all NTFS files, since there is no possible conversion that will map to all possible 16-bit strings. Use of the Windows-specific _wfopen may be required. As a corollary, there is in general no well defined notion of "how many characters" comprise a given file name, as there is no notion of "character" in the first place. Caveat emptor.

Barnet answered 10/6, 2011 at 0:35 Comment(22)

Looks good to me... I might assert() that setlocale did not return NULL. (The spec says it returns a string on success and NULL otherwise, but then does not define any actual errors. To me that says to assert that it did not return NULL.) Great question, by the way. – Malignancy 10/6, 2011 at 0:52

Although I do not think wmain should be extern "C" if it takes a std::vector. (I do not think you are supposed to pass a C++ class to a function with C linkage.) – Malignancy 10/6, 2011 at 0:55

Yes, the full code has plenty of checks of return values of setlocale and the conversions and iconv_open() -- this is more of a conceptual question. I had thought wchar_t was a useless monster for the longest time, but suddenly I feel that it's actually a really good idea... – Barnet 10/6, 2011 at 0:56

This question is actually an answer to many long-standing doubts that I had about the C/C++ standards and how they expect us to use wchar_ts & co. +1, and I would give more if I could. :) – Roobbie 10/6, 2011 at 1:7

"you get a type wchar_t which can hold all your system's characters" -- No, it's worse than that. In Windows, wchar_t might only hold half of a surrogate pair. For those characters you need two wchar_t objects to contain an entire character. It could be worse. If I recall correctly, an obnoxious but legal implementation could make wchar_t the same as unsigned char. – Conventioneer 10/6, 2011 at 6:19

@WP: A surrogate isn't a character. It's part of a serialization method. Granted, with Windows's 16-bit wchars, Windows has access to a smaller range of characters than Linux. But that's just a characteristic of the platform which I'm happy to live with. If you follow my flow-chart, you'll see that I would never face any "surrogates" -- the relevant conversions would simply fail for unrepresentable characters. That's OK. – Barnet 10/6, 2011 at 7:30

Yes a surrogate isn't a character, and that's exactly why you DON'T get a type wchar_t which can hold all of your system's characters. – Conventioneer 10/6, 2011 at 7:39

@WP: I think we're talking past each other. Me, I'm saying that internally there are only characters, nothing else. Only when you serialize the characters into a well-defined output stream, such as UTF16, you start getting things like surrogates and byteorder marks and whatnot. If my wchar_t is 16bit, then I simply cannot hold more than 2^16 distinct characters, but note that there is no mention at all about how wchar_t values correspond to characters. mbstowcs gives me "the right thing", but I have no right to suppose anything about the internal representation of characters. – Barnet 10/6, 2011 at 11:48

Using UTF-16 for wchar_t is broken, and cannot be made to work right. The next version of the C standard has new types char16_t and char32_t to accomodate systems that insist on using UTF-16 internally. – Backbone 10/6, 2011 at 18:46

If __STDC_ISO_10646__ is defined, wchar_t values are Unicode codepoints. C1x has __STDC_UTF_16__ and __STDC_UTF_32__ for char16_t and char32_t, respectively, C++0x doesn't seem to have these last two macros. – Backbone 10/6, 2011 at 18:58

@Ninjalj: Thanks, that's good to know. It could spare you an iconv-conversion from WCHAR_T to UTF32 if you know that you already have raw codepoints. – Barnet 11/6, 2011 at 1:35

This question definitely pertains to only C++, as only C++ sample code is posted and I see no way that it's for C. – Marriageable 11/6, 2011 at 11:26

Type punning has nothing to do with serialization. – Byte 11/6, 2011 at 11:37

@DeadMG: The question is general and about the flow of data, I just gave an example in C++, but I could equally have done one for C using wchar_t[] etc. Could you not have asked me before editing my question? – Barnet 11/6, 2011 at 23:54

@Dietrich: Typical scenario: uint32_t in; read_from_file((char*)(&in), 4);. Sure, you could read into a char[4]` and just use arithmetic, but type punning is often convenient and morally fitting because the i/o byte stream simply doesn't have a type system, so manual coercion is inevitable. Type-ignorant byte-stream serialization often goes well with explicit type casting. – Barnet 11/6, 2011 at 23:57

@Kerrek: You can do the same thing with an int. Neither will create files that can be transferred between different platforms. – Byte 12/6, 2011 at 1:48

@Dietrich: You mean because of endianness? I suppose you should do something like uint32_t myint = buf[0] | (buf[1] << 8) | (buf[2] << 16) | (buf[3] << 24); to read from a byte stream with definite endianness. That way you don't need to cast pointers. I guess what I should have said is that serialization requires manual "typing". – Barnet 12/6, 2011 at 9:13

@Kerrek: that is again nonportable because CHAR_BIT is not guaranteed to be 8—i.e., a byte might be larger than 8 bits. – Counterreply 14/6, 2011 at 5:26

@Philipp: I thought you might say that :-) But a varying bit number seems to put a limit to serialization via read()/write() anyway, i.e. if I cannot predict how much data read(1) will read, then I can't really exchange data between such platforms anyway. So I'm willing to put the stop there. (But perhaps you'll agree that pointer-casting would be a portable way to write code that can serialize among platforms of equal, yet undetermined, bit number?) – Barnet 14/6, 2011 at 11:32

Only one word to say: read utf8everywhere.org about how, why, how cold, why it happened, what to do now and what others should. – Ullage 13/9, 2012 at 21:29

@KerrekSB "File systems don't seem to have any notion of encoding and simply take any null-terminated string as a file name" This isn't quite true. Most Unix-like filesystems do this. But HFS+ on Mac OS X stores filenames as UTF-16 Normalization Form D (though the standard char * APIs accept UTF-8), and filenames are compared case-insensitively. NTFS filenames are 16 bit, I don't believe they do any normalization, but they also compare case-insensitively when interpreted as UTF-16. I have never bothered to find out the exact case mapping algorithm they each use; I'd probably be horrified. – Sidestroke 29/10, 2013 at 3:48

@BrianCampbell: It is the Windows API that uses NTFS to store UTF-16, but that's not a property of NTFS. With the native API you can store arbitrary 16-bit sequences in filenames, even invalid UTF-16 (much to the discontent of the Windows API). – Barnet 29/10, 2013 at 8:45

Is this the right way to write an idiomatic, portable, universal, encoding-agnostic program core using only pure standard C/C++

No, and there is no way at all to fulfill all these properties, at least if you want your program to run on Windows. On Windows, you have to ignore the C and C++ standards almost everywhere and work exclusively with wchar_t (not necessarily internally, but at all interfaces to the system). For example, if you start with

int main(int argc, char** argv)

you have already lost Unicode support for command line arguments. You have to write

int wmain(int argc, wchar_t** argv)

instead, or use the GetCommandLineW function, none of which is specified in the C standard.

More specifically,

any Unicode-capable program on Windows must actively ignore the C and C++ standard for things like command line arguments, file and console I/O, or file and directory manipulation. This is certainly not idiomatic. Use the Microsoft extensions or wrappers like Boost.Filesystem or Qt instead.
Portability is extremely hard to achieve, especially for Unicode support. You really have to be prepared that everything you think you know is possibly wrong. For example, you have to consider that the filenames you use to open files can be different from the filenames that are actually used, and that two seemingly different filenames may represent the same file. After you create two files a and b, you might end up with a single file c, or two files d and e, whose filenames are different from the file names you passed to the OS. Either you need an external wrapper library or lots of #ifdefs.
Encoding agnosticity usually just doesn't work in practice, especially if you want to be portable. You have to know that wchar_t is a UTF-16 code unit on Windows and that char is often (bot not always) a UTF-8 code unit on Linux. Encoding-awareness is often the more desirable goal: make sure that you always know with which encoding you work, or use a wrapper library that abstracts them away.

I think I have to conclude that it's completely impossible to build a portable Unicode-capable application in C or C++ unless you are willing to use additional libraries and system-specific extensions, and to put lots of effort in it. Unfortunately, most applications already fail at comparatively simple tasks such as "writing Greek characters to the console" or "supporting any filename allowed by the system in a correct manner", and such tasks are only the first tiny steps towards true Unicode support.

Counterreply answered 11/6, 2011 at 21:18 Comment(17)

Interesting -- is wmain() not just a wrapper around main() and mbstowcs? I mean, mbstowcs is available on Windows, are you sure that won't work with unicode input? Also, I said "portable", NOT "portable, unicode-capable". Unicode support is explicitly a separate feature, see my reply to Dietrich's answer. Yes, if you want Unicode, you have to include that into your core, no doubt. Me, I was rather after the idea that I can make a small, self-contained prog with console I/O without EVER thinking about encodings and only using the standard C functions and yet get access to lots of characters. – Barnet 12/6, 2011 at 0:8

A thought about filenames: if filenames aren't ASCII, you will simply have to find out from somewhere else which encoding the stdio function fopen() requires. You can then convert to that encoding from your internal wide strings. But finding that out is outside the scope of the language standard, I suppose. – Barnet 12/6, 2011 at 0:34

@Kerrek: No, wmain is not a wrapper around main, and main doesn't work with Unicode. The true entry point of a Windows console application using the Microsoft runtime is _wmainCRTStartup, which gets the command line via GetCommandLineW, parses it, and calls wmain. – Counterreply 12/6, 2011 at 7:0

@Kerrek: Regarding filenames. Windows uses UTF-16 for filenames (and for everything else), but you can't use fopen to access them. You have to use _wfopen, which is nonstandard. If you really want a portable C or C++ program, you can't support Unicode on Windows, and I think that is hardly acceptable nowadays. So better forget about portability... – Counterreply 12/6, 2011 at 7:2

@Philipp: Does that mean that fopen simply doesn't work on Windows for certain files? I never realized -- what does the C standard say about that? Wouldn't that mean that something is broken? – Barnet 12/6, 2011 at 9:15

@Philipp: Can you have non-BMP characters in Windows filenames? – Barnet 12/6, 2011 at 9:28

@Kerrek: I don't think the C standard says anything about filenames. And yes, fopen from the Microsoft C runtime doesn't work if you try to open any file whose name isn't representable in the current legacy encoding ("ANSI codepage"). Essentially that means that fopen is not usable. – Counterreply 12/6, 2011 at 14:16

@Kerrek: I think non-BMP characters are possible in Windows filenames. AFAIK the kernel treats filenames as opaque arrays of 16-bit numbers, so even illegal UTF-16 strings like lone surrogates should be possible (but not advisable, of course). – Counterreply 12/6, 2011 at 14:17

@Philipp: It seems that many file systems don't have an explicit notion of encoding and just treat filenames as null-terminated byte strings. That's OK. But would that mean that can't open a file in Windows using fopen or _wfopen if the filename has non-BMP characters or illegal stuff? (I.e. would you need a kernel function?) – Barnet 13/6, 2011 at 0:23

@Kerrek I haven't tested it, but I'd bet you can open all files with _wfopen (not with fopen) that you can open with CreateFileW. AFAIK _wfopen is just a wrapper around CreateFileW which translates its arguments and passes them along, but doesn't add additional checks. – Counterreply 13/6, 2011 at 6:24

Yes, you can open any file with _wfopen: That's what it's for. But it's Windows-specific. For cross-platform code, you'll need to write a function that calls _wfopen on Windows and fopen on other systems. – Loaf 13/6, 2011 at 8:46

Oh alright, I understand, for Windows you need just any null-terminated sequence of 16-bit numbers as a filename, as opposed to a sequence of bytes as passed to fopen. There's no notion of encoding in Windows either, but to be safe you'd have to perform an internal conversion from WCHAR_T to UTF16LE before using _wfopen (if you keep your strings as wchar_t-strings internally as in my flowchart). – Barnet 13/6, 2011 at 9:22

Hey, another question: Is the Windows _wfopen identical to fopen composed with mbstowcs? I mean, if the filesystem always uses 16-bit units, surely there must be some sort of translation inside fopen...? So if that's the case, can I just drop _wfopen and simply go the otherway, first wcstombs and then the ordinary fopen? – Barnet 13/6, 2011 at 21:33

@Kerrek: No, it is not identical because the encodings used by fopen (Windows-1252 etc.) only represent a small subset of Unicode. fopen internally calls CreateFileA, which in turn translates the filename argument to UTF-16 (presumably using MultiByteToWideChar) and calls CreateFileW. _wfopen calls CreateFileW directly. There is no way to avoid calling CreateFileW or a wrapper function thereof, and in particular, it is not possible in any way to get Unicode support if you use fopen from the Microsoft C runtime. – Counterreply 14/6, 2011 at 5:24

@P: Why do you say "no way" if you just said that CreateFileA just calls a conversion function internally? Why can't I do setlocale(...); wcstombs(...); fopen(my_mbs); and get the same result? Are you assuming (or do you know) that the locale (1252?) must always be a classical fixed-8bit one? (I tried this myself when I got back to my Windows machine last night, and indeed I failed totally to make even umlaut-filenames print correctly (after myprog *.txt etc.) in the WinXP cmd console; the encoding was ...-1252.) – Barnet 14/6, 2011 at 11:36

I got loads of useful information out of all the discussions, but I have to choose one to accept. I hope you understand that this is fairly arbitrary and I appreciate all contributions greatly! – Barnet 17/6, 2011 at 23:13

disagree with recommendation to work with wchar_t. I think char is better for unicode support. Summary of my views is in utf8everywhere.org. – Ullage 13/9, 2012 at 21:31

I would avoid the wchar_t type because it's platform-dependent (not "serializable" by your definition): UTF-16 on Windows and UTF-32 on most Unix-like systems. Instead, use the char16_t and/or char32_t types from C++0x/C1x. (If you don't have a new compiler, typedef them as uint16_t and uint32_t for now.)

DO define functions to convert between UTF-8, UTF-16, and UTF-32 functions.

DON'T write overloaded narrow/wide versions of every string function like the Windows API did with -A and -W. Pick one preferred encoding to use internally, and stick to it. For things that need a different encoding, convert as necessary.

Loaf answered 10/6, 2011 at 1:3 Comment(4)

I think we mean different things by "platform dependent" and "portable". I don't want to swap my RAM content between a PC, a Mac and a Playstation, I just want the program to compile and run on each platform. Ideally I don't want to have to know about any encoding at all! The only time I need to worry about encodings is at the serialization/deserialization stage, which is where I interface using iconv(). Internally, I don't want to know anything about the representation of my data. Does that make sense? Like the basic C motto, "values, not representation". – Barnet 10/6, 2011 at 1:6

Also, by your reasoning int is platform dependent because its 32 bit here and 64 bit there -- yes, types may have different ranges on different platforms, but that doesn't make something not portable -- it just makes it behave differently. E.g. Windows XP doesn't let me use non-BMP unicode characters but Linux does. Fine. That's what you get for being native. – Barnet 10/6, 2011 at 1:9

UTF-32 isn't really "native" for Linux the way UTF-16 is for Windows: All the POSIX API functions (that aren't specifically related to wide-character handling) use char* strings. – Loaf 10/6, 2011 at 2:10

The Windows API is a different story. Its MultiByte* functions actually tell you that they produce Unicode. Me, I'm only interested in standard-C. I believe that <wchar.h> does provide wide versions of all the standard functions, e.g. wcstoul and wcscmp etc. No encoding is native, because the language standard doesn't talk about i/o serialisation formats. – Barnet 10/6, 2011 at 11:55

The problem with wchar_t is that encoding-agnostic text processing is too difficult and should be avoided. If you stick with "pure C" as you say, you can use all of the w* functions like wcscat and friends, but if you want to do anything more sophisticated then you have to dive into the abyss.

Here are some things that much harder with wchar_t than they are if you just pick one of the UTF encodings:

Parsing Javascript: Identifers can contain certain characters outside the BMP (and lets assume that you care about this kind of correctness).
HTML: How do you turn 𐀀 into a string of wchar_t?
Text editor: How do you find grapheme cluster boundaries in a wchar_t string?

If I know the encoding of a string, I can examine the characters directly. If I don't know the encoding, I have to hope that whatever I want to do with a string is implemented by a library function somewhere. So the portability of wchar_t is somewhat irrelevant as I don't consider it an especially useful data type.

Your program requirements may differ and wchar_t may work fine for you.

Byte answered 11/6, 2011 at 11:35 Comment(5)

Good point, I think you really hit the issue here that it all depends on what you want to do with the data. If explicitly-unicode text processing is a core part, then by all means the transformation to, say, UTF32 as the primary internal program should be part of the core, not the I/O (i.e. the input is mbsrtowcs -> iconv(WCHAR_T -> UTF32); output is the reverse). Just adapt my ASCII art chart above accordingly... – Barnet 12/6, 2011 at 0:2

... On the other hand, if text strings play a purely ancillary role in your program (e.g. player names printed on the final score screen), then restricting ourselves to the available system characters is perfectly reasonable. About HTML: You'll have to know the page's encoding! If it's, say, UTF32, then just do iconv(UTF32->WCHAR_T) on U"\65536"; either it works or it fails. Your Text and JS examples clearly mandate explicit handling of Unicode, so see above. (The text example will probably even require sophisticated unicode stuff, e.g. see libicu.) – Barnet 12/6, 2011 at 0:6

Also, I agree that the utility of an abstract "string" type without knowing its encoding may be fairly limited. But what I could definitely do is comparing and matching, even with literal constants a la L"foo", so I think that there could also be plenty of situations where I need some sort of string handling, but I never need to know particulars about the encoding -- e.g. read stuff from stdin, assign seat numbers to each and output the result to stdout. – Barnet 12/6, 2011 at 0:22

@Kerrek: While true that you don't always need to know which encoding you're using, it can be difficult to predict whether that applies to your project. Choosing a specific encoding (UTF-8/16/32) is relatively safe, and except for a few platform-specific APIs, I don't see any benefit to wchar_t. It's worse if you consider that a portable program (according to the spec) is not allowed to assume that wchar_t can store an arbitrary Unicode string, even after conversion. – Byte 12/6, 2011 at 1:46

I suppose practically that makes sense. I guess there's a theoretical possibility that your environment uses an entirely obscure encoding that you don't know and can't make, so that you need to use wcstombs to create usable output, and you need to go via an internal wchar_t-string. But realistically, when the locale uses UTF8, then an internal 16-bit wchar_t representation does indeed limit you unnecessarily. I think my real question is then how I should treat the stdin data if not via mbstowcs. – Barnet 12/6, 2011 at 9:22

Given that iconv is not "pure standard C/C++", I don't think you are satisfying your own specifications.

There are new codecvt facets coming with char32_t and char16_t so I don't see how you can be wrong as long as you are consistent and pick one char type + encoding if the facets are here.

The facets are described in 22.5 [locale.stdcvt] (from n3242).

I don't understand how this doesn't satisfy at least some of your requirements:

namespace ns {

typedef char32_t char_t;
using std::u32string;

// or use user-defined literal
#define LIT u32

// Communicate with interface0, which wants utf-8

// This type doesn't need to be public at all; I just refactored it.
typedef std::wstring_convert<std::codecvt_utf8<char_T>, char_T> converter0;

inline std::string
to_interface0(string const& s)
{
    return converter0().to_bytes(s);
}

inline string
from_interface0(std::string const& s)
{
    return converter0().from_bytes(s);
}

// Communitate with interface1, which wants utf-16

// Doesn't have to be public either
typedef std::wstring_convert<std::codecvt_utf16<char_T>, char_T> converter1;

inline std::wstring
to_interface0(string const& s)
{
    return converter1().to_bytes(s);
}

inline string
from_interface0(std::wstring const& s)
{
    return converter1().from_bytes(s);
}

} // ns

Then your code can use ns::string, ns::char_t, LIT'A' & LIT"Hello, World!" with reckless abandon, without knowing what's the underlying representation. Then use from_interfaceX(some_string) whenever it's needed. It doesn't affect the global locale or streams either. The helpers can be as clever as needed, e.g. codecvt_utf8 can deal with 'headers', which I assume is Standardese from tricky stuff like the BOM (ditto codecvt_utf16).

In fact I wrote the above to be as short as possible but you'd really want helpers like this:

template<typename... T>
inline ns::string
ns::from_interface0(T&&... t)
{
    return converter0().from_bytes(std::forward<T>(t)...);
}

which give you access to the 3 overloads for each [from|to]_bytes members, accepting things like e.g. const char* or ranges.

Cantu answered 10/6, 2011 at 1:37 Comment(22)

iconv can't be "pure standard", because the pure standard has no notion of encoding at all. That's why I only want to use iconv at the i/o interface end. Ideally I don't want to "pick one encoding" internally, because encodings aren't programming concepts -- they're serialization concepts. While I'm not serializing, I would feel dirty if I had to mention an explicit encoding. – Barnet 10/6, 2011 at 7:27

What do you mean, mention? You can refactor that away in e.g. a typedef (but you still will have to settle for a given literal, unless using macros). The correct overloads are picked for whatever conversions are needed when interfacing with something. And if you feel that "encoding aren't programming concepts" then why not pick UTF-32? – Cantu 10/6, 2011 at 7:33

By "mention" I mean that if I write 'a' or L'a', I get "the character 'a'", but I have absolutely no right to suppose anything about how that's implemented (in particular that it's integrally 97). All I am guaranteed is that char can hold an 'a' and wchar_t a L'a'. No typedefs, no choices, no encodings. Just the character 'a'. – Barnet 10/6, 2011 at 11:50

That's interesting, I hadn't really given C++ locale support a thought. So what is my program-internal string type, and how do I read the command line arguments, say? E.g. what's the equivalent of setlocale(LC_CTYPE, ""); mbsrtowcs(buf, &argv[i], N, 0);, which creates an internal, opaque, wide string (without me needing to think about encodings)? Do those "facets" do the same job as iconv? – Barnet 11/6, 2011 at 9:7

@Kerrek Should I assume that the input to the program is in utf-8? Otherwise a std::copy would be correct even in your own example. – Cantu 11/6, 2011 at 9:41

@Kerrek I am somewhat in disbelief. From the draft Standard I can see how to go from the narrow set to the wide set, and from there on to any Unicode encoding. Are you still interested in that? – Cantu 11/6, 2011 at 10:14

@Luc: I don't want to have to make any assumptions or know anything about the environment. That's why I was hoping that I can just use setlocale(LC_CTYPE, ""); and mbsrtowcs() (possibly followed by iconv I feel like I need a specific encoding internally). If I'm reading from a file, I'd have to know the encoding of course and I'd use iconv(file-enc -> WCHAR_T) rather than mbsrtowcs. [That was for your -3rd comment, I didn't nevermind it after all ;-). Please do say if you know how to do things in a C++ way with facets!] – Barnet 12/6, 2011 at 0:16

@Kerrek After having slept on the problem I'm in the process of rewriting this answer. – Cantu 12/6, 2011 at 0:19

Oh, just thinking, the platform should probably provide a method to bridge between program console I/O and file encodings to deal with redirection. If you store the standard output in a file, you have to be able to learn the locale's encoding, since you'll have received whatever wcstombs created. That's the platform's responsibility, though, not the programmer's. – Barnet 12/6, 2011 at 0:25

@Kerrek After a bit of looking around, while it is possible to convert from (char, narrow encoding) to (wchar_t, wide encoding), and it is possible to convert from any ([char, char16_t, char32_t], [utf-8, utf-16, utf-32]) pair to any almost other, the Standard doesn't provide a way to go from the implementation encodings to Unicode ones and back. I won't salvage this answer and I recommend Philipp's. – Cantu 12/6, 2011 at 4:42

@Luc: Alright, cheers, and thanks for your input! I see that an alternative design could be to avoid the setlocale/mbstowcs operations altogether and instead find a different means of discovering the locale's encoding (how?). With that, one could use iconv directly to move from the stdin to a deterministic internal encoding (UTF32). Perhaps that's more practical. But I'm also concerned about the portability of fopen now as Philipp brought it up... – Barnet 12/6, 2011 at 9:26

@Kerrek Now I'm wondering if std::use_facet<std::codecvt<char_type, char, std::mbstate_t>>(loc) (where char_type is char16_t or char32_t) is a legitimate way of finding out if your implementation is using (char, utf-8)/(char_type, utf-16/utf-32) for their narrow/wide sets. Seems hackish. – Cantu 12/6, 2011 at 9:42

@Luc: I'll have to read up some background on locales and facets in C++ -- I never looked at those seriously before you brought them up. – Barnet 12/6, 2011 at 9:57

I looked up locales now, and specifically the codecvt facet. Unfortunately, that one seems rather useless: the generic version has no features, and the specialization to "internal wchar_t, external char" provides basically only a verbose, cumbersome wrapper around mbstowcs/wcstombs. So I don't imagine it would clear up anything to make that part of the design. – Barnet 13/6, 2011 at 21:10

@Kerrek I was going by the n3242 draft of C++0x, which defines new specializations for codecvt and add new codecvt_utfX types. And more to the point, the very convenient wstring_convert which wraps the ugliness of codecvt. – Cantu 13/6, 2011 at 21:16

@Luc: Oh I see -- let me look at the new standard again then (and try whether GCC has support for it)! – Barnet 13/6, 2011 at 21:34

@Kerrek My month-old snapshot doesn't :(. The codecvt_utfX are supposed to reside inside <codecvt> but there's no such header. There's no wstring_convert either, nor the required codecvt specializations for char16_t and char32_t. – Cantu 13/6, 2011 at 21:42

Right, I see. But if it did exist, then conversion with wstring_convert<codecvt<wchar_t, char, mbstate_t>> would nicely encapsulate by hand-written conversion functions (not included in my code snippets above). And I guess one could build a locale with that facet and imbue cin/cout with that and it would automagically do the right thing when using standard in/out? – Barnet 13/6, 2011 at 22:1

@Kerrek It's complicated. First, you'd want to imbue the wide streams, not the narrow ones. Then, AFAIK std::wcout 'speaks' (well, wants) wide set encoding. But there's no codecvt that does Unicode (any kind) <-> wide (or even narrow) set encoding. They all convert to/from multibyte Unicode! – Cantu 13/6, 2011 at 23:13

I'm wrong about which particular streams to imbue: in both cases there's no overload that will accept a character type other than the one it deals with. So can't pass const char32_t* to anything else than std::basic_stream<char32_t>! This does mean however that if there were a codecvt for UTF-8 <-> narrow set encoding then you could, in fact, imbue e.g. std::cout and then pass a u8"" literal with portable results. Probably the hottest thing to wait for from a possible (?) Boost.Unicode! – Cantu 13/6, 2011 at 23:22

Folks - you know we have an excellent chat feature where you can carry on this fascinating discussion. :) – Enceladus 13/6, 2011 at 23:45

You know, I finally downloaded a copy of libc++ and made wstring_convert work and thought I should update this question, and it turns out you've already said everything I wanted to say two years ago :-S – Barnet 1/1, 2014 at 17:36

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags