The following may not qualify as a SO question; if it is out of bounds, please feel free to tell me to go away. The question here is basically, "Do I understand the C standard correctly and is this the right way to go about things?"
I would like to ask for clarification, confirmation and corrections on my understanding of character handling in C (and thus C++ and C++0x). First off, an important observation:
Portability and serialization are orthogonal concepts.
Portable things are things like C, unsigned int
, wchar_t
. Serializable things are things like uint32_t
or UTF-8. "Portable" means that you can recompile the same source and get a working result on every supported platform, but the binary representation may be totally different (or not even exist, e.g. TCP-over-carrier pigeon). Serializable things on the other hand always have the same representation, e.g. the PNG file I can read on my Windows desktop, on my phone or on my toothbrush. Portable things are internal, serializable things deal with I/O. Portable things are typesafe, serializable things need type punning. </preamble>
When it comes to character handling in C, there are two groups of things related respectively to portability and serialization:
wchar_t
,setlocale()
,mbsrtowcs()
/wcsrtombs()
: The C standard says nothing about "encodings"; in fact, it is entirely agnostic to any text or encoding properties. It only says "your entry point ismain(int, char**)
; you get a typewchar_t
which can hold all your system's characters; you get functions to read input char-sequences and make them into workable wstrings and vice versa.iconv()
and UTF-8,16,32: A function/library to transcode between well-defined, definite, fixed encodings. All encodings handled by iconv are universally understood and agreed upon, with one exception.
The bridge between the portable, encoding-agnostic world of C with its wchar_t
portable character type and the deterministic outside world is iconv conversion between WCHAR-T and UTF.
So, should I always store my strings internally in an encoding-agnostic wstring, interface with the CRT via wcsrtombs()
, and use iconv()
for serialization? Conceptually:
my program
<-- wcstombs --- /==============\ --- iconv(UTF8, WCHAR_T) -->
CRT | wchar_t[] | <Disk>
--- mbstowcs --> \==============/ <-- iconv(WCHAR_T, UTF8) ---
|
+-- iconv(WCHAR_T, UCS-4) --+
|
... <--- (adv. Unicode malarkey) ----- libicu ---+
Practically, that means that I'd write two boiler-plate wrappers for my program entry point, e.g. for C++:
// Portable wmain()-wrapper
#include <clocale>
#include <cwchar>
#include <string>
#include <vector>
std::vector<std::wstring> parse(int argc, char * argv[]); // use mbsrtowcs etc
int wmain(const std::vector<std::wstring> args); // user starts here
#if defined(_WIN32) || defined(WIN32)
#include <windows.h>
extern "C" int main()
{
setlocale(LC_CTYPE, "");
int argc;
wchar_t * const * const argv = CommandLineToArgvW(GetCommandLineW(), &argc);
return wmain(std::vector<std::wstring>(argv, argv + argc));
}
#else
extern "C" int main(int argc, char * argv[])
{
setlocale(LC_CTYPE, "");
return wmain(parse(argc, argv));
}
#endif
// Serialization utilities
#include <iconv.h>
typedef std::basic_string<uint16_t> U16String;
typedef std::basic_string<uint32_t> U32String;
U16String toUTF16(std::wstring s);
U32String toUTF32(std::wstring s);
/* ... */
Is this the right way to write an idiomatic, portable, universal, encoding-agnostic program core using only pure standard C/C++, together with a well-defined I/O interface to UTF using iconv? (Note that issues like Unicode normalization or diacritic replacement are outside the scope; only after you decide that you actually want Unicode (as opposed to any other coding system you might fancy) is it time to deal with those specifics, e.g. using a dedicated library like libicu.)
Updates
Following many very nice comments I'd like to add a few observations:
If your application explicitly wants to deal with Unicode text, you should make the
iconv
-conversion part of the core and useuint32_t
/char32_t
-strings internally with UCS-4.Windows: While using wide strings is generally fine, it appears that interaction with the console (any console, for that matter) is limited, as there does not appear to be support for any sensible multi-byte console encoding and
mbstowcs
is essentially useless (other than for trivial widening). Receiving wide-string arguments from, say, an Explorer-drop together withGetCommandLineW
+CommandLineToArgvW
works (perhaps there should be a separate wrapper for Windows).File systems: File systems don't seem to have any notion of encoding and simply take any null-terminated string as a file name. Most systems take byte strings, but Windows/NTFS takes 16-bit strings. You have to take care when discovering which files exist and when handling that data (e.g.
char16_t
sequences that do not constitute valid UTF16 (e.g. naked surrogates) are valid NTFS filenames). The Standard Cfopen
is not able to open all NTFS files, since there is no possible conversion that will map to all possible 16-bit strings. Use of the Windows-specific_wfopen
may be required. As a corollary, there is in general no well defined notion of "how many characters" comprise a given file name, as there is no notion of "character" in the first place. Caveat emptor.
assert()
that setlocale did not return NULL. (The spec says it returns a string on success and NULL otherwise, but then does not define any actual errors. To me that says to assert that it did not return NULL.) Great question, by the way. – Malignancywmain
should beextern "C"
if it takes astd::vector
. (I do not think you are supposed to pass a C++ class to a function with C linkage.) – Malignancysetlocale
and the conversions andiconv_open()
-- this is more of a conceptual question. I had thoughtwchar_t
was a useless monster for the longest time, but suddenly I feel that it's actually a really good idea... – Barnetwchar_t
s & co. +1, and I would give more if I could. :) – Roobbiewchar_t
is broken, and cannot be made to work right. The next version of the C standard has new typeschar16_t
andchar32_t
to accomodate systems that insist on using UTF-16 internally. – Backbone__STDC_ISO_10646__
is defined,wchar_t
values are Unicode codepoints. C1x has__STDC_UTF_16__
and__STDC_UTF_32__
forchar16_t
andchar32_t
, respectively, C++0x doesn't seem to have these last two macros. – Backbonewchar_t[]
etc. Could you not have asked me before editing my question? – Barnetuint32_t in; read_from_file((char*)(&in), 4);. Sure, you could read into a
char[4]` and just use arithmetic, but type punning is often convenient and morally fitting because the i/o byte stream simply doesn't have a type system, so manual coercion is inevitable. Type-ignorant byte-stream serialization often goes well with explicit type casting. – Barnetint
. Neither will create files that can be transferred between different platforms. – Byteuint32_t myint = buf[0] | (buf[1] << 8) | (buf[2] << 16) | (buf[3] << 24);
to read from a byte stream with definite endianness. That way you don't need to cast pointers. I guess what I should have said is that serialization requires manual "typing". – BarnetCHAR_BIT
is not guaranteed to be 8—i.e., a byte might be larger than 8 bits. – Counterreplyread()
/write()
anyway, i.e. if I cannot predict how much dataread(1)
will read, then I can't really exchange data between such platforms anyway. So I'm willing to put the stop there. (But perhaps you'll agree that pointer-casting would be a portable way to write code that can serialize among platforms of equal, yet undetermined, bit number?) – Barnetchar *
APIs accept UTF-8), and filenames are compared case-insensitively. NTFS filenames are 16 bit, I don't believe they do any normalization, but they also compare case-insensitively when interpreted as UTF-16. I have never bothered to find out the exact case mapping algorithm they each use; I'd probably be horrified. – Sidestroke