Using char16_t and char32_t in I/O
Asked Answered
T

1

27

C++11 introduces char16_t and char32_t to facilitate working with UTF-16- and UTF-32-encoded text strings. But the <iostream> library still only supports the implementation-defined wchar_t for multi-byte I/O.

Why has support for char16_t and char32_t not been added to the <iostream> library to complement the wchar_t support?

Triarchy answered 17/11, 2011 at 14:43 Comment(12)
Have you tried std::basic_iostream<char32_t>? Just because there's no predefined types (like std::iostream for char) doesn't mean there is no support.Sentimentalism
I've just tested basic_istringstream<char16_t> in GCC version 4.7.0. It compiles, but crashes during execution. This, of course, does not prove that support could be present in another environment, but I still find it strange that the standardization committee did not include support on an equal footing with wchar_t.Triarchy
I mean, "... does not disprove that ...".Triarchy
basic_istringstream<char16_t> and <char32_t> should work fine. If it doesn't in GCC then it's just a bug or that they haven't gotten to that yet.Alginate
@Alginate : The standard doesn't require support beyond char and wchar_t -- all other character types are strictly implementation-defined, so not supporting them isn't necessarily a "bug".Scandent
@Mooing : §27.2.2/2 says otherwise. This is specific to streams, not char_traits or containers.Scandent
@ClausTøndering: basic_istringstream (and similar) all default the second argument to std::char_traits<char>. You'll have to give it both template arguments.Siana
@ildjarn: Well I'll be... That's bizzare. It clearly states char, wchar_t, and any other implementation-defined character types...Siana
@Scandent I read §27.2.2/2 as saying not that support beyond char and wchar_t is implementation defined, but instead that if there are other character types that satisfy the requirements for a character on which any of the iostream components can be instantiated, then those types are supported. char16_t and char32_t seem to fit that or at least I don't see any requirements they don't fulfill for iostreams. I would be curious to find out why those types aren't listed explicitly in §27.2.2/2 though. Just an oversight?Alginate
@Alginate : It's ambiguous certainly, and I read it just the opposite way -- that support for any character types beyond char and wchar_t is implementation-defined. Also, I'm not sure that streams could be expected to work directly with char16_t in particular, because that data type implies the possibility of multi-byte character sequences (surrogate pairs in this case), and I'm not aware that streams can use multi-byte sequences without a non-default facet. That said, std iostreams are certainly not my area of expertise.Scandent
@Scandent The standard does specify codecvt<char16_t,char,mbstate_t> does UTF-16 (§ 22.3.1.1.1, Table 81) at least. There is a footnote in § 22.4.1.4.2/3 "Informally, this means that basic_filebuf assumes that the mappings from internal to external characters is 1 to N: a codecvt facet that is used by basic_filebuf must be able to translate characters one internal character at a time." I think that requirement can be managed using by using a shift state, and there's a note right above that that seems to explicitly support that. Anyway I'm still working on becoming and expert myself :)Alginate
basic_istringstream<char16_t> compiles with errors under gcc 4.6.2Tympanites
A
22

In the proposal Minimal Unicode support for the standard library (revision 2) it is indicated that there was only support among the Library Working Group for supporting the new character types in strings and codecvt facets. Apparently the majority was opposed to supporing iostream, fstream, facets other than codecvt, and regex.

According to minutes from the Portland meeting in 2006 "the LWG is committed to full support of Unicode, but does not intend to duplicate the library with Unicode character variants of existing library facilities." I haven't found any details, however I would guess that the committee feels that the current library interface is inappropriate for Unicode. One possible complaint could be that it was designed with fixed sized characters in mind, but Unicode completely obsoletes that as, while Unicode data can use fixed sized code points, it does not limit characters to single code points.

Personally I think there's no reason not to standardized the minimal support that's already provided on various platforms (Windows uses UTF-16 for wchar_t, most Unix platforms use UTF-32). More advanced Unicode support will require new library facilities, but supporting char16_t and char32_t in iostreams and facets won't get in the way but would enable basic Unicode i/o.

Alginate answered 17/11, 2011 at 16:20 Comment(5)
@Alginate there is no <codecvt> in the libstdc++ source tree: gcc.gnu.org/git/?p=gcc.git;a=tree;f=libstdc%2B%2B-v3/include/…Bowfin
@Bowfin yeah, libstdc++ doesn't have it yet. As far as I know only libc++ and Dinkumware have it.Alginate
But note Dinkumware does not mean MSVC... because last I checked, they didn't have any charNN_t support.Bowfin
@Bowfin I know that MSVC provided the most minimalistic possible support for charX_t types since at least 2010 (defining char16_t and char32_t as typedefs of unsigned short and unsigned int), but that didn't work properly everywhere. It's at least semi-functional, though, which is useful when trying to port code back to older versions.Lisa
On the plus side, at least they outright admitted that they didn't provide any actual support for the types. On the minus side, not documenting the typedefs likely led people to use wchar_t where they didn't actually need to, and it'd be a miracle if it didn't force people to rewrite code that might possibly have functioned as is.Lisa

© 2022 - 2024 — McMap. All rights reserved.