C++ When are characters widened in output stream operator<<()?
Asked Answered
D

1

12

It seems to me, that there is an inconsistency in the C++ standard, specifically in §30.7.5.2.4 of the C++17 draft (N4659), about when characters are widened in formatted output operations on output streams (operator<<()). Exactly the same inconsistency seems to be reflected in en.cppreference.com.

First, assume the following declarations:

std::ostream out;
std::wostream wout;
char ch;
wchar_t wch;
const char* str;
const wchar_t* wstr;

It is then stated that

  1. out << ch does not perform character widening,
  2. out << str performs character widening,
  3. wout << ch performs character widening,
  4. wout << str performs character widening,
  5. wout << wch does not perform character widening,
  6. wout << wstr performs character widening.

The first and most obvious inconsistency is that (6) cannot be true, as there is no widen() function taking a wchar_t argument, only one that takes a char argument.

The second (seeming) inconsistency is between (1) and (2). It seems strange to me that out << "x" should widen 'x', while out << 'x' should not.

Am I misinterpreting the standard text, or is there something wrong there? If the latter is true, do you know what the intended behavior is?

EDIT: Apparently, this inconsistency (if I am right), has been present in the standard since at least C++03 (§27.6.2.5.4). The text changes a bit through the intermediate standards, but the inconsistency, as I explain it above, remains.

Downer answered 5/6, 2017 at 20:41 Comment(3)
This should be an LWG issue.Ripe
"... If c has type char and the character type of the stream is not char , then seq consists of out.widen(c) ; otherwise seq consists of c . ..." Sorry, my English (and understanding in general, why not) is not so good; Could you indicate one of the sentence where you find inconsistencies?Subsidize
@Subsidize The inconsistency is not really within any single sentence of the standard. As I describe above, (1) and (2) sem to be in conflict with each other, and Dietmar confirms that (1) is correct, and (2) is wrong (follows incorrectly from bad wording).Downer
R
5

It looks as if the standard isn't entirely correct. Most of the issue stems from the bulk-specification of the respective operations. Instead of handling each overload individually similar overloads are described together resulting in a misleading specification.

I doubt, any implementer has any trouble understanding what is intended, though. Essentially when a char is inserted into a non-char stream the character needs to be widen()ed to obtain the character of the stream's character type. This widening is intended to map one character from the source character set to the one character in the stream's wide character set.

Note that the IOStreams specification assumes the original notion of characters in streams being individual entities. Since the specification was created (for the C++1998 version) the text wasn't really updates substantially but with wide use of Unicode the "characters" in a stream are really bytes of an encoding. Although the streams mostly function OK in this modified environment, some flexibility which would be helpful to deal with Unicode characters isn't really properly supported. The absence of something "widening" one character into a sequence of UTF8 bytes is probably one of these.

If you feel the inconsistency/incorrectness in the stream's section warrants addressing, file a defect report. Instruction on filing defect reports are at http://isocpp.org. When you do raise an issue consider providing proposed wording to correct the issue. Since there is no lack of clarity what is actually intended and probably most implementations do the right thing anyway I'd expect this issue to get fairly low priority and without proposed wording it is unlikely to receive much attention. Of course, addressing the issue won't change the intended behavior, e.g., to "widen" chars into a UTF8 sequence: that would effectively be a redesign of the streams library which may be in order but won't be done as part of defect resolution.

Restrained answered 5/6, 2017 at 22:32 Comment(12)
There are plenty of occasions where the standard specifies unconditional widen calls, though, without any indication that it may be skipped for same-character-type cases.Ripe
@T.C.: sure. If you feel it is something which should be fixed, raise a defect. If the specification ends up mandating calling widen() in all cases it would have the neat effect that hopefully all implementers should realize that the result of widen() needs to be cached so the virtual function is called just once.Super
Indeed--I think it's safe to say the committee would likely look favorably upon at least the idea of completely overhauling iostreams--with the proviso that doing so is an extremely non-trivial undertaking, so any such proposal would be subjected to a lot of scrutiny, and there are a lot of unwritten, poorly known, and most likely conflicting goals it would need to address to have any hope for success.Dictum
How do you imagine "widen"ing a char into a UTF-8 character sequence given that the type char already means UTF-8 code unit? (except in pre-Unicode locales, but we're talking about future..)Kall
@DietmarKühl If widening char -> wchar_t maps characters in the basic source character set from the source character encoding, to the locale specific wide character encoding, isn't a "widening" char -> char needed to map those same characters from the source character encoding to the locale specific non-wide (multi-byte) character encoding? Or, are the source and the locale specific non-wide character encodings assumed to be identical for all characters in the basic source character set?Downer
@KristianSpangsege: the widen() is specifically intended to deal with the fact that in the wide character set the correct mapping of basic characters are not known and may depend on language setting. It was bolted onto the streams begavior when it was thought that parameterizing the character type of streams is a Good Idea. At that time Unicode "guaranteed" that each character would consist of one 16 bit unit. The design does not take the possibility of mapping a char into anything at all and certainly not into a sequence entities. So for char the identity is kind of assumed.Super
@KristianSpangsege: since the environment moved on into a direction which with gindsight was obvious: Unicode could not possibly keep its promises! At the time nobody wanted to see that. ... and since then IOStreams were not updated, partly because it is something non-trivial and nobody with the relevant experience created something properly covering the space. Simplistic approaches end up doing possibly one thing right but not covering the entire space people expect and/or require from a redesigned system (and often are just worse beyond naive use).Super
@DietmarKühl Right, but my understanding is that out << for char and const char* only has well defined semantics when used with characters from the basic source character set. And those are guaranteed to be represented in one byte of the multi-byte encoding of any conforming locale. The question that remains for me, is whether there is also an implied guarantee that these characters are identically represented in the multi-byte encoding of all conforming locales. Only in that case, it makes sense to me that char -> char is allowed to skip the used of std::ctype<char>::widen().Downer
@DietmarKühl ... since it would imply that std::ctype<char>::widen() is an identity function. Do you see my point?Downer
@KristianSpangsege: the widen() isn't about code conversion and multi-byte sequences don't enter the picture! Widen always produces one character for each input character. Code conversion is done on the stream buffer level (using std::codecvt).Super
@DietmarKühl Ok, I get that, but since widen() doesn't enter the picture, no conversion takes place. Doesn't that only make sense under the assumption that all conforming locales must agree on representation of the characters from the basic source character set within the multi-byte character encoding (one byte each in the case of basic source characters)?Downer
@DietmarKühl Otherwise I will not be able to build my program in a way that provides interoperability will all locales at once (still assuming I'm only using characters from the basic source character set), right?Downer

© 2022 - 2024 — McMap. All rights reserved.