C++23: char now supports Unicode?
Asked Answered
C

2

19

Does C++23 now provide support for Unicode characters in its basic char type, and to what degree?


So on cppreference for character literals, a character literal:

'c-char'

is defined as either:

  • a basic-c-char
  • an escape sequence, as defined in escape sequences
  • a universal character name, as defined in escape sequences

and then for basic-c-char, it's defined that:

A character from the basic source character set (until C++23) translation character set (since C++23), except the single-quote ', backslash \, or new-line character

On the cppreference's page for character sets, it then defines the "translation character set" as consisting of the following:

  • each abstract character assigned a code point in the Unicode codespace, and (since C++23)
  • a distinct character for each Unicode scalar value not assigned to an abstract character.

and states:

The translation character set is a superset of the basic character set and the basic literal character set (see below).

It seems to me that the "basic character set" (given on the just-above page) is basically a subset of ASCII. I also always thought of char as namely being ASCII (with support for ISO-8859 character sets, such as per Microsoft's page on the character types). But now with the change to the translation character set for basic-c-char, it seems it supports Unicode to some fuller extent.

I'm aware that the actual encoding is implementation defined (apart from the null character and incrementing decimal digit characters it seems). But my main question is what characters are really supported by this "translation character set"? Is it all of Unicode? I feel as though I'm reading more into this than is actually the case.

Crescendo answered 6/9, 2023 at 20:46 Comment(3)
A bunch of the weird phrasing in the standard is basically saying "We want C++ implementations to support Unicode, but we don't want to declare any existing code as nonstandard just because it or its platform is non-Unicode-aware."Siusan
What do you mean with "supports Unicode"? If you think it more precisely, probably you can answer. In short: just keep data as black box string (e.g. as UTF-8). On input and output do the conversion on your black box predefined format (nobody can guess the expected encoding on input and output, so neither a C++ standard). And for handling, you need good Unicode libraries (do not think a single codepoint is a good unit to handle unicode strings).Ferraro
utf8everywhere.orgSiusan
D
17

Effectively not much changed (with two important differences):

Before C++23 the first translation phase defined that any character in the source file that isn't an element of the basic source character set (which is a subset of the ASCII character set) was to be mapped to a universal-character-name, i.e. it would be replaced by a sequence of the form \UXXXXXXXX where XXXXXXXX is the number of the ISO/IEC 10646 (equivalently Unicode) code point for the character.

Then when writing a character literal 'X' where X is replaced with a character that is not in the basic source character set you would get '\UXXXXXXXX' after the first translation phase and then the c-char -> universal-character-name grammar applied.

So you could always write non-ASCII characters in a character literal, assuming the source encoding permitted to write such character. Source file encoding and supported source characters outside the basic source character set were implementation-defined as the source character set (encoding). Regardless of source character set, you could already write any Unicode scalar value directly into a character literal with a universal character name.

How this character literal will then behave is a different question, because the encoding used for to determine the value of the char from the universal-character-name (or any character of the basic source character set) is implementation-defined as well (the execution character set encoding in C++20 or ordinary literal encoding in C++23). Obviously if char is 8bit wide it can't represent all Unicode scalar values. If the character was not representable in char, then the behavior was implementation-defined.

The changes for C++23 are now that support for UTF-8 source encoding became mandatory, implying support for all Unicode scalar values in the source file, (although other encodings can of course also be supported) and that the first phase was changed, so that instead of rewriting everything to the basic source character set via universal character names, now the source characters are mapped to a translation character set sequence which is essentially a Unicode scalar value sequence. Unicode code points that are not Unicode scalar value, i.e. surrogate code points, aren't elements of the translation character set (and can't be produced by decoding any source file).

Therefore, in C++23 when getting to the translation phase where the character literal's value is determined, a single Unicode scalar value in the source file matches the basic-c-char grammar as you showed in your question.

The value of the character literal is still determined as before by implementation-defined encoding. However, in contrast to C++20, the literal is now ill-formed if the character is not representable in char via this encoding.

So the two differences are that UTF-8 source file encoding must be supported and that a single source character (meaning a single Unicode scalar value) in the character literal that is not representable in the implementation-defined ordinary literal encoding will now cause the literal to be ill-formed instead of having an implementation-defined value.


Analogously to the above, string literals (rather than character literals) haven't really changed either. The encoding is still implementation-defined using the same ordinary literal encoding and primarily only the internal representation in the translation phases changed. And in the same way as for character literals, with C++23 the literal becomes ill-formed if a character (i.e. translation character set element or Unicode scalar value) is not representable in the ordinary literal character encoding. However that encoding may be e.g. UTF-8, so that a single Unicode scalar value in the source file may map to multiple char in the encoded string, as has always been the case.

Dionne answered 6/9, 2023 at 22:0 Comment(0)
C
5

what characters are really supported by this "translation character set"?

As you already quoted (I'll quote from latest C++ standard draft):

[lex.charset]

The translation character set consists of the following elements:

  • each abstract character assigned a code point in the Unicode codespace, and
  • a distinct character for each Unicode scalar value not assigned to an abstract character.

Let's look up definitions for the terms used in the rule (quote from Unicode 14):

For the first point:

Characters and Encoding

Abstract character: A unit of information used for the organization, control, or representation of textual data.

  • When representing data, the nature of that data is generally symbolic as opposed to some other kind of data (for example, aural or visual). Examples of such symbolic data include letters, ideographs, digits, punctuation, technical symbols, and dingbats.
  • An abstract character has no concrete form and should not be confused with a glyph.
  • An abstract character does not necessarily correspond to what a user thinks of as a “character” and should not be confused with a grapheme.
  • The abstract characters encoded by the Unicode Standard are known as Unicode abstract characters.
  • Abstract characters not directly encoded by the Unicode Standard can often be represented by the use of combining character sequences

For the second point:

Unicode Encoding Forms

Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points.

  • As a result of this definition, the set of Unicode scalar values consists of the ranges 0 to D7FF 16 and E000 16 to 10FFFF 16, inclusive.

The C++ standard also has a clarifying note:

[Note 1: Unicode code points are integers in the range [0, 10FFFF] (hexadecimal). A surrogate code point is a value in the range [D800, DFFF] (hexadecimal). A Unicode scalar value is any code point that is not a surrogate code point. — end note]


Is it all of Unicode?

TLDR: No. For example. surrogate code points, and combining character sequences are not in the translation character set.

Furthermore, this is important rule from C++:

A character-literal with a c-char-sequence consisting of a single basic-c-char, simple-escape-sequence, or universal-character-name is the code unit value of the specified character as encoded in the literal's associated character encoding. If the specified character lacks representation in the literal's associated character encoding or if it cannot be encoded as a single code unit, then the program is ill-formed.

If your system has an 8 bit char, then it will not be able to represent all 10FFFF code points of the Unicode codespace.


P.S. Unicode in char literals has never been disallowed by the C++ standard; This change is just making Unicode support mandatory.

Craig answered 6/9, 2023 at 21:13 Comment(3)
Isn't lex.charset the set of characters you can write C++ code in, not the set of characters the C++ libraries handle?Awake
@Yakk-AdamNevraumont Yes. This question is about character literals as far as I can tell. Libraries can handle any character set they want to.Craig
"If your system has an 8 bit char, then it will not be able to represent all 10FFFF code points of the Unicode codespace" A single 8-bit char literal obviously can't do it today, and C++23 is not going to magically give it this ability. A string literal however is potentially able to represent all of Unicode, and C++23 is not going to take this ability away (the question doesn't mention string literals, but a good answer IMHO should).Randall

© 2022 - 2024 — McMap. All rights reserved.