Effectively not much changed (with two important differences):
Before C++23 the first translation phase defined that any character in the source file that isn't an element of the basic source character set (which is a subset of the ASCII character set) was to be mapped to a universal-character-name, i.e. it would be replaced by a sequence of the form \UXXXXXXXX
where XXXXXXXX
is the number of the ISO/IEC 10646 (equivalently Unicode) code point for the character.
Then when writing a character literal 'X'
where X
is replaced with a character that is not in the basic source character set you would get '\UXXXXXXXX'
after the first translation phase and then the c-char -> universal-character-name grammar applied.
So you could always write non-ASCII characters in a character literal, assuming the source encoding permitted to write such character. Source file encoding and supported source characters outside the basic source character set were implementation-defined as the source character set (encoding). Regardless of source character set, you could already write any Unicode scalar value directly into a character literal with a universal character name.
How this character literal will then behave is a different question, because the encoding used for to determine the value of the char
from the universal-character-name (or any character of the basic source character set) is implementation-defined as well (the execution character set encoding in C++20 or ordinary literal encoding in C++23). Obviously if char
is 8bit wide it can't represent all Unicode scalar values. If the character was not representable in char
, then the behavior was implementation-defined.
The changes for C++23 are now that support for UTF-8 source encoding became mandatory, implying support for all Unicode scalar values in the source file, (although other encodings can of course also be supported) and that the first phase was changed, so that instead of rewriting everything to the basic source character set via universal character names, now the source characters are mapped to a translation character set sequence which is essentially a Unicode scalar value sequence. Unicode code points that are not Unicode scalar value, i.e. surrogate code points, aren't elements of the translation character set (and can't be produced by decoding any source file).
Therefore, in C++23 when getting to the translation phase where the character literal's value is determined, a single Unicode scalar value in the source file matches the basic-c-char grammar as you showed in your question.
The value of the character literal is still determined as before by implementation-defined encoding. However, in contrast to C++20, the literal is now ill-formed if the character is not representable in char
via this encoding.
So the two differences are that UTF-8 source file encoding must be supported and that a single source character (meaning a single Unicode scalar value) in the character literal that is not representable in the implementation-defined ordinary literal encoding will now cause the literal to be ill-formed instead of having an implementation-defined value.
Analogously to the above, string literals (rather than character literals) haven't really changed either. The encoding is still implementation-defined using the same ordinary literal encoding and primarily only the internal representation in the translation phases changed. And in the same way as for character literals, with C++23 the literal becomes ill-formed if a character (i.e. translation character set element or Unicode scalar value) is not representable in the ordinary literal character encoding. However that encoding may be e.g. UTF-8, so that a single Unicode scalar value in the source file may map to multiple char
in the encoded string, as has always been the case.