I find this in the new C++ Standard:
2.11 Identifiers [lex.name]
identifier:
identifier-nondigit
identifier identifier-nondigit
identifier digit
identifier-nondigit:
nondigit
universal-character-name
other implementation-defined character
with the additional text
An identifier is an arbitrarily long sequence of letters and digits. Each universal-character-name in an identifier shall designate a character whose encoding in ISO 10646 falls into one of the ranges specified in E.1. [...]
I can not quite comprehend what this means. From the old standard I am used to that a "universal character name" is written \u89ab
, for example. But using those in an identifier...? Really?
Is the new standard more open with respect to Unicode? And I do not refer to the new Literal Types "uHello \u89ab thing"u32
, I think I understood those. But:
Can (portable) source code be in any Unicode encoding, like UTF-8, UTF-16 or any (how-ever-defined) codepage?
Can I write an identifier with
\u1234
in itmyfu\u1234ntion
(for whatever purpose)Or can I use the "character names" that Unicode defines like in the ICU, i.e.
const auto x = "German Braunb\U{LOWERCASE LETTER A WITH DIARESIS}r."u32;
or even in an identifier in the source itself? That would be a treat... cough...
I think the answer to all these questions is no, but I can not map this reliably to the wording in the standard... :-)
I found "2.2 Phases of translation [lex.phases]", Phase 1:
Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set [...] if necessary. The set of physical source file characters accepted is implementation-defined. [...] Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)
By reading this, I now think that a compiler may choose to accept UTF-8, UTF-16 or any codepage it wishes (by meta information or user configuration). In Phase 1 it translates this into an ASCII form ("basic source character set") in which then the Unicode-characters are replaced by its \uNNNN
notation (or the compiler can choose to continue to work in its Unicode-representation, but than has to make sure it handles the other \uNNNN
the same way.
What do you think?