Why don't these Unicode variable names work with -fextended- identifiers? «, » and ≠ [duplicate]
Asked Answered
L

2

5

I heard that it is possible to use Unicode variable names using the -fextended-identifiers flag in GCC. So I made a test program in C++, but it does not compile.

#include <iostream>
#include <string>

#define ¬ !
#define ≠ !=
#define « <<
#define » >>

/* uniq: remove duplicate lines from stdin */
int main() {
    std::string s;
    std::string t = "";
    while (cin » s) {
        if (s ≠ t)
            cout « s;
        t = s;
    }
    return 0;
}

I get these errors:

g++ -fextended-identifiers -g3 -o a main.cpp
main.cpp:10:3: error: stray ‘\342’ in program
   if (s ≠ t)
   ^
main.cpp:10:3: error: stray ‘\211’ in program
main.cpp:10:3: error: stray ‘\240’ in program
main.cpp:11:4: error: stray ‘\302’ in program
    cout « s;
    ^
main.cpp:11:4: error: stray ‘\253’ in program

What is going on? Aren't these macro names supposed to work with -fextended-identifiers?

Lotte answered 26/9, 2015 at 16:2 Comment(9)
Does your version of g++ not include the "error: macro names must be identifiers" diagnostic before it gets to the "stray '\...' in program" errors?Corrasion
"Ranges of characters disallowed initially [charname.disallowed] 0300-036F, 1DC0-1DFF, 20D0-20FF, FE20-FE2F"Gelatinate
@Lightness: not a dupe, that question asked about usage of characters from the basic set, these characters are definitely extended.Gelatinate
universal-character-names are sequences of \uXXXX and \UXXXXXXXX. These would be allowed by -fextended-identifiers. But what you gave are the characters directly.Cussed
@JohannesSchaub-litb: Would be allowed, provided that they meet the requirements of E.1 and E.2Gelatinate
@JohannesSchaub-litb: Giving the characters directly is not prohibited. "Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character." (section 2.2)Gelatinate
@BenVoigt true, but the GCC option he uses does not promise to accept the corresponding direct encoding.Cussed
@JohannesSchaub-litb: Indeed, it is intended to but not yet implemented: gcc.gnu.org/wiki/FAQ#utf8_identifiersGelatinate
@Lightness: found the real duplicateGelatinate
G
4

The C++ Standard requires (section 2.10):

An identifier is an arbitrarily long sequence of letters and digits. Each universal-character-name in an identifier shall designate a character whose encoding in ISO 10646 falls into one of the ranges specified in E.1. The initial element shall not be a universal-character-name designating a character whose encoding falls into one of the ranges specified in E.2. Upper- and lower-case letters are different. All characters are significant.

And E.1:

Ranges of characters allowed [charname.allowed]

  • 00A8, 00AA, 00AD, 00AF, 00B2-00B5, 00B7-00BA, 00BC-00BE, 00C0-00D6, 00D8-00F6, 00F8-00FF

  • 0100-167F, 1681-180D, 180F-1FFF

  • 200B-200D, 202A-202E, 203F-2040, 2054, 2060-206F

  • 2070-218F, 2460-24FF, 2776-2793, 2C00-2DFF, 2E80-2FFF

  • 3004-3007, 3021-302F, 3031-303F

  • 3040-D7FF

  • F900-FD3D, FD40-FDCF, FDF0-FE44, FE47-FFFD

  • 10000-1FFFD, 20000-2FFFD, 30000-3FFFD, 40000-4FFFD, 50000-5FFFD, 60000-6FFFD, 70000-7FFFD, 80000-8FFFD, 90000-9FFFD, A0000-AFFFD, B0000-BFFFD, C0000-CFFFD, D0000-DFFFD, E0000-EFFFD 0300-036F, 1DC0-1DFF, 20D0-20FF, FE20-FE2F

Your angle brackets are 0x300A and 0x300B, which are not included. Not equal is 0x2260, also disallowed.

Gelatinate answered 26/9, 2015 at 16:21 Comment(3)
The whole point of -f switches is to extend/break standard functionality, no? So what's the use of quoting the standard to explain why a compiler flag won't work as the OP expects?Horntail
@LightnessRacesinOrbit: -fextended-identifiers is on by default in C++, it is -fno-extended-identifiers that breaks standard compliance.Gelatinate
oh lol okay - should be in the answer thenHorntail
G
5

G++ doesn't support Unicode characters in the source yet:

Notably, the errors generated by your program are for the individual octets of the UTF-8 encoding, not for the Unicode character they represent. is being seen as three bytes: \342\211\240 and « as two: \302\253.

Gelatinate answered 26/9, 2015 at 16:28 Comment(1)
U+2260 (NOT EQUAL TO. UTF-8 342 211 240 (octal), 0xE2 0x89 0xA0 (hexadecimal)) and U+00AB (LEFT-POINTING DOUBLE ANGLE QUOTATION MARK. UTF-8 302 253 (octal), 0xC2 0xAB (hexadecimal)), respectively.Unblessed
G
4

The C++ Standard requires (section 2.10):

An identifier is an arbitrarily long sequence of letters and digits. Each universal-character-name in an identifier shall designate a character whose encoding in ISO 10646 falls into one of the ranges specified in E.1. The initial element shall not be a universal-character-name designating a character whose encoding falls into one of the ranges specified in E.2. Upper- and lower-case letters are different. All characters are significant.

And E.1:

Ranges of characters allowed [charname.allowed]

  • 00A8, 00AA, 00AD, 00AF, 00B2-00B5, 00B7-00BA, 00BC-00BE, 00C0-00D6, 00D8-00F6, 00F8-00FF

  • 0100-167F, 1681-180D, 180F-1FFF

  • 200B-200D, 202A-202E, 203F-2040, 2054, 2060-206F

  • 2070-218F, 2460-24FF, 2776-2793, 2C00-2DFF, 2E80-2FFF

  • 3004-3007, 3021-302F, 3031-303F

  • 3040-D7FF

  • F900-FD3D, FD40-FDCF, FDF0-FE44, FE47-FFFD

  • 10000-1FFFD, 20000-2FFFD, 30000-3FFFD, 40000-4FFFD, 50000-5FFFD, 60000-6FFFD, 70000-7FFFD, 80000-8FFFD, 90000-9FFFD, A0000-AFFFD, B0000-BFFFD, C0000-CFFFD, D0000-DFFFD, E0000-EFFFD 0300-036F, 1DC0-1DFF, 20D0-20FF, FE20-FE2F

Your angle brackets are 0x300A and 0x300B, which are not included. Not equal is 0x2260, also disallowed.

Gelatinate answered 26/9, 2015 at 16:21 Comment(3)
The whole point of -f switches is to extend/break standard functionality, no? So what's the use of quoting the standard to explain why a compiler flag won't work as the OP expects?Horntail
@LightnessRacesinOrbit: -fextended-identifiers is on by default in C++, it is -fno-extended-identifiers that breaks standard compliance.Gelatinate
oh lol okay - should be in the answer thenHorntail

© 2022 - 2024 — McMap. All rights reserved.