Is it possible to get GCC to compile UTF-8 with BOM source files?
Asked Answered
T

2

14

I develop C++ cross platform using Microsoft Visual Studio on Windows and GCC on Ubuntu Linux.

In Visual Studio, I can use Unicode symbols like "π" and "²" in my code. Visual Studio always saves the source files as UTF-8 with BOM (Byte Order Mark).

For example:

// A = π.r²
double π = 3.14;

GCC happily compiles these files only if I remove the BOM first. If I do not remove the BOM, I get errors like these:

wwga_hydutils.cpp:28:9: error: stray ‘\317’ in program

wwga_hydutils.cpp:28:9: error: stray ‘\200’ in program

Which brings me to the question:

Is there a way to get GCC to compile UTF-8 files without first removing the BOM?


I'm using:

and:


As the first commenter pointed out, my problem was not the BOM, but having non-ASCII characters outside of string constants. GCC does not like non-ASCII characters in symbol names, but it turns out GCC is fully compatible with UTF-8 with BOM.

Topographer answered 26/10, 2011 at 7:25 Comment(6)
Works fine for me in gcc 4.4.5, using a string containing both the UNICODE characters in your question. File with BOM. Also, the error you get has nothing to do with the BOM, but seems to be that the UNICODE characters in question is outside any string (thats why they are called stray.)Grocer
@JoachimPileborg yes the unicode characters are outside of the string, the "π" I was using as a symbol name, the "²" was just in comments. When I remove the BOM, it does eliminate the error from the console output, but I guess that's no gaurantee that GCC is really handling the characters how I expect.Topographer
@JoachimPileborg, I've updated the question to include the context in which I'm using the unicode characters.Topographer
It is an error to have a BOM in a UTF-8 stream, because it precluded catting three of them together and getting the correct result.Crepe
double π = 3.14; : typography +1, math -1.Hothouse
clang supports these symbols in identifiers, gcc only supports in strings,To use Λ (greek lambda) in identifiers in gcc use universal character name (ibm.com/support/knowledgecenter/en/ssw_ibm_i_74/rzarg/…), so a function funΛ(), would be written as fun\u039B() to be able to run in gcc. I changed my compiler to clang, and things worked fine. gcc's -finput-charset=UTF-8 -fextended-identifiers don't help either. -fextended-identifiers is simply for supporting universal character name format, if turn off(-fno-extended-identifiers) even fun\u039B() fails.Lupulin
R
4

According to the GCC Wiki, this isn't supported yet. You can use -fextended-identifiers and pre-process your code to convert the identifiers to UCN. From the linked page:

perl -pe 'BEGIN { binmode STDIN, ":utf8"; } s/(.)/ord($1) < 128 ? $1 : sprintf("\\U%08x", ord($1))/ge;' 

See also g++ unicode variable name and Unicode Identifiers and Source Code in C++11?

Rivers answered 26/10, 2011 at 15:44 Comment(1)
GCC caught up in version 10 (mid 2020).Amass
L
4

While Unicode identifiers are supported in GCC, UTF-8 input is not. Therefore, Unicode identifiers have to be encoded using \uXXXX and \UXXXXXXXX escape codes. However, a simple one-line patch to the C++ preprocessor allows GCC and g++ to process UTF-8 input provided a recent version of iconv that support C99 conversions is also installed. Details are present at UTF-8 Identifiers in GCC.

However, the patch is so simple it can be given right here:

diff -cNr gcc-5.2.0/libcpp/charset.c gcc-5.2.0-ejo/libcpp/charset.c

Output:

*** gcc-5.2.0/libcpp/charset.c  Mon Jan  5 04:33:28 2015
--- gcc-5.2.0-ejo/libcpp/charset.c  Wed Aug 12 14:34:23 2015
***************
*** 1711,1717 ****
    struct _cpp_strbuf to;
    unsigned char *buffer;

!   input_cset = init_iconv_desc (pfile, SOURCE_CHARSET, input_charset);
    if (input_cset.func == convert_no_conversion)
      {
        to.text = input;
--- 1711,1717 ----
    struct _cpp_strbuf to;
    unsigned char *buffer;

!   input_cset = init_iconv_desc (pfile, "C99", input_charset);
    if (input_cset.func == convert_no_conversion)
      {
        to.text = input;

Even with the patch, two command line options (-finput-charset and -fextended-identifiers) are needed to enable UTF-8 input. In particular, try something like

/usr/local/gcc-5.2/bin/gcc \
    -finput-charset=UTF-8 -fextended-identifiers \
    -o circle circle.c
Ludwigg answered 15/8, 2015 at 0:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.