What is the meaning of these strange question marks?
Asked Answered
E

2

7

I came across some weird-looking code. It doesn't even look like C, yet to my surprise it compiles and runs on my C compiler. Is this some non-standard extension to the C language and if so, what is the reason for it?

??=include <stdio.h>

int main()
??<
  const char arr[] = 
  ??<
    0xF0 ??! 0x0F,
    ??-0x00,
    0xAA ??' 0x55
  ??>;

  for(int i=0; i<sizeof(arr)/sizeof(*arr); i++)
  ??<
    printf("%X??/n", (unsigned char)arr??(i??));
  ??>

  return 0;
??>

Output:

FF
FF
FF
Excerpta answered 23/5, 2014 at 9:29 Comment(4)
That's my first step towards obfuscation. I don't think many of C coders are aware of this.Laurentium
possible duplicate of meaning of `???-` in C++ codeRudderpost
You should refactor code that looks like this. A quick way would be to replace "??=" with "? ?=", run gcc -trigraph -E on it, and then replace "? ?=" in the output with "#" while removing the first few lines of output by gcc.Foliage
@Rudderpost It is actually not a duplicate, because this is for C. C++ has additional trigraphs, none of which was mentioned in your duplicate candidate, so it's not a complete answer. Nor does it mention the meaning of the trigraphsExcerpta
E
15

The code is fully standard compliant to any version of the C standard. The ?? mechanism is called trigraphs and was introduced to C to allow an alternative way of printing certain symbols. It looks like the program was written as a demonstration of various trigraph sequences.

Back in the days, many computers and their keyboards were based on an old symbol table called ISO 646 which didn't contain all symbols used in the C language, such as \ { } [ ]. This made it impossible for programmers from some countries to even write C, because their national keyboard layout lacked the necessary symbols. Instead of remaking the keyboards and symbol tables, the C language was changed.

Therefore trigraphs were introduced. Today they are considered a completely obsolete feature and it is not recommended to use them.[1] GCC will for example give you a warning if you use them. Still, they remain in the C standard for backwards-compatibility and all C compilers must support them.

The existing trigraph sequences are (C11 5.2.1.1 Trigraph sequences):

??=  #
??(  [
??/  \
??)  ]
??'  ^
??<  {
??!  |
??>  }
??-  ~

The left column is the trigraph sequence and the right column is its meaning.


EDIT: Those interested in the historical decisions can read about it themselves in the C rationale v5.10, chapter 5.2.1.1.


[1]: C23 removed trigraphs from the language standard entirely.

Excerpta answered 23/5, 2014 at 9:29 Comment(6)
Firstly, ISO 646 doesn’t lack “{ }” etc.; it only makes them non-portable. For example, British and Dutch ISO 646 codepages have “{ }” whereas French and German ones do not.Unkenned
Secondly, were “programmers from some countries [unable] to even write C, because their national keyboard layout lacked the necessary symbols”? In fact, they were able, because compilers (such as ISO 646-oriented) understand “{ }” as “\173 \175” (abstract characters) and don’t care about rendering, keyboard etc. So, a coder from France or Canada was able to type “é è” for C compounds, likewise “ä ü” for Germany or Switzerland… which, of course, would be inconvenient for programmers (hence trigraphs and digraphs), but indeed possible. You got this historical stuff wrong.Unkenned
@IncnisMrsi @IncnisMrsi Apparently Scandinavian letters were one of the reasons. I heard this directly from a certain C programming guru who themselves were on the C89 committee. I was having lunch with him when he visited Sweden and he joked by blaming Scandinavians for trigraphs. For example in Sweden you have ä which can be written with any keyboard, but also å which requires a special keyboard. Same with some Danish and Norwegian letters.Excerpta
Apparently some Danish member of the WG was pushing for adding trigraphs. I don't know if it's true or just a fun story that grew over the years. You can probably go dig up some minutes of meeting from ISO somewhere if it matters...Excerpta
@IncnisMrsi As for "you got this historical stuff wrong", this is from the C rationale: _"The characters in the ASCII repertoire used by C and absent from the ISO/IEC 646 invariant repertoire are: # [ ] { } \ | ~ ^". I added a link to the rationale to the answer.Excerpta
What namely derives from the rationale? It’s really cool that @Excerpta defends this writeup made nearly ten years ago, but §5.2.1 from exactly that C99RationaleV5.10.pdf states ❝No particular graphic representation for any character is prescribed❞ contrary to Lundin’s interpretation and adds an example of \134 (C’s backslash) mapped to the yen sign (¥) as legitimate. This “impossible to write C, because their national keyboard layout” innuendo demonstrates misunderstanding of problems faced by computer users in 1980s and is likely some imprint from the 8-bit era (∼1990s).Unkenned
U
-3

This is an obfuscated code conforming to the 1989 ANSI C standard (which formally defined “??”-based trigraphs) and later standards. The reason behind choice of 0xF0 ??! 0x0F etc. is evidently obfuscation or compiler testing, because no incentive to use bitwise OR on trivial literals during initialization can be foreseen for a production code.

The reasons formulated by the C89 Committee for introduction of the trigraphs were motivated by portability. The problem with # [ ] { } \ | ~ ^ was not confined to ISO 646 encodings and should be understood in the context of diverse computing platforms of 1980s, including Soviet computers (some of which used KOI-7), IBM mainframes (which used various dialects of EBCDIC), Commodore computers (which used PETSCII), Atari computers (which used ATASCII), etc. Now we can deem that the decision came late.

Unkenned answered 6/4 at 5:37 Comment(7)
I wonder why you felt the need to necro a ten-year-old Q with an accepted A, with an answer that adds nothing to the existing A except some opinion on rationale and hostility.Acervate
@Acervate the pre-existing answer actually didn’t (and doesn’t yet) address parts of the question. Moreover, it presented a misleading narrative about rationale behind introduction of the trigraphs in 1989.Unkenned
And if you weren't so hostile about things, I might feel like telling you about source character set, and how writing a character that happens to have the same encoding as { isn't the same as ??<. Or argue why your assertion that the code example is about "obfuscation" is patently false. But you are, so I won't, and just let my downvote on your presentation of opinion stand.Acervate
Surely, using é is different from ??<, if only because the former would break the code with recoding to another character set. When did Ī̲ assert that such presentations of the block-beginning make no difference? We see merely a lengthy negative opinion by @Acervate (mostly about my character, not the answer) and no convincing theory about purpose of the code.Unkenned
I am criticizing the character of your answer (and your comments), which are unnecessarily hostile. You can present your own take of matters without making disparaging comments on others. Such should be reserved to genuinely bad-faith answers, which Lundin's isn't, and even then serves no particular purpose. Generally speaking, the voting system works quite nicely, as you might have noticed. On the subject, I refer to your comment on Lundin's answer as of Apr 4 10:32, which makes referrals to trigraphs as "obfuscating" somewhat questionable. Perhaps re-think your communication strategy.Acervate
Did DevSolar actually read my text, not the comment by @Abhineet? Ī̲ never made any point that ??! is a reliable indicator of obfuscation. Ī̲ stated that | on trivial literals seems incompatible with production code and suggests obfuscation.Unkenned
Your answer literally starts with "This is an obfuscated code...". This is getting tiresome.Acervate

© 2022 - 2024 — McMap. All rights reserved.