Why was the space character not chosen for C++14 digit separators?

Asked 4/1, 2015 at 16:38 Answered 26/8, 2018 at 20:21

As of C++14, thanks to n3781 (which in itself does not answer this question) we may write code like the following:

const int x = 1'234; // one thousand two hundred and thirty four

The aim is to improve on code like this:

const int y = 100000000;

and make it more readable.

The underscore (_) character was already taken in C++11 by user-defined literals, and the comma (,) has localisation problems — many European countries bafflingly^† use this as the decimal separator — and conflicts with the comma operator, though I do wonder what real-world code could possibly have been broken by allowing e.g. 1,234,567.

Anyway, a better solution would seem to be the space character:

const int z = 1 000 000;

These adjacent numeric literal tokens could be concatenated by the preprocessor just as are string literals:

const char x[5] = "a" "bc" "d";

Instead, we get the apostrophe ('), not used by any writing system I'm aware of as a digit separator.

Is there a reason that the apostrophe was chosen instead of a simple space?

_{^† It's baffling because all of those languages, within text, maintain the notion of a comma "breaking apart" an otherwise atomic sentence, with a period functioning to "terminate" the sentence — to me, at least, this is quite analogous to a comma "breaking apart" the integral part of a number and a period "terminating" it ready for the fractional input.}

Altis answered 4/1, 2015 at 16:38 Comment(29)

Regarding the comma, isn't the issue the comma operator, rather than localization problems? – Strongwilled 4/1, 2015 at 16:49

@BenjaminLindley: As I suggest in the question, though this may be true, I can't think of any real-world code that would actually have been broken by such a conflict. Who writes 1,000,000 and might expect anything other than concatenation of those literals, in reality? The closest I can get is foo()*3, 4, 5 but I think requiring parens around the first expression is reasonable. Because it's silly code in the first place. – Altis 4/1, 2015 at 16:50

@BenjaminLindley: Ah, I forgot a few words. I did mean for that half-sentence to briefly address the comma operator. – Altis 4/1, 2015 at 16:53

I so often hear that "whitespace doesn't matter!" or "is ignored!" - would be nice if that were made a bit truer! – Steverson 4/1, 2015 at 16:58

@LightnessRacesinOrbit: I assume nobody intended to change the meaning of int a[] = {123,000,000}. As for the comma versus period distinction, note that these are fairly recently standardized - both in text and numbers. – Suppositive 4/1, 2015 at 17:20

@MSalters: Ouch, that's a good example. re "recently standardized" what do you mean? I'm not aware of any language that has changed it in recent memory, and certainly not since 1998. – Altis 4/1, 2015 at 17:28

@LightnessRacesinOrbit: I actually meant their use in written language, which happened in the 19th century. – Suppositive 4/1, 2015 at 17:33

@MSalters: Right, which is why I'm confused as to the relevance of that fact, because the 19th century somewhat predates C++. – Altis 4/1, 2015 at 17:42

@LightnessRacesinOrbit: The comment was in regard to your footnote and non-English languages. The period was already in use as group separator in dates, e.g. IV.I.MMXV is today. – Suppositive 4/1, 2015 at 17:45

@MSalters: I interpret that differently from you. Those periods are delimiting three distinct fields (as the period in English 123.45 delimits integral and fractional); this is a different function than that served by a thousands separator, which is purely aesthetic rather than semantic in use. As such, your would-be counterexample is just another example of why the modern English comma-as-thousands-separator makes sense (over the use of a period for the same thing) and has done since before the 19th century. :) – Altis 4/1, 2015 at 17:48

@LightnessRacesinOrbit There's no "sense" about it. A decimal separator is certainly not a full stop; using a comma would probably be the most "reasonable". But these are purely typesetting conventions, developed over time, different in different "locales", and as MSalters points out, only standardized very recently. – Leatrice 4/1, 2015 at 18:32

@JamesKanze: 1800s is hardly "very recently", though I concede that such things are relative. – Altis 4/1, 2015 at 19:14

Besides from technical points, you say that the apostrophe ('), [is] not used by any writing system I'm aware of as a digit separator. There is one country using the apostrophe as digit separator: Switzerland. I’ve also seen it in instances where the author likes it more or a point/comma would cause confusion, since they are used differently internationally. – Evocative 4/1, 2015 at 20:50

If that makes you feel any better, I'm european and thanks to many products being made in the usa (calculators, etc) using commas for decimal values is - fortunately - very slowly falling out of flavor. I would say 0.99 is now more widely used than 0,99; using commas as thousand separator is unheard of though, as is using dots, we just don't separate them (probably because 1,234 and 1.234 both mean decimals nowadays) – Outward 4/1, 2015 at 22:10

Regarding using the comma as the separator, you could consider what (1,200) means -- it could mean 1200 or 200, depending on what you want to read it as, if the comma is used as the thousand separator. Again, as @AndreasBonini writes, it not used that often in Europe. – Charcot 4/1, 2015 at 22:33

@AndreasBonini: It does somewhat ;) – Altis 5/1, 2015 at 0:54

@BenjaminLindley I'd have thought the issue with comma would be the ambiguity in the case of int foo(int);int foo(int,int); foo(1,000); – Baud 5/1, 2015 at 1:29

Well, we just need a Unicode character that means specifically digit grouping. – Novice 5/1, 2015 at 4:9

@jdlugosz: And an appropriate Alternative Token representation (digraph)! :D – Altis 5/1, 2015 at 8:8

For the record, the apostrophe is standard digit separator notation on adding machines. (Reference) – Darendaresay 7/1, 2015 at 20:2

@Eric: Not all of them. There is no "standard". – Altis 7/1, 2015 at 20:53

That's the first I've seen without apostrophes, and Google Image Search seems to show the vast majority using it. At the very least, it's accountable as common usage in such machines, if we want to avoid picky words like "standard". – Darendaresay 7/1, 2015 at 20:55

@Eric: From a cursor search on adding machine it looks more like half/half, not "the vast majority". – Altis 7/1, 2015 at 21:15

@Luc - Latvia too uses an apostrophe as thousand separator - or at least we did when I was still at school. :P It's not something you use often. – Betthezul 26/6, 2016 at 15:27

There's a great post on the UX site, taken from Wikipedia, that shows the usage of separators for several different countries. As is often the case with internationalisation, there are more variants than one would expect, and it's pretty much a mess! :-) – Overwhelm 12/7, 2016 at 13:32

Commas may be lesser separators in written English, but algebraic expressions dots are often optionally placed in solely to clarify the boundaries between tokens, whereas the comma is used to separate two distinct elements in a pair. Consider a·sin(A) which is the same as a(sin(A)), whereas v = (2,3) is very different from v = (2(3)). Anyway, logical recourses to precedent to choose between localisations never really give us the right answer. – Agnate 8/4, 2017 at 14:28

@TimMB: True enough. – Altis 8/4, 2017 at 19:23

Regarding real-world code using comma operator between numbers: Eigen::Matrix3f m; m << 1,2,3,4,5,6,7,8,9;. See Eigen's comma initializer. – Olav 28/5, 2018 at 20:16

@Ruslan: But that's actually a chained bunch of function calls (each one with an Eigen type on the LHS) and could be defined to take precedence over a "single" literal found in a subexpression on its own. Logically the two could be distinguished but, admittedly, it's otherwise ambiguous and the parsing stage may not want to have to work that out. – Altis 12/6, 2018 at 12:35

There is a previous paper, n3499, which tell us that although Bjarne himself suggested spaces as separators:

While this approach is consistent with one common typeographic style, it suffers from some compatibility problems.

It does not match the syntax for a pp-number, and would minimally require extending that syntax.

More importantly, there would be some syntactic ambiguity when a hexadecimal digit in the range [a-f] follows a space. The preprocessor would not know whether to perform symbol substitution starting after the space.

It would likely make editing tools that grab "words" less reliable.

I guess the following example is the main problem noted:

const int x = 0x123 a;

though in my opinion this rationale is fairly weak. I still can't think of a real-world example to break it.

The "editing tools" rationale is even worse, since 1'234 breaks basically every syntax highlighter known to mankind (e.g. that used by Markdown in the above question itself!) and makes updated versions of said highlighters much harder to implement.

Still, for better or worse, this is the rationale that led to the adoption of apostrophes instead.

Altis answered 4/1, 2015 at 16:42 Comment(16)

I think the example would be better with const int x = 0x123 a45;. Note that unlike the string case, a45 is not another literal. – Maudiemaudlin 4/1, 2015 at 16:53

@aschepler: If I were President of Earth, it would be the case that a "literal" would include a space in its production, making 0x123 a45 a single, albeit-multi-token literal. Can you think of a scenario in which a45 being interpreted as part of an integer literal here would not be desired? There's no operator or anything before it so what else could it ever be? – Altis 4/1, 2015 at 16:55

#define abc + 1, const int x = 0x123 abc; – Carefree 4/1, 2015 at 17:21

@Carefree Macros are expanded in phase 4, and string literals are concatenated in phase 6. I would expect "number literal concatenation" to also take place in phase 6, thus maintaining the behaviour of your example code and not breaking anything. – Altis 4/1, 2015 at 17:28

@LightnessRacesinOrbit I'm not sure if it's that easy. To permit macro replacement you'd have to parse abc as an identifier, but then you'd have to specify some sort of concatenation of a pp-number and an identifier, which is...weird. Besides, there are apparently also significant concerns with breaking Objective-C. – Carefree 4/1, 2015 at 17:41

@LightnessRacesinOrbit: I have sometimes used things like x and q as temporary "metaprogramming" macros (undef'ed immediately after use) when it was necessary to define data tables that combined various bit-shifted values. I don't think I've used a-f in such fashion, but I don't think I deliberately avoided them, so it's plausible that a programmer might use metaprogramming macros that start with those characters, thus creating ambiguity. – Transonic 4/1, 2015 at 19:51

@supercat: It wouldn't be ambiguous, and your macros would still work, because macros are expanded early. – Altis 4/1, 2015 at 19:56

@LightnessRacesinOrbit: I was thinking of situations where macros expand to things like <<6) | (uint64)(, and one might produce some tables with things like V(12 x 12 y 5 z). Such macros could be ambiguous if they contained letters a-f, or if a blank could appear between the characters of the 0x prefix. – Transonic 4/1, 2015 at 20:13

@supercat: That's terrible code and I would rather we did not optimise for it ;p – Altis 4/1, 2015 at 20:28

@LightnessRacesinOrbit: I still see some meaning to T.C.'s argument. It could create extreme confusion with existing macros. Consider the new macro rule we'd have to add: "Do not write macros whose name only contains [0-9A-Fa-f], because that can break arbitrary hex numbers which used spaces." While it is technically backwards comparable because old code wont use this number notation, there's something very unplesant about that. Especially since DEADBEEF is probably not an unreasonable macro name today. In the present system, at least DEADBEEF can never be a number without a 0x – Aleishaalejandra 4/1, 2015 at 20:39

@LightnessRacesinOrbit: Can you explain how exactly it comes to work correctly if there are #define'd things like GOOD (as something arbitrary), BAD (as something arbitrary), and somewhere down below there is a number 0xBAD BAD? (please don't mind the capitalization - it could be any) – Gujarat 4/1, 2015 at 20:41

@LightnessRacesinOrbit: The construct isn't used for "code" as such, but rather for data tables which will need to reside in ROM. Given a choice between writing (and having to maintain) a separate utility to convert some other format of data table into C constant declarations, or (ab)using the preprocessor, there can be advantages to keeping everything within one tool chain. In any case, such code exists, and thus the standard should not change in such a way as to alter its meaning. – Transonic 4/1, 2015 at 20:55

@LightnessRacesinOrbit LOL aside, Alisdair Meredith in one of his talks at CppCon 2014 said that whitespace-as-separator was rejected for that reason. – Carefree 4/1, 2015 at 23:25

@EugeneRyabtsev: Again, those macros are expanded first. Nothing would change. Nothing would be made more confusing than it already is by having a macro. If you know that the macro exists and what it expands to, then you understand the program; if you don't, then you are already confused today! I guess my entire counter-argument hinges on this. I shan't assert that it's a sufficient counter-argument, but it is one nonetheless! I just hope that I've put it forth adequately. People keep talking about altered meaning, but I do not see one (cont.) – Altis 5/1, 2015 at 0:56

(cont.) except in the array initialisation / function args cases, which are very convincing. – Altis 5/1, 2015 at 0:57

@PaulManta no kidding; makes me see the appeal of LISP, especially for language research. – Novice 5/1, 2015 at 4:22

The obvious reason for not using white space is that a new line is also white space, and that C++ treats all white space identically. And off hand, I don't know of any language which accepts arbitrary white space as a separator.

Presumably, Unicode 0xA0 (non-breaking space) could be used—it is the most widely used solution when typesetting. I see two problems with that, however: first, it's not in the basic character set, and second, it's not visually distinctive; you can't see that it isn't a space by just looking at the text in a normal editor.

Beyond that, there aren't many choices. You can't use the comma, since that is already a legal token (and something like 1,234 is currently legal C++, with the meaning 234). And in a context where it could occur in legal code, e.g. a[1,234]. While I can't quite imagine any real code actually using this, there is a basic rule that no legal program, regardless how absurd, should silently change semantics.

Similar considerations mean that _ can't be used either; if there is a #define _234 * 2, then a[1_234] would silently change the meaning of the code.

I can't say that I'm particularly pleased with the choice of ', but it does have the advantage of being used in continental Europe, at least in some types of texts. (I seem to remember having seen it in German, for example, although in typical running text, German, like most other languages, will use a point or a non breaking space. But maybe it was Swiss German.) The problem with ' is parsing; the sequence '1' is already legal, as is '123'. So something like 1'234 could be a 1, followed by the start of a character constant; I'm not sure how far you have to look-ahead to make the decision. There is no sequence of legal C++ in which an integral constant can be followed by a character constant, so there's no problem with breaking legal code, but it means that lexical scanning suddenly becomes very context dependent.

(With regards to your comment: there is no logic in the choice of a decimal or a thousands separator. A decimal separator, for example, is certainly not a full stop. They are just arbitrary conventions.)

Leatrice answered 4/1, 2015 at 18:29 Comment(12)

"a new line is also white space". Sorry if I am being silly hear, but why is that? :) – Bittner 4/1, 2015 at 18:41

@G.Samaras: C defines "whitespace" to be "... space, horizontal tab, new-line, vertical tab, and form-feed", and this is entirely conventional. – Altis 4/1, 2015 at 19:17

I don't buy the comma problem example at all. Why would anyone write 1,234? That its currently valid doesn't mean its useful. MSalters' example of array initialisation was pretty good, though. As for silently changing semantics more generally, though, there is precedent for doing so where the utility vastly outstrips the actual use cases (auto being the most obvious example). – Altis 4/1, 2015 at 19:18

@LightnessRacesinOrbit Or even void f(int); void f(int, int); f(12,345); – Carefree 4/1, 2015 at 19:18

Re a[1_234] silently being changed, again no because macros are already processed two stages of translation before the stage that performs string concatenation (where I would expect this literal "concatenation" to also take place). – Altis 4/1, 2015 at 19:20

Finally, I presented some logic as to why a decimal separator makes more sense as a period. – Altis 4/1, 2015 at 19:21

How is '123' legal? – Uncinariasis 4/1, 2015 at 22:56

@CraigMcQueen It's a multi-character literal. Not very useful, because of it's implementation-defined nature. – Tarim 5/1, 2015 at 1:6

@G.Samaras Because C++ is not line oriented. A new line plays exactly the same role as any other white space in the language. – Leatrice 5/1, 2015 at 1:11

@LightnessRacesinOrbit Re the definition of white space being conventional, this is partially true. But there are certain unifying characteristics: none of the white space characters require any ink. More importantly, they can be used interchangeably, and can be repeated without effect. – Leatrice 5/1, 2015 at 1:14

@CraigMcQueen: Not only is it legal with implementation-defined semantics, but even if it weren't then this would more than likely be a semantic restriction, and still valid syntax, which is the key especially when discussing syntax highlighting problems. – Altis 5/1, 2015 at 1:16

@LightnessRacesinOrbit Re reasons why a period makes sense as a decimal separator: the only comments I can find with regards is to the effect that the decimal separator is truly a separator. Which argues against a full stop, since a full stop (or period) is a terminator, not a separator. (But it's completely irrelevant either way, since we are dealing with mathematical convensions, not textual punctuation.) – Leatrice 5/1, 2015 at 1:23

From wiki, we have a nice example:

auto floating_point_literal = 0.000'015'3;

Here, we have the . operator and then if another operator would be to be met, my eyes would wait for something visible, like a comma or something, not a whitespace.

So an apostrophe does much better here than a whitespace would do.

With whitespaces it would be

auto floating_point_literal = 0.000 015 3;

which doesn't feel as right as the case with the apostrophes.

In the same spirit of Albert Renshaw's answer, I think that the apostrophe is more clear than the space the Lightness Races in Orbit proposes.

type a = 1'000'000'000'000'000'544'445'555;
type a = 1 000 000 000 000 000 544 445 555;

Space is used for many things, like the strings concatenation the OP mentions, unlike the apostrophe, which in this case makes it clear for someone that is used separating the digits.

When the lines of code become many, I think that this will improve readability, but I doubt that is the reason they choose it.

About the spaces, it might worth taking a look at this C question, which says:

The language doesn't allow int i = 10 000; (an integer literal is one token, the intervening whitespace splits it into two tokens) but there's typically little to no expense incurred by expressing the initializer as an expression that is a calculation of literals:

int i = 10 * 1000; /* ten thousand */

Bittner answered 4/1, 2015 at 17:2 Comment(6)

Often the long number you're expressing doesn't end in all zeros, in which case your 10*1000 example doesn't work. – Jaynajayne 4/1, 2015 at 17:17

@MarkRansom this is an example pasted from the answer I linked. You think I should modify it? – Bittner 4/1, 2015 at 17:18

You are, I assume, aware of the publication date (specifically, the month and day, not so much the year) of that paper on whitespace overloading, right? – Strongwilled 4/1, 2015 at 17:24

Yeah not so modern, I am going to edit @BenjaminLindley. – Bittner 4/1, 2015 at 17:25

The modernity of it was not the concern. Investigate it a bit more carefully. If the date has no significant meaning in your part of the world, google it. – Strongwilled 4/1, 2015 at 17:26

Your opinions on readability are by no means universal; the apostrophe looks downright bizarre to most of the world population. And certainly all scientific documents use whitespace as the separator on the fractional side, and most use it as the separator on the integer side too. – Innumerable 5/1, 2015 at 0:31

It is true I see no practical meaning to:

if (a == 1 1 1 1 1) ...

so digits might be merged without real ambiguity but what about an hexadecimal number?

0 x 1 a B 2 3

There is no way to disambiguate from a typo doing so (normally we should see an error)

Claret answered 4/1, 2015 at 17:18 Comment(1)

Well, simple. It would be valid code now, instead of an error. A typo can still result in valid code, and there is absolutely no way to prevent this if your 'language' consists of more than one word. – Baud 5/1, 2015 at 1:32

I would assume it's because, while writing code, if you reach the end of a "line" (the width of your screen) an automatic line-break (or "word wrap") occurs. This would cause your int to get split in half, one half of it would be on the first line, the second half on the second... this way it all stays together in the event of a word-wrap.

Traveled answered 4/1, 2015 at 16:50 Comment(4)

I'm not on the C++ design committee, but from what I gather concerns like these typically don't factor into the decision-making. – Presber 4/1, 2015 at 16:51

I don't think that this is the reason, but it's an interesting one that I had not considered. Open to more ideas in more answers from people :) – Altis 4/1, 2015 at 16:51

@LightnessRacesinOrbit Also, it probably also prevents code compilers from omitting your numeric-breaks. Spaces would get stripped, these could be left. But that's just a silly possibility that anyone would care about this haha. – Traveled 4/1, 2015 at 17:10

@AlbertRenshaw: I don't follow? – Altis 4/1, 2015 at 17:14

-1

float floating_point_literal = 0.0000153;   /* C, C++*/

auto floating_point_literal = 0.0000153;    // C++11

auto floating_point_literal = 0.000'015'3;  // C++14

Commenting does not hurt:

/*  0. 0000 1530 */ 
float floating_point_literal = 0.00001530;

Binary strings can be hard to parse:

long bytecode = 0b1111011010011001; /* gcc , clang */  

long bytecode = 0b1111'0110'1001'1001;  //C++14
// 0b 1111 0110 1001 1001  would be better, really.
// It is how humans think.

A macro for consideration:

#define B(W,X,Y,Z)    (0b##W##X##Y##Z)
#define HEX(W,X,Y,Z)  (0x##W##X##Y##Z)
#define OCT(O)        (0##O)



long z = B(1001, 1001, 1020, 1032 ); 

// result :  long z = (0b1001100110201032);

 long h = OCT( 35); 

// result :  long h  = (035); // 35_oct => 29_dec

 long h = HEX( FF, A6, 3B, D0 ); 

// result :  long h  = (0xFFA6BD0);

Ellen answered 26/8, 2018 at 20:21 Comment(4)

This doesn't answer the question. – Mulloy 26/8, 2018 at 21:18

Oh yes, commenting does hurt. One problem is that the comment might be wrong, now or in future. The other is that repetititititititition hinders readability and is error-prone. – Friedly 26/8, 2018 at 21:35

@Friedly In this case a wrong comment is pretty trivial to spot (the comment doesn't add meaning, it just re-formats the information below it). – Altis 28/8, 2018 at 10:38

Sure in this case it's easy to spot. If you divert your attention a bit to try doing so. – Friedly 28/8, 2018 at 13:55

-2

It has to do with how the language is parsed. It would have been difficult for the compiler authors to rewrite their products to accept space delimited literals.

Also, I don't think seperating digits with spaces is very common. That i've seen, it's always non-whitespace characters, even in different countries.

Spano answered 8/5, 2017 at 2:18 Comment(5)

They had to change their parsers anyway. – Altis 8/5, 2017 at 8:58

@BoundaryImposition I'm afraid you don't understand. Whitespace already has a meaning in the language. One that is fundamental. Changing 12'345'678 (digit separators) into the binary form is about the same as without digit separators. It takes the same amount of effort for the compiler author. Whereas to redefine the tokenizing system itself would have been difficult. Plus space separated numbers look ugly. – Spano 8/5, 2017 at 22:41

I can assure you I do understand. The "tokenizing system" would not need to be "redefined". Consider, for example, string literal concatenation, which already works just fine. – Altis 8/5, 2017 at 22:42

Whitespace only has a "fundamental" meaning inasmuch as it prevents two consecutive characters from being part of the same token. As the OP mentioned, this could be trivially slotted in to the "join adjacent string literals" preprocessor pass. The (main) parser would never even see it. – Welch 28/5, 2018 at 15:9

I am afraid you overmystify the tokenizer. You could either do as is done for string literals, for which concatenation happens in translation phase 6, i.e. in phase 6, ["foobar"] ["frob"] becomes ["foobarfrob]. Or the tokenizer could be extended to absorb spaces: decimal_literal ::= [1-9][0-9]+[uU]?(l|L|ll|LL)? becomes decimal_literal ::= [1-9][ 0-9]+[uU]?(l|L|ll|LL)?, in which case the literal has to be normalized later. It's basically the same operation mode as for '. Not sure what you really want to say :| – Plains 13/11, 2018 at 12:46

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags