Why can't variable names start with numbers?

Asked 4/12, 2008 at 21:32 Answered 5/8, 2018 at 7:24

Solved c++variables programming-languages language-design variable-names

167

I was working with a new C++ developer a while back when he asked the question: "Why can't variable names start with numbers?"

I couldn't come up with an answer except that some numbers can have text in them (123456L, 123456U) and that wouldn't be possible if the compilers were thinking everything with some amount of alpha characters was a variable name.

Was that the right answer? Are there any more reasons?

string 2BeOrNot2Be = "that is the question"; // Why won't this compile?

Retorsion answered 4/12, 2008 at 21:32 Comment(15)

And why can't they have spaces in them? – Quoit 4/12, 2008 at 21:38

Re-tagged this with "c++" because this is a language limitation. It's quite possible that some languages will allow this (though I can't think of any offhand). – Kulun 4/12, 2008 at 21:41

This issue predates C++ by at least 20 years, if not back to the first macro assemblers. – Josi 4/12, 2008 at 21:50

The OP mentioned C++ specifically, but I like the new set of tags better anyway. – Kulun 4/12, 2008 at 22:14

Well, in FORTH, you can do it. AFAIK, there is a word called 0 that pushes 0 onto the stack. another one is 0= that checks whether 0 is on the stack. – Unseasonable 24/11, 2013 at 16:25

Why is this question so popular and the answers so wrong? Many languages do allow variables to start with numbers. C++ doesn't but it's just a convenient limitation that avoids certain ambiguities. Sometimes SO amazes me in all the wrong ways. – Thomasenathomasin 15/3, 2014 at 3:26

If this question was asked today on SO, it will be termed opinion-based and close out. Thanks for asking this. – Dower 26/6, 2015 at 2:17

@Thomasenathomasin Personally I expect that pretty much every single language limitation has a "why" question being asked somewhere, IMO that's a good thing, it means programmers are thinking about what they're doing and want to learn. – Tourer 8/1, 2018 at 13:30

@Dower Well... i'ts still open. IMO the POB close reason would be incorrect, because somebody, at some point in time needed to implement this restriction, and there was a reason for it (even if it was just "I hate numbers" or "I wanted to leave early on Friday"), so that one person's answer would be the absolute truth. Hypothetically if that person showed up to this question, or somebody happened to read their book / paper / blog / magazine article, the true answer would be found. – Tourer 8/1, 2018 at 13:33

Also, related post on SE.SE – Tourer 8/1, 2018 at 13:37

@jrh: No, the question is OK and it could have a good answer (which I could even write, but won't). The amazing thing is how many answers there are and how wrong most of them are (including the accepted answer). – Thomasenathomasin 9/1, 2018 at 23:20

@OutlawProgrammer one example is batch: this is a %valid variable name%. %2 Be Or Not 2 Be % is also valid. All the whitespaces are significant – Hewie 24/11, 2018 at 7:8

@ChristianFritz why do you remove the c++ tag? This isn't language agnostic since many languages do allow variables to start with a number, like shell scripts $1 – Hewie 24/11, 2018 at 7:10

@Quoit not in C++ but many other languages do allow that Why can't variable names have spaces in them?, Is there any language that allows spaces in its variable names, Why should identifiers not begin with a number? – Hewie 24/11, 2018 at 7:13

It is techically possible in every language, but makes lexical analysis more complex. See en.wikipedia.org/wiki/Lexical_analysis – Diffluent 10/10, 2020 at 11:13

140

Because then a string of digits would be a valid identifier as well as a valid number.

int 17 = 497;
int 42 = 6 * 9;
String 1111 = "Totally text";

Parthenogenesis answered 4/12, 2008 at 21:43 Comment(15)

Well, what if they said variables cannot be only numbers. Then what? – Delamination 4/12, 2008 at 21:45

It'd take me longer to come up with a regular expression for the lexer to pick up identifiers using that rule, if it's even possible, so I can see why no language has ever been implemented that way, in addition to the reasons given in other answers. – Parthenogenesis 4/12, 2008 at 21:48

you can make the rules as complex as you want, but you might regret it when you try to implement the compiler. ;-) – Mencius 4/12, 2008 at 21:49

note - I am not advocating it - just saying that that reason is way down on the list and most likely it is all just due to convention. – Quoit 4/12, 2008 at 21:50

I particularly like the ability to change numbers - "int 1 = 2; int a = 1 + 1;" would set a to 4. :-) – Funereal 4/12, 2008 at 22:15

If people are going to be silly, then "L" looks like "1" - as in l234 (that's L234) - looks like a number but is legal. If you want to write obtuse code like "17 = 497" then using "L" makes it possible. But why? -R – Billybillycock 5/12, 2008 at 0:12

This answer is actually on the right track. The real problem lies in performance. Backtracking can make well-behaved regular expressions painfully slow. – Glarus 20/5, 2009 at 19:57

If it had to be numbers+alpha, then you could still do String 0x123 = "Hello World". Unless you state that variable names are "numbers+alpha that don't parse to a valid numeric designation", and that's just silly. – Medullated 11/10, 2009 at 3:56

Some languages do support assigning on top of numbers. Those languages will allow code like assigning 3 to be 4. – Codee 15/11, 2011 at 18:52

Never mind the compiler: the people using the language need to be able to readily (at a glance) distinguish variable names from numbers. If the first character didn't tell you -- instead, if you needed to search through the rest of the word to tell if there was a non-numeric alpha somewhere in there -- the code would be harder to read. – Pasol 18/6, 2012 at 21:35

@eaolson: I've worked with an assembler which applied that rule to hex numbers which started with A-F and ended with h. Tripped me up the first time I tried to define a label to point to the music data for Bach's Two Part Invention #13 (logical name? Bach). – Veneering 20/11, 2012 at 17:52

This is wrong. The question was about variables starting with numbers, not consisting entirely of numbers. – Thomasenathomasin 15/3, 2014 at 3:21

"Unless you state that variable names are 'numbers+alpha that don't parse to a valid numeric designation', and that's just silly." But languages do exactly that for keywords: A variable name is a sequence of letters that don't parse to a valid reserved word. – Holarctic 19/12, 2016 at 15:51

@Holarctic True, but the list of reserved words is finite, whereas the list of valid numeric designators is infinite, or nearly so. – Roundworm 30/11, 2017 at 16:38

This is the accepted answer and it's dead wrong. I write compilers, and it's mind-numbingly easy to allow an identifier to be a string of characters containing at least one letter, regardless of what it starts with. – Thomasenathomasin 9/1, 2018 at 23:18

129

Well think about this:

int 2d = 42;
double a = 2d;

What is a? 2.0? or 42?

Hint, if you don't get it, d after a number means the number before it is a double literal

Delamination answered 4/12, 2008 at 21:38 Comment(3)

This is actually a [relatively] late coming notation ("d" for "double"), C89 standard IIRC. Leading numerics in identifiers aren't possible if this construct is in the language, but that is not the reason numerics can't start an identifier. – Josi 4/12, 2008 at 21:47

d isn't a valid floating literal suffix in C++. Floating literals are doubles by default, you can use f or l if you need a float or a long double literal. – Scourings 4/12, 2008 at 21:54

It is for Java, and while the original question was for C++, it also applies to many other languages, like Java. But I agree. This isn't the original reason why identifiers can't start with numbers. – Delamination 4/12, 2008 at 22:11

It's a convention now, but it started out as a technical requirement.

In the old days, parsers of languages such as FORTRAN or BASIC did not require the uses of spaces. So, basically, the following are identical:

10 V1=100
20 PRINT V1

and

10V1=100
20PRINTV1

Now suppose that numeral prefixes were allowed. How would you interpret this?

101V=100

10 1V = 100

or as

101 V = 100

or as

1 01V = 100

So, this was made illegal.

Ema answered 26/4, 2011 at 16:58 Comment(3)

Minor nit: line numbers had to be in columns 1-6, and executable code following column 8. On the other hand DO 10 I=1,50 could be ambiguously parsed as DO1 0I=1,50 [incidentally, if one uses a period instead of a comma, the statement becomes an assignment to a floating-point variable named DO10I. – Veneering 20/11, 2012 at 17:54

Interesting explanation! That makes sense for older languages, still makes me wonder why we've still continued the design choice for languages like Python or JavaScript or R. – Alanis 10/5, 2015 at 4:0

I definitely remember this with BASIC and feel this is probably the most valid practical reason of the practice. Technically though, I vaguely remember that it may actually go back to early assembly language. I'm unsure what assembler though, and I very well could be wrong. – Puglia 20/8, 2018 at 19:35

Because backtracking is avoided in lexical analysis while compiling. A variable like:

Apple;

the compiler will know it's a identifier right away when it meets letter 'A'.

However a variable like:

123apple;

compiler won't be able to decide if it's a number or identifier until it hits 'a', and it needs backtracking as a result.

Vaccination answered 13/10, 2014 at 11:16 Comment(1)

To answer by remembering my compiler designs class, This answer goes straight right ! Kudos – Machellemachete 13/10, 2014 at 22:26

Compilers/parsers/lexical analyzers was a long, long time ago for me, but I think I remember there being difficulty in unambiguosly determining whether a numeric character in the compilation unit represented a literal or an identifier.

Languages where space is insignificant (like ALGOL and the original FORTRAN if I remember correctly) could not accept numbers to begin identifiers for that reason.

This goes way back - before special notations to denote storage or numeric base.

Josi answered 4/12, 2008 at 21:43 Comment(0)

I agree it would be handy to allow identifiers to begin with a digit. One or two people have mentioned that you can get around this restriction by prepending an underscore to your identifier, but that's really ugly.

I think part of the problem comes from number literals such as 0xdeadbeef, which make it hard to come up with easy to remember rules for identifiers that can start with a digit. One way to do it might be to allow anything matching [A-Za-z_]+ that is NOT a keyword or number literal. The problem is that it would lead to weird things like 0xdeadpork being allowed, but not 0xdeadbeef. Ultimately, I think we should be fair to all meats :P.

When I was first learning C, I remember feeling the rules for variable names were arbitrary and restrictive. Worst of all, they were hard to remember, so I gave up trying to learn them. I just did what felt right, and it worked pretty well. Now that I've learned alot more, it doesn't seem so bad, and I finally got around to learning it right.

Taranto answered 25/5, 2009 at 22:29 Comment(1)

LOL - "The problem is that it would lead to weird things like 0xdeadpork being allowed, but not 0xdeadbeef. Ultimately, I think we should be fair to all meats :P." – Ambivert 12/5, 2011 at 16:9

Variable names cannot start with a digit, because it can cause some problems like below:

int a = 2;
int 2 = 5;
int c = 2 * a;

what is the value of c? is 4, or is 10!

another example:

float 5 = 25;
float b = 5.5;

is first 5 a number, or is an object (. operator) There is a similar problem with second 5.

Maybe, there are some other reasons. So, we shouldn't use any digit in the beginnig of a variable name.

Whitewall answered 13/1, 2013 at 23:56 Comment(1)

Even if one required that identifiers contain at least one non-digit character, one would also either have to require that numeric formats that contain letters must also contain a non-alphanumeric character [e.g. require 0x1234 to be written as $1234 and 1E6 to be written as 1.E6 or 1.0E6] or else have an odd combination of legal and illegal identifier names. – Veneering 16/6, 2013 at 21:33

It's likely a decision that came for a few reasons, when you're parsing the token you only have to look at the first character to determine if it's an identifier or literal and then send it to the correct function for processing. So that's a performance optimization.

The other option would be to check if it's not a literal and leave the domain of identifiers to be the universe minus the literals. But to do this you would have to examine every character of every token to know how to classify it.

There is also the stylistic implications identifiers are supposed to be mnemonics so words are much easier to remember than numbers. When a lot of the original languages were being written setting the styles for the next few decades they weren't thinking about substituting "2" for "to".

Welcy answered 4/12, 2008 at 21:50 Comment(0)

The restriction is arbitrary. Various Lisps permit symbol names to begin with numerals.

Hilarius answered 12/1, 2012 at 5:4 Comment(0)

COBOL allows variables to begin with a digit.

Elissaelita answered 14/12, 2014 at 11:10 Comment(0)

Use of a digit to begin a variable name makes error checking during compilation or interpertation a lot more complicated.

Allowing use of variable names that began like a number would probably cause huge problems for the language designers. During source code parsing, whenever a compiler/interpreter encountered a token beginning with a digit where a variable name was expected, it would have to search through a huge, complicated set of rules to determine whether the token was really a variable, or an error. The added complexity added to the language parser may not justify this feature.

As far back as I can remember (about 40 years), I don't think that I have ever used a language that allowed use of a digit to begin variable names. I'm sure that this was done at least once. Maybe, someone here has actually seen this somewhere.

Physiology answered 4/12, 2008 at 21:55 Comment(1)

It isn't that difficult. It makes the lexical phase more difficult, that's all. Of course, back when I took compilers, I was told that lexical scanning could take over a quarter of the total compilation time. – Morena 1/5, 2009 at 21:13

As several people have noticed, there is a lot of historical baggage about valid formats for variable names. And language designers are always influenced by what they know when they create new languages.

That said, pretty much all of the time a language doesn't allow variable names to begin with numbers is because those are the rules of the language design. Often it is because such a simple rule makes the parsing and lexing of the language vastly easier. Not all language designers know this is the real reason, though. Modern lexing tools help, because if you tried to define it as permissible, they will give you parsing conflicts.

OTOH, if your language has a uniquely identifiable character to herald variable names, it is possible to set it up for them to begin with a number. Similar rule variations can also be used to allow spaces in variable names. But the resulting language is likely to not to resemble any popular conventional language very much, if at all.

For an example of a fairly simple HTML templating language that does permit variables to begin with numbers and have embedded spaces, look at Qompose.

Energetic answered 4/12, 2008 at 22:26 Comment(2)

Actually, there are several languages that allow you to have characters marking identifiers. They're called "sigils" and you have them in Perl and PHP. – Glarus 20/5, 2009 at 20:2

Except you still aren't allowed to begin a variable name in PHP with a number - the language rules forbid it. :-) But you can in Qompose for exactly the same reason. – Energetic 20/5, 2009 at 23:17

Because if you allowed keyword and identifier to begin with numberic characters, the lexer (part of the compiler) couldn't readily differentiate between the start of a numeric literal and a keyword without getting a whole lot more complicated (and slower).

Whilst answered 26/4, 2011 at 16:57 Comment(1)

The lexing process is rarely the bottleneck. Sure, it makes the regex for identifier tokens more complex, but they can still be super-fast DFAs. The runtime of those is peanuts compared to most other tasks compilers have to accomplish. – Podgy 26/4, 2011 at 17:0

C++ can't have it because the language designers made it a rule. If you were to create your own language, you could certainly allow it, but you would probably run into the same problems they did and decide not to allow it. Examples of variable names that would cause problems:

0x, 2d, 5555

Brotherton answered 4/12, 2008 at 22:42 Comment(1)

This restriction holds in languages where that kind of syntax isn't allowed though. – Glarus 20/5, 2009 at 20:0

One of the key problems about relaxing syntactic conventions is that it introduces cognitive dissonance into the coding process. How you think about your code could be deeply influenced by the lack of clarity this would introduce.

Wasn't it Dykstra who said that the "most important aspect of any tool is its effect on its user"?

Copepod answered 5/11, 2009 at 21:46 Comment(0)

The compiler has 7 phase as follows:

Lexical analysis
Syntax Analysis
Semantic Analysis
Intermediate Code Generation
Code Optimization
Code Generation
Symbol Table

Backtracking is avoided in the lexical analysis phase while compiling the piece of code. The variable like Apple, the compiler will know its an identifier right away when it meets letter ‘A’ character in the lexical Analysis phase. However, a variable like 123apple, the compiler won’t be able to decide if its a number or identifier until it hits ‘a’ and it needs backtracking to go in the lexical analysis phase to identify that it is a variable. But it is not supported in the compiler.

When you’re parsing the token you only have to look at the first character to determine if it’s an identifier or literal and then send it to the correct function for processing. So that’s a performance optimization.

Overrate answered 5/8, 2018 at 7:8 Comment(0)

Probably because it makes it easier for the human to tell whether it's a number or an identifier, and because of tradition. Having identifiers that could begin with a digit wouldn't complicate the lexical scans all that much.

Not all languages have forbidden identifiers beginning with a digit. In Forth, they could be numbers, and small integers were normally defined as Forth words (essentially identifiers), since it was faster to read "2" as a routine to push a 2 onto the stack than to recognize "2" as a number whose value was 2. (In processing input from the programmer or the disk block, the Forth system would split up the input according to spaces. It would try to look the token up in the dictionary to see if it was a defined word, and if not would attempt to translate it into a number, and if not would flag an error.)

Morena answered 1/5, 2009 at 21:20 Comment(1)

The thing is that Forth doesn't really have a very sophisticated parser. Really, all it cares about is if an identifier is between two sets of whitespace. – Glarus 20/5, 2009 at 19:55

Suppose you did allow symbol names to begin with numbers. Now suppose you want to name a variable 12345foobar. How would you differentiate this from 12345? It's actually not terribly difficult to do with a regular expression. The problem is actually one of performance. I can't really explain why this is in great detail, but it essentially boils down to the fact that differentiating 12345foobar from 12345 requires backtracking. This makes the regular expression non-deterministic.

There's a much better explanation of this here.

Glarus answered 20/5, 2009 at 19:54 Comment(1)

How would one design a regular expression to allow a variable named ifq or doublez but not if or double? The fundamental problem with allowing identifiers to start with digits would be that there are existing forms of hex literals and floating-point numbers which consist entirely of alphanumeric characters (languages would use something like $1234 or h'1234 instead of 0x1234, and require numbers like 1E23 to include a period, could avoid that issue). Note that attempts to regex-parsing C can already get tripped up by things like 0x12E+5. – Veneering 20/11, 2012 at 17:58

it is easy for a compiler to identify a variable using ASCII on memory location rather than number .

Sedative answered 22/11, 2013 at 23:35 Comment(0)

I think the simple answer is that it can, the restriction is language based. In C++ and many others it can't because the language doesn't support it. It's not built into the rules to allow that.

The question is akin to asking why can't the King move four spaces at a time in Chess? It's because in Chess that is an illegal move. Can it in another game sure. It just depends on the rules being played by.

Wikiup answered 4/12, 2008 at 21:54 Comment(5)

Except that C++ was invented recently by people who are still alive. We can ask them why they chose the things they did, and rejected the alternatives. Same doesn't apply to chess. – Colorado 4/12, 2008 at 22:49

But that is not the point I'm making. It's analogy as to why there can't be numbers at the start of variable names, and the simplest answer is, because rules of the language don't allow it. – Wikiup 5/12, 2008 at 1:0

Sure, but I don't think the questioner is an imbecile. He's probably worked out that far already by himself. The question IMO is "why don't the rules of the language allow it?". He wants to bridge the gap between knowing the rules and understanding them. – Colorado 5/12, 2008 at 15:55

Yeah, upon reflecting on this, I realized where you were going. You have a point. I guess I was a applying Occam's razor a little to freely and assumed there is no real answer to why except that variables don't start with numbers, because there not numbers. – Wikiup 5/12, 2008 at 16:8

I'm not saying you're wrong, mind, occasionally the decisions of the C++ standards bodies do surpass mortal understanding, and you end up with "because they had to decide something and they decided this". But there is at least a question there to be asked :-) – Colorado 5/12, 2008 at 17:2

Originally it was simply because it is easier to remember (you can give it more meaning) variable names as strings rather than numbers although numbers can be included within the string to enhance the meaning of the string or allow the use of the same variable name but have it designated as having a separate, but close meaning or context. For example loop1, loop2 etc would always let you know that you were in a loop and/or loop 2 was a loop within loop1. Which would you prefer (has more meaning) as a variable: address or 1121298? Which is easier to remember? However, if the language uses something to denote that it not just text or numbers (such as the $ in $address) it really shouldn't make a difference as that would tell the compiler that what follows is to be treated as a variable (in this case). In any case it comes down to what the language designers want to use as the rules for their language.

Hectare answered 22/2, 2012 at 21:21 Comment(0)

The variable may be considered as a value also during compile time by the compiler so the value may call the value again and again recursively

Tunnel answered 8/4, 2015 at 13:38 Comment(0)

Backtracking is avoided in lexical analysis phase while compiling the piece of code. The variable like Apple; , the compiler will know its a identifier right away when it meets letter ‘A’ character in the lexical Analysis phase. However, a variable like 123apple; , compiler won’t be able to decide if its a number or identifier until it hits ‘a’ and it needs backtracking to go in the lexical analysis phase to identify that it is a variable. But it is not supported in compiler.

Reference

Jewel answered 23/4, 2018 at 16:10 Comment(0)

There could be nothing wrong with it when comes into declaring variable.but there is some ambiguity when it tries to use that variable somewhere else like this :

let 1 = "Hello world!" print(1) print(1)

print is a generic method that accepts all types of variable. so in that situation compiler does not know which (1) the programmer refers to : the 1 of integer value or the 1 that store a string value. maybe better for compiler in this situation to allows to define something like that but when trying to use this ambiguous stuff, bring an error with correction capability to how gonna fix that error and clear this ambiguity.

Berthoud answered 5/8, 2018 at 7:24 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags