Is string "1a" an error for lexical analyser or not?

Asked 29/5, 2013 at 16:52 Answered 1/6, 2013 at 0:3

Solved java programming-languages lexical-analysis

I am making a basic lexical analyser in Java for my semester project and I am at conflict on a concept with my subject teacher.

My view is that in general if an input like "1a" is given to lexical analyser then it should give output as:

"<Number><Identifier>"

But, my teacher says that it should flag this as an error because instead of treating it as a number and a identifier it should flag the whole string(i.e. "1a") as an error.This is because(as he says) identifiers cannot start with a number.

On the contrary I think this should be the responsibility of next stage of compiler(syntax analyser) to decide if something is a valid identifier or not. I know he is right about identifiers not starting with a number but I need closure on the part that the lexical analyser should be the one deciding that.

I will really appreciate your help. Thank you

Fancyfree answered 29/5, 2013 at 16:52 Comment(0)

A lexical analyzer should be dealing with which kinds of tokens are legal or not and and dividing the text into tokens. It will error out if a string cannot form a valid token.

The syntax analyzer only deals with the structure of the program once the tokens have been determined. It will give an error if the tokens cannot be parsed according to the given grammar.

So your teacher is correct. Determining whether an identifier is legal falls under lexical analysis.

Disappear answered 29/5, 2013 at 17:0 Comment(5)

Why can't the lexical analyser reads '1' and then stops at 'a' flag '1' as number and then starts over and then reads 'a' and flag it as identifier. Isn't 'a' a separator here for the NFA that deals with numbers? – Fancyfree 29/5, 2013 at 17:9

I'd say tokenization happens with predefined separators, usually with spaces. For the lexical analyzer to separate the string into 1 and a we'd need to define additional rules for doing this. I think what your teacher has in mind is that 1a is one token and as it doesn't match a regular expression for an identifier, the lexical analyzer will give an error. – Disappear 29/5, 2013 at 17:26

Depends how the lexical analyzer is written. Using flex it is actually very easy to generate a lexical analyzer that would parse 1a as a number followed by an identifier. – Macule 29/5, 2013 at 17:31

Yes, it depends on analyzer. There's an advantage in detecting errors early though. If a number followed by an identifier isn't a valid input and you can detect this at the lexical analysis stage then it would be preferable to do so. – Disappear 29/5, 2013 at 17:39

Thank you very much.Your responses really made me get it.I need to define a list of separators for each of my NFAs. – Fancyfree 29/5, 2013 at 17:54

I agree with your teacher, correct list of identifiers is work for lexical analyser. http://en.wikipedia.org/wiki/Lexical_analysis

Rustie answered 29/5, 2013 at 16:59 Comment(0)

Detecting this in the parser would only work for grammars where a number followed by an identifier happens to be syntactically invalid. If 1 a was valid syntax in your language, you would have to handle this in the lexer because the parser can't distinguish between 1a (no whitespace) and 1 a (with whitespace).

Why not do this in the lexer? The lexer's job is to make the parser's job easier. Any work it can do to simplify your parser without adding a lot of complexity to the lexer itself is a good idea.

Madelainemadeleine answered 1/6, 2013 at 0:3 Comment(0)

The reason for this is that languages often use postfixes on numbers, like 1L in C is the value 1 of type long instead of the default type int. Also you want to be able to add postfixes later in a language. Consider your 1a. First this would be parsed as int value 1 followed by an identifier a. But now the creator of your compiler decides to start using a as a postfix on numbers. Suddenly 1a becomes a single token.

For 1a there is also a special case which is that 1a could be meant as a hexidecimal number but you forgot to put on the required postfix/prefix 0x1a for C or 1ah for certain assembler versions.

Macule answered 29/5, 2013 at 17:25 Comment(0)

Recommended topics

Hot tags