Using flex in c and regular expressions

Asked 6/2, 2015 at 14:52 Answered 6/2, 2015 at 15:30

c regex compiler-construction flex-lexer lexical-analysis

I am trying to create a lexical analyzer for a compiler.But I have a problem using regular expressions to find things like keywords and real numbers.. for example some definitions :

id         [aA-zZ][aA-zZ-0-9_]* 

keyword    if|else|when|while

integer    [0-9]+

real       integer\.integer

..There are some problems though,the analyzer cant get a keyword for example instead if i give the word 'else' it sees it as a id(I get a warning like rule cannot be matched too.

Also if I try to give a real number for example 1.2 the linker sees it as integer delimeter integer instead as a real.I am not good at regural expressions language though,..I thought for the real/integer distinction to put a rule like ("Read a number that doesn't end with . and it's an integer , else it's a number") but how can I put that in reg. language.

Mastiff answered 6/2, 2015 at 14:52 Comment(0)

The reason Flex isn't detecting the desired strings is there are ambiguities when reading input. The word 'else' corresponds both to regex "else" and [a-zA-Z][a-zA-Z0-9]*, but the last one was written first so Flex decides input must fit this regular expression.

Flex is a lexical analyzer that reads input file and check the current string with any of regular expressions you want it to check for. You must always remember two fundamental rules driving its string-matching system:

Longest Best Match;
Priority rules;

'Longest Best Match' tells Flex that read input must match the longest pattern possible; LBM rules work better with priority.

Priority rules tell Flex that if input match more that one regular expression, only the first one must be considered.

When writing code, keywords must be considered as keywords and not as IDs, then you should write keywords' regex before anything else: if something doesn't match 'keyword' regular expression, it means we are reading something else, like an id or the name of a function.

I would write something like this:

keyword     "if"|"else"|"when"|"while"
id          [a-zA-Z][a-zA-Z0-9_-]*
nat         [0-9]|([1-9][0-9]*)
integer     [+|-]*{nat}
real        {integer}[.]{integer}

Keyword first; if we aren't reading a keyword, it could be an id; if there are only numbers, let's see what kind of number it is. Priority rules will let Flex read 'else', 'if' and other keywords as 'keyword' tokens, and Longest best match will make sure a number like 1.2 will be considered a 'real' instead of two 'nat' separated by a dot.

Nombles answered 6/2, 2015 at 15:30 Comment(4)

The order of definitions is irrelevant. Priority is determined by the order of the rules in the rules section. IMHO, definitions are overused by novices; it's probably better to not use them at all until you have some basic understanding of the tool. – Empirical 6/2, 2015 at 22:51

I partially agree with you. Priority is relevant in the rules section only, I might have been clear about this in my answer. But using some regular definitions instead of no definitions at all may work fine but makes the code harder to read and edit when mistakes are found. – Nombles 6/2, 2015 at 22:56

Your answered rearranges the definitions, which implies that that will make a difference. That's highly misleading and I strongly suggest you edit the answer. As far as definitions go, they might help if you use them more than once, but that's pretty rare. I find [[:digit:]]+ easier to read than {integer}, for example, precisely because I don't need to go looking for the definition in case it was something less obvious. I'm not saying they're useless; just that they are overused. – Empirical 6/2, 2015 at 23:0

(As an example, your {real} is wrong because 12.-17 is not a valid number; had you just used [[:digit:]]+, you wouldn't have created that bug, which was not immediately visible. Just sayin'.) – Empirical 6/2, 2015 at 23:2

When the same lexeme can be matched by more than one rule, lex chooses the one that appears first in the file. So your rule for keywords must come before that for identifiers.

The problem with the rule integer.integer isn't that it can't be disambiguated from your integer rule - in fact it can be disambiguated trivially as there is no overlap at all. The problem is that it matches the string "integer", followed by any character, followed by the string "integer". You can't reference other rules within a rule. You could create a definition and reference that using curly braces or you could just write [0-9]+ '.' [0-9]+.

Lester answered 6/2, 2015 at 15:5 Comment(0)

Recommended topics

Hot tags