How to make lex/flex recognize tokens not separated by whitespace?
Asked Answered
I

2

6

I'm taking a course in compiler construction, and my current assignment is to write the lexer for the language we're implementing. I can't figure out how to satisfy the requirement that the lexer must recognize concatenated tokens. That is, tokens not separated by whitespace. E.g.: the string 39if is supposed to be recognized as the number 39 and the keyword if. Simultaneously, the lexer must also exit(1) when it encounters invalid input.

A simplified version of the code I have:

%{
#include <stdio.h>
%}

%option main warn debug

%%

if      |
then    |
else    printf("keyword: %s\n", yytext);

[[:digit:]]+    printf("number: %s\n", yytext);

[[:alpha:]][[:alnum:]]*     printf("identifier: %s\n", yytext);

[[:space:]]+    // skip whitespace
[[:^space:]]+   { printf("ERROR: %s\n", yytext); exit(1); }

%%

When I run this (or my complete version), and pass it the input 39if, the error rule is matched and the output is ERROR: 39if, when I'd like it to be:

number: 39
keyword: if

(I.e. the same as if I entered 39 if as the input.)

Going by the manual, I have a hunch that the cause is that the error rule matches a longer possible input than the number and keyword rules, and flex will prefer it. That said, I have no idea how to resolve this situation. It seems unfeasible to write an explicit regexp that will reject all non-error input, and I don't know how else to write a "catch-all" rule for the sake of handling lexer errors.

UPDATE: I suppose I could just make the catch-all rule be . { exit(1); } but I'd like to get some nicer debug output than "I got confused on line 1".

Ilke answered 15/4, 2013 at 23:33 Comment(9)
a) Have you run your simplified version? b) what does it do that is wrong?Reexamine
@IraBaxter Sorry, seems I forgot to be explicit about my test case while lost in the speculation in the last paragraph. The answers are a) yes; and b) reports the lexer error instead of two tokens. (I've also added them into the question.)Ilke
Ah. OK, yes, your "^space" rule will eat any sequence of non-space, and thus consume "39if". Secret: avoid rules whose regexes overlap, unless the longer rule comes safely first. In your case, I'd use (I'm not a lex-pert) something to replace :^space: that was "not a digit, not a letter, not a space". ...Reexamine
With really good lexical specifications, you can write things like [any]*-[digit]+-[alpha][alnum]+-[space]+ to easily specify anything that doesn't look like your legal tokens. I don't believe lex will let your write this.Reexamine
@IraBaxter Unfortunately, for the assignment, I'm stuck to lex and a requirement that I'd never practically have to bother with. (If I ever actually had to write a language I'd probably stick to ANTLR, based on the intuition that if what I'm doing can't be LL(1) I'm the wrong person to be doing it anyway.)Ilke
What should happen to white space? What tokens should be returned for 39 if (i.e. separated by white space), or is that illegal?Morita
@Bryan The second-to-last rule is to ignore whitespace. It's legal and should return the same as if they were concatenated. (Unless of course the concatenated word has a longer match, e.g. is a valid identifier.)Ilke
Then you should follow Ira Baxter's advice and just loose the last rule all together. Lex will always give the longest match and otherwise the first match. If white space is just ignored, then the absence of white space is not going to hurt.Morita
If you want to handle inputs like ifa<bthenx=4 (no spaces between names and keywords), you'll need something more complex.Steelworks
R
4

You're quite right that you should just match a single "any" character as a fallback. The "standard" way of getting information about where in the line the parsing is at is to use the --bison-bridge option, but that can be a bit of a pain, particularly if you're not using bison. There are a bunch of other ways -- look in the manual for the ways to specify your own i/o functions, for example, -- but the all around simplest IMHO is to use a start condition:

%x LEXING_ERROR
%%
// all your rules; the following *must* be at the end
.                 { BEGIN(LEXING_ERROR); yyless(1); }
<LEXING_ERROR>.+  { fprintf(stderr,
                            "Invalid character '%c' found at line %d,"
                            " just before '%s'\n",
                            *yytext, yylineno, yytext+1);
                    exit(1);
                  }

Note: Make sure that you've ignored whitespace in your rules. The pattern .+ matches any number but at least one non-newline character, or in other words up to the end of the current line (it will force flex to read that far, which shouldn't be a problem). yyless(n) backs up the read pointer by n characters, so after the . rule matches, it will rescan that character producing (hopefully) a semi-reasonable error message. (It won't really be reasonable if your input is multibyte, or has weird control characters, so you could write more careful code. Up to you. It also might not be reasonable if the error is at the end of a line, so you might also want to write a more careful regex which gets more context, and maybe even limits the number of forward characters read. Lots of options here.)

Look up start conditions in the flex manual for more info about %x and BEGIN

Redeemer answered 16/4, 2013 at 3:22 Comment(3)
I read up on start conditions but couldn't really put the pieces together, thanks!Ilke
It's a lot simpler to just return yytext[0] to the parser in the . rule and let the parser's error recovery deal with it. No start states required. This also eliminates all the rules for single special characters.Quittor
@EJP: The OP specifically states that one of the requirements is that the lexer must exit(1) when it encounters invalid input. There's no indication that there is a parser at all, with or without error recovery.Redeemer
R
1

This is a full solution, incorporating the suggestion of @rici. I noted a minor error in the call to yyless and also gave a solution that meets the needs to the original poster.

Put this into a file yyin

%x LEXING_ERROR
%%
\n      { yylineno++; }  // Track line number.
if      |
then    |
else    printf("keyword: %s\n", yytext);

[[:digit:]]+    printf("number: %s\n", yytext);
[[:alpha:]][[:alnum:]]*     printf("identifier: %s\n", yytext);
[[:space:]]    ;// skip whitespace

.                 { BEGIN(LEXING_ERROR); yyless(0); }
<LEXING_ERROR>.+  { fprintf(stderr,
                            "Invalid character '%c' found at line %d,"
                            " just before '%s'\n",
                            *yytext, yylineno, yytext+1);
                    exit(1);
                  }
%%
int yywrap(void) { return 1;}

int main(void) {
yylex();
return 0;
}

To run, go to terminal, cd to the directory holding yyin and run

 flex yyin       
 gcc lex.yy.c -ll  
 ./a.out <atest.txt

where atest.txt is the text file that you are parsing.

To explain: first line '%x LEXING_ERROR' is a declaration that the term LEXING_ERROR is for an exclusive start condition. See the link in Rici's answer. The action for each new line is to update the yylinenum.

The final condition to capture any character, the dot, then triggers the start condition LEXING_ERROR. Note the call to yyless, where the argument is 0. Passing 1 points to the wrong character.

The skip whitespace has to be a single character match. If multicharacter it will swallow some new lines and the line update does not work correctly.

In this form, the original posters problem is solved. The single character exception produces a nice error message with the correct line number.

I don't know if the details might differ on another platform. This runs in terminal on a Mac running Sonoma 14.4.1 with gcc and flex installed.

Rockwood answered 4/7, 2024 at 2:25 Comment(1)
I wonder if it would be better to use yymore() or unput() instead of yyless(). I've never used any of them but yyless() strikes me as designed to some slightly different purpose. (some docs)Restive

© 2022 - 2025 — McMap. All rights reserved.