semicolon insertion ala google go with flex

Asked 31/5, 2012 at 2:32 Answered 26/3, 2021 at 11:15

I'm interested in adding semi-colon insertion ala Google Go to my flex file.

From the Go documentation:

Semicolons

Like C, Go's formal grammar uses semicolons to terminate statements; unlike C, those semicolons do not appear in the source. Instead the lexer uses a simple rule to insert semicolons automatically as it scans, so the input text is mostly free of them.

The rule is this. If the last token before a newline is an identifier (which includes words like int and float64), a basic literal such as a number or string constant, or one of the tokens
break continue fallthrough return ++ -- ) }
the lexer always inserts a semicolon after the token. This could be summarized as, “if the newline comes after a token that could end a statement, insert a semicolon”.

A semicolon can also be omitted immediately before a closing brace, so a statement such as
go func() { for { dst <- <-src } }()
needs no semicolons. Idiomatic Go programs have semicolons only in places such as for loop clauses, to separate the initializer, condition, and continuation elements. They are also necessary to separate multiple statements on a line, should you write code that way.

One caveat. You should never put the opening brace of a control structure (if, for, switch, or select) on the next line. If you do, a semicolon will be inserted before the brace, which could cause unwanted effects. Write them like this
if i < f() {
    g()
}
not like this
if i < f()  // wrong! 
{           // wrong!
    g()     // wrong!
}           // wrong!

How would I go about doing this (how can I insert tokens in the stream, how can I see the last token that was matched to see if it is a good idea, etc etc etc)?

I am using bison too, but Go seems to just use their lexer for semicolon insertion.

Apology answered 31/5, 2012 at 2:32 Comment(8)

I'm personally confused. Are you implementing semicolon injection in a Go scanner/lexer/tokenizer or is it for some other language? In the later case: which language - or - what rules for injection are to be implemented? I'm asking all of this because the specific implementation is heavily dependent on all of those circumstances. – Throughcomposed 3/6, 2012 at 7:40

No. What I'd like is to implement the ability to insert semicolon tokens within my gnu-flex lexer for a language not Go—but that doesn't matter. – Apology 3/6, 2012 at 13:30

My confusion continues. GNU flex is a tool to convert .l files into , usually, C code. Semicolon injection is done in code (== not using some regexp or so). Thus AFAICS, the problem is not flex-related. I might be wrong, admittedly ;-) Anyway, the result of lexing are lexems and/or tokens. If the rules for injection are clear, then the client/wrapper of the lexer can implement e.g. a state machine which in some states doesn't call the lexer but returns the injection instead, such example in Go: code.google.com/p/godeit/source/browse/parser/token.go#291 – Throughcomposed 3/6, 2012 at 13:41

I'd like to insert extra tokens from my flex grammar. So, in pseuo-code, '}' { if (LOOKAHEAD == '\n') INSERT_TOKEN(';'); return '}' } – Apology 3/6, 2012 at 14:2

Well, that's the pattern/rule action, not the (lexical) grammar per se. Previously said applies - the solution is in writing code, not REs. It doesn't matter that much if that code is in some caller function or in the rule's action, that's only an implementation detail/design choice about the approach. – Throughcomposed 3/6, 2012 at 14:47

This is a question about the implementation details/design. – Apology 3/6, 2012 at 14:50

It seems like this should be handled by altering your grammar to make semicolons optional in certain cases instead of injecting fake tokens in the input. – Everara 4/6, 2012 at 21:1

@Craig, agreed, but that might require exposing the newlines in the grammar. Just ignoring them in the lexer could be more convenient. – Isagoge 4/6, 2012 at 22:8

You could pass lexer result tokens through a function that inserts semicolons where necessary. Upon detection of the need to insert, the next token can be put back to the input stream, basically lexing it again in the next turn.

Below is an example that inserts a SEMICOLON before a newline, when it follows a WORD. The bison file "insert.y" is this:

%{
#include <stdio.h>

void yyerror(const char *str) {
  printf("ERROR: %s\n", str);
}

int main() {
  yyparse();
  return 0;
}
%} 
%union {
  char *string;
}
%token <string> WORD
%token SEMICOLON NEWLINE
%%
input: 
     | input WORD          {printf("WORD: %s\n", $2); free($2);}
     | input SEMICOLON     {printf("SEMICOLON\n");}
     ;
%%

and the lexer is generated by flex from this:

%{
#include <string.h>
#include "insert.tab.h"
int f(int token);
%}
%option noyywrap
%%
[ \t]          ;
[^ \t\n;]+     {yylval.string = strdup(yytext); return f(WORD);}
;              {return f(SEMICOLON);}
\n             {int token = f(NEWLINE); if (token != NEWLINE) return token;}
%%
int insert = 0;

int f(int token) {
  if (insert && token == NEWLINE) {
    unput('\n');
    insert = 0;
    return SEMICOLON;
  } else {
    insert = token == WORD;
    return token;
  }
}

For input

abc def
ghi
jkl;

it prints

WORD: abc
WORD: def
SEMICOLON
WORD: ghi
SEMICOLON
WORD: jkl
SEMICOLON

Unputting a non-constant token requires a little extra work - I have tried to keep the example simple, just to give the idea.

Isagoge answered 4/6, 2012 at 21:47 Comment(2)

Thanks for the answer. If it wasn't clear from the message, I'm also using Bison--so, it seems to me at least, that this doesn't completely work (because I can't modify yyparse as you do here). – Apology 5/6, 2012 at 0:44

@luxun, not sure where you see the problem, but I have edited my above example to use bison as well. – Isagoge 5/6, 2012 at 6:10

Alter the lexer rules for \n and } to look at the last token returned by the lexer. This will require that your lexer record the last token returned for every rule.

Then your newline rule will look like this:

\n   { if (newline_is_semi(last_token)) {
          return SEMICOLON;
       }
     }

newline_is_semi will check if last_token is in the list of tokens you listed.

To handle the optional semicolon before a closing brace: when matching '}' check if last_token was SEMICOLON and if not unput the '}' and return SEMICOLON

'}'  { if (last_token != SEMICOLON) {
          unput('}');
          return SEMICOLON;
       }
     }

Everara answered 5/6, 2012 at 15:10 Comment(0)

One simple way is to create a global variable

%{
    ins_token = 0
%}

Then suppose after ")" you want to insert a SEMICOLON then you set the ins_token = 1 and in other tokens you reset the ins_token = 0

Now, after ")" comes "\n" then you check if ins_token == 1 you return SEMICOLON else ignore it and always reset the ins_token = 0.

The ins_token acts a flag. Set the flag when you want the SEMICOLON to be inserted. On getting \n it will check that flag and if its set it will insert the SEMICOLON.

This is because flex doesn't remember the previous token.

[\n] { if (ins_token == 1) { ins_token = 0; return SEMICOLON; } }
")"  { ins_token = 1; }

...other tokens
...  { ins_token = 0; }

Lodicule answered 26/3, 2021 at 11:15 Comment(0)

Semicolons

Recommended topics

Hot tags