Is it possible in ocamllex to define a rule that looks ahead at the next character without consuming it?
Asked Answered
S

1

7

I'm using ocamllex to write a lexer for a scripting language but I am facing a conflict with my rule for comments.

I want to allow my command arguments to be unquoted as long as they only contain alphanumeric characters and slashes "/". For example:

echo "quoted argument !@#%" /this/second/argument/is/unquoted

Additionally, one of my prerequesites is C++ style comments with "//"

//this is a comment
echo hello world

The problem this brings is things like

echo foo//comment

I would like my lexer to produce a "foo" token, while also leaving the "//" untouched, to be consumed in the next time I ask the lexer for token. Is that possible? The reason for this is that its possible that the input buffer might not have reached the end of the comment yet and I'd rather immediately return the "foo" token than needlessly block trying to eagerly consume the comment.

Stumble answered 27/11, 2014 at 22:10 Comment(6)
If your parser has a rule which only consumes echo foo, shouldn't that be enough for it to postpone comment lexing?Persecution
At lexer level, you only need to make sure that your unquoted params cannot contain double slashes, otherwise you'd have to discriminate the cases where it could happen, and perhaps have two sets of rules for when it is possible or notPersecution
You should probably provide a small program exhibiting the issue, so that we can see how to fix it.Persecution
MY difficulty is exactly in writing a rule that matches "foo" without consuming the slashes. If I write a regex for "strings without double slashes" then it will match "foo/" and leave "/comment" in the buffer which is not what I want. Fundamentally, the problem is that I can only tell if I should match "foo" or "foo/" when I see if the second slash is there or not.Stumble
I preferred to not include a full program because I though it would just add too much noise. If you want to simplify the problem then restricting the alphabet to "ab" and use as tokens "bb" and "words without bb in them". The problem is that when faced with "aaabba" as input I would like it to lex as "aaa bb a" and not "aaab" "ba"Stumble
btw, I managed to get it to "work" by directly fiddling with lexbuf.lex_curr_pos, subtracting "2" from it when I find a "//". HOwever I haev absolutely no idea if thats something safe to do.Stumble
P
8

The following is a small lexer which only matches echo, quoted and unquoted strings, comments, and prints out the resulting tokens:

{
  type token = NEWLINE | ECHO | QUOTED of string | UNQUOTED of string | COMMENT of string
  exception Eof

  type state = CODE | LINE_COMMENT
  let state = ref CODE
}

let newline      = '\n'
let alphanum     = [ 'A'-'Z' 'a'-'z' '0'-'9' '_' ] 
let comment_line = "//"([^ '\n' ]+)
let space        = [ ' ' '\t' ]
let quoted       = '"'([^ '"' ]+)'"'
let unquoted     = ('/'?(alphanum+'/'?)+)

rule code = parse
  space+                      { code lexbuf }
| newline                     { code lexbuf }
| "echo"                      { ECHO }
| quoted                      { QUOTED (Lexing.lexeme lexbuf) }
| "//"                        { line_comment "" lexbuf }
| ('/'|alphanum+)             { unquoted (Lexing.lexeme lexbuf) lexbuf }
| eof                         { raise Eof }

and unquoted buff = parse
  newline                     { UNQUOTED buff }
| "//"                        { state := LINE_COMMENT; if buff = "" then line_comment "" lexbuf else UNQUOTED buff }
| ('/'|alphanum+)             { unquoted (buff ^ Lexing.lexeme lexbuf) lexbuf }
| space+                      { UNQUOTED buff }
| eof                         { raise Eof }

and line_comment buff = parse
  newline                     { state := CODE; COMMENT buff }
| _                           { line_comment (buff ^ Lexing.lexeme lexbuf) lexbuf }

{

  let lexer lb =
    match !state with
      CODE -> code lb
    | LINE_COMMENT -> line_comment "" lb

  let _ =
    try
      let lexbuf = Lexing.from_channel stdin in
      while true do
        let () =
          match lexer lexbuf with
            ECHO -> Printf.printf "ECHO\n"
          | QUOTED s -> Printf.printf "QUOTED(%s)\n" s
          | UNQUOTED s -> Printf.printf "UNQUOTED(%s)\n" s
          | COMMENT s -> Printf.printf "COMMENT(%s)\n" s
          | NEWLINE -> Printf.printf "\n"
        in flush stdout
      done
    with Eof -> exit 0

}

It's a trick that I used in a project of mine, to overcome that same limitation in ocamllex (compared to the original C lex program which let one match patterns in "look ahead mode"). Basically, it splits the ambiguous rules in their distinct radicals, and switch the lexer to different parser accordingly. It also keeps track of the currently used parser ad the next entry point.

In your situation, the only states it needs to keep track of are the default one (CODE), and comment mode (LINE_COMMENT). This could be expanded to support other states if needed.

Persecution answered 28/11, 2014 at 11:51 Comment(7)
The problem with this version is that all slashes inside an "unquoted" muct be followed by an alphanum. This rules out tokens that end in a slash, like "foo/". As you already noticed, due to the ambiguity in the grammar ("foo//comment" being parseable as "foo/"+"/comment" or "foo"+"//comment") you need to lookahead whenever you encounter a "/" in order to decide if its part of a "//" or not and to choose how to disambiguate it.Stumble
This grammar also rules out tokens starting with a slash.Stumble
Now the unquoted tokens can start with "/" but they still can't end with a "/". The whole ambiguity w/ the rule for comments happens precisely because I want to allow the unquoteds to end with "/". Anyway, a regex that precisely describes the tokens (any sequence of alphanum or "/" without "//") that I already tried using is (alphanum '/'? | '/')(alphanum+ '/'?)*. The problme is that since the lexer tries to consume as much input as possible it lexes "foo//comment" as "foo/" + "/comment" so I don't think a pure regex-based solution will work here :(Stumble
Why don't you use a different comment syntax?Persecution
I guess this works although at the cost of making the lexer not reentrant anymore, right? As for the other question, now that I realise that this was much harder than I expected I was about to change my comments syntax to "#" but then I reviewed my prerequesites and realised that unquoted slashes weren't the hard requirement I thought they were so I got rid of them:)Stumble
Now that I think of it, I think you can make that lexer reentrant again by making state a parameter to the lexer function instead of a global. So I guess this wraps everything up :)Stumble
Indeed, but it becomes tricky when using it with an ocamlyacc parser.Persecution

© 2022 - 2024 — McMap. All rights reserved.