Lua long strings in fslex
Asked Answered
B

1

6

I've been working on a Lua fslex lexer in my spare time, using the ocamllex manual as a reference.

I hit a few snags while trying to tokenize long strings correctly. "Long strings" are delimited by '[' ('=')* '[' and ']' ('=')* ']' tokens; the number of = signs must be the same.

In the first implementation, the lexer seemed to not recognize [[ patterns, producing two LBRACKET tokens despite the longest match rule, whereas [=[ and variations where recognized correctly. In addition, the regular expression failed to ensure that the correct closing token is used, stopping at the first ']' ('=')* ']' capture, no matter the actual long string "level". Also, fslex does not seem to support "as" constructs in regular expressions.


let lualongstring =    '[' ('=')* '[' ( escapeseq | [^ '\\' '[' ] )* ']' ('=')* ']'

(* ... *)
    | lualongstring    { (* ... *) }
    | '['              { LBRACKET }
    | ']'              { RBRACKET }
(* ... *)


I've been trying to solve the issue with another rule in the lexer:


rule tokenize = parse
    (* ... *)
    | '[' ('=')* '['   { longstring (getLongStringLevel(lexeme lexbuf)) lexbuf }
    (* ... *)

and longstring level = parse 
    | ']' ('=')* ']'   { (* check level, do something *) }
    | _                { (* aggregate other chars *) }

    (* or *)

    | _    {
               let c = lexbuf.LexerChar(0);
               (* ... *)           
           }

But I'm stuck, for two reasons: first, I don't think I can "push", so to speak, a token to the next rule once I'm done reading the long string; second, I don't like the idea of reading char by char until the right closing token is found, making the current design useless.

How can I tokenize Lua long strings in fslex? Thanks for reading.

Blagoveshchensk answered 4/12, 2010 at 0:11 Comment(5)
Offhand, just wanted to mention: you chould always choose to parse it, rather than lex it.Jackscrew
@Brian, can you please elaborate? :) I'm a bit at loss trying to understand how to parse a sequence of unrelated tokens to create the original long string - provided the lexer can produce tokens for all the content of the string. Thanks for your comment.Blagoveshchensk
Yeah, it's probably not a good strategy, I was just throwing it out there.Jackscrew
@Jackscrew thanks all the same, I'm still at grips with F# and fslex, every little bit helps.Blagoveshchensk
@Raine In any case, keep us informed; I'm also interested in both F# and Lua.Summon
B
5

Apologies if I answer my own question, but I'd like to contribute with my own solution to the problem for future reference.

I am keeping state across lexer function calls with the LexBuffer<_>.BufferLocalStore property, which is simply a writeable IDictionary instance.

Note: long brackets are used both by long string and multiline comments. This is often an overlooked part of the Lua grammar.



let beginlongbracket =    '[' ('=')* '['
let endlongbracket =      ']' ('=')* ']'

rule tokenize = parse
    | beginlongbracket 
    { longstring (longBracketLevel(lexeme lexbuf)) lexbuf }

(* ... *)

and longstring level = parse
    | endlongbracket 
    { if longBracketLevel(lexeme lexbuf) = level then 
          LUASTRING(endLongString(lexbuf)) 
      else 
          longstring level lexbuf 
    }

    | _ 
    { toLongString lexbuf (lexeme lexbuf); longstring level lexbuf }

    | eof 
    { failwith "Unexpected end of file in string." }


Here are the functions I use to simplify storing data into the BufferLocalStore:

let longBracketLevel (str : string) =
    str.Count(fun c -> c = '=')

let createLongStringStorage (lexbuf : LexBuffer<_>) =
    let sb = new StringBuilder(1000)
    lexbuf.BufferLocalStore.["longstring"] <- box sb
    sb

let toLongString (lexbuf : LexBuffer<_>) (c : string) =
    let hasString, sb = lexbuf.BufferLocalStore.TryGetValue("longstring")
    let storage = if hasString then (sb :?> StringBuilder) else (createLongStringStorage lexbuf)
    storage.Append(c.[0]) |> ignore

let endLongString (lexbuf : LexBuffer<_>) : string = 
    let hasString, sb = lexbuf.BufferLocalStore.TryGetValue("longstring")
    let ret = if not hasString then "" else (sb :?> StringBuilder).ToString()
    lexbuf.BufferLocalStore.Remove("longstring") |> ignore
    ret

Perhaps it's not very functional, but it seems to be getting the job done.

  • use the tokenize rule until the beginning of a long bracket is found
  • switch to the longstring rule and loop until a closing long bracket of the same level is found
  • store every lexeme that does not match a closing long bracket of the same level into a StringBuilder, which is in turn stored into the LexBuffer BufferLocalStore.
  • once the longstring is over, clear the BufferLocalStore.

Edit: You can find the project at http://ironlua.codeplex.com. Lexing and parsing should be okay. I am planning on using the DLR. Comments and constructive criticism welcome.

Blagoveshchensk answered 5/12, 2010 at 13:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.