I'm trying to create a grammar to parse some Excel-like formulas I have devised, where a special character in the beginning of a string signifies a different source. For example, $
can signify a string, so "$This is text
" would be treated as a string input in the program and &
can signify a function, so &foo()
can be treated as a call to the internal function foo
.
The problem I'm facing is how to construct the grammar properly. For example, This is a simplified version as a MWE:
grammar = r'''start: instruction
?instruction: simple
| func
STARTSYMBOL: "!"|"#"|"$"|"&"|"~"
SINGLESTR: (LETTER+|DIGIT+|"_"|" ")*
simple: STARTSYMBOL [SINGLESTR] (WORDSEP SINGLESTR)*
ARGSEP: ",," // argument separator
WORDSEP: "," // word separator
CONDSEP: ";;" // condition separator
STAR: "*"
func: STARTSYMBOL SINGLESTR "(" [simple|func] (ARGSEP simple|func)* ")"
%import common.LETTER
%import common.WORD
%import common.DIGIT
%ignore ARGSEP
%ignore WORDSEP
'''
parser = lark.Lark(grammar, parser='earley')
So, with this grammar, things like: $This is a string
, &foo()
, &foo(#arg1)
, &foo($arg1,,#arg2)
and &foo(!w1,w2,w3,,!w4,w5,w6)
are all parsed as expected. But if I'd like to add more flexibility to my simple
terminal, then I need to start fiddling around with the SINGLESTR
token definition which is not convenient.
What have I tried
The part that I cannot get past is that if I want to have a string including parentheses (which are literals of func
), then I cannot handle them in my current situation.
- If I add the parentheses in
SINGLESTR
, then I getExpected STARTSYMBOL
, because it's getting mixed up with thefunc
definition and it thinks that a function argument should be passed, which makes sense. - If I redefine the grammar to reserve the ampersand symbol for functions only and add the parentheses in
SINGLESTR
, then I can parse a string with parentheses, but every function I'm trying to parse givesExpected LPAR
.
My intent is that anything starting with a $
would be parsed as a SINGLESTR
token and then I could parse things like &foo($first arg (has) parentheses,,$second arg)
.
My solution, for now, is that I'm using 'escape' words like LEFTPAR and RIGHTPAR in my strings and I've written helper functions to change those into parentheses when I process the tree. So, $This is a LEFTPARtestRIGHTPAR
produces the correct tree and when I process it, then this gets translated to This is a (test)
.
To formulate a general question: Can I define my grammar in such a way that some characters that are special to the grammar are treated as normal characters in some situations and as special in any other case?
EDIT 1
Based on a comment from jbndlr
I revised my grammar to create individual modes based on the start symbol:
grammar = r'''start: instruction
?instruction: simple
| func
SINGLESTR: (LETTER+|DIGIT+|"_"|" ") (LETTER+|DIGIT+|"_"|" "|"("|")")*
FUNCNAME: (LETTER+) (LETTER+|DIGIT+|"_")* // no parentheses allowed in the func name
DB: "!" SINGLESTR (WORDSEP SINGLESTR)*
TEXT: "$" SINGLESTR
MD: "#" SINGLESTR
simple: TEXT|DB|MD
ARGSEP: ",," // argument separator
WORDSEP: "," // word separator
CONDSEP: ";;" // condition separator
STAR: "*"
func: "&" FUNCNAME "(" [simple|func] (ARGSEP simple|func)* ")"
%import common.LETTER
%import common.WORD
%import common.DIGIT
%ignore ARGSEP
%ignore WORDSEP
'''
This falls (somewhat) under my second test case. I can parse all the simple
types of strings (TEXT, MD or DB tokens that can contain parentheses) and functions that are empty; for example, &foo()
or &foo(&bar())
parse correctly. The moment I put an argument within a function (no matter which type), I get an UnexpectedEOF Error: Expected ampersand, RPAR or ARGSEP
. As a proof of concept, if I remove the parentheses from the definition of SINGLESTR in the new grammar above, then everything works as it should, but I'm back to square one.
STARTSYMBOL
) and you add separators and parentheses where required to be clear; I don't see any ambiguity here. You'd still have to split yourSTARTSYMBOL
list into individual items to be distinguishable. – Whispering