Parsing quoted string with escape chars
Asked Answered
D

1

5

I'm having a problem parsing a list of lines of format in antlr4

* this is a string
*  "first"  this is "quoted"
* this is "quoted with \" "

I want to build a parse tree like

(list 
(line * (value (string this is a string))) 
(line * (value (parameter first) (string   this is) (parameter quoted))) 
(line * (value (string this is) (parameter quoted with " )))
)

I have an antlr4 grammar of this format

grammar List;
list : line+;
line : '*' (WS)+ value* NEWLINE;
value : string
      | parameter
      ;
string : ((WORD) (WS)*)+;
parameter : '"'((WORD) (WS)*)+ '"';
WORD : (~'\n')+;
WS : '\t' | ' ';
NEWLINE     : '\n';

But this is failing in the first character recognition of '*' itself, which baffles me.

line 1:0 mismatched input '* this is a string' expecting '*'

Dinar answered 22/5, 2014 at 6:26 Comment(0)
C
7

The problem is that your lexer is too greedy. The rule

WORD : (~'\n')+;

matches almost everything. This causes the lexer to produce the following tokens for your input:

  • token 1: WORD (* this is a string)
  • token 2: NEWLINE
  • token 3: WORD (`* "first" this is "quoted")
  • token 4: NEWLINE
  • token 5: WORD (* this is "quoted with \" ")

Yes, that is correct: only WORD and NEWLINE tokens. ANTLR's lexer tries to construct tokens with as much characters as possible, it does not "listen" to what the parser is trying to match.

The error message:

line 1:0 mismatched input '* this is a string' expecting '*'

is telling you this: on line 1, index 0 the token with text '* this is a string' (type WORD) is encountered, but the parser is trying to match the token: '*'

Try something like this instead:

grammar List;

parse
 : NEWLINE* list* NEWLINE* EOF
 ;

list
 : item (NEWLINE item)*
 ;

item
 : '*' (STRING | WORD)* 
 ;

BULLET : '*';
STRING : '"' (~[\\"] | '\\' [\\"])* '"';
WORD : ~[ \t\r\n"*]+;
NEWLINE : '\r'? '\n' | '\r';
SPACE : [ \t]+ -> skip;

which parses your example input as follows:

(parse 
  (list 
    (item 
      * this is a string) \n 
    (item 
      * "first" this is "quoted") \n 
    (item 
      * this is "quoted with \" ")) 
   \n 
  <EOF>)
Cheesy answered 22/5, 2014 at 7:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.