Optional Prefix in ANTLR parser/lexer - McMap

About

Optional Prefix in ANTLR parser/lexer

Asked 20/4, 2015 at 16:20 Answered 20/4, 2015 at 17:52

Solved antlr antlr4

D

1

6

I'm trying to use ANTLR4 to parse input strings that are described by a grammar like:

grammar MyGrammar;

parse : PREFIX? SEARCH;

PREFIX
  : [0-9]+ ':'
  ;

SEARCH
  : .+ 
  ;

e.g. valid input strings include:

0: maracujá
apple
3:€53.60
1: 10kg
2:chilli pepper

But the SEARCH rule always matches the whole string - whether it has a prefix or not.

I understand this is because the ANTLR4 lexer gives preference to the rules that match the longest string. Therefore the SEARCH rule matches all input, not giving the PREFIX rule a chance.

And the non-greedy version (i.e. SEARCH : .+? ;) has the same problem because (as I understand) it's only non-greedy within the rule - and the SEARCH rule doesn't have any other parts to constrain it.

If it helps, I could constrain the SEARCH text to exclude ':' but I really would prefer it recognise anything else - unicode characters, symbols, numbers, space etc.

I've read Lexer to handle lines with line number prefix but in that case, the body of the string (after the prefix) is significantly more constrained.

Note: SEARCH text might have a structure to it - like €53.00 and 10kg above (which I'd also like ANTLR4 to parse) or it might just be free text - like apple, maracujá and chilli pepper above. But I've tried to simplify so I can solve the problem of extracting the PREFIX first.

Denizen answered 20/4, 2015 at 16:20 Comment(2)

It doesn't make sense to not constrain the SEARCH rule, because your grammar would be ambiguous. 0: x 1: y could be tokenized as either PREFIX SEARCH PREFIX SEARCH or PREFIX SEARCH. – Winze 20/4, 2015 at 16:29

in my language 0: x 1: y would be a PREFIX of 0: and SEARCH of x 1: y - so there's only ever one PREFIX and everything that follows is the SEARCH. – Denizen 20/4, 2015 at 16:42

G

2

ANTLR does lexing before parsing. The lexer prefers long matches and SEARCH tokens match every PREFIX token and even any character appended to it, so your complete line is matched by SEARCH.

To prevent this: Keep the lexer rules disjunct, or at least the tokens should not subsume each other.

parse : prefix? search;

search: (WORD | NUMBER)+;

prefix: NUMBER ':';

NUMBER : [0-9]+;
WORD : (~[0-9:])+;

Guidepost answered 20/4, 2015 at 17:52 Comment(2)

Unfortunately this doesn't do what I hoped. When I test with my examples: 0: maracujá : search matches the whole string. I want prefix=0 and search=maracujá. apple : OK. 3:€53.60 : search matches the whole string. I expected prefix=3 and search=€53.60. 1: 10kg : OK. 2:chilli pepper : search matches the whole string. I expected prefix=2 and search=chilli pepper. – Denizen 21/4, 2015 at 16:14

Exclude the colon : from WORD. I corrected the grammar. If you want to allow : in your search, adjust the parser rule for search. – Guidepost 21/4, 2015 at 18:2

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.