How to match any symbol in ANTLR parser (not lexer)?

Asked 16/5, 2013 at 22:21 Answered 5/6, 2013 at 16:46

How to match any symbol in ANTLR parser (not lexer)? Where is the complete language description for ANTLR4 parsers?

UPDATE

Is the answer is "impossible"?

Cribriform answered 16/5, 2013 at 22:21 Comment(2)

The answer to your updated question is: yes, it is impossible (as I indicated in my answer). – Wanids 18/5, 2013 at 18:38

I have no idea what exactly you do need. But maybe you should look at "island grammars". It should help with cases when you need to parse one input with two different grammars. – Minta 20/5, 2013 at 6:57

You first need to understand the roles of each part in parsing:

The lexer: this is the object that tokenizes your input string. Tokenizing means to convert a stream of input characters to an abstract token symbol (usually just a number).

The parser: this is the object that only works with tokens to determine the structure of a language. A language (written as one or more grammar files) defines the token combinations that are valid.

As you can see, the parser doesn't even know what a letter is. It only knows tokens. So your question is already wrong.

Having said that it would probably help to know why you want to skip individual input letters in your parser. Looks like your base concept needs adjustments.

Ilse answered 18/5, 2013 at 8:11 Comment(2)

As previous answerer said, there is DOT meta char in ANTLR. It would also has some lexer meta rule to convert all chars into tokens in 1:1 fashion. I see no reason in having lexer at all. It does tagging so as parser, but poor and limited tagging. It is redundant construct. – Cribriform 18/5, 2013 at 18:4

@SuzanCioc, you're just using the wrong tool. ANTLR does not work as you expect/want it to. What you want is a PEG, no need to complain about the fact that ANTLR does not do as you expect/want it to. – Wanids 18/5, 2013 at 18:39

It depends what you mean by "symbol". To match any token inside a parser rule, use the . (DOT) meta char. If you're trying to match any character inside a parser rule, then you're out of luck, there is a strict separation between parser- and lexer rules in ANTLR. It is not possible to match any character inside a parser rule.

Wanids answered 17/5, 2013 at 17:53 Comment(5)

Doesn't this seriously limit working with Unicode? Unicode consists of thousands of symbols and ANTLR allow me either treat most of them as same token or create lexer definition for each. Looks like seriously bad design. Isn't it? – Cribriform 17/5, 2013 at 23:6

P.S. Suppose I want to describe Java string literal. It enclosed with quotes but can contain anything. If I parse it with lexer, I will loose it's content. Why so different with integer literal? – Cribriform 17/5, 2013 at 23:8

@SuzanCioc sorry, I have no idea what you mean. String literals are usually handled in the lexer, why would you need to handle them in a parser rule? – Wanids 18/5, 2013 at 5:47

are number literals also handled by lexers usually? String literals can contain some sophisticated constructions, like in JSTL/EL or C# strings. This is the work of the parser. I see no reason to tie this with lexer, especially while lexer has so limited functionality. – Cribriform 18/5, 2013 at 18:2

@SuzanCioc, that is the work for another parser in that case. What you're trying to do is parse 2 different languages: I know of no tool that does this (easily). – Wanids 18/5, 2013 at 18:31

It is possible, but only if you have such a basic grammar that the reason to use ANTlr is negated anyway.

If you had the grammar:

text     : ANY_CHAR* ;
ANY_CHAR : . ;

it would do what you (seem to) want.

However, as many have pointed out, this would be a pretty strange thing to do. The purpose of the lexer is to identify different tokens that can be strung together in the parser to form a grammar, so your lexer can either identify the specific string "JSTL/EL" as a token, or [A-Z]'/EL', [A-Z]'/'[A-Z][A-Z], etc - depending on what you need.

The parser is then used to define the grammar, so:

phrase     : CHAR* jstl CHAR* ;
jstl       : JSTL SLASH QUALIFIER ;

JSTL       : 'JSTL' ;
SLASH      : '/'
QUALIFIER  : [A-Z][A-Z] ;
CHAR       : . ;

would accept "blah blah JSTL/EL..." as input, but not "blah blah EL/JSTL...".

I'd recommend looking at The Definitive ANTlr 4 Reference, in particular the section on "Islands in the stream" and the Grammar Reference (Ch 15) that specifically deals with Unicode.

Schoolroom answered 5/6, 2013 at 16:46 Comment(2)

Your phrase rule will no work on phrase "JSTL is capitalized version of of jstl" because lexer will eat first "JSTL" as JSTL token and it will not match phrase rule then. So, the idea of lexer is just bad idea. – Cribriform 6/6, 2013 at 13:13

This is quite bizarre as when I started with ANTlr I had pretty much the opposite view - and came very close a number of times to just using the Lexer to provide tokens and then writing my own Parser outside of ANTlr. All I can say is that, for much of the simple grammars on here which people use to discuss bugs and/or misunderstandings, yes, it seems like overkill to have both a Parser and a Lexer. However, once things get complicated, the split is beneficial - both the lexer and parser grammars for our application here run to 100s of lines and we find this decoupling to be a good idea. – Schoolroom 7/6, 2013 at 7:59

Recommended topics

Hot tags