ANTLR4 negative lookahead in lexer
Asked Answered
J

1

5

I am trying to define lexer rules for PostgreSQL SQL.

The problem is with the operator definition and the line comments conflicting with each other.

for example @--- is an operator token @- followed by the -- comment and not an operator token @---

In grako it would be possible to define the negative lookahead for the - fragment like:

OP_MINUS: '-' ! ( '-' ) .

In ANTLR4 I could not find any way to rollback already consumed fragment.

Any ideas?

Here the original definition what the PostgreSQL operator can be:

The operator name is a sequence of up to NAMEDATALEN-1
(63 by default) characters from the following list:

 + - * / < > = ~ ! @ # % ^ & | ` ?

There are a few restrictions on your choice of name:
-- and /* cannot appear anywhere in an operator name,
since they will be taken as the start of a comment.

A multicharacter operator name cannot end in + or -,
unless the name also contains at least one of these
characters:

~ ! @ # % ^ & | ` ?

For example, @- is an allowed operator name, but *- is not.
This restriction allows PostgreSQL to parse SQL-compliant
commands without requiring spaces between tokens.
Jeanajeanbaptiste answered 12/6, 2014 at 21:19 Comment(5)
Can you give a more specific example of what you're trying to do, what you already attempted, and why that didn't solve your problem?Damal
So I need the lexer to return Op class, that (for simplification) can contain +, -, * and / in any combination. But -- and /* start the comment, and the lexer should be able to return +--this_is_plusas two tokens: Op(+) and LineComment(--this_is_plus) and not as Op(+--) and Ident(this_is_plus)Jeanajeanbaptiste
Have you tried it? What you are describing that you want is the only way ANTLR works.Damal
The operator token always consists of a @ followed by exactly 1 character?Carranza
No, the operator token is 1 to 63 character long.Jeanajeanbaptiste
D
7

You can use a semantic predicate in your lexer rules to perform lookahead (or behind) without consuming characters. For example, the following covers several rules for an operator.

OPERATOR
  : ( [+*<>=~!@#%^&|`?]
    | '-' {_input.LA(1) != '-'}?
    | '/' {_input.LA(1) != '*'}?
    )+
  ;

However, the above rule does not address the restrictions on including a + or - at the end of an operator. To handle that in the easiest way possible, I would probably separate the two cases into separate rules.

// this rule does not allow + or - at the end of a rule
OPERATOR
  : ( [*<>=~!@#%^&|`?]
    | ( '+'
      | '-' {_input.LA(1) != '-'}?
      )+
      [*<>=~!@#%^&|`?]
    | '/' {_input.LA(1) != '*'}?
    )+
  ;

// this rule allows + or - at the end of a rule and sets the type to OPERATOR
// it requires a character from the special subset to appear
OPERATOR2
  : ( [*<>=+]
    | '-' {_input.LA(1) != '-'}?
    | '/' {_input.LA(1) != '*'}?
    )*
    [~!@#%^&|`?]
    OPERATOR?
    ( '+'
    | '-' {_input.LA(1) != '-'}?
    )+
    -> type(OPERATOR)
  ;
Damal answered 14/6, 2014 at 15:33 Comment(1)
wow, I'm having a real hard time deciphering this x'] haha.. To put the first rule into words, An operator matches: Either one of : [*<>=~!@#%^&|'?], or ((1 or many : + or (- without another - ahead)) followed by either of : [*<>=~!@#%^&|'?]), or a / without a * ahead of it. Is this correct ? Or did i misunderstand some part of itKoto

© 2022 - 2024 — McMap. All rights reserved.