Handling String Literals which End in an Escaped Quote in ANTLR4
Asked Answered
S

2

12

How do I write a lexer rule to match a String literal which does not end in an escaped quote?

Here's my grammar:

lexer grammar StringLexer;

// from The Definitive ANTLR 4 Reference
STRING: '"' (ESC|.)*? '"';
fragment ESC : '\\"' | '\\\\' ;

Here's my java block:

String s = "\"\\\""; // looks like "\"
StringLexer lexer = new StringLexer(new ANTLRInputStream(s)); 

Token t = lexer.nextToken();

if (t.getType() == StringLexer.STRING) {
    System.out.println("Saw a String");
}
else {
    System.out.println("Nope");
}

This outputs Saw a String. Should "\" really match STRING?

Edit: Both 280Z28 and Bart's solutions are great solutions, unfortunately I can only accept one.

Sublease answered 3/7, 2014 at 15:36 Comment(0)
R
15

For properly formed input, the lexer will match the text you expect. However, the use of the non-greedy operator will not prevent it from matching something with the following form:

'"' .*? '"'

To ensure strings are tokens in the most "sane" way possible, I recommended using the following rules.

StringLiteral
  : UnterminatedStringLiteral '"'
  ;

UnterminatedStringLiteral
  : '"' (~["\\\r\n] | '\\' (. | EOF))*
  ;

If your language allows string literals to span across multiple lines, you would likely need to modify UnterminatedStringLiteral to allow matching end-of-line characters.

If you do not include the UnterminatedStringLiteral rule, the lexer will handle unterminated strings by simply ignoring the opening " character of the string and proceeding to tokenize the content of the string.

Rusticate answered 3/7, 2014 at 17:19 Comment(5)
Is there any reason you made these parser rules instead of lexer rules? I just implemented them as lexer rules and they appear to work fine.Sublease
@RepickBroom They are lexer rules (start with a capital letter). Parser rules start with a lower case letter.Rusticate
So much for reading comprehension on my part... I'm used to seeing the all-caps lexer rules; my eyes just glazed over those capitalized rules.Sublease
What was the purpose of splitting rule into two? As I understood, lexer is context-less?Obstruent
@Obstruent I understand its been a few years, but: The lexer goes for the longest possible token, so if you have "mogusEOF the lexer will see it as an unterminated string literal. However if you have "mogus"EOF then the one extra quote makes StringLiteral the longer token so it is the one that will get tokenizedBoonie
C
8

Yes, "\" is matched by the STRING rule:

            STRING: '"' (ESC|.)*? '"';
                     ^       ^     ^
                     |       |     |
// matches:          "       \     "

If you don't want the . to match the backslash (and quote), do something like this:

STRING: '"' ( ESC | ~[\\"] )* '"';

And if your string can't be spread over multiple lines, do:

STRING: '"' ( ESC | ~[\\"\r\n] )* '"';
Cosgrave answered 3/7, 2014 at 17:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.