How do I escape an escape character with ANTLR 4?
Asked Answered
R

2

16

Many languages bound a string with some sort of quote, like this:

"Rob Malda is smart."

ANTLR 4 can match such a string with a lexer rule like this:

QuotedString : '"' .*? '"';

To use certain characters within the string, they must be escaped, perhaps like this:

"Rob \"Commander Taco\" Malda is smart."

ANTLR 4 can match this string as well;

EscapedString : '"' ('\\"|.)*? '"';

(taken from p96 of The Definitive ANTLR 4 Reference)

Here's my problem: Suppose that the character for escaping is the same character as the string delimiter. For example:

"Rob ""Commander Taco"" Malda is smart."

(This is perfectly legal in Powershell.)

What lexer rule would match this? I would think this would work:

EscapedString : '"' ('""'|.)*? '"';

But it doesn't. The lexer tokenizes the escape character " as the end of string delimiter.

Rika answered 22/4, 2015 at 14:5 Comment(0)
F
21

Negate certain characters with the ~ operator:

EscapedString : '"' ( '""' | ~["] )* '"';

or, if there can't be line breaks in your string, do:

EscapedString : '"' ( '""' | ~["\r\n] )* '"';

You don't want to use the non-greedy operator, otherwise "" would never be consumed and "a""b" would be tokenized as "a" and "b" instead of a single token.

Fumble answered 22/4, 2015 at 15:12 Comment(3)
It works, thank you. But it only works if I use the greedy quantifier, not the non-greedy one. Why is that?Rika
@Rika you're welcome. I added some extra info on the non-greedy matching.Fumble
It's very interesting to see how the negation functions like a non-greedy quantifier. Cool.Rika
R
2

(Don't vote for this answer; vote for @Bart Kiers' answer.)

I'm offering this for completeness, as it's a small piece of a Powershell grammar. Combining the escape logic from p76 in The Definitive ANTLR 4 Reference with Bart's answer, here are the rules necessary for lexing escaped strings in Powershell:

EscapedString
    : '"'      (Escape | '""'   | ~["])* '"'
    | '\''     (Escape | '\'\'' | ~['])* '\''
    | '\u201C' (Escape | .)*? ('\u201D' | '\u2033')   // smart quotes
    ;

fragment Escape
    : '\u0060\''    // backtick single-quote
    | '\u0060"'     // backtick double-quote
    ;

These rules handle the following four ways to escape strings in Powershell:

'Rob ''Commander Taco'' Malda is smart.'
"Rob ""Commander Taco"" Malda is smart."
'Rob `'Commander Taco`' Malda is smart.'
"Rob `"Commander Taco`" Malda is smart."
Rika answered 22/4, 2015 at 17:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.