lark grammar: How does the escaped string regex work?
Asked Answered
W

1

7

The lark parser predefines some common terminals, including a string. It is defined as follows:

_STRING_INNER: /.*?/
_STRING_ESC_INNER: _STRING_INNER /(?<!\\)(\\\\)*?/ 

ESCAPED_STRING : "\"" _STRING_ESC_INNER "\""

I do understand _STRING_INNER. I also understand how ESCAPED_STRING is composed. But what I don't really understand is _STRING_ESC_INNER.

If I read the regex correctly, all it says is that whenever I find two consecutive literal backslashes, they must not be preceeded by another literal backslash?

How can I combine those two into a single regex?

And wouldn't it be required for the grammar to only allow escaped double quotes in the string data?

Waterer answered 22/4, 2020 at 18:11 Comment(1)
why does _STRING_INNER have two backslashes? in /.*?/?Apostate
P
9

Preliminaries:

  • .*? Non-greedy match, meaning the shortest possible number of repetitions of . (any symbol). This only makes sense when followed by something else. So .*?X on input AAXAAX would match only the AAX part, instead of expanding all the way to the last X.

  • (?<!...) is a "negative look-behind assertion" (link): "Matches if the current position in the string is not preceded by a match for ....". So .*(?<!X)Y would match AY but not XY.

Applying this to your example:

  • ESCAPED_STRING: The rule says: "Match ", then _STRING_ESC_INNER, and then " again".

  • _STRING_INNER: Matches the shortest possible number of repetitions of any symbol. As said before, this only makes sense when considering the regular expression that comes after it.

  • _STRING_ESC_INNER: We want this to match the shortest possible string that does not contain a closing quote. That is, for an input "abc"xyz", we want to match "abc", instead of also consuming the xyz" part. However, we have to make sure that the " is really a closing quote, in that it should not be itself escaped. So for input "abc\"xyz", we do not want to match only "abc\", because the \" is escaped. We observe that the closing " has to be directly preceded by an even number of \ (with zero being an even number). So " is ok, \\" is ok, \\\\" is ok etc. But as soon as " is preceded by an odd number of \, that means the " is not really a closing quote.

    (\\\\) matches \\. The (?<!\\) says "the position before should not have \". So combined (?<!\\)(\\\\) means "match \\, but only if it is not preceded by \".

    The following *? then does the smallest possible repetitions of this, which again only makes sense when considering the regular expression that comes after this, which is the " from the ESCAPED_STRING rule (possible point of confusion: the \" in the ESCAPED_STRING refers to a literal " in the actual input we want to match, in the same way that \\\\ refers to \\ in the input). So (?<!\\)(\\\\)*?\" means "match the shortest amount of \\ that is followed by " and not preceded by \. So in other words, (?<!\\)(\\\\)*?\" matches only " that are preceded by an even number of \ (including blocks of size 0).

    Now combining it with the preceding _STRING_INNER, the _STRING_ESC_INNER rule then says: Match the first " preceded by an even number of \, so in other words, the first " where the \ is not itself escaped.

Platypus answered 22/4, 2020 at 20:6 Comment(6)
Thanks. But why do we want to match an escaped quote \"? That basically means the string is not yet complete and there are more characters to be consumed.Waterer
Got it. When you write \" you actually mean the literal ", just in regex style so it has to be escaped.Waterer
Yes, I'm sorry, that was poorly worded on my part. I edited the answer to make it more clear. So the \" in the program code corresponds to a " in the input.Platypus
I'm also confused about `. why does _STRING_INNER` have two backslashes? in /.*?/?Apostate
@CharlieParker I'm not fully sure what you mean? If your question is why there are two forward slashes in _STRING_INNER: /.*?/, this just the syntax for specifying a regular expression in Lark (lark-parser.readthedocs.io/en/latest/grammar.html#terminals). I guess this was chosen to make it easier to differentiate between regular strings ("somestring") and regular expressions (/someregex/).Platypus
In general, you can refer to #15662469 for some more background on this convention.Platypus

© 2022 - 2024 — McMap. All rights reserved.