Does context-sensitive tokenisation require multiple goal symbols in the lexical grammar?

Asked 16/11, 2021 at 2:25 Answered 18/11, 2021 at 2:41

Solved parsing programming-languages grammar context-free-grammar context-sensitive-grammar

There are several situations where the identification of lexical input elements is sensitive to the syntactic grammar context that is consuming the input elements. This requires multiple goal symbols for the lexical grammar.

Two such symbols are InputElementDiv and InputElementRegExp.

In ECMAScript, the meaning of / depends on the context in which it appears. Depending on the context, a / can either be a division operator, the start of a regex literal or a comment delimiter. The lexer cannot distinguish between a division operator and regex literal on its own, so it must rely on context information from the parser.

I'd like to understand why this requires the use of multiple goal symbols in the lexical grammar. I don't know much about language design so I don't know if this is due to some formal requirement of a grammar or if it's just convention.

Questions

Why not just use a single goal symbol like so:

InputElement ::
     [...]
     DivPunctuator
     RegularExpressionLiteral
     [...]

and let the parser tell the lexer which production to use (DivPunctuator vs RegExLiteral), rather than which goal symbol to use (InputElementDiv vs InputElementRegExp)?

What are some other languages that use multiple goal symbols in their lexical grammar?
How would we classify the ECMAScript lexical grammar? It's not context-sensitive in the sense of the formal definition of a CSG (i.e. the LHS of its productions are not surrounded by a context of terminal and nonterminal symbols).

Applicable answered 16/11, 2021 at 2:25 Comment(1)

"goal" and "start" symbol for a grammar are equivalent terms. Lexers should work independent from the parser for performance. But, practically speaking, most parsers for the major languages do! A lexer is a recognizer that has its own grammar containing alphabet, rules, and start symbol(s). In Antlr4, this is explicit using “lexer grammar” syntax, except the start symbol is not explicit because it is trivially a rule that can derive any token. In Antlr4, you can switch the start symbol with “mode”. grep for “mode” in [grammars-v4]( github.com/antlr/grammars-v4). – Outofdoors 16/11, 2021 at 12:6

Saying that the lexical production is "sensitive to the syntactic grammar context that is consuming the input elements" does not make the grammar context-sensitive, in the formal-languages definition of that term. Indeed, there are productions which are "sensitive to the syntactic grammar context" in just about every non-trivial grammar. It's the essence of parsing: the syntactic context effectively provides the set of potentially expandable non-terminals, and those will differ in different syntactic contexts, meaning that, for example, in most languages a statement cannot be entered where an expression is expected (although it's often the case that an expression is one of the manifestations of a statement).

However, the difference does not involve different expansions for the same non-terminal. What's required in a "context-free" language is that the set of possible derivations of a non-terminal is the same set regardless of where that non-terminal appears. So the context can provide a different selection of non-terminals, but every non-terminal can be expanded without regard to its context. That is the sense in which the grammar is free of context.

As you note, context-sensitivity is usually abstracted in a grammar by a grammar with a pattern on the left-hand side rather than a single non-terminal. In the original definition, the context --everything other than the non-terminal to be expanded-- needed to be passed through the production untouched; only a single non-terminal could be expanded, but the possible expansions depend on the context, as indicated by the productions. Implicit in the above is that there are grammars which can be written in BNF which don't even conform to that rule for context-sensitivity (or some other equivalent rule). So it's not a binary division, either context-free or context-sensitive. It's possible for a grammar to be neither (and, since the empty context is still a context, any context-free grammar is also context-sensitive). The bottom line is that when mathematicians talk, the way they use words is sometimes unexpected. But it always has a clear underlying definition.

In formal language theory, there are not lexical and syntactic productions; just productions. If both the lexical productions and the syntactic productions are free of context, then the total grammar is free of context. From a practical viewpoint, though, combined grammars are harder to parse, for a variety of reasons which I'm not going to go into here. It turns out that it is somewhat easier to write the grammars for a language, and to parse them, with a division between lexical and syntactic parsers.

In the classic model, the lexical analysis is done first, so that the parser doesn't see individual characters. Rather, the syntactic analysis is done with an "alphabet" (in a very expanded sense) of "lexical tokens". This is very convenient -- it means, for example, that the lexical analysis can simply drop whitespace and comments, which greatly simplifies writing a syntactic grammar. But it also reduces generality, precisely because the syntactic parser cannot "direct" the lexical analyser to do anything. The lexical analyser has already done what it is going to do before the syntactic parser is aware of its needs.

If the parser were able to direct the lexical analyser, it would do so in the same way as it directs itself. In some productions, the token non-terminals would include InputElementDiv and while in other productions InputElementRegExp would be the acceptable non-terminal. As I noted, that's not context-sensitivity --it's just the normal functioning of a context-free grammar-- but it does require a modification to the organization of the program to allow the parser's goals to be taken into account by the lexical analyser. This is often referred to (by practitioners, not theorists) as "lexical feedback" and sometimes by terms which are rather less value neutral; it's sometimes considered a weakness in the design of the language, because the neatly segregated lexer/parser architecture is violated. C++ is a pretty intense example, and indeed there are C++ programs which are hard for humans to parse as well, which is some kind of indication. But ECMAScript does not really suffer from that problem; human beings usually distinguish between the division operator and the regexp delimiter without exerting any noticeable intellectual effort. And, while the lexical feedback required to implement an ECMAScript parser does make the architecture a little less tidy, it's really not a difficult task, either.

Anyway, a "goal symbol" in the lexical grammar is just a phrase which the authors of the ECMAScript reference decided to use. Those "goal symbols" are just ordinary lexical non-terminals, like any other production, so there's no difference between saying that there are "multiple goal symbols" and saying that the "parser directs the lexer to use a different production", which I hope addresses the question you asked.

Notes

The lexical difference in the two contexts is not just that / has a different meaning. If that were all that it was, there would be no need for lexical feedback at all. The problem is that the tokenization itself changes. If an operator is possible, then the /= in
```
a /=4/gi;
```
is a single token (a compound assignment operator), and gi is a single identifier token. But if a regexp literal were possible at that point (and it's not, because regexp literals cannot follow identifiers), then the / and the = would be separate tokens, and so would g and i.
Parsers which are built from a single set of productions are preferred by some programmers (but not the one who is writing this :-) ); they are usually called "scannerless parsers". In a scannerless parser for ECMAScript there would be no lexical feedback because there is no separate lexical analysis.
There really is a breach between the theoretical purity of formal language theory and the practical details of writing a working parser of a real-life programming language. The theoretical models are really useful, and it would be hard to write a parser without knowing something about them. But very few parsers rigidly conform to the model, and that's OK. Similarly, the things which are popularly calle "regular expressions" aren't regular at all, in a formal language sense; some "regular expression" operators aren't even context-free (back-references). So it would be a huge mistake to assume that some theoretical result ("regular expressions can be identified in linear time and constant space") is actually true of a "regular expression" library. I don't think parsing theory is the only branch of computer science which exhibits this dichotomy.

Counterpoison answered 16/11, 2021 at 4:49 Comment(9)

(there's no difference between saying that there are "multiple goal symbols" and saying that the "parser directs the lexer to use a different production"): That would be true if the productions in question were those of goal symbols, but that's not the case in the original question. – Flop 17/11, 2021 at 22:46

@michaelDyck: OK, that's fair. I'll rewrite it to be more precise. – Counterpoison 18/11, 2021 at 1:25

rici, @michaelDyck, just to clarify, my main question asks why we need the multiple goal symbols. Could they just be the spec authors' version of "lexical states"? This article mentions that lexical states are particularly useful when we need to tokenize language constructs that accept strings from a recursive sub-language like regex or template literals. – Applicable 18/11, 2021 at 2:8

@user51462: yes, that's correct. But unlike the full generality of "lexical states", goal symbols can be derived from the syntactic grammar by constructing the set of possible initial lexical non-terminals for each state in the parser's state machine. (You have to do that to write a parser; the standard only provides a simplified algorithm.) So I stand by my basic claim that no additional parsing power is being provided; the goal symbols are inherent in the grammar. But they do make for a convenient model to organise the work. – Counterpoison 18/11, 2021 at 2:19

I'll edit all that into the answer. But not this instant. – Counterpoison 18/11, 2021 at 2:20

Thanks for getting back to me so quickly @rici. When you say "inherent" in the grammar, do you mean that they are an essential feature? That can't be true if they are just there for convenience, right? – Applicable 18/11, 2021 at 2:25

No, i mean the reverse. They are derivable from the grammar so they are there whether or not stated explicitly. That's what "inherent" means to me. Maybe I use it incorrectly. – Counterpoison 18/11, 2021 at 2:56

@user51462: perhaps "implicit" would have been a better choice of word. – Counterpoison 18/11, 2021 at 3:3

Oh I see, sorry, it makes sense when you say derivable from the grammar. – Applicable 18/11, 2021 at 3:10

Why not just use a single goal symbol like so:

InputElement ::
  ...
  DivPunctuator
  RegularExpressionLiteral
  ...

and let the parser tell the lexer which production to use (DivPunctuator vs RegExLiteral), rather than which goal symbol to use (InputElementDiv vs InputElementRegExp)?

Note that DivPunctuator and RegExLiteral aren't productions per se, rather they're nonterminals. And in this context, they're right-hand-sides (alternatives) in your proposed production for InputElement. So I'd rephrase your question as: Why not have the syntactic parser tell the lexical parser which of those two alternatives to use? (Or equivalently, which of those two to suppress.)

In the ECMAScript spec, there's a mechanism to accomplish this: grammatical parameters (explained in section 5.1.5).

E.g., you could define the parameter Div, where:

+Div means "a slash should be recognized as a DivPunctuator", and
~Div means "a slash should be recognized as the start of a RegExLiteral".

So then your production would become

InputElement[Div] ::
  ...
  [+Div] DivPunctuator
  [~Div] RegularExpressionLiteral
  ...

But notice that the syntactic parser still has to tell the lexical parser to use either InputElement[+Div] or InputElement[~Div] as the goal symbol, so you arrive back at the spec's current solution, modulo renaming.

What are some other languages that use multiple goal symbols in their lexical grammar?

I think most don't try to define a single symbol that derives all tokens (or input elements), let alone have to divide it up into variants like ECMAScript's InputElementFoo, so it might be difficult to find another language with something similar in its specification.

Instead, it's pretty common to simply define rules for the syntax of different kinds of tokens (e.g. Identifier, NumericLiteral) and then reference them from the syntactic productions. So that's kind of like having multiple lexical goal symbols, but not (I would say) in the sense you were asking about.

How would we classify the ECMAScript lexical grammar?

It's basically context-free, plus some extensions.

Flop answered 18/11, 2021 at 2:41 Comment(9)

What part of the lexical grammar is not context-free? (For that matter, what part of the syntactic grammar as provided, not including the many context-sensitive constraints listed in the narrative)? – Counterpoison 18/11, 2021 at 3:6

Grammatical parameters, '?' (for optional), lookahead constraints, order-disambiguation (in the Annex B Pattern grammar), "but not", "but only if", cover grammars, automatic semicolon insertion. – Flop 18/11, 2021 at 3:23

Grammatical parameters can be eliminated with macro substitution. They are all finite (and not even unpractically large). Optionality (like repetition) can also be macro substituted. Lookahead constraints could be used to create context-sensitive grammars but they are not used that way anywhere. (The intersection of a CFG and a regular grammar is CFG.) Ditto for "but not" and "but only if". (Could be CS but not used that way) – Counterpoison 18/11, 2021 at 4:8

My understanding of the ordering constraints is that they are only used for disambiguation. An ambiguous grammar can still be context-free. Also, my impression was that the ordering could be replaced by an unordered predicate requiring only bounded lookahead. However, I haven't donne a thorough analysis. – Counterpoison 18/11, 2021 at 4:20

Automatic semicolon insertion is definitely bounded context; in fact, unless something has changed recently, only a single lookahead is necessary. – Counterpoison 18/11, 2021 at 4:21

I noticed that there is now a context-sensitive requirement on the use of numbered backreferences in regular expressions, where the backreference cannot specify a number greater than the number of captures in the regex. So I'll give you that one. – Counterpoison 18/11, 2021 at 4:24

If by "covering grammar" you mean the second parse mandated by section 5.2.4, then it's possible that some of those are context-sensitive. But as I said above, if all it is doing is removing ambiguity, then the language is still context-free. Anyway, that's not part of the lexical grammar. – Counterpoison 18/11, 2021 at 4:30

There are a lot of "early error" rules which are definitely context-sensitive. Those are the narrative constraints I referred to initially. That makes the language context-sensitive, like most programming languages. If you want to argue that those are grammatical constraints as well, I can only agree. I've made precisely that argument many times. :-) – Counterpoison 18/11, 2021 at 4:36

@rici: Much as I'd like to respond to your points, I think we're already into "secondary discussion". – Flop 18/11, 2021 at 18:56

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Notes

Recommended topics

Hot tags