Is JavaScript a Context Free Language?
Asked Answered
A

3

19

This article on how browsers work explains how CSS is context free, while HTML is not. But what about JavaScript, is JavaScript context free?

I am learning about CFG and formal proofs, but am a long way away from understanding how to figure this out. Does anyone know if JavaScript is context free or not?

Aurelia answered 7/6, 2015 at 18:52 Comment(1)
This might have been a better fit for Computer ScienceFoiled
F
21

No, JavaScript is not a context-free language.

It is very close to one, and the ECMAScript 5 specification does indeed use a context-free grammar1 to describe the language's syntax (you can find all productions in Annex A).

Of course, it does make some extensions to pure context-free grammatical productions, and describes extra behaviour of the parser. One particular thing is the usage of lookahead which still makes a context-free languages, but would complicate the grammar a lot if it couldn't be used for some rules. Not allowing certain things to appear in strict mode code is similar - it could be done by adjusting the grammar (with far more productions), but the rule is much easier expressed by leaving the BNF.

However, there are also some2 rules that do make the language not context-free. You'll find an overview in the description of early errors, which can make a program code invalid. That object literals must not contain duplicate property names and that function parameter lists must not contain duplicate identifiers are two rules that cannot be expressed using (finite) context-free grammars.
My gut tells me that the automatic semicolon insertion falls in the same box, but I think its rules are too complicated to even attempt a proof here.

1: Actually it uses two grammars, a lexical and a syntactical one, where the first disambiguates between division expressions and regular expressions, and does produce the tokens that are the input to the second grammar.
2: Rather few actually, compared to other programming languages

Foiled answered 7/6, 2015 at 22:10 Comment(16)
Automatic semicolom insertion is definitely context-free. I think its even lr(1). But the grammar would be a monster. The same can bw said for disambiguating /. The duplicate id issues are clearly not CF, although if you are going to be that strict, then almost no language is CF.Medora
@rici: Hm, I'm not convinced. ASI needs to detect "[a token] that is not allowed by any production of the grammar, [but] is then allowed if preceded by a semicolon". That sounds like complement and intersection to me, which both do not (necessarily) form a context-free language. But I would agree that the resulting grammar had to be a monster :-)Foiled
"a token ... that is not allowed by any production of the grammar" only requires a lookahead computation. The semicolon is not inserted if the next token is acceptable, even if there is no parse with that prefix. So I think it can be treated as a left-derivative, under which CFGs are closed. There is no requirement that the sentence be accepted with the semicolon, but of course normal parsing will guarantee that. I'm unable to locate the quote "is then allowed if preceded by a semicolon" in the JS standard documents I'm aware of.Medora
@rici: You might be right. I didn't think that ASI only looks at a single token. And sorry for that misquote, everything following the "[but]" was me paraphrasing that the normal parsing needs to accept the sentence with the semicolon.Foiled
"Not allowing certain things to appear in strict mode code is similar - it could be done by adjusting the grammar (with far more productions), but the rule is much easier expressed by leaving the BNF", this is not correct, as one can have endless variations which should all be taken into account, so it makes the number of productions/rules in CFG infinite (thus not applicable as CFG)Dapple
@rici, exactly, almost no language is CF, unless by grammar specificatiion one means that a grammar can accept all valid strings but not reject non-valid ones, given the language semantics and specification (which is another definition). Then in that case almost all languages would be CFDapple
to put it another way, for every "rule-augmented" CFG grammar which can handle "strict mode" productions i can give a valid example code which is not handled, by it, for example all i have to do is add dummy statements in betweenDapple
@Bergi, any example of context-sensitive parsing that does not use "strict mode" examples for JavaScript (also not "automatic semicolon insertion")? i think there should be some, but i'm still thinking about it (given the complexity it is not easy to think of good examples on-the-fly)Dapple
@Foiled ASI is indeed a single token lookahead, even though this isn't immediately obviously from how it's defined (sadly).Rector
@Bergi, added some javascript examples not requiring "strict mode" in my answer, open to suggestionsDapple
@NikosM.: No, strict mode rules do not require endless variations of rules. You'd just replace the current FunctionBody = SourceElements* with FunctionBody = SloppySourceElements* | UseStrictDirective StrictSourceElements*, and would have to duplicate all productions for the SourceElement goal (and distinguish between strict and sloppy mode for each). This duplication is ugly, but does still define a context-free language (with a finite set of grammar rules)Foiled
@Bergi, you talk about a subset of strict mode rules (excluding unique function declarations et all). Still i have some doubts this is so, provided the strict mode has a function-scope level interleaved with "sloppy" code. Certainly needs further investigation whether and how it can be defined only as CFG (i understand one can split the CFG in two parts and follow either route, but the interleaving is a little complicated)Dapple
@NikosM.: Yeah, I'm talking about a subset of the strict mode rules, specifically the one that does not allow with statements in strict code. That duplicate identifiers are forbidden is an other problem. The interleaving actually wouldn't be that complicated, it's basically StrictFunctionBody = UseStrictDirective? StrictSourceElements* and SloppyFunctionBody = SloppySourceElements* | UseStrictDirective StrictSourceElements*Foiled
@Bergi, ok i'm convinced , i agree for some subset the grammar can be split in the various casesDapple
@NikosM.: Languages which require variables to be declared or which prohibit the same identifier to be declared twice in the same context are certainly not context-free, but one could argue that the non-context-freeness is unimportant. ES (which has some such prohibitions even in non-strict mode, eg. es5.github.io/#x11.1.5, step 4 for the second production for PropertyNameAndValueList) falls into this category. Scheme (iirc) and Lua are truly context-free. C is non-trivially not CF because the "kind" of an identifier changes the identifier's token-type. C++ is even worse.Medora
@rici, agree from a practical point-of-view, even a mild context-sensitive language can be treated as if CF with slight modifications.Dapple
D
10

No programming language is (completely) context-free (i would say including CSS). Even though context-free grammars (CFGs) may be used to define/generate compilers/parsers for the language.

The simple fact (for example) that variables need to be defined first, before used, or that declarations involving identifiers should be unique, makes the language "context-sensitive".

A grammar for a (programming) language is supposed to describe (and generate) strings which are only the valid programs in that language (syntacticaly, but also semanticaly). Yet a CFG can describe and generate strings which are not valid programs (given the language semantics and specification). Conditions which describe valid programs (like for example: 1. a class needs to be defined before using new class(), 2. ids must match etc..) require context-sensitivity.

No CFG (with any finite number of productions) can correctly represent only the valid strings of this language: { anbncn : n >= 1 }, where n should be the same for a, b, c (it should match). Note one can indeed define a CFG for (a superset of) this language, but it will accept also non-valid strings along with valid ones (and then by other means filter them out), this is not what a grammar specification for a language is supposed to do. It should accept only the valid strings and reject the non-valid. In an analogy with statistics, one could say that a grammar specification for a language should eliminate/minimise both Type-I (reject valid strings) and Type-II (accept non-valid strings) errors, not just one of them.

Let me give a simple example in the context of JavaScript (since variables may seem as posing no problem for JavaScript).

In JavaScript (in strict mode), duplicate named function declaration is not valid. So this is not valid:

function duplicateFunc(){}
function duplicateFunc(){} // duplicate named function declaration

So the program is not correct, yet a CFG cannot handle this type of condition.

Even turning on strict mode itself is context-sensitive a subset of strict mode rules can be handled by spliting the CFG in cases and parsing accordingly as per @Bergi's answer (strict mode examples removed)

[UPDATE]

i will try to give a couple of examples of JavaScript non-context-free code which does not require "strict mode" (open to suggestions/corrections).

The use of reserved words/keywords is an extension (or limitation) on the grammar. It is an extraneous feature, so the following examples should count as examples of non-CF behaviour.

var var; // identifier using reserved name
var function; // identifier using reserved name
obj.var; // reserved name used as (explicit) property
obj["var"]; // this is fine!!
Object++; // built-in type used as numeric variable

[/UPDATE]

So the context plays a part in the correct parsing of the program. As it is said "context is everything"!

However this context-sensitivity can be handled (hopefuly) by only slight extensions to context-free grammars (like for example Attribute Grammars, Affix Grammars, TAG Grammars and so on), which still make for efficient parsing (meaning in polynomial time).

[UPDATE]

"i would say including CSS"

To elaborate a little on this statement. CSS1 would be CF, but as CSS specification adds more features inclufing variable support (e.g css-counters) it makes the CSS code context-sensitive in the sense described above (e.g variables need to be defined before used). so the following css code would be parsed by the browser (and ignored as it is not valid) but it cannot be described by a CFG

body { }
h3::before {
  counter-increment: section;               /* no counter section has been defined, not valid css code */
  content: "Section" counter(section) ": "; /* Display the counter */
}

[/UPDATE]

Dapple answered 7/6, 2015 at 18:59 Comment(14)
In JavaScript, a variable does not need to be declared before being used. And even if it throws an error at runtime (it often doesn't), that doesn't mean the program is syntactically invalid.Foiled
Your example is wrong. Even a context-free language can distinguish between [b] being used as a property accessor or an array literal.Foiled
@Bergi, added another example with duplicate function declaration, i would argue that the first example is also correct, but it is not so criticalDapple
Thanks! You might want to mention that it's only invalid in strict mode, though.Foiled
@Bergi, yeah sure, added i!Dapple
"So the context plays a part in the correct parsing of the program" - I don't think that's an accurate description of what context-free language means. It's about context of substituion rules in the grammar, not that a string can mean different things in different "contexts" (productions). Neither strict mode nor reserved keywords make the language non-context-free.Foiled
@Bergi, how about this Object++;?Dapple
Object++ is not a syntax error. In fact it works quite well and returns NaN.Foiled
@Bergi, but the thing is is it valid javascript code (given the language semantics) or not? It does not matter whether it produces a runtime error instead of compile-time error (which in some cases javascript parser would prefer for sure), neither whether it can be forgiven so to speak.Dapple
Yes, it is valid javascript code (it doesn't even throw a runtime error). It might be a mistake of the programmer, and "wrong" for the requirements of the application, but it still is valid. And btw, no, you don't want a language where compilation is undecidable (although they actually exist, using some advanced type systems).Foiled
@Bergi, true that is why minimal extensions to CFGs are used to cover these cases but be tractable (read polynomial). But we are not discussing this here, whether one needs a polynomial parser or not. But whether a pure CFG can describe the valid javascript programs and only them (not a superset of them). Anyway this can be a long discusion, editing my answer to remove the strict mode examples (as per your answer), will leave the rest though (as per my comments)Dapple
I'm saying that no parser for a language that describes turing-complete programs can reject all programs that would throw runtime errors, because that would be equivalent to solving the halting problem.Foiled
@NikosM.: The fact that a reserved word is sometimes not reserved does not make a language context-sensitive. I'm not sure what you intended to imply in your example, but my reading ES5 is that var function and var var are syntax errors, while obj.var and obj["var"] are both perfectly valid (and semantically similar). I base that on where the CFG uses Identifier and where it uses IdentifierName, and the fact that you can write a CFG like that makes it clear that the feature does not make the grammar non-context-free.Medora
@rici, correct, but the argument is if a pure CFG can describe this, since as far as i can see these are extraneous features put on top of the CFG (so we are talking about sth else). Maybe better examples can be found??Dapple
D
-2

I'm pretty certain JS is not context free — given an arbitrary code artefact, you cannot necessarily determine its exact meaning without knowing its context.

The first example that comes to mind is {} — does this represent an empty object literal or an empty statement block? It's impossible to decide without context, but because the language allows semicolons to be omitted from statements ending in '}' (as do most languages with C-like syntax) it may also be undecidable with context! Consider {x: {}} — this could be an object literal with the "x" field containing an empty object, or a statement block with a labelled sub-statement (where the label is 'x' and the sub-statement is {}). Perhaps the language specification has some rules for selecting the correct interpretation in such scenarios, but in any case the language does not appear to be context-free, judging by these examples alone.

JavaScript's 'automatic semicolon insertion' feature certainly doesn't help in distinguishing expressions and statements.

Here's another one to think about: function x() {} — what does this do? If it's a statement, it declares a new hoisted variable 'x' with this function as its value. If it's an expression, it simply evaluates to a function which has an upvalue 'x' bound to the same function (for self-reference).

Dearden answered 7/6, 2015 at 19:27 Comment(4)
Not sure I understand this answer. {} is an empty object literal. function x() {} is a statement.Friede
@Jamie: {} without function x() in front of it is a legal code block in javascript - try it. You can do {var x = 1} or {x:1} and one is a code block and the other is an object.Catabasis
@Jamie: function x() {} is both a valid statement and a valid expression. You can't know until you add some context: it's definitely a statement in the case of ;function x() {};, while it's definitely an expression in the case of (function x() {}).Dearden
@cauterite: yes, but that's not what context-free means.Medora

© 2022 - 2024 — McMap. All rights reserved.