Is the Python's grammar LL(1)?

Asked 3/12, 2018 at 15:14 Answered 3/12, 2018 at 16:1

Solved python parsing compiler-construction grammar

Possible duplicate for this question however for me it's not specific enough.

The python grammar is claimed to be LL(1), but I've noticed some expressions in the Python grammar that really confuse me, for example, the arguments in the following function call:

foo(a)
foo(a=a)

corresponds to the following grammar:

argument: ( test [comp_for] |
            test '=' test |
            '**' test |
            '*' test )

test appears twice in the first position of the grammar. It means that by only looking at test Python cannot determine it's test [comp_for] or test '=' test.

More examples:

comp_op: '<'|'>'|'=='|'>='|'<='|'<>'|'!='|'in'|'not' 'in'|'is'|'is' 'not'

Note 'is' and 'is' 'not'

subscript: test | [test] ':' [test] [sliceop]

test also appears twice.

Is my understanding of LL(1) wrong? Does Python do some workaround for the grammar during lexing or parsing to make it LL(1) processable? Thank you all in advance.

Heterotaxis answered 3/12, 2018 at 15:14 Comment(5)

@PatrickHaugh I don't think it's fair to label this a duplicate when neither the linked question nor answer address OP's specific concerns regarding is not and test. – Malamut 3/12, 2018 at 15:19

Doesn't the (1) mean the parser has one token of lookahead? So if it's "on" the test, it can look at the next token (but no farther). (My compilers class was many moons ago, so grain of salt) – Intracutaneous 3/12, 2018 at 15:19

@JETM @Malamut Fair enough. I think this question boils down to a misunderstanding about the nature of LL(1) grammars/parsers though. – Intracutaneous 3/12, 2018 at 15:22

@PatrickHaugh It counts the current token as lookahead (if you can look at it without consuming it). So LL(1) means you can look at the current token only and LL(2) would mean that you can look one token further. LL(0) (which isn't really a thing) would mean that you can never look at a token without consuming it, so it wouldn't be possible to select an alternative based on the current token. – Malamut 3/12, 2018 at 15:23

As per en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form , a EBNF describes a language rather than grammar. There's no 1-1 correspondence between a EBNF and the resulting grammar, there are multiple possible grammars for a given EBNF. So it only matters if some of them can be LL(1). – Chauvin 3/12, 2018 at 15:50

The grammar presented in the Python documentation (and used to generate the Python parser) is written in a form of Extended BNF which includes "operators" such as optionality ([a]) and Kleene closure ((a b c)*). LL(1), however, is a category which appies only to simple context-free grammars, which do not have such operators. So asking whether that particular grammar is LL(1) or not is a category error.

In order to make the question meaningful, the grammar would have to be transformed into a simple context-free grammar. This is, of course, possible but there is no canonical transformation and the Python documentation does not explain the precise transformation used. Some transformations may produce LL(1) grammars and other ones might not. (Indeed, naive translation of the Kleene star can easily lead to ambiguity, which is by definition not LL(k) for any k.)

In practice, the Python parsing apparatus transforms the grammar into an executable parser, not into a context-free grammar. For Python's pragmatic purposes, it is sufficient to be able to build a predictive parser with a lookahead of just one token. Because a predictive parser can use control structures like conditional statements and loops, a complete transformation into a context-free grammar is unnecessary. Thus, it is possible to use EBNF productions -- as with the documented grammar -- which are not fully left-factored, and even EBNF productions whose transformation to LL(1) is non-trivial:

simple_stmt: small_stmt (';' small_stmt)* [';'] NEWLINE

In the above production, the repetition of (';' small_stmt)* may be followed by a ';', which means that a simple while loop will not correctly represent the production. I don't know how this production is handled by the Python parser generator, but it is possible to transform it into CFG by left-factoring after expanding the repetition:

simple_stmt: small_stmt rest_A
rest_A     : ';' rest_B
           | NEWLINE
rest_B     : small_stmt rest_A
           | NEWLINE

Similarly, the entire EBNF can be transformed into an LL(1) grammar. That is not done because the exercise is neither useful for parsing or for explaining the syntax. It would be hard to read, and the EBNF can be directly transformed into a parser.

This is slightly independent of the question of whether Python is LL(1), because a language is LL(1) precisely if an LL(1) grammar exists for the language. There will always be an infinitude of possible grammars for a language, including grammars which are not LL(k) for any k and even grammars which are not context-free, but that is irrelevant to the question of whether the language is LL(1): the language is LL(1) if even one LL(1) grammar exists. (I'm aware that this is not the original question, so I won't pursue this any further.)

Bassorilievo answered 3/12, 2018 at 16:1 Comment(2)

"Similarly, the entire EBNF can be transformed into an LL(1) grammar." How do we know this is true? Is there a theoretical result about EBNF grammars in general, or is it just true for Python's grammar specifically (based on the fact that Python's parser generator handles the grammar without errors)? – Foregone 29/4, 2020 at 2:49

@user200783: it's specifically about Python. It is certainly not true of all grammars. And yes, the demonstration is being able to produce a one-token lookahead LL- parser. – Bassorilievo 29/4, 2020 at 4:43

You're correct that constructs like 'is' | 'is' 'not' aren't LL(1). They can be left-factored to LL(1) quite easily by changing it to 'is' notOpt where notOpt: 'not' | ϵ or, if you allow EBNF syntax, just 'is' 'not'? (or 'is' ['not'] depending on the flavor of EBNF).

So the language is LL(1), but the grammar technically is not. I assume the Python designers decided that this was okay because the left-factored version would be more difficult to read without much benefit and the current version can still be used as the basis for an LL(1) parser without much difficulty.

Malamut answered 3/12, 2018 at 15:28 Comment(2)

Just an additional thought from Wikipedia: A ε-free LL(1) grammar is also a SLR(1) grammar. So Python is not SLR(1) – Siusiubhan 3/12, 2018 at 15:31

#53597080 – Chauvin 3/12, 2018 at 15:50

Recommended topics

Hot tags