Purpose of FIRST and FOLLOW sets in LL(1) parsers?
Asked Answered
S

1

30

Can anyone explain to me how FIRST and FOLLOW should be used in LL(1) grammar? I understand that they are used for syntax table construction, but I don't understand how.

Selfsatisfaction answered 1/12, 2013 at 20:58 Comment(0)
H
53

In an LL(1) parser, the parser works by maintaining a workspace initially seeded to the start symbol followed by the end-of-string marker (usually denoted $). At each step, it does one of the following:

  • If the first symbol of the workspace is a terminal, it matches it against the next token of input (or reports an error if it doesn't match.)
  • If the first symbol of the workspace is a nonterminal, it predicts what production to replace that nonterminal with.

The predict step is where FIRST and FOLLOW show up. The parser needs to be able to guess, based purely on the current nonterminal and the next token of input, which production to use. The question is how to do this.

Let's suppose that the current nonterminal is A and the next token of input is t. If you know the productions of A, which one would you choose to apply? There's one simple case to consider: if there's a production of the form A → tω, where ω is some arbitrary string, then you should pick that production because the t you're looking at as input will match the t at the front of the production.

There are also some complex cases to consider. Suppose you have a production of the form A → Bω, where B is a nonterminal and ω is some string. Under what circumstances would you want to guess this production? Well, if you know that the next terminal symbol is a t, you wouldn't want to guess this production unless you knew that B can expand to a string that starts with the terminal t (there's another case that we'll talk about in a second). This is where FIRST sets come in. In grammars without ε productions, the set FIRST(X) for some nonterminal X is the set of all terminals that can potentially appear at the start of some string derived from X. If you have a production of the form A → Bω and you see the nonterminal t, you'd guess to use that production precisely when t ∈ FIRST(B); that is, B can derive some string that starts with t. If B doesn't derive anything starting with t, then there's no reason to choose it, and if B does derive something starting with t, you'd want to make this choice so that you could eventually match the t against it.

Things get a bit trickier when ε productions are introduced. Now, let's suppose that you have a production of the form A → BCω, where B and C are nonterminals and ω is a string. Let's also suppose the next token of input is t. If t ∈ FIRST(B), then we'd choose this production, as mentioned above. However, what happens if t ∉ FIRST(B)? If there are ε productions in the grammar, we might still want to choose this production if B can derive ε and t ∈ FIRST(C). Why is this? If this happens, it means that we might be able to match the t by producing BCω, then producing ε from B, leaving Cω against which to match the t. This is one context where we might have to "look through" a nonterminal. Fortunately, this is handled by FIRST sets. If a nonterminal X can produce ε, then ε ∈ FIRST(X). Therefore, we can use FIRST sets to check whether we need to "look through" a nonterminal by seeing whether ε ∈ FIRST(X).

So far we haven't talked about FOLLOW sets. Where do they come in? Well, suppose that we're processing the nonterminal A, we see the terminal t, but none of the productions for A can actually consume the t. What do we do then? It turns out there's still a way that we can eat up that t. Remember that LL(1) parsers work by maintaining a workspace with a string in it. It's possible that the t we're looking at is not supposed to be matched against the current nonterminal A at all, and instead we're supposed to have A produce ε and then let some later nonterminal in the workspace match against the t. This is where FOLLOW sets come in. The FOLLOW set of a nonterminal X, denoted FOLLOW(X), is the set of all terminal symbols that can appear immediately after X in some derivation. When choosing which production to choose for A, we add in a final rule - if the terminal symbol t is in the FOLLOW set of A, we choose some production that ultimately will produce ε. That way, the A will "disappear" and we can match the t against some character that appears after the A nonterminal.

This isn't a complete introduction to LL(1) parsing, but I hope it helps you see why we need FIRST and FOLLOW sets. For more information, pick up a book on parsing (I recommend Parsing Techniques: A Practical Guide by Grune and Jacobs) or take a course on compilers. As a totally shameless plug, I taught a compilers course in Summer 2012-2013 and all of the lecture slides are available online.

Hope this helps!

Heathenry answered 1/12, 2013 at 21:13 Comment(9)
Thank you, it is very well explained, and it is very helpful! I wish you all the best!Selfsatisfaction
I would just add that it is an honor to get an answer from a Stanford professor. :) It would be great to listen courses at your university...Selfsatisfaction
@templatetypedef:You're slides are very good and interesting.Not just these one(CS143) but CS103 too.I would like to know which is given more priority:FIRST or FOLLOW?I have looked through wikipedia and it says that it(FIRST-FOLLOW conflicts) come under situations like left recursion?Is there any other case that this happens other than the mentioned case(left recursion)?Also why FOLLOW sets doesn't contain 'epsilon'.Is there any reason behind or is it just a rule?Artwork
@templatetypedef:Sorry for commenting about first-follow conflicts since you have already answered it.You're slides are very good and interesting not just these one(CS143) but CS103 too.Could you tell why FOLLOW sets doesn't contain epsilon?Artwork
@Artwork I think it's just how it's defined. FOLLOW(A) is the set of all characters that can legally appear after A in any derivation, and since the empty string isn't a character, it can't appear in FOLLOW sets.Heathenry
@templatetypedef:Okay.That's great.Is it good to have epsilon in FOLLOW(A)?Artwork
@templatetypedef:Is this getting very bad?I'm just curious to know why 'epsilon' is not present in FOLLOW(A) but it's included in FIRST(A). I can see that people say that 'epsilon' is not considered a visible input to follow any symbol.Then could you tell why there is 'epsilon' in FIRST(A) if 'epsilon' is not a visible input.Artwork
@Heathenry Thank you for making the course notes publicly available, they were very helpful.Reminiscent
I found your lecture on Advanced Parsing very interesting. Back in the 1980s I discovered and published a way to take the LR(0) states and apply a kind of simulation of lookahead to them in order to generate a full LR(1) grammar (or LR(k)). While it is true that in the worst case the result is an exponential explosion of states, for practical languages such as programming languages the resulting LR(1) grammar only has a few more states to handle the small number of conflicts. This procedure eliminates the need for LALR(1) and the many other partial solutions that sometimes fail.Gyrus

© 2022 - 2024 — McMap. All rights reserved.