Chomsky Language Types
Asked Answered
S

2

16

I'm trying to understand the four different Chomsky language types but the definitions that I have found don't really mean anything to me. I know type 0 is free grammar, type 1 is context sensitive, type 2 is context free whilst type 3 is regular. So, could someone please explain this and put it into context, thanks.

Soukup answered 23/5, 2012 at 12:1 Comment(3)
The clearest difference is in the rewrite/production rules, have you studied those?Akerboom
I'm not sure, so probably not.Soukup
This is quite good question, something many people don't get their mind wrapped around. But it makes a huge difference when you are actually writing parsers.Xenogenesis
P
28

A language is the set of words that belong to that language. Many times, however, instead of listing each and every word in the language, it is enough to specify the set of rules that generate the words of the language (and only those) to identify what is the language-in-question.

Note: there can be more than one set of rules that desrcibe the same language.

In general, the more restrictions placed on the rules, the less expressive the language (less words can be generated from the rules), but easier to recognize if a word belongs to the language the rules specify. Because of the latter, we want to identify languages with the most restrictions on their rules that will still allow us to generate the same language.

A few words about the rules: In general, you describe a formal language with four items (AKA a four-tuple):

  1. The set of non-terminal symbols (N)
  2. The set of terminal symbols (T)
  3. The set of production rules (P)
  4. The start symbol (S)

The terminal symbols (AKA letters) are the symbols that words of the language consist of, ususally a subset of lowercase English letters (e.g. 'a', 'b', 'c')

The non-terminal symbols are marking an intermediate state in the generation of a word, indicating that a possible transformation can still be applied to the intermediate "word". There is no overlap between the terminal and non-terminal symbols (i.e. the intersection of the two sets are empty). The symbols used for non-terminals are usually subsets of uppercase English letters (e.g. 'A', 'B', 'C')

The rules denote possible transformations on a series of terminal and non-terminal symbols. They are in the form of: left-side -> right-side, where both the left-side and the right-side consists of series of terminal and non-terminal symbols. An example rule: aBc -> cBa, which means that a series of symbols "aBc" (as part of intermediary "words") can be replaced with the series "cBa" during the generation of words.

The start symbol is a dedicated non-terminal symbol (usually denoted by S) that denotes the "root" or the "start" of the word generation, i.e. the first rule applied in the series of word-generation always has the start-symbol as its left-side.

The generation of a word is successfully over when all non-terminals have been replaced with terminals (so the final "intermediary word" consists only of terminal symbols, which indicates that we arrived at a word of the language-in-question).

The generation of a word is unsuccessful, when not all non-terminals have been replaced with terminals, but there are no production rules that can be applied on the current intermediary "word". In this case the generation has to strart anew from the starting symbol, following a different path of production rule applications.

Example:

L={T, N, P, S},

where

T={a, b, c}

N={A, B, C, S}

P={S->A, S->AB, S->BC, A->a, B->bb, C->ccc}

which denotes the language of three words: "a", "abb" and "bbccc"

An example application of the rules:

S->AB->aB->abb

where we 1) started from the start symbol (S), 2) applied the second rule (replacing S with AB), 3) applied the fourth rule (replacing A with a) and 4) applied the fifth rule (replacing B with bb). As there are no non-terminals in the resulting "abb", it is a word of the language.

When talking in general about the rules, the Greek symbols alpha, beta, gamma etc. denote (a potentially empty) series of terminal+non-terminal symbols; the Greek letter epsilon denotes the empty string (i.e. the empty series of symbols).

The four different types in the Chomsky hierarchy describe grammars of different expressive power (different restrictions on the rules).

  • Languages generated by Type 0 (or Unrestricted) grammars are most expressive (less restricted). The set of Recursively Enumerable languages contain the languages that can be generated using a Turing machine (basically a computer). This means that if you have a language that is more expressive than this type (e.g. English), you cannot write an algorithm that can list each an every (and only these) words of the language. The rules have one restriction: the left-side of a rule cannot be empty (no symbols can be introduced "out of the blue").

  • Languages generated by Type 1 (Context-sensitive) grammars are the Context-sensitive languages. The rules have the restriction that they are in the form: alpha A beta -> alpha gamma beta, where alpha and beta can be empty, and gamma is non-empty (exception: the S->epsilon rule, which is only allowed if the start symbol S does not appear on the right-side of any rules). This restricts the rules to have at least one non-terminal on their left-side and allows them to have a "context": the non-terminal A in this rule example can be replaced with gamma, only if it is surrounded by ("is in the context of") alpha and beta. The application of the rule preserves the context (i.e. alpha and beta does not change).

  • Languages generated by Type 2 (Context-free) grammars are the Context-free languages. The rules have the restriction that they are in the form: A -> gamma. This restricts the rules to have exactly one non-terminal on their left-side and have no "context". This essentially means that if you see a non-terminal symbol in an intermediary word, you can apply any one of the rules that have that non-terminal symbol on their left-side to replace it with their right-side, regardless of the surroundings of the non-terminal symbol. Most programming languages have context free generating grammars.

  • Languages generated by Type 3 (Regular) grammars are the Regular languages. The rules have the restriction that they are of the form: A->a or A->aB (the rule S->epsilon is permitted if the starting symbol S does not appear on the right-side of any rules), which means that each non-terminal must produce exactly one terminal symbol (and possibly one non-terminal as well). The regular expressions generate/recognize languages of this type.

Some of these restrictions can be lifted/modified in a way to keep the modified grammar have the same expressive power. The modified rules can allow other algorithms to recognize the words of a language.

Note that (as noted earlier) a language can often be generated by multiple grammars (even grammars belonging to different types). The expressive power of a language family is usually equated with the expressive power of the type of the most restrictive grammars that can generate those languages (e.g. languages generated by regular (Type 3) grammars can also be generated by Type 2 grammars, but their expressive power is still that of Type 3 grammars).

Examples

The regular grammar

T={a, b}

N={A, B, S}

P={S->aA, A->aA, A->aB, B->bB, B->b}

generates the language which contains words that start with a non-zero number of 'a's, followed by a non-zero number of 'b's. Note that is it not possible to describe a language where each word consists of a number of 'a's followed by an equal number of 'b's with regular grammars.

The context-free grammar

T={a, b}

N={A, B, S}

P={S->ASB, A->a, B->b}

generates the language which contains words that start with a non-zero number of 'a's, followed by an equal number of 'b's. Note that it is not possible to describe a language where each word consists of a number of 'a's, followed by an equal number of 'b's, followed by an equal number of 'c's with context-free grammars.

The context-sensitive grammar

T={a, b, c}

N={A, B, C, H, S}

P={S->aBC, S->aSBC, CB->HB, HB->HC, HC->BC, aB->ab, bB->bb, bC->bc, cC->cc}

generates tha language which contains words that start with non-zero number of 'a's, followed by an equal number of 'b's, followed by an equal number of 'c's. The role of H in this grammar is to enable "swapping" a CB combination to a BC combination, so the B's can be gathered on the left, and the C's can be gathered on the right. Note that it is not possible to describe a language where the words consist of a series of 'a's, where the number of 'a's is a prime with context-sensitive grammars, but it is possible to write an unrestricted grammar that generates that language.

Plug answered 30/5, 2012 at 12:21 Comment(10)
Great answer. It took a some time to understand each and every bit and I build parsers professionally. Could it be explained in simpler terms? (I should try.)Xenogenesis
I found some confusion of the use of Word and Language. Words are terminals and need to be listed each and every. A grammar will not generate words, it will generate sentences. Chomsky did not classify languages but grammars. A language can have multiple valid grammar definitions each being of different order.Xenogenesis
@SeanFarrell - Usually the following terminology is used for formal languages: terminals are symbols (letters), words (of the language) are series of terminals and the language is a set of words. You use the name sentence, for what I called word. You are correct that the Chomsky hierarchy is classifying grammars, I hope my changes made this more clear.Plug
This is odd, the way I remember it from the university was different. But then my courses where in German... Then again that was a LONG time ago. Nevertheless your answer is great.Xenogenesis
@Xenogenesis I covered Type III, II, and I in this answer (https://mcmap.net/q/617118/-the-difference-between-chomsky-type-3-and-chomsky-type-2-grammar). Come up with an easy-to-understand description of a Type 0 and you have your wish.Nutty
Very nice answer, but don't you sometimes confuse left and right side of the rules ? For example you say ... the rules have one restriction: the right-side of a rule cannot be empty ..., ... This restricts the rules to have at least one non-terminal on their right-side ... and ... This restricts the rules to have exactly one non-terminal on their right-side ..., shouldn't that all be left-side ?Indention
@Indention - You are right. At least I was consistent throughout... Corrected, thx.Plug
By rereading your answer, I think, I found a couple more typos, could please check my edits ?Indention
This is an old answer, but you mentioned that English is not recognisable by Type-0 grammars. If this is true, then I'd really like to see a proof.Osher
@Osher Can't give you proof, but the trouble starts with the definition of what to consider "English". Think of sentences that include "foreign" words or ones like "The the is a definite article."Plug
V
1

There are 4 types of grammars

TYPE-0 :

  • Grammar accepted by Unrestricted Grammar

  • Language accepted by Recursively enumerable language

  • Automaton is Turing machine

TYPE-1 :

  • Grammar accepted by Context sensitive Grammar

  • Language accepted by Context sensitive language

  • Automaton is Linear bounded automaton

TYPE-2 :

  • Grammar accepted by Context free Grammar

  • Language accepted by Context free language

  • Automaton is PushDown automaton

TYPE-3 :

  • Grammar accepted by Regular Grammar

  • Language accepted by Regular language

  • Automaton is Finite state automaton

  • -

Chomsky is a Normal Form that each production has form:

  • A->BC or A->a

There can be Two variables or Only one terminal for rule in Chomsky

Viborg answered 26/12, 2018 at 7:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.