grammarchomsky-hierarchy

Chomsky Language Types


I'm trying to understand the four different Chomsky language types but the definitions that I have found don't really mean anything to me. I know type 0 is free grammar, type 1 is context sensitive, type 2 is context free whilst type 3 is regular. So, could someone please explain this and put it into context, thanks.


Solution

  • A language is the set of words that belong to that language. Many times, however, instead of listing each and every word in the language, it is enough to specify the set of rules that generate the words of the language (and only those) to identify what is the language-in-question.

    Note: there can be more than one set of rules that desrcibe the same language.

    In general, the more restrictions placed on the rules, the less expressive the language (less words can be generated from the rules), but easier to recognize if a word belongs to the language the rules specify. Because of the latter, we want to identify languages with the most restrictions on their rules that will still allow us to generate the same language.

    A few words about the rules: In general, you describe a formal language with four items (AKA a four-tuple):

    1. The set of non-terminal symbols (N)
    2. The set of terminal symbols (T)
    3. The set of production rules (P)
    4. The start symbol (S)

    The terminal symbols (AKA letters) are the symbols that words of the language consist of, ususally a subset of lowercase English letters (e.g. 'a', 'b', 'c')

    The non-terminal symbols are marking an intermediate state in the generation of a word, indicating that a possible transformation can still be applied to the intermediate "word". There is no overlap between the terminal and non-terminal symbols (i.e. the intersection of the two sets are empty). The symbols used for non-terminals are usually subsets of uppercase English letters (e.g. 'A', 'B', 'C')

    The rules denote possible transformations on a series of terminal and non-terminal symbols. They are in the form of: left-side -> right-side, where both the left-side and the right-side consists of series of terminal and non-terminal symbols. An example rule: aBc -> cBa, which means that a series of symbols "aBc" (as part of intermediary "words") can be replaced with the series "cBa" during the generation of words.

    The start symbol is a dedicated non-terminal symbol (usually denoted by S) that denotes the "root" or the "start" of the word generation, i.e. the first rule applied in the series of word-generation always has the start-symbol as its left-side.

    The generation of a word is successfully over when all non-terminals have been replaced with terminals (so the final "intermediary word" consists only of terminal symbols, which indicates that we arrived at a word of the language-in-question).

    The generation of a word is unsuccessful, when not all non-terminals have been replaced with terminals, but there are no production rules that can be applied on the current intermediary "word". In this case the generation has to strart anew from the starting symbol, following a different path of production rule applications.

    Example:

    L={T, N, P, S},

    where

    T={a, b, c}

    N={A, B, C, S}

    P={S->A, S->AB, S->BC, A->a, B->bb, C->ccc}

    which denotes the language of three words: "a", "abb" and "bbccc"

    An example application of the rules:

    S->AB->aB->abb

    where we 1) started from the start symbol (S), 2) applied the second rule (replacing S with AB), 3) applied the fourth rule (replacing A with a) and 4) applied the fifth rule (replacing B with bb). As there are no non-terminals in the resulting "abb", it is a word of the language.

    When talking in general about the rules, the Greek symbols alpha, beta, gamma etc. denote (a potentially empty) series of terminal+non-terminal symbols; the Greek letter epsilon denotes the empty string (i.e. the empty series of symbols).

    The four different types in the Chomsky hierarchy describe grammars of different expressive power (different restrictions on the rules).

    Some of these restrictions can be lifted/modified in a way to keep the modified grammar have the same expressive power. The modified rules can allow other algorithms to recognize the words of a language.

    Note that (as noted earlier) a language can often be generated by multiple grammars (even grammars belonging to different types). The expressive power of a language family is usually equated with the expressive power of the type of the most restrictive grammars that can generate those languages (e.g. languages generated by regular (Type 3) grammars can also be generated by Type 2 grammars, but their expressive power is still that of Type 3 grammars).

    Examples

    The regular grammar

    T={a, b}

    N={A, B, S}

    P={S->aA, A->aA, A->aB, B->bB, B->b}

    generates the language which contains words that start with a non-zero number of 'a's, followed by a non-zero number of 'b's. Note that is it not possible to describe a language where each word consists of a number of 'a's followed by an equal number of 'b's with regular grammars.

    The context-free grammar

    T={a, b}

    N={A, B, S}

    P={S->ASB, A->a, B->b}

    generates the language which contains words that start with a non-zero number of 'a's, followed by an equal number of 'b's. Note that it is not possible to describe a language where each word consists of a number of 'a's, followed by an equal number of 'b's, followed by an equal number of 'c's with context-free grammars.

    The context-sensitive grammar

    T={a, b, c}

    N={A, B, C, H, S}

    P={S->aBC, S->aSBC, CB->HB, HB->HC, HC->BC, aB->ab, bB->bb, bC->bc, cC->cc}

    generates tha language which contains words that start with non-zero number of 'a's, followed by an equal number of 'b's, followed by an equal number of 'c's. The role of H in this grammar is to enable "swapping" a CB combination to a BC combination, so the B's can be gathered on the left, and the C's can be gathered on the right. Note that it is not possible to describe a language where the words consist of a series of 'a's, where the number of 'a's is a prime with context-sensitive grammars, but it is possible to write an unrestricted grammar that generates that language.