Grammatical Format
This chapter corresponds to chapter 13 of the second edition of Cohen (1997).
Regular Grammars
The regular grammars form a subset of the set of context-free grammars. On
the righthand side of every production of a regular grammar is either a word
(which is a string of terminals) or a semiword (which is a string of termi-
nals followed by exactly one nonterminal). We shall see that the languages
generated by the regular grammars are exactly the regular languages.
Theorem 21 states that every regular language is also a context-free lan-
guage. Recall that the languages accepted by FAs are precisely the regular
languages. Cohen takes any FA and shows how to turn it into a CFG, thus
he shows how to write down a CFG that generates the same language that
is accepted by the FA. Theorem 22 states that the languages generated by
regular grammars are all regular. Although is does not follow directly from
these two theorems, every regular language can be generated by a regular
grammar. The reason is that the productions of the grammar that is built
up in the proof of Theorem 21 are of the correct form, i.e. we are actually
constructing a regular grammar.
a
?
a
?
a
?
a
?
−S b - X b - +Y b - +Z
b
?
-
a, b
Figure 1: An FA - acceptor of a regular language
Suppose we have a CFG that is not a regular grammar. Can we be sure
1
that the language generated by this grammar is definitely not regular? No.
It is possible that such a CFG does generate a regular language. What we
do know from the above results is that, if we have a CFG that generates a
regular language, we will be able to find a regular grammar generating the
same language.
Chomsky Normal Form
Every now and then we want to prove that something or other holds for every
CFG that exists. The problem is that CFGs are very flexible in terms of what
their productions may look like. If we could find some type of ‘representative’
CFG with a more restrictive form, a form in which any given CFG can be
rewritten, it would make life easier for us. This is where the Chomsky Normal
Form (CNF) comes in. A CFG is said to be in CNF if every production has
the form
• nonterminal −→ exactly two nonterminals, or
• nonterminal −→ exactly one terminal.
The amazing thing is that, for every CFG that does not generate the
empty word, there is a CFG in CNF that generates exactly the same lan-
guage. And for every CFG that can generate the empty word, there is is a
CFG which is almost in CNF that generates the same language. To obtain
the CFG which is almost in CFG, we convert the original grammar to CNF
and then just add the production S −→ Λ.
The conversion of a CFG into CNF consists of a number of steps. One
starts by removing (‘killing’, according to Cohen) the Λ-productions, then
the unit productions are killed and finally the productions are written in the
required form.
When the modified replacement rule is used in the process of killing the
Λ-productions, it is important to make sure that all the productions as de-
scribed by the rule are added. This will ensure that the grammar can still
generate all the words that could be generated when the Λ-productions were
part of the CFG. Consider the following CFG:
S −→ N M aN
M −→ a | Λ
N −→ b | Λ.
Since M and N are nullible, we have to add the following productions:
S −→ M aN | N aN | N M a | aN | M a | N a | a.
The new CFG will be
2
S −→ N M aN | M aN | N aN | N M a | aN | M a | N a | a
M −→ a
N −→ b.
Suppose we forgot one combination of nullible nonterminals occurring in
the production S −→ N M aN , say the combination consisting of the first
N and the M . That would leave our new CFG without the new production
S −→ aN , and the new grammar will not be able to generate the word ab.
This word could be generated by the original grammar. From this exam-
ple we see that it is quite possible to have more than one occurrence of the
same nullable nonterminal in a production (N above) and thus instead of
the phrase ‘all possible subsets of nullable nonterminals’ used by Cohen, we
should rather use the phrase ‘all possible combinations of nullable nonter-
minals on the righthand side of the production’ in the modified replacement
rule. The possible combinations in our example are
• the first N
• the M
• the second N
• the first N and the M
• the two N ’s
• the M and the second N
• both the N ’s and the M .
Let us look at an example of the second step in the process of converting
a CFG to CNF, namely the killing of all unit productions. The CFG
S −→ aS | A
A −→ aA | bS | a
becomes
S −→ aS | aA | bS | a
A −→ aA | bS | a
after the unit production S −→ A has been removed.
Theorem 26 states that it is possible to convert any CFG to CNF. There
is a subtle point in the (straightforward) proof of this theorem, namely that
the new nonterminals R1 , R2 , ... are not used anywhere else than once at the
end of the righthand side of a production and once on the lefthand side of a
production.
3
We give one complete example of the conversion of a CFG into CNF.
Consider the following CFG:
S −→ XaY | b
X −→ aY | Λ
Y −→ X | bY | a.
We have to perform three tasks in the conversion process, namely
Step 1: Eliminate the Λ-productions.
Step 2: Eliminate the unit productions.
Step 3: Rewrite the productions in the required form.
in this fixed order. (Make sure that you understand why the order is crucial.)
• Step 1. Here we must add all productions that can be obtained by
replacing the nullable nonterminals with Λ. There is one Λ-production,
X −→ Λ. The nullable nonterminals are X and Y . (This can easily be
seen: Λ can be produced by the derivation Y =⇒ X =⇒ Λ. We could
also have used the blue paint method to find the nullible nonterminals.)
Now we have to look at each production in which any combination, i.e.
either one or both, of these nonterminals appear in order to find the
new productions that must be added.
– From S −→ XaY we get S −→ aY and S −→ Xa and S −→ a,
– from X −→ aY we get X −→ a,
– from Y −→ X we do not get a new production (because we do
not want to produce another Λ-production — see the modified
replacement rule) and
– from Y −→ bY we get Y −→ b.
So at the completion of this step we have the following grammar:
S −→ XaY | aY | Xa | a | b
X −→ aY | a
Y −→ X | bY | a | b.
• Step 2: There is only one unit production namely Y −→ X. We remove
this and add the appropriate productions to get
S −→ XaY | aY | Xa | a | b
X −→ aY | a
Y −→ bY | aY | a | b.
• Step 3: We now add the productions A −→ a and B −→ b and replace
terminals with nonterminals where necessary. Then we have
4
S −→ XAY | AY | XA | a | b
X −→ AY | a
Y −→ BY | AY | a | b
A −→ a
B −→ b.
• There is now only one production which is not in the required form,
namely S −→ XAY . Let us convert it to get the CFG finally in CNF:
S −→ XR1 | AY | XA | a | b
R1 −→ AY
X −→ AY | a
Y −→ BY | AY | a | b
A −→ a
B −→ b.
The above process will work correctly for any CFG. When asked to con-
vert some CFG to CNF, it is sometimes very tempting to take short cuts.
Please do not yield to the temptation. It is very easy to make a mistake.
There is, however, one part of the algorithm which you may apply less rig-
orously: it is unnecessary to apply the blue paint algorithm provided that
you make very sure that you have found all the nullable nonterminals. Fi-
nally, note that there is a standardised way of deriving words, a so called
leftmost derivation. Any word generated by a given CFG can be generated
by a leftmost derivation.