MA513: Formal Languages and Automata Theory
Topic: Properties of Context-free Languages
Lecture Number 28 Date: October 17, 2011
1 Eliminating Unit Productions
Definition: A unit production U in a CFG G = (V, T, R, S) is a production (i.e.,
member of R) of the form U : A → B, where A and B are both variables. The
variable B is said to be A-derivable.
For a variable A ∈ V of the CFG G = (V, T, R, S), the Algorithm 1 find the set of
A-derivable variables.
Algorithm 1 Derivable(G, A)
1: Input: CFG G = (V, T, R, S) and a variable A ∈ V .
2: Output: Set W of A-derivable variables in the grammar G.
3: W ← ∅, W 0 ← ∅
4: for (each production A → B ∈ R) do
5: W = W ∪ {B}
6: end for
7: while (W 0 6= W ) do
8: W0 = W
9: for (each C ∈ W 0 ) do
10: for (each C → B ∈ R such that B 6= A) do
11: W = W ∪ {B}
12: end for
13: end for
14: end while
15: Return (W )
1
• If G = (V, T, R, S) is a CFG with no null production, then we can design a algo-
rithm (see Algorithm 2) to find a CFG G1 = (V, T, R1 , S) having no unit production
such that L(G1 ) = L(G).
Algorithm 2 Elimination of unit Productions(G)
1: Input: CFG G = (V, T, R, S)
2: Output: CFG G1 = (V, T, R1 , S) having no unit production s. t. L(G1 ) = L(G)
3: R1 ← R
4: for (each A ∈ V ) do
5: W = Derivable(G, A) /* Using the Algorithm 1 */
6: for (each B ∈ W ) do
7: for (each non-unit production B → α ∈ R) do
8: if (A → α 6∈ R1 ) then
9: R1 = R1 ∪ {A → α}
10: end if
11: end for
12: end for
13: end for
14: Delete all unit productions from R1
15: Return (G1 = (V, T, R1 , S))
Now, we can summarize the various simplifications on grammar described so far. We
want to convert any CFG G into an equivalent CFG that has no useless symbols,
-productions, or unit productions. Some care must be taken in the order of
application of the constructions. A safe order is:
1. Eliminate -productions.
2. Eliminate unit productions.
3. Eliminate useless symbols.
Theorem: If G is a CFG generating a language that contains at least one string
other than , then there is another CFG G1 such that L(G1 ) = L(G) − {}, and G1
has no -productions, unit productions, or useless symbols.
2 Chomsky Normal Form (CNF)
In this section, we shall show that every nonempty CFL without has a grammar
G in which all productions are in one of two simple forms, either:
1. A → BC, where each of the A, B, and C are variables, or
2. A → a, where A is a variable and a is a terminal.
2
Further, G has no useless symbols. Such a grammar is said to be Chomsky Normal
Form, or CNF.
To put a grammar in CNF, start with one that satisfies the following restrictions:
1. The grammar has no -productions,
2. The grammar has no unit productions, and
3. The grammar has no useless symbols.
Every production of such a grammar is either of the form A → a, which is already
in a form allowed by CNF, or it has a body 1 of length 2 or more. Our tasks are to:
a) Arrange all that bodies of length 2 or more consists only of variables.
b) Break bodies of length 3 or more into a cascade of productions, each with a
body consisting of two variables.
Construction of (a): For every terminal a that appears in a body of length 2 or
more, create a new variable, say A. This variable has only one production
A → a. Now, we use A in place of a everywhere a appears in body of length
2 or more. At this point, every production has a body that is either a single
terminal or at least two variables and no terminals.
Construction of (b): We break all the productions of the form A → B1 B2 . . . Bk ,
for k ≥ 3 into a group of productions with two variables in each body. We
introduce k − 2 new variables, C1 , C2 , . . . , Ck−2. The original production is
replaced by the following k − 1 productions:
A → B1 C1 , C1 → B2 C2 , C2 → B3 C3 , . . ., Ck−2 → Bk−1 Bk
Theorem: If G is a CFG whose language contains at least one string other than
, then there is a grammar G1 in Chomsky Normal Form, such that L(G1 ) =
L(G) − {}.
1
If A → α is a production, then the part α is said to be body of that production.