Automata Theory & Compiler Design 210551 Module 2
REGULAR EXPRESSIONS AND LANGUAGE;
Introduction
Instead of focusing on the power of a computing device, Iet's look at the task that we need to
perform, Let's consider problems in which our goal is to match finite or repeating patterns.
For example regular expressions are used as pattern description language in
+ Lexical analysis.-- compiler
+ Filtering email for spam.
= Sorting email into appropriate mailboxes based on sender and/or content words and
phrases.
+ Searching a complex directory structure by specifying patterns that are known to occur in
the file we want.
A regular expression is a pattern description language, which is used to describe particular
patterns of interest. A regular expression provides a concise and flexible means for
"matching" strings of text, such as particular characters, words, or patterns of characters.
Example: [ |: A character class which matches any character within the brackets
[* \en] matches any character except space, tab and newline character.
Regular expression
A language accepted by finite- automata is called as regular language
A regular- languages can be described using regular expressions, in the form of algebraic
notations consisting of the symbols such as alphabets in ¥, the operators such as +. and *, where
+ iy used for univ uperati
+ iy used for concatenation aud * iy used for elusure uperatis
‘Thus the regular expressions are the structural representation of finite-automata using algebraic
notations, which can serve as the input language for many systems that process strings. Example
such as Lex program, Unix Grep ete.
Definition of Regular expression
Define Regular expression,
A regular expression is defined as follows:
is a regular expression denoting an empty language.
eis regular expression denoting the language containing empty string,
ais regular expression denoting the language containing only {a}
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 1Automata Theory & Compiler Design 210551 Module 2
IfR is a regular expression denoting the language Ly and S is a regular expression denoting the
language Ls then R + S is a regular expression corresponding to the language Lx U Ls
RSS is a regular expression corresponding to the language Lx Ls.
R’ is a regular expression corresponding to the language Lp, Thus the expressions obtained by
applying any of the rules are regular expressions
Examples of Regular expressions
Regular expression | Meaning
a Siring consisting of any number of a's. (zero of more as)
a ‘String consisting of at least one a. (one or more a’s)
(a+b) String consisting of either a or b
(aby String consisting of any nuber of a's and b's including ©
(arby ab Strings of a's and b’s ending with ab,
ab(atby Strings of a's and b's starting with ab
(a+b) ab (ab) Strings of a's and b’s with substring ab.
a, Strings of a’s and b’s having length 2:
Regular expression = (a+ b) ( a* b).
b. Strings of a’s and b’s of length < 10:
Regular expression = («+ a+b)".
c. Strings of a’s and b’s of ‘even length.
(a+b) (a+b) T°
d. Strings of a’s and b’s of odd length
Regular expression
Regular expression = ( a +b) |(a +b) (a+ b)]"
e. Strings of a's of even length
Regular expression = (aa)"
f. Strings of a's of odd length
Regular expression = a(aa)
‘Strings of a’s and b’s with altemate as and b's.
Alternate a’s and b’s can be obtained by concatenating the string (ab) zero or more times. ie (ab)"
and adding an optional b to the front ie: (¢ +b) and adding an optional a at the end, ie: (e +a)
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 2Automata Theory & Compiler Design 210551 Module 2
The regular expression = (e-+b) (ab) (+a)
OR
Je
String should contain at least one a and one b, so the regular expression corresponding to this is
given by = ab+ba
There is no restriction on ¢”s. Insert any number of a’s, b’s and ¢;s ic: (a+b+e)" in between the
above regular expression.
So the final regular expression = (atb+c)’ a (atbte) b(atb+e)’ + (atbte)”
D(atb+e)"a(atb+e)”
Regular expression for string containing 3 consecutive 0’s = 000
The above regular expression can be preceded or followed by any number of 0’s and 1’s, ie:
(oH)
Regular expression = (0+1)"000(0+1)"
Regular expression for strings of a’s and b’s ending with b and has no substring aa is nothing but
the string containing any combinations of either b or ab without «.
Regular expression = (b + ab) (b +ab)"
means even number of a’s, regular expression = (aa)"
'b™ means even number of b’s, regular expression = (bb).
‘The regular expression for the given language =(aa)" (bb)"
means odd number of a’s, regular expression = a(aa)”
b*” means even number of b’s, regular expression = (bb)
‘The regular expression for the given language = a(aa) (bb)”
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 3Automata Theory & Compiler Design 210551 Module 2
a?'! means odd number of a’s, regular expression = a(aa)"
b*™"! means odd number of b’s, regular expression = b(bb)*
‘The regular expression for the given language = a(aa)'b(bb)"
Regular eapressivut for exactly ume 1 = 1
Even number of 0’s = (00)"
So here 1 can be preceded or followed by even number of 0's or 1 can be preceded and followed
by odd number of 0's.
‘The regular expression for the given language = (00)" 1 (00)" + 0(00)" 1 0(00)"
OR
Whenever a 0 occurs it should be followed by 1, But there is no restriction on number of I's. So
it isa string consisting of any combinations of 1’s and 01’s, ie regular expression = (1+01)"
Suppose string ends with 0, the above regular expression can be modified by inserting (0 + ¢) at
the end.
Regular expression for the given language = (1+01)" (0+)
OR
‘Whenever a 1 occurs it should be followed by 0, But there is no restriction on number of 0's. So
it is a string consisting of any combinations of 0’s and 10’s, ie regular expression = (0+10)"
Suppose string ends with 1, the above regular expression can be modified by inserting (1+ ¢ ) at
the end.
Regular expression for the given language = (0+10)" (1+ ©)
Strings of a’s and b’s with substring ab.
‘ATHMARANJAN K Department of ISE, SIT Mangaluru. Page 4Automata Theory & Compiler Design 210551 Module 2
Regular expression = (a+b)" aab(at+b)"
ii. Strings of a’s and b’s such that 4" symbol from right end is b and the 5" symbol from
right end is a.
Here the 4" symbol from right end is b and the 5" symbol from right end is a the
corresponding regu:lar expression = ab(a+b)(a+b)(a+b),
But the above regular expression can be preceded with any number of a’s and b’s.
‘Therefore the regular expression for the given language = (a+b) ab(a+b)(a+b)(a+b).
iii. Strings of a’s and b’s such that 10" symbol from right end is b.
The regular expression for the given language = (a+b) b(a+b)”
Strings of a’s and b’s whose lengths are multiple of 3
OR
L = { |w|mod 3 =0, where w is in = {a,b}
Length of string w is multiple of 3, the regular expression = [(a+b) (a+b) (a+b)]"
Vv. Strings of a’s and b’s whose lengths are multiple of 5
OR
, where w is in == { a,b}
Length of string w is multiple of 5,
L={ |w|mod 5 =
‘The regular expression = [(a+b) (a+b) (a+b) (atb)(at+b)]"
vi. Strings of a’s and b’s not more than 3 a's:
Not more than 3 a's, regular expression= (ea) (e+a) (ea),
But thete is uy resuivtivu on b's, so we cau include Lin between the abuve regular
expression,
‘The regular expression for the given language = b’(c+a) b (eta) b” (eta) b”
vii, Obtain the regular expression to accept the words with two or more letters but
beginning and ending with the same letter. 2 = { a, b}
Regular expression beginning and ending with same letter is = a a + b b. In between
include any number of a’s and b’s.
‘Therefore the regular expression = a(a*b)"
+b(atb)b
viii. Strings of a’s and b’s of length is either even or multiple of 3.
Multiple of regular expression = [(a+b) (a+b) (a+b)]"
Length is of even, regular expression = [(a+b) (a+b)]”
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 5Automata Theory & Compiler Design 21C851 Module 2
So the regular expression for the given language = |(a+b) (a+b) (atb)|"+ [(a+b)
(a+b)
Obtain the regular expression to accept the language L~ { a" | m+n is even }
Here n represents number of a’s and m represents number of b’s.
mn is even results in two possible cases;
case i. when even number of a’s followed by even number of bs.
regular expression : (aa) (bb)”
‘asc ii. Odd number of a’s followed by odd number of b’s.
regular expression = a(aa)’ b(bb)’.
So the regular expression for the given language = (aa)'(bb)" + a(aa)’ b(bb)"
X. Obtain the regular expression to accept the language L= { "B® |n>4 and m <3 }.
Here n 2 4 means at least 4 a’s, the regular expression for this = aaaa(a)”
‘m_< 3 means at most 3 b’s, regular expression for this = (e+b) (+b) (+b).
So the regular expression for the given language = aaaa(a)’ («+b) (e+b) (e+b).
Obtain the regular expression to accept the language L~= { a" e? |m> 4 andm <3 p<
a.
Here n > 4 means at least 4 a’s, the regular expression for this = aaaa(a)"
'm <3 means at most 3 b’s, regular expression for this = (<+b) (+b) (e+b)
p<2 means at most 2 ¢’s, regular expression for this ~ (ce) (c+e)
So the regular expression for the given language = aaaa(a)'(etb) (eb) (e+b) (e+e)
(ete).
All strings of a's and b’s that do not end with ab.
‘Suings of length 2 and that do uot end with ab ave ba, wa and bb.
So the regular expression = (a+b)'(aa + ba +bb)
xiii. All strings ofa’s, b’s and c’s with exactly one a
‘The regular expression = (b+c)" a (b+c)"
xiv. All strings of a’s and b’s with at least one occurrence of each symbol in ¥ = {a, b}
Atleast one occurence of a’s and b’s means ab + ba, in between we have 1 number
ofa’s and b's.
So the regular expression =(a+b)" a (a+b)" b(a+b)’ +(a+b)" b(a+b)” a(a+b)”
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 6Automata Theory & Compiler Design 210551 Module 2
Solution:
Cave i, Since nm 23, if'm = 1 then m should be & 3, The equivalent regular expression is given
by: RE=a bbb(b)”
Case ii, Since nm > 3, ifm = 1 then n should be 2 3. The equivalent regular expression is given
by: RE= aaa(a) b
Case iii. Since nm 2 3, ifm >2and n> 2 then the equivalent regular expression 1s given by:
RE = aa(a)" bb(b)"
So the final regular expression is obtained by adding all the above regular expression.
Regular expression = abbb(b)" + aaa(a)"b + aa(a)"bb(b)”
Application of Regular expression:
1. Regular expressions are used in UNIX.
2. Regular expressions are extensively used in the design of Lexical analyzer phase.
3. Regular expressions are used to search patterns in text.
FINITE AUTOMATA AND REGULAR EXPRESSIONS,
1. **™*Converting Regular Expressions to Automata:
Prove that every language defined by a regular expression is also defined by a finite automata.
Proof:
Suppose L = L(R) for a regular expression R, we show that L = L(E) for some -NFA E with:
a, Exactly one accepting state.
b, No ares into the initial state,
c. No ares out of the accepting state.
The proof must be discussed with the following transition diagrams for the basis of the
construction of an automaton.
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page7Automata Theory & Compiler Design 21C851 Module 2
——©6)
By definition of regular expression, if R is a RE and S is a RE then R+S is also a RE
corresponding to the language L (R+S), its automaton is given by:
1s
Starting at new start state, we can go to the start state of either the automaton for R or S. We then
reach the accepting state of one of these automata R or S. We can follow one of the ¢- ares to the
accepting state of the new automaton.
Automaton for R.S is given by:
RS)
The start state of the first( R) automata becomes the start state of the whole and the final state of
the second(S) automata becomes the final state of the whole. Oo
Automaton for R* is given by:
-QEST SS
Loe)
From start state to final state one arc labeled g ( for ¢ in R") or the to the start state of automaton
R through that automaton one or more time and then to the final state.
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 8Automata Theory & Compiler Design 210551 Module 2
The automaton for L = 0 is given by:
—O--0
‘The automaton for L = 1 is given by:
—O2—O
‘The automaton for L = 0+1 is given by:
oO
=
1-0
‘The automaton for L = (0+1)" is given by:
ea
‘The automaton for L = (0+1)" 1 is given by:
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page9Automata Theory & Compiler Design 210551 Module 2
Finally the e-NFA for the regular expression: (0+1)"1(0+1) is given by:
‘The automaton for L
‘The automaton for L = 1 is given by:
‘The automaton for L= 01+1 is given by:
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 10Automata Theory & Compiler Design 210551 Module 2
‘The automaton for L = 01 is given by:
The automaton for L= 101 is given
O70 0
The automaton for L = (01+101) is given by:
: 2 Q4—¢y e
~ @+ 944
The automaton for L = 01 is given by:
Epsilon-NFA for the regular expression (0+ 1)°01 is given by:
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 11Automata Theory & Compiler Design 210551 Module 2
Zo +
<_—® Se + O-.@1—@O
ten Oem ae
‘The automaton for L
The automaton for L = 1 is given by:
Oo <
The automaton for L = 2 is given by:
—O7—O
‘The automaton for L = 0° is given by:
‘The automaton for L = 2° is given by:
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 12Automata Theory & Compiler Design 21C851 Module 2
Se
a.
CONVERTING DFA’s TO REGULAR EXPRESSION USING STATE ELIMINATION
TECHNIQUE
How to build a regular expression for a FSM. Instead of limiting the labels on the transitions of
an FSM to a single character or s, we will allow entire regular expressions as labels.
* For a given input FSM/FA M, we will construct a machine M’ such that M and M’ are
equivalent and M’ has only vo states, start state and a single accepting state.
+ M’ will also have just one transition, which will go from its start state to its accepting
state. The label on that transition will be a regular expression that describes all the strings
that could have driven the original machine M from its start state to some accepting state
Algorithm to create a regular expression from FSM: (State elimination)
1. Remove any states from given FSM M that are unreachable from the start state
2. If FSM M has no accepting states then helt and return the simple regular expression @.
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 13Automata Theory & Compiler Design 210551 Module 2
3. If the start state of FSM M is part of a loop (i.e: it has any transitions coming into it), then
create a new start state s and connects to M ‘s start state via an e-transition, This new.
start state s will have no transitions into it.
4, Ifa FSM M has more than one accepting state or if there is just one but there are any
transitions out of it, create a new accepting state and connect each of M’s accepting states
to it via an e-transition, Remove the old accepting states from the set of accepting states.
Note that the new accepting state will have no transitions out from it
5. At this point, if M has only one state, then that state is both the start state and the
accepting state and M has no transitions. So L (M} = {2}. Halt and return the simple
regular expression as &
6. Until only the start state and the accepting state remain do:
6
Select some state s of Mwhich is of any state except the start state or the accepting
state,
6.2 Remove that state s from M.
6.3. Modify the transitions among the remaining states so that M accepis the same
strings The labels on the rewritten transitions may be any regular expression.
7. Return the regular expression that labels the one remaining transition from the start state
to the accepting state
OR
OOO
‘We can build an equivalent machine M’ by eliminating state q2 and replacing it by a transition
from qj to q; labeled with the regular expression ab’a.
So M'is:
© ava ©
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 14
Regular Expression = ab‘aAutomata Theory & Compiler Design 21C851 Module 2
‘There is uo ines
ag edge into the
ial state as well as no oulgoing edge fiom final state. So
there is only two states, initial and final.
abe
Regular expression = (a+b+e) or (aU b Ue)
There is no incoming edge into the initial state as well as no outgoing edge from final state.
Oa ©
After eliminating the state B:
Regular expression = ab
Y
H—O-—©)
There is no incoming edge into the initial state as well as no outgoing edge from final state.
After eliminating the state B
o> ©)
Regular expression = ab’e
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 15Automata Theory & Compiler Design 21C851 Module 2
@ : OD
Since initial state has incoming edge, and final sate has outgoing edge, we have to create a new
iniatial and final state by connecting new initial state to old initial state through ¢ and old final
state to new final state through ¢, Make old final state has non-final state.
After removing state A.
10
7
0 «
B
Afier removing state B:
"©
Regular expression: 0(10)"
start,(qQ)__a B £ @)
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 16Automata Theory & Compiler Design 21C851 Module 2
Since there are multiple final states, we have to create a new final state.
After removing states C, D and E:
@ @)—t@)
Afier removing state B
Regular Expression: a(b+e+d)
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 17Automata Theory & Compiler Design 21C851 Module 2
After removing state A.
After removing state B:
Regular expression: b(c tab) d
After removing state B
After removing state A’
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 18Automata Theory & Compiler Design 210551 Module 2
2
No ald
Regular expression: (0+10°1)"
By creating new start and final states:
After removing state qr
S ro
After removing state q2
10 01 =
After removing state qs
® 1700"1 +1071)" ®
Regular expression: 1°00°1(0+10" 1)"
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 19Automata Theory & Compiler Design 21C851 Module 2
After removing qi state:
After removing qp state:
After removing qa state
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 20Automata Theory & Compiler Design 210551 Module 2
After removing qpy state:
| @1+10)"
Regular expression: (01+10)"
OR
Since start state 1 has incoming transitions, we create a new start state and link that state to state
1 through e.
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 21Automata Theory & Compiler Design 210551 Module 2
Since accepting state 1 and 2 has outgoing transitions, we create a new accepting state and link
that state to state I and state 2 through ¢, Remove the old accepting states from the set of
accepting states. (ie: consider | and 2 has non final states)
Consider the state 1 and remove that state:
Finally we have only start and final states with one transition from start state I to final state 2,
The labels on transition path indicates the regular edpression,
Regular Expression = (ab U aaa” b)" (a Uc)
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 22Automata Theory & Compiler Design 21C851 Module 2
After removing qz state:
After removing q) state
ut
After removing qo state
Regular expression: 0° (¢ + 1°)=0" 1°
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 23Automata Theory & Compiler Design 210551 Module 2
After removing D state:
o
©
After removing E state:
After removing A state:
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 24Automata Theory & Compiler Design 210551 Module 2
(s) Govt (2) * (c) ©
After removing B state:
ery
(s ) OTT —« x e
After removing C state:
—~@
Regular expression = (00) 11(11)"
OR
a
“Ox,
Co,
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 25Automata Theory & Compiler Design 210551 Module 2
By creating final state
After removing qistate:
After removing qastate:
— =
oro
After removing qsstate:
Oa
©,
Regular expression= 01°01"
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 26Automata Theory & Compiler Design 210551 Module 2
By creating new start and final states:
After removing qo state:
—) ony e oa
After removing qr state:
Afier remaving qs state
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 27Automata Theory & Compiler Design 21C851 Module 2
nen OL
ene
After removing q; state:
(0+1)*1(0+1) + (0+1)°100+1)00+1) ®
oT
Regular expression: (0-+1)"1(0+1) +(0+1)'1(0+1)(0+1)
CONVERTING DFA’s TO REGULAR EXPRESSION USING KLEEN’S THEOREM
The construction regular expression using this method describes sets of strings that label certain
paths in the DFA’s transition diagram, However the paths are allowed to pass through only a
limited subset of the states. We start with the simplest expressions that describe paths that are not
allowed to pass through any states (ie: they are single state or single arc), and inductively build
the expressions that let the paths go through progressively larger sets of states. Finally the paths
are allowed to go through any state, At the end these expressions represent all possible paths.
Let us consider a DFA with ‘n’ number of states and use Ry as the name of regular expression
whose language is the set of strings w is the label of a path from state I to state j in a given DFA
and that path has no intermediate node whose number is greater than k. Note that beginning and
end points of the path are not intermediate, so there is no constraint that i and/or j be less than or
equal to k.
To construct the regular expression Rj" we use the inductive definition, starting at k = 0 and
finally k is reaching k = n which is the number of states in DFA.
When k=0
Regular expressions for the paths that can go through no intermediate state at all, there are 2
Kinds of paths that mect such condition:
1. Anare from node (state) i to node j
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 28Automata Theory & Compiler Design 210551 Module 2
2. A path of length 0 that consists of only some node i
If 4 j then only case (1) is possible, We must examine DFA and find those input symbols a such
that there is a transition from state i to state j on symbol a.
a) If there is no such symbol a, then R;;” =@
b) If there is exactly one such symbol a, then Ri! =a
©) If there are symbols ay, a2, as, ay that label ares from state Ito state j, then
Ry’ =a; tay tay, +a
If i —j, then legal paths are the path of length 0 and all loops from i to itself. The path length 0 is
represented by the regular expression c. Thus we add ¢ to the various expressions devised in a)
through c) above, That is in case a) expression becomes @ + ¢ = e, in case b) expression becomes
ta, in case c) expression becomes ¢ + a; + a2 +35, ay
Suppose there is a path from state i to state j that goes through no state higher than k, There are
‘two possible cases to consider.
1. The path does not go through state k at all, In this case, the label of path is in the
language of Ri"
2. The path goes through state k at least once, Then we can break the path into several
pieces, as suggested in below figure:
A ee
ier en
tn Re? In REY
Zero or more strings in RG
The first gocs from state i to state k without passing through k, the last picee gocs from k to j
without passing through k, and all the pieces in the middle go from k to itself, without passing
through k. When we combine the expressions for the paths of the two types above, we have the
expression for the labels of all paths from state ito state j that go through no state higher than k.
RD = RG eR POR PR
Ry’ = Regular expressions for the paths that can go through no intermediate states at all.
Ry! = Regular expressions for the paths that can go through an intermediate state 1 only.
Rj? = Regular expressions for the paths that can go through an intermediate state 1 and state 2
only.
Ry’ = Regular expressions for the paths that can go through an intermediate state 1, state 2 and
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 29Automata Theory & Compiler Design 21C851 Module 2
State3 only and so on.
NOTE:
Tdentity rule Example
eR=Re=R le =el=1
O@R=RO=G 19=01=0
e =e
(@) =e
O+R=R1O=RK
R4R=R
RR'=RR=R™
(RY=R
RR=R
e+RR=R eral
(PHOR= PRIOR
PHO =P'Q)= PHD)
R@+R)=@+R)R=
(e+R) =R
e+R=R
(PQ P=P(QP)
RR+R=RR-=R™
Start A @D 1
‘When k =0; (passing through no intermediate state), the various regular expressions are:
Answer:
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 30Automata Theory & Compiler Design 210551 Module 2
When
=1; (passing through sate 1 as intermediate state), the various regular expressions are:
Therefore the regular expression corresponding to the language accepted by the DFA is given by:
Rie’ (state 1(i
the start state and state 2(j) is the final state). By using the formula:
Rix =Ru! +Ri2! (Ra!)’ Ra!
A? | 1°0+1°0(e +041)" +041)
“0 (0 +1)"
‘Answer:
Number of states in DFA = 3; ie: k=3
By renaming the states of DFA:
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 31Automata Theory & Compiler Design 210551
Module 2
RQ [ise] [RO [> R® |r
R2 |o RQ [ro | [R® [a0
FS | RY RQ | 01
|e yy RQ |e
RE |e RQ |e R® le
RQ [a Comin @ Ta
RY |e RY |¢ RQ |e
RQ {i r@ |r Re [a
]0+e sy [O+e @ [ore
@) () ©
Regular expressions for paths that can go througi a) no state, b) state 1 only and c) states 1 and 2
only.
Therefore the regular expression corresponding to the language accepted by the DFA is given by:
Ris’ (state 1(i) is the start state and state 3 (j) is the final state). By using the formula:
a = ro +
we get
Where i
sje dand k=
Rus) = Rus? + Ris? (Rss?) Ras?
=101 +101 (0+e+11) (O+e+11)
=101+1°01O+e+11)]
Regular expression = 1°01 (0+ 11)"
é 0 1
=a ca ai
cs ca @
“qs we q
Answer:
Number of states in DFA = 3; ie: k =3
By renaming the states of DFA as a= 1. a> =2.a:=3
Transition diagram of DFA:
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 32Automata Theory & Compiler Design 210551 Module 2
Ri” jert Ri TI
Ra” [oO RT [10 RL [100
Rw” | O Ra” |TO=0 R37 [Toor
Ra” |O Ra? |O Ri? |
R27 [O+e Rx® [Ote Rx” [0
Ra” [1 Ra” [1 Ra” [Fit
Ri” |@ Rai [Oo Ri” [0
Ra I Re [1 Ru? [10
Ras Ore Ru” [Ore Ra? | @+101)
‘Therefore the regular expression corresponding fo the language accepted by the DFA is given by:
Rjs' (state 1(j) is the start state and state 3 (j) is the final state). By using the formula
RAP = REY 4 REN REY) RED
Where i=, j= 3 and k = 3; we get
Ris = Ris + Ris’ (Rav) Ras?
1 00'1 + 1°00 1 (0 +10°1)') + 1071)”
Regular expression = 1° 0° 1 (0 + 10°1)"
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 33Automata Theory & Compiler Design Module 2
By renaming the states of DFA, we get
Regular expressions for paths that can go through 3 intermediate states: states 1, states 2 and
states 3 only.
Re
RW [oO
Ra [hb
Ri” | (a+ bb)b
Ra” [aba + bbba
Ri? |
Rae Oo
Ras bb’
Ra? [bba
Ri [0
Rx” 18
Rs [8
Ru [a
Ri? |O
Ro [Oo
Rw [Oo
Rw [Oo
‘The regular expression corresponding to the language accepted by the DFA is given by: Rus’
(tate 1(j) is the start state and state 4 (j) is the final state), By using the formula:
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 34Automata Theory & Compiler Design 210551 Module 2
AE = RE? + RMA?
Where i=1, j= 4 and k =4; we get
Rush = Rig + Rig (Ras) Rag?
fab’a + bbb’a) + (ab‘a + bbb’a) (0) @
We know that 0° =e and OR= 0
Therefore Regular expression reduces to
=ab'a+ bbb‘a
REGULAR LANGUAGES AND PROPERTIES OF REGULAR LANGUAGES.
Regular Languages:
The regular languages are the languages accepted by Finite automata (DFA, NFA” and ¢ —
NFA") and defined by regular expressions.
{ Strings of a’s and b’s ending abb}
L.={ even of a’s and even number of b’s} etc
Example: L
There are many languages which are not regular. We can prove that certain languages are not
regular using one powerful too! called pumping lemma.
Example: L= {a" b" | n>=0}
L= { equal number of 0s and 1’s}_ ete.
PROVING LANGUAGES NOT TO BE REGULAR
Pumping Lemma (PL) for Regular Languages:
Theorem (Statement) :
Let L be a regular language. Then there exists a constant ‘n’ (which depends on L) such that for
every string ‘w’ in L such that |w| > n, we can break w into three strings, w=xyz, such that:
Lly>0 ie: y#e
2. fey sn
3. For all k > 0, the string xy*z is also in L.
Proof. Suppose L — L(A) for sume DFA ‘A’ and regular language L. Suppose ‘A’ ha s'n”
number of states. Consider any string w = ajazas ay of length ’m’ where m >= n and
each aj is an input symbol. Since we have ‘m’ input symbols, naturally we should have ‘m+1”
states, in sequence qo, qu, dz qm Where qp is > start state and qm is + final state
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 35Automata Theory & Compiler Design 21C851 Module 2
EY HO) AO HE
n, by the pigeonhole principle itis net possible to have distinet transitions, since there
Since |
are only ‘n’ different states. So one of the state can have a loop. Thus we can find two different
integers i and j with 0
0, xy*z is also accepted by DFA ‘A’; that is for a language L to be a regular,
xy‘zis in L forall k>0.
Applications of Pumping lemma:
1 Itis useful to prove certain languages are non-regular.
2. Itis possible to check whether a language accepted by FA is finite or infinite,
‘Show that L= {a b" |m>=0} is not regular.
Let L 1s regular language and ‘n’ be the number of states in FA.
since |w|=n-+n=2n > n, we can split ‘w’ into xyz such that |xy| 1 as
2 a
on ———r—rD
w = aaaaaaaaazazzzaaa a bbbbbbbbbLbLbELbb
x y z
‘Where [x|=n-1 and |y| =I so that xy] =n-1 +1 = no.
If*k? = 0, the string *y’ does not appear, so the number of ‘a’s will be less than number of *b’s
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 36Automata Theory & Compiler Design 210551 Module 2
ie: w= a"! DY
But according to pumping lemma ‘n’ number of ‘a’s should be followed by ‘n” of “b’s, which is
‘a contradiction to the assumption that the language is regular.
So the language L= {a" b” | n>= 0} is not regular language.
‘Show that L= {a'b!|i>j} is not regular.
Let L is regular language and ‘n’ be the number of states in FA.
Consider the string w = a"! b”
since |w|— n+] +n—2nt1 =n, we can split ‘w’ into ‘xyz’ such that xy| 1 as
atl 2
—— —~__.
me smn aaa
FS
keel yea =
Where [x|=n-I and |y| =I so that [xy] =n-1 +1 = n 0.
It'k
= 0, the string *y’ does not appear, so the string ‘w’ has ‘n’ number of ‘a‘s followed by ‘n”
number of ‘b's. ie: w= a" b”
But according to pumping lemma ‘n+I’number of ‘a’s should be followed by ‘n’ of *b’s, which
is a contradiction to the assumption that the language is regular.
So the language L= {a' bj | i>j} is not regular language.
‘Show that L= {w | n,(W) < nyo } iS not regular,
Let L is regular language and ‘n’ be the number of states in FA.
Consider the string w= a™! b”
since |w|=n-1 +n =2n-1 > n, we can split ‘w" into “xyz? such that [xy| 1 as
al a
——, ——~—_~
Lm daaaaaaaaaaaaazaaaa bbdbbbobbtbdbobbdob:
zope
Where [x|=n-1 and |y|= 1 so that [xy| =n-1 +1 = nn, which is true.
According to pumping lemma xy‘z € L for all k> 0.
If *k’ = 0, the string ‘y’ does not appear, so the string ‘w” has ‘n-1” number of ‘a's followed by
“nel” number of ‘b's. ie: w= a"! b™
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 37Automata Theory & Compiler Design 210551 Module 2
But according to pumping lemma ‘n-I’number of ‘a’s should be followed by ‘n’ of “b’s, which is
a contradiction to the assumption that the language is regular,
So the language L~ {wv | ng(w’) < mw) } is not regular.
‘We can prove that L is not regular by taking string w= ab" | n>=0.
For solution refer problem1.
ie: ij meansi
j ori < j; 90 we can take string “w’ ~e"*!'b* or w- ab"
Solution is similar to the previous problems.
"Show that L= fa" b c"™ | n,m >= 0} is not regular.
Let L is regular language and ‘n’ be the number of states in FA.
ince L is regular it is closed under homomorphism. So we can take h(a) = a, h(b) = a and h(c) =
.
Now the language L is reduced to L = {a al c"""| ntm >= 0}
ie: L= fa") ntm >= 0} which is in the form
L={a'b)|i>=0},
Consider w= a"b"
since |w|=n-4n=2n > n, we can split ‘w’ into xyz such that [xy| 1 as
a n
—_S,
w = aaaaaaaaaaaaaaaaaaa bbbbbbbbbbbbbbbbbbb
‘Where [x|=n-l and |y|= 1 so that jxy| =n-1 +1 = 0.
If‘ k=
|, the string ‘y’ does not appear, so the number of ‘a’s will be less than number of “b's
jes w= a"! bY
Which is a contradiction to the assumption that the language is regular. So the given language
So the language L=
a b® | n>= 0} is not regular language L= {a" b ¢*™" | n,m >= 0} is not
regular.
Let L is regular language and ‘n’ be the number of states in TA.
Consider the string w =a" b®, therefore ww =a" b” a" b™
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 38Automata Theory & Compiler Design 210551 Module 2
since |w| = ntrtn +n = 4n > n, we can split ‘w" into ‘xyz? such that [xy| 1 as
a a a a
—— ae ee ———
ws amma ‘bbbbbbbbbbbbbbbtbbb aaaasasaaasagaaaaaaa bbbbbbbbbbbbbbbbbbb
a
x-# 0 ya z
‘Where [x|=n-l and |y|=1 so that xy] =n-l +
nn, which is tue.
According to pumping lemma xy*z € L for all k> 0.
If “k’ = 0, the string *y’ does not appear, so the number of *a’s on the left of first b will be less
than number of a’s after the first b
ie: ww=a"" b'a"b?.
Which is a contradiction to the assumption that the language is regular.
So the language L= {ww | w € (a+b)’} is not regular is not regular language.
‘Show that L= {ww* | w € (a+b)" } is not regular.
Let L is regular language and ‘n’ be the number of states in FA.
=a boa?
Consider the string w =a" b*, therefore ww
since |w|=ntnin + n=4n > n, we can split ‘w’ into ‘xyz’ such that |xy| 0.
If ‘k’ = 0, the string ‘y’ does not appear, so the number of ‘a's on the left of first b will be less
than number of ‘a’s after the first b
ie: ww* =a"! b"b" a”
Which is a contradiction to the assumption that the language is regular.
So the language L= {ww® | w € (a+b)’} is not regular is not regular language.
Let L is regular language and ‘n’ be the number of states in FA.
Consider the string w= a™
since |w|=n 2n, we can split ‘w" into ‘xyz’ such that |xy| 0.
If*k’ =0, the string ‘y’ does not appear, ie: a’ (a) a”?
Itis clear that n! > n! —} ie: when j=1, nl >n!
But according to pumping lemma, language to be regular, n! = n! ~ 1, which is not true and it is a
contradiction. So L can not be regular.
Let L is regular language and ‘n’ be the number of states in FA,
Consider the string w= 0°
since |w|=n > n, we can split ‘w’ into ‘xyz’ such that |xy| las
‘Where |x| =i and |y| =j so that |xy|=i+j1
According to pumping lemma xy'z € L for all k> o.
If ‘k’ = 0, the string ‘y’ does not appear, ie: 0' (0) 0°”
Itis clear that when,
» n¢(n-1)asaprime
But according to pumping lemma, language to be regular, when n is prime, n-1 is also prime,
which is not true and it is a contradiction, So L ean not be regular.
‘Show that L= {0° | isa perfect square } is uot vegular.
Let L is regular language and ‘n’ be the number of states in FA.
Consider the string w= 0"
since |w| =n? > n, we can split ‘w’ into ‘xyz’ such that [xy| 1 as
x=0'
y-0i
z 0”? is
‘Where [x|~ i and ly =j so that y|=i+j1
According to pumping lemma xy*z € L for all k> o.
If ‘k* = 0, the string ‘y* does not appear, ie: 0! (0) 0"?
It is clear that when (n° -1 ) is not a perfect square.
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 40Automata Theory & Compiler Design 210551 Module 2
But according to pumping lemma, language to be regular, when n” is a perfect square, n°-1 is
also a perfect square, which is not true and it is a contradiction, So L can not be regular.
‘Show that L~ (0° | n is a perfect cube } is not regular.
Let L is regular language and ‘n’ be the number of states in FA.
Consider the string w =
since |w|=n' >n, we can split ‘w" into ‘xyz’ such that [xy| 1
According to pumping lemma xy'z € L for all k> 0,
3
If “k’ = 0, the string “y’ does not appear, ie: 0' (0) 0") = 0" TEL
It is clear that when j=1, (n° -1 ) is not a perfect cube.
But according to pumping lemma, language to be regular, when n’ is a perfect cube, n’-1 is also
aperfect cube, which is not true and it is a contradiction. So L can not be regular.
LEXICAL ANALYSIS PHASE OF COMPILER DESIGN
Lexical Analysis,
Lexical analysis reads characters from left to right and groups into tokens. A simple way to build
lexical analyzer is to construct a diagram to illustrate the structure of tokens of the source
program, We can also produce a lexical analyzet automatically by specifying the lexeme pattems
to a lexical-analyzer generator and compiling those pattems into code that functions as a lexical
analyzer. This approach makes it easier to modify a lexical analyzer, since we have only to
rewrite the affected patterns, not the entire program,
Three general approaches for implementing lexical analyzer are:
1. Use lexical analyzer generator (LEX) from a regular expression based specification
that provides routines for reading and buffering the input.
2. Write lexical analyzer in conventional language (C ) using V/O facilities to read input.
3. Write lexical analyzer in assembly language and explicitly manage the reading of
input.
Note: The speed of lexical analysis is a concern in compiler design, since only this phase reads the
source program character-by character.
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 41Automata Theory & Compiler Design 21¢851 Module 2
1
2,
Lexical analyzer reads the source program character by character to produce tokens,
Normally a lexical analyzer doesn’t return a list of tokens at one shot, it returns a token
when the parser asks a token from it.
Normally L.A. don’t return a comment as a token. It skips a comment, and return the
next token (which is not a comment) to the parser.
Correlating error messages: It can associate a line number with each error message. In
some compilers it makes a copy of the source program with the error messages inserted
at the appropriate positions.
If the sou
ce program uses a macra-prepracessor, the expansion of macros may be
performed by the lexical analyzer.
Role of Lexical Analyzer
source Lexical token + To semastic analysis
oven Analyzer |* get sexisked
\
o
ymbol
Read the input characters of the source program, group them into lexemes and produces
output as a sequence of tokens.
It interacts with the symbol table.
Initially parser calls the lexical analyzer, by means of getNextToken command.
In response to this command LA read characters from its input until it can identify the
next lexeme and produce a token for that lexeme, which can be returned to parser.
It eliminates comments and white space.
It displays error messages with line number.
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 42Automata Theory & Compiler Design 210551 Module 2
Lexical Analysis versus Parsing:
‘The main reasons are;
1 P|
that had to deal with comments and whitespace as syntactic units would be
ty of design: Separation allows us to simplify the task. For example a parser
considerably more complex than one that can assume comments and whitespace have
already been removed by the LA.
Compiler efficiency is improved: A specialized buffering techniques for reading
input characters can speed up the compiler significantly.
3. Compiler portability is enhanced: Input device peculiarities can be restricted to the
lexical analyzer.
‘Tokens, Patterns and Lexemes:
i
ii.
Token: It describes the class or category of input string. A token is a pair consisting of a token
name and an optional attribute value.
or example, identifier, keywords, constants are called tokens
Pattern: Set of rule that describes the tokens. Itis a description of the form that the lexemes of a
token may take.
Example: letter [A-Za-z]
Lexeme: Sequence of characters in the source program that are matched with the pattern of the
token,
Example: int, a, num, ans ete
Token representation:
In many programming languages, the following classes cover most or all of the tokens
i. One token for each keyword; The pattern for a keyword is the same as the keyword itself
ji, Tokens for the operators, either individuelly or in classes such as the token comparison.
iii, One token representing all identifiers.
iv. One or more tokens representing constants, such as numbers and literal.
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 43Automata Theory & Compiler Design 210551 Module 2
v. Tokens for each punctuation symbol, such as left and right parentheses, comma, and
semicolon.
Attributes for tokens
A token has only a single attribute that is a pointer to the symbol-table entry in which the
information about the token is kept.
Example: The token names and associated attribute values for the statement E =M * C ** 2 are
written below as a sequence of pairs.
id, pointer to symbol-table entry for E~
Lexical errors:
1. Itis hard for a lexical analyzer to tell, without the aid of other components, that there is a
source-code error. Tor instance, if the string fi is encountered for the first time in a C
program in the context
fi(a==f(x))
A lexical analyzer cannot tell whether fi is 2 misspelling of the keyword if or an undeclared
funetion identifier. Since fi is a valid lexeme for the token id, the lexical analyzer must retum
the when id tw the parser and let sume oth
phase of the compiler probably the paises
this case handle an error due to transposition of the letters.
2. Suppose a situation arises in which the lexical analyzer is unable to proceed because none of
the patterns for tokens matches any prefix of the remaining input, The simplest recovery
strategy is "panic mode" recovery. We delete successive characters from the remaining input,
until the lexical analyzer can find a well-formed token at the beginning of what input is left
This recovery technique may confuse the parser, but in an interactive computing environment
it may be quite adequate.
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 44Automata Theory & Compiler Design 210551 Module 2
Possible error-recovery actions are:
1, Delete one character from the remaining input
2. Insert a missing character into the remaining input.
3. Replace a character by another character.
4, Transpose two adjacent characters
INPUT BUFFERING
Lexical analyzer scans the input string from left to right, one character at a time. It uses two
pointers as:
> beginLexeme pointer
> forward pointer
To keep track of the position of the input scanned. Initially both the pointers point to the first
character of the input string,
7
|
|
‘The forward pointer moves ahead to search for end of lexeme. As soon as the blank space is
encountered, it indicates end of lexeme. In above example as soon as forward pointer encounters
ablank space, the lexeme is identified
be
|
CC CO EC CG
The fp will be moved ahead whi it sees white space. That is when fp encounters white space
it ignores and moves ahead. Then both fp and bp is set at next token
‘What is meant by input buffering?
To recognize tokens reading data/source program from hard disk is done. Accessing hard disk
each time is costly and time consuming so special buffer technique has been developed to reduce
the amount of overhead required. This process of reading source program into a buffer is called
input buffering.
A block of data is first read into a buffer and scanned by lexical analyzer. There are two methods
used
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 45Automata Theory & Compiler Design 210551 Module 2
1. One buffer
2. Two buffer
Onc buffer scheme:
Here only one buffer is used to store the input string. But the problem with this scheme is that, if
a lexeme is very long, then it crosses the buffer boundary. To scan the remaining part of lexeme,
the buffer has to be refilled, that makes overwriting of first part of lexeme. Sometimes it may
result in loss of data due to the user misinterpretation.
Two Buffer scheme:
‘Why two buffer schemes is used in lexical analysis? Explain.
Because of the amount of time taken to process characters and the large number of characters
that must be processed during the compilation of a large source program, specialized two
buffering techniques have been developed to reduce the amount of overhead required to process
a single input character.
Here a buffer (array) divided into two N-character halves, where N = number of
characters on one disk block Ex: 4096 bytes — If fewer than N characters remain in the
input file , then special character, represented by eof, marks the end of source file and it is
different from input character.
© One read command is used to read N characters. Two pointers are maintained: be;
of the lexeme pointer and forward pointer.
* Initially, both pointers point to the first character of the next lexeme.
* Using this method we can overcome the problem faced by one buffer scheme, even
though the mput 1s lengthier the user knows from where he has to begin in the next
buffer, as he can see the contents of previous buffer. Thus there is no scope for loss of
any data,
Sentinels:
In two buffering scheme we must check the forward pointer, each time it is incremented. Thus
we make two tests: one for the end of the buffer, and one to determine what character is read,
We can combine these two tests. if we use a sentinel character at the end of buffer.
ar
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 46Automata Theory & Compiler Design 21C851 Module 2
Sentinel is a special character inserted at the end of buffer, that cannot be a part of source
program; eof is used as sentinel,
Look ahead code:
Switeh(*forward++)
{
case eof
if (forward is at end of first buffer)
{ reload second buffer:
eginning of second buffer:
forward
}
elseif (forward is at end of second buffer)
{ reload first buffer:
forward = beginning of frst buffer:
}
else /* eof within a buffer marks the end of input terminate lexical analysis*
break:
case for the other characters
}
Operations on Languages
Give the formal definitions of operations on languages with notations.
In lexical analysis the most important operations on languages are:
i. Union
ii, Concatenation
iii, Star duswe
iv. Positive closure
‘These operations are formally defined as follows
‘OPERATION DEFINITION AND
Union of Land M___ [LUM ={o| isin Lorik
Concatenation of Land M
in MY
{ot [is in L and tis in MY
‘Kleene closure of L TUR,
Positive closure of L Tt up, i
Regular Expressions
We use regular expressions to describe tokens of a programming language.
+ A regular expression is built up of simpler regular expressions (using defining rules)
‘ATHMARANJAN K Department of ISE, SIT Mangaluru. Page 47,Automata Theory & Compiler Design 210551 Module 2
+ Each regular expression denotes a language.
* A language denoted by a regular expression is called as a regular set,
© *, concatenation and | has highest to lowest precedence with left associative.
* If two regular expression ‘r’ and ‘s’ denote the same language, we say ‘r’ and ‘s’ are
equivalent and write ms
‘Write the definition of regular expression
Regular expression over alphabet 5 can be defined as follows:
&- is a regular expression for the set containing empty string.
ais a regular expression, if a belongs to ¥: that is {a} set containing the string a.
Suppose r and s are regular expression denoting the languages L(r) and L(s) then,
(rs) is a regular expression denoting L(r) U L(s).
rs is a regular expression denoting L(r)L(s)
(0)* is a regular expression denoting (L(1))*
The following table shows some of the regular expressions along with their possible regular
sets:
Regular expression Set
alb fa, b}
{(alb)(alb) o aajab|bajbb faa, ab, ba, bb}
at {e,a, aa, aaa,...}
(alb)* or (a*b*)* fe, a,aa,b,bb,...}
ala*b {a, b, ab,aab,aaab,...}
Algebraic properties
11s =s\t
r((sit)= (ris)
(rs)t=n(st)
(sit) =rs)rt
(sit)r = srltr
er
re
r= (re) t=
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 48Automata Theory & Compiler Design 21C851 Module 2
[abe] denotes the regular expression albje
[a-z] denotes the regular expression alb|......!2
Regular definition:
* To write regular expression for some languages can be difficult, because their regular
expressions can be quite complex. In those cases, we may use regular definitions.
+ Weccan give names to regular expressions, and we can use these names as symbols to define
other regular expressions.
Define regular definition aud write the regular definition for C identifier.
co
A regular definition is a sequence of the definitions of the form:
di > rl where Gyis a distinct name and Ip is a regular expression over symbols
in DU { di. dads dua}
a2a- 2 NN
. . basic symbols previously defined names
dsm
Regular definition for *C’ identifier:
letter + A|B |e
digit > 0] 1]....19
id — letter (letter | digit )°
OR
leter > [A-Za-z]
digit > [0-9]
id > letter (letter | digit )”
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 49Automata Theory & Compiler Design
21C851
Module 2
digit >
digits >
optionalFraction —>
optionalExponent —>
number —
OR
digit > [0-9]
digits + digit”
umber >
Recognition of Tokens
O/1...19
(digit )°
- digits |e
(E(+|-[e) digits) |e
digits optionalFraction optionalExponent
digits (digits)? (F [+-]? digits )?
Our current goal is to perform the lexical analysis needed for the following grammar.
stmt — if expr then stmt
if expr then stmt else stmt
e
expr — term relop term // relop is relational operator =. >. ete
tem
term — id
number
Recall that the terminals are the tokens the nanterminals produce termi
A regular definition for the terminals is
digit — [0-9]
digits — digits”
number — digits (. digits)? (E[+-]? digits)?
letter —> [A-Za-2]
id — letter (letter | digit
if if
then — then
else — else
relop > <|
‘ATHMARANJAN K Department of ISE, SIT Mangaluru
Page 50Automata Theory & Compiler Design 21CS51
Module 2
Lexeme Token Attribute
Whitespace ws
if if _
then then = —
else else —
An identifier id Pointer to table entry
Amumber — number Pointer to table entry
< relop LT
« relop Lr
= relop EQ
=> relop NE
> relop GT
relop GE
We also want the lexer to remove whitespace so we define a new token
ws — ( blank | tab | newline ) +
where blank, tab, and newline are symbols used to represent the corresponding ascii
characters,
Recall that the lexer will be called by the parser when the latter needs a new token. If the
lexer then recognizes the token ws, it does nor retum it to the parser but instead goes on
to recognize the next token, which is then retumed. Note that you can't have two
consecutive ws tokens in the input because, for a given token, the lexer will match the
Iongest lexeme starting at the current position that yields this token. The table on the
right summarizes the situation
For the parser, all the relational ops are to be treated the same so they are all the same
token, relop. Naturally, other parts of the compiler, for example the code generator, will
need to distinguish between the various relational ops so that appropriate code is
generated. Hence, they have distinct attribute values,
‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 51Automata Theory & Compiler Design 21851.
Specification of Token
To specify tokens Regular Expressions are used.
Recogniti
n of Token: To recognize tokens there are 2 steps
1. Design of Transition Diagram
2. Implementation of Transition Diagram
‘Transition Diagrams
‘A transition diagram is similar to a flowchart for (a part of) the lexer. We draw one for each
possible token. It shows the decisi
ns that must be made based on the input seen, The two main
components are circles representing states (think of them as decision points of the lexer) and arrows
representing edges (think of them as the decisions made).
Itis fairly clear how to write code corresponding to this diagram. You look at the first character, if
it is <, you look at the next character. If that character is =, you return (relop, LE) to the parser. If
instead that character is >, you return (relop, NE). If it is another character, return (relop, LT) and
adjust the input buffer so that you will read this character again since you have not used it for the
current lexeme. If the first character was =, you retum (relop, EQ),
i. relop (relational operator)
Identifier and keyword
iii, Unsigned number
iv. Integer constant
vw Whitespace
i, ‘Transition diagram for relop:
@)returnteelop, Ge)
[ @igeren
‘ATHMARANJAN K, DEPARTMENT OF ISE, SIT MANGALURU. Page 52Automata Theory & Compiler Design 21851.
Transition diagram for identifier and keyword:
letter or digit
iv. Integer constant:
start tor
ve Whitespace:
Whitespace characters are represented by delimiter, where delim includes the characters like
blank, tab, new line and other characters that are not considered by the language design to be part
of any token,
‘ATHMARAN Page 53Automata Theory & Compiler Design 210851
Recognition of reserved words and :
letter or digit
\ raturn(getToken(), installi() )
‘There are two ways we can handle reserved words that look like identifiers:
1. Install the reserved words in the symbol table initially: When we find an identifier, a call
to installID( ) function places that identifier into the symbol table if it is not already there
and returns a pointer to the symbol table entry. The function getToken( ) examines the
symbol table for the lexeme found, and returns token name as either id or one of the
keyword token that was initially installed in the table.
2. Create separate transition diagrams for each keyword
Architecture of a transition diagram based lexical analyzer
The idea is that we write a piece of code for each decision diagram. This piece of code contains a
case for each state, which typically reads a character and then goes to the next case depending on
the character read. nextehari) 1s used to read a next char from the input butfer. Ihe numbers in the
circles are the names of the cases. Accepting states often need to take some action and return to the
parser. Many of these accepting states (the ones with stars) need to restore one character of input.
This is called retract() in the code.
What should the code for a particular diagram do if at one state the character read is not one of
those for which a next state has been defined? That is, what if the character read is not the label of
any of the outgoing arcs? This means that we have failed to find the token corresponding to this
diagram.
The code calls fail(, is not an error case. It simply means that the current input does not match
this particular token. So we need to go to the code section for another diagram after restoring the
‘ATHMARANJAN K, DEPARTMENT OF ISE, SIT MANGALURU. Page 54Automata Theory & Compiler Design 21851.
input pointer so that we start the next diagram at the point where this failing diagram started. If
wwe have tried all the diagram, then we have a real failure and need to print an error message and
perhaps try to repair the input
Transition diagram:
stare return( relop, LE)
othe
returntretep, tay @returntrelop. ert
—+@)returnivaton, oe)
—
a ae
Coding part:
TOKEN getRelop( )
{
JUKEN retloken = new(KELUP);
while(1)
( * repeat character processing until a return or failure occurs */
Switch (state)
=nextChar( );
if(e
‘<*) state = 1; else
‘ATHMARANJAN K, DEPARTMENT OF ISE, SIT MANGALURU. Page 55,Automata Theory & Compiler Design 210851
else fail(); * Iexeme is not a relational operator...other */
break;
cease 1: ¢ = nextChar( );
4,
else fail(); /* lexeme is not a relational operator...other */
break;
case 8: retract( );
retToken.attribute = GT;
return(retToken);
ATHMARANJAN K, DEPARTMENT OF ISE, SIT MANGALURU.
Page 56