KEMBAR78
Atc-21cs51 Module 2 | PDF
0% found this document useful (0 votes)
393 views56 pages

Atc-21cs51 Module 2

Uploaded by

Deekshith M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
393 views56 pages

Atc-21cs51 Module 2

Uploaded by

Deekshith M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 56
Automata Theory & Compiler Design 210551 Module 2 REGULAR EXPRESSIONS AND LANGUAGE; Introduction Instead of focusing on the power of a computing device, Iet's look at the task that we need to perform, Let's consider problems in which our goal is to match finite or repeating patterns. For example regular expressions are used as pattern description language in + Lexical analysis.-- compiler + Filtering email for spam. = Sorting email into appropriate mailboxes based on sender and/or content words and phrases. + Searching a complex directory structure by specifying patterns that are known to occur in the file we want. A regular expression is a pattern description language, which is used to describe particular patterns of interest. A regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Example: [ |: A character class which matches any character within the brackets [* \en] matches any character except space, tab and newline character. Regular expression A language accepted by finite- automata is called as regular language A regular- languages can be described using regular expressions, in the form of algebraic notations consisting of the symbols such as alphabets in ¥, the operators such as +. and *, where + iy used for univ uperati + iy used for concatenation aud * iy used for elusure uperatis ‘Thus the regular expressions are the structural representation of finite-automata using algebraic notations, which can serve as the input language for many systems that process strings. Example such as Lex program, Unix Grep ete. Definition of Regular expression Define Regular expression, A regular expression is defined as follows: is a regular expression denoting an empty language. eis regular expression denoting the language containing empty string, ais regular expression denoting the language containing only {a} ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 1 Automata Theory & Compiler Design 210551 Module 2 IfR is a regular expression denoting the language Ly and S is a regular expression denoting the language Ls then R + S is a regular expression corresponding to the language Lx U Ls RSS is a regular expression corresponding to the language Lx Ls. R’ is a regular expression corresponding to the language Lp, Thus the expressions obtained by applying any of the rules are regular expressions Examples of Regular expressions Regular expression | Meaning a Siring consisting of any number of a's. (zero of more as) a ‘String consisting of at least one a. (one or more a’s) (a+b) String consisting of either a or b (aby String consisting of any nuber of a's and b's including © (arby ab Strings of a's and b’s ending with ab, ab(atby Strings of a's and b's starting with ab (a+b) ab (ab) Strings of a's and b’s with substring ab. a, Strings of a’s and b’s having length 2: Regular expression = (a+ b) ( a* b). b. Strings of a’s and b’s of length < 10: Regular expression = («+ a+b)". c. Strings of a’s and b’s of ‘even length. (a+b) (a+b) T° d. Strings of a’s and b’s of odd length Regular expression Regular expression = ( a +b) |(a +b) (a+ b)]" e. Strings of a's of even length Regular expression = (aa)" f. Strings of a's of odd length Regular expression = a(aa) ‘Strings of a’s and b’s with altemate as and b's. Alternate a’s and b’s can be obtained by concatenating the string (ab) zero or more times. ie (ab)" and adding an optional b to the front ie: (¢ +b) and adding an optional a at the end, ie: (e +a) ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 2 Automata Theory & Compiler Design 210551 Module 2 The regular expression = (e-+b) (ab) (+a) OR Je String should contain at least one a and one b, so the regular expression corresponding to this is given by = ab+ba There is no restriction on ¢”s. Insert any number of a’s, b’s and ¢;s ic: (a+b+e)" in between the above regular expression. So the final regular expression = (atb+c)’ a (atbte) b(atb+e)’ + (atbte)” D(atb+e)"a(atb+e)” Regular expression for string containing 3 consecutive 0’s = 000 The above regular expression can be preceded or followed by any number of 0’s and 1’s, ie: (oH) Regular expression = (0+1)"000(0+1)" Regular expression for strings of a’s and b’s ending with b and has no substring aa is nothing but the string containing any combinations of either b or ab without «. Regular expression = (b + ab) (b +ab)" means even number of a’s, regular expression = (aa)" 'b™ means even number of b’s, regular expression = (bb). ‘The regular expression for the given language =(aa)" (bb)" means odd number of a’s, regular expression = a(aa)” b*” means even number of b’s, regular expression = (bb) ‘The regular expression for the given language = a(aa) (bb)” ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 3 Automata Theory & Compiler Design 210551 Module 2 a?'! means odd number of a’s, regular expression = a(aa)" b*™"! means odd number of b’s, regular expression = b(bb)* ‘The regular expression for the given language = a(aa)'b(bb)" Regular eapressivut for exactly ume 1 = 1 Even number of 0’s = (00)" So here 1 can be preceded or followed by even number of 0's or 1 can be preceded and followed by odd number of 0's. ‘The regular expression for the given language = (00)" 1 (00)" + 0(00)" 1 0(00)" OR Whenever a 0 occurs it should be followed by 1, But there is no restriction on number of I's. So it isa string consisting of any combinations of 1’s and 01’s, ie regular expression = (1+01)" Suppose string ends with 0, the above regular expression can be modified by inserting (0 + ¢) at the end. Regular expression for the given language = (1+01)" (0+) OR ‘Whenever a 1 occurs it should be followed by 0, But there is no restriction on number of 0's. So it is a string consisting of any combinations of 0’s and 10’s, ie regular expression = (0+10)" Suppose string ends with 1, the above regular expression can be modified by inserting (1+ ¢ ) at the end. Regular expression for the given language = (0+10)" (1+ ©) Strings of a’s and b’s with substring ab. ‘ATHMARANJAN K Department of ISE, SIT Mangaluru. Page 4 Automata Theory & Compiler Design 210551 Module 2 Regular expression = (a+b)" aab(at+b)" ii. Strings of a’s and b’s such that 4" symbol from right end is b and the 5" symbol from right end is a. Here the 4" symbol from right end is b and the 5" symbol from right end is a the corresponding regu:lar expression = ab(a+b)(a+b)(a+b), But the above regular expression can be preceded with any number of a’s and b’s. ‘Therefore the regular expression for the given language = (a+b) ab(a+b)(a+b)(a+b). iii. Strings of a’s and b’s such that 10" symbol from right end is b. The regular expression for the given language = (a+b) b(a+b)” Strings of a’s and b’s whose lengths are multiple of 3 OR L = { |w|mod 3 =0, where w is in = {a,b} Length of string w is multiple of 3, the regular expression = [(a+b) (a+b) (a+b)]" Vv. Strings of a’s and b’s whose lengths are multiple of 5 OR , where w is in == { a,b} Length of string w is multiple of 5, L={ |w|mod 5 = ‘The regular expression = [(a+b) (a+b) (a+b) (atb)(at+b)]" vi. Strings of a’s and b’s not more than 3 a's: Not more than 3 a's, regular expression= (ea) (e+a) (ea), But thete is uy resuivtivu on b's, so we cau include Lin between the abuve regular expression, ‘The regular expression for the given language = b’(c+a) b (eta) b” (eta) b” vii, Obtain the regular expression to accept the words with two or more letters but beginning and ending with the same letter. 2 = { a, b} Regular expression beginning and ending with same letter is = a a + b b. In between include any number of a’s and b’s. ‘Therefore the regular expression = a(a*b)" +b(atb)b viii. Strings of a’s and b’s of length is either even or multiple of 3. Multiple of regular expression = [(a+b) (a+b) (a+b)]" Length is of even, regular expression = [(a+b) (a+b)]” ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 5 Automata Theory & Compiler Design 21C851 Module 2 So the regular expression for the given language = |(a+b) (a+b) (atb)|"+ [(a+b) (a+b) Obtain the regular expression to accept the language L~ { a" | m+n is even } Here n represents number of a’s and m represents number of b’s. mn is even results in two possible cases; case i. when even number of a’s followed by even number of bs. regular expression : (aa) (bb)” ‘asc ii. Odd number of a’s followed by odd number of b’s. regular expression = a(aa)’ b(bb)’. So the regular expression for the given language = (aa)'(bb)" + a(aa)’ b(bb)" X. Obtain the regular expression to accept the language L= { "B® |n>4 and m <3 }. Here n 2 4 means at least 4 a’s, the regular expression for this = aaaa(a)” ‘m_< 3 means at most 3 b’s, regular expression for this = (e+b) (+b) (+b). So the regular expression for the given language = aaaa(a)’ («+b) (e+b) (e+b). Obtain the regular expression to accept the language L~= { a" e? |m> 4 andm <3 p< a. Here n > 4 means at least 4 a’s, the regular expression for this = aaaa(a)" 'm <3 means at most 3 b’s, regular expression for this = (<+b) (+b) (e+b) p<2 means at most 2 ¢’s, regular expression for this ~ (ce) (c+e) So the regular expression for the given language = aaaa(a)'(etb) (eb) (e+b) (e+e) (ete). All strings of a's and b’s that do not end with ab. ‘Suings of length 2 and that do uot end with ab ave ba, wa and bb. So the regular expression = (a+b)'(aa + ba +bb) xiii. All strings ofa’s, b’s and c’s with exactly one a ‘The regular expression = (b+c)" a (b+c)" xiv. All strings of a’s and b’s with at least one occurrence of each symbol in ¥ = {a, b} Atleast one occurence of a’s and b’s means ab + ba, in between we have 1 number ofa’s and b's. So the regular expression =(a+b)" a (a+b)" b(a+b)’ +(a+b)" b(a+b)” a(a+b)” ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 6 Automata Theory & Compiler Design 210551 Module 2 Solution: Cave i, Since nm 23, if'm = 1 then m should be & 3, The equivalent regular expression is given by: RE=a bbb(b)” Case ii, Since nm > 3, ifm = 1 then n should be 2 3. The equivalent regular expression is given by: RE= aaa(a) b Case iii. Since nm 2 3, ifm >2and n> 2 then the equivalent regular expression 1s given by: RE = aa(a)" bb(b)" So the final regular expression is obtained by adding all the above regular expression. Regular expression = abbb(b)" + aaa(a)"b + aa(a)"bb(b)” Application of Regular expression: 1. Regular expressions are used in UNIX. 2. Regular expressions are extensively used in the design of Lexical analyzer phase. 3. Regular expressions are used to search patterns in text. FINITE AUTOMATA AND REGULAR EXPRESSIONS, 1. **™*Converting Regular Expressions to Automata: Prove that every language defined by a regular expression is also defined by a finite automata. Proof: Suppose L = L(R) for a regular expression R, we show that L = L(E) for some -NFA E with: a, Exactly one accepting state. b, No ares into the initial state, c. No ares out of the accepting state. The proof must be discussed with the following transition diagrams for the basis of the construction of an automaton. ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page7 Automata Theory & Compiler Design 21C851 Module 2 ——©6) By definition of regular expression, if R is a RE and S is a RE then R+S is also a RE corresponding to the language L (R+S), its automaton is given by: 1s Starting at new start state, we can go to the start state of either the automaton for R or S. We then reach the accepting state of one of these automata R or S. We can follow one of the ¢- ares to the accepting state of the new automaton. Automaton for R.S is given by: RS) The start state of the first( R) automata becomes the start state of the whole and the final state of the second(S) automata becomes the final state of the whole. Oo Automaton for R* is given by: -QEST SS Loe) From start state to final state one arc labeled g ( for ¢ in R") or the to the start state of automaton R through that automaton one or more time and then to the final state. ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 8 Automata Theory & Compiler Design 210551 Module 2 The automaton for L = 0 is given by: —O--0 ‘The automaton for L = 1 is given by: —O2—O ‘The automaton for L = 0+1 is given by: oO = 1-0 ‘The automaton for L = (0+1)" is given by: ea ‘The automaton for L = (0+1)" 1 is given by: ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page9 Automata Theory & Compiler Design 210551 Module 2 Finally the e-NFA for the regular expression: (0+1)"1(0+1) is given by: ‘The automaton for L ‘The automaton for L = 1 is given by: ‘The automaton for L= 01+1 is given by: ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 10 Automata Theory & Compiler Design 210551 Module 2 ‘The automaton for L = 01 is given by: The automaton for L= 101 is given O70 0 The automaton for L = (01+101) is given by: : 2 Q4—¢y e ~ @+ 944 The automaton for L = 01 is given by: Epsilon-NFA for the regular expression (0+ 1)°01 is given by: ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 11 Automata Theory & Compiler Design 210551 Module 2 Zo + <_—® Se + O-.@1—@O ten Oem ae ‘The automaton for L The automaton for L = 1 is given by: Oo < The automaton for L = 2 is given by: —O7—O ‘The automaton for L = 0° is given by: ‘The automaton for L = 2° is given by: ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 12 Automata Theory & Compiler Design 21C851 Module 2 Se a. CONVERTING DFA’s TO REGULAR EXPRESSION USING STATE ELIMINATION TECHNIQUE How to build a regular expression for a FSM. Instead of limiting the labels on the transitions of an FSM to a single character or s, we will allow entire regular expressions as labels. * For a given input FSM/FA M, we will construct a machine M’ such that M and M’ are equivalent and M’ has only vo states, start state and a single accepting state. + M’ will also have just one transition, which will go from its start state to its accepting state. The label on that transition will be a regular expression that describes all the strings that could have driven the original machine M from its start state to some accepting state Algorithm to create a regular expression from FSM: (State elimination) 1. Remove any states from given FSM M that are unreachable from the start state 2. If FSM M has no accepting states then helt and return the simple regular expression @. ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 13 Automata Theory & Compiler Design 210551 Module 2 3. If the start state of FSM M is part of a loop (i.e: it has any transitions coming into it), then create a new start state s and connects to M ‘s start state via an e-transition, This new. start state s will have no transitions into it. 4, Ifa FSM M has more than one accepting state or if there is just one but there are any transitions out of it, create a new accepting state and connect each of M’s accepting states to it via an e-transition, Remove the old accepting states from the set of accepting states. Note that the new accepting state will have no transitions out from it 5. At this point, if M has only one state, then that state is both the start state and the accepting state and M has no transitions. So L (M} = {2}. Halt and return the simple regular expression as & 6. Until only the start state and the accepting state remain do: 6 Select some state s of Mwhich is of any state except the start state or the accepting state, 6.2 Remove that state s from M. 6.3. Modify the transitions among the remaining states so that M accepis the same strings The labels on the rewritten transitions may be any regular expression. 7. Return the regular expression that labels the one remaining transition from the start state to the accepting state OR OOO ‘We can build an equivalent machine M’ by eliminating state q2 and replacing it by a transition from qj to q; labeled with the regular expression ab’a. So M'is: © ava © ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 14 Regular Expression = ab‘a Automata Theory & Compiler Design 21C851 Module 2 ‘There is uo ines ag edge into the ial state as well as no oulgoing edge fiom final state. So there is only two states, initial and final. abe Regular expression = (a+b+e) or (aU b Ue) There is no incoming edge into the initial state as well as no outgoing edge from final state. Oa © After eliminating the state B: Regular expression = ab Y H—O-—©) There is no incoming edge into the initial state as well as no outgoing edge from final state. After eliminating the state B o> ©) Regular expression = ab’e ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 15 Automata Theory & Compiler Design 21C851 Module 2 @ : OD Since initial state has incoming edge, and final sate has outgoing edge, we have to create a new iniatial and final state by connecting new initial state to old initial state through ¢ and old final state to new final state through ¢, Make old final state has non-final state. After removing state A. 10 7 0 « B Afier removing state B: "© Regular expression: 0(10)" start,(qQ)__a B £ @) ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 16 Automata Theory & Compiler Design 21C851 Module 2 Since there are multiple final states, we have to create a new final state. After removing states C, D and E: @ @)—t@) Afier removing state B Regular Expression: a(b+e+d) ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 17 Automata Theory & Compiler Design 21C851 Module 2 After removing state A. After removing state B: Regular expression: b(c tab) d After removing state B After removing state A’ ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 18 Automata Theory & Compiler Design 210551 Module 2 2 No ald Regular expression: (0+10°1)" By creating new start and final states: After removing state qr S ro After removing state q2 10 01 = After removing state qs ® 1700"1 +1071)" ® Regular expression: 1°00°1(0+10" 1)" ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 19 Automata Theory & Compiler Design 21C851 Module 2 After removing qi state: After removing qp state: After removing qa state ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 20 Automata Theory & Compiler Design 210551 Module 2 After removing qpy state: | @1+10)" Regular expression: (01+10)" OR Since start state 1 has incoming transitions, we create a new start state and link that state to state 1 through e. ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 21 Automata Theory & Compiler Design 210551 Module 2 Since accepting state 1 and 2 has outgoing transitions, we create a new accepting state and link that state to state I and state 2 through ¢, Remove the old accepting states from the set of accepting states. (ie: consider | and 2 has non final states) Consider the state 1 and remove that state: Finally we have only start and final states with one transition from start state I to final state 2, The labels on transition path indicates the regular edpression, Regular Expression = (ab U aaa” b)" (a Uc) ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 22 Automata Theory & Compiler Design 21C851 Module 2 After removing qz state: After removing q) state ut After removing qo state Regular expression: 0° (¢ + 1°)=0" 1° ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 23 Automata Theory & Compiler Design 210551 Module 2 After removing D state: o © After removing E state: After removing A state: ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 24 Automata Theory & Compiler Design 210551 Module 2 (s) Govt (2) * (c) © After removing B state: ery (s ) OTT —« x e After removing C state: —~@ Regular expression = (00) 11(11)" OR a “Ox, Co, ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 25 Automata Theory & Compiler Design 210551 Module 2 By creating final state After removing qistate: After removing qastate: — = oro After removing qsstate: Oa ©, Regular expression= 01°01" ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 26 Automata Theory & Compiler Design 210551 Module 2 By creating new start and final states: After removing qo state: —) ony e oa After removing qr state: Afier remaving qs state ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 27 Automata Theory & Compiler Design 21C851 Module 2 nen OL ene After removing q; state: (0+1)*1(0+1) + (0+1)°100+1)00+1) ® oT Regular expression: (0-+1)"1(0+1) +(0+1)'1(0+1)(0+1) CONVERTING DFA’s TO REGULAR EXPRESSION USING KLEEN’S THEOREM The construction regular expression using this method describes sets of strings that label certain paths in the DFA’s transition diagram, However the paths are allowed to pass through only a limited subset of the states. We start with the simplest expressions that describe paths that are not allowed to pass through any states (ie: they are single state or single arc), and inductively build the expressions that let the paths go through progressively larger sets of states. Finally the paths are allowed to go through any state, At the end these expressions represent all possible paths. Let us consider a DFA with ‘n’ number of states and use Ry as the name of regular expression whose language is the set of strings w is the label of a path from state I to state j in a given DFA and that path has no intermediate node whose number is greater than k. Note that beginning and end points of the path are not intermediate, so there is no constraint that i and/or j be less than or equal to k. To construct the regular expression Rj" we use the inductive definition, starting at k = 0 and finally k is reaching k = n which is the number of states in DFA. When k=0 Regular expressions for the paths that can go through no intermediate state at all, there are 2 Kinds of paths that mect such condition: 1. Anare from node (state) i to node j ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 28 Automata Theory & Compiler Design 210551 Module 2 2. A path of length 0 that consists of only some node i If 4 j then only case (1) is possible, We must examine DFA and find those input symbols a such that there is a transition from state i to state j on symbol a. a) If there is no such symbol a, then R;;” =@ b) If there is exactly one such symbol a, then Ri! =a ©) If there are symbols ay, a2, as, ay that label ares from state Ito state j, then Ry’ =a; tay tay, +a If i —j, then legal paths are the path of length 0 and all loops from i to itself. The path length 0 is represented by the regular expression c. Thus we add ¢ to the various expressions devised in a) through c) above, That is in case a) expression becomes @ + ¢ = e, in case b) expression becomes ta, in case c) expression becomes ¢ + a; + a2 +35, ay Suppose there is a path from state i to state j that goes through no state higher than k, There are ‘two possible cases to consider. 1. The path does not go through state k at all, In this case, the label of path is in the language of Ri" 2. The path goes through state k at least once, Then we can break the path into several pieces, as suggested in below figure: A ee ier en tn Re? In REY Zero or more strings in RG The first gocs from state i to state k without passing through k, the last picee gocs from k to j without passing through k, and all the pieces in the middle go from k to itself, without passing through k. When we combine the expressions for the paths of the two types above, we have the expression for the labels of all paths from state ito state j that go through no state higher than k. RD = RG eR POR PR Ry’ = Regular expressions for the paths that can go through no intermediate states at all. Ry! = Regular expressions for the paths that can go through an intermediate state 1 only. Rj? = Regular expressions for the paths that can go through an intermediate state 1 and state 2 only. Ry’ = Regular expressions for the paths that can go through an intermediate state 1, state 2 and ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 29 Automata Theory & Compiler Design 21C851 Module 2 State3 only and so on. NOTE: Tdentity rule Example eR=Re=R le =el=1 O@R=RO=G 19=01=0 e =e (@) =e O+R=R1O=RK R4R=R RR'=RR=R™ (RY=R RR=R e+RR=R eral (PHOR= PRIOR PHO =P'Q)= PHD) R@+R)=@+R)R= (e+R) =R e+R=R (PQ P=P(QP) RR+R=RR-=R™ Start A @D 1 ‘When k =0; (passing through no intermediate state), the various regular expressions are: Answer: ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 30 Automata Theory & Compiler Design 210551 Module 2 When =1; (passing through sate 1 as intermediate state), the various regular expressions are: Therefore the regular expression corresponding to the language accepted by the DFA is given by: Rie’ (state 1(i the start state and state 2(j) is the final state). By using the formula: Rix =Ru! +Ri2! (Ra!)’ Ra! A? | 1°0+1°0(e +041)" +041) “0 (0 +1)" ‘Answer: Number of states in DFA = 3; ie: k=3 By renaming the states of DFA: ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 31 Automata Theory & Compiler Design 210551 Module 2 RQ [ise] [RO [> R® |r R2 |o RQ [ro | [R® [a0 FS | RY RQ | 01 |e yy RQ |e RE |e RQ |e R® le RQ [a Comin @ Ta RY |e RY |¢ RQ |e RQ {i r@ |r Re [a ]0+e sy [O+e @ [ore @) () © Regular expressions for paths that can go througi a) no state, b) state 1 only and c) states 1 and 2 only. Therefore the regular expression corresponding to the language accepted by the DFA is given by: Ris’ (state 1(i) is the start state and state 3 (j) is the final state). By using the formula: a = ro + we get Where i sje dand k= Rus) = Rus? + Ris? (Rss?) Ras? =101 +101 (0+e+11) (O+e+11) =101+1°01O+e+11)] Regular expression = 1°01 (0+ 11)" é 0 1 =a ca ai cs ca @ “qs we q Answer: Number of states in DFA = 3; ie: k =3 By renaming the states of DFA as a= 1. a> =2.a:=3 Transition diagram of DFA: ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 32 Automata Theory & Compiler Design 210551 Module 2 Ri” jert Ri TI Ra” [oO RT [10 RL [100 Rw” | O Ra” |TO=0 R37 [Toor Ra” |O Ra? |O Ri? | R27 [O+e Rx® [Ote Rx” [0 Ra” [1 Ra” [1 Ra” [Fit Ri” |@ Rai [Oo Ri” [0 Ra I Re [1 Ru? [10 Ras Ore Ru” [Ore Ra? | @+101) ‘Therefore the regular expression corresponding fo the language accepted by the DFA is given by: Rjs' (state 1(j) is the start state and state 3 (j) is the final state). By using the formula RAP = REY 4 REN REY) RED Where i=, j= 3 and k = 3; we get Ris = Ris + Ris’ (Rav) Ras? 1 00'1 + 1°00 1 (0 +10°1)') + 1071)” Regular expression = 1° 0° 1 (0 + 10°1)" ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 33 Automata Theory & Compiler Design Module 2 By renaming the states of DFA, we get Regular expressions for paths that can go through 3 intermediate states: states 1, states 2 and states 3 only. Re RW [oO Ra [hb Ri” | (a+ bb)b Ra” [aba + bbba Ri? | Rae Oo Ras bb’ Ra? [bba Ri [0 Rx” 18 Rs [8 Ru [a Ri? |O Ro [Oo Rw [Oo Rw [Oo ‘The regular expression corresponding to the language accepted by the DFA is given by: Rus’ (tate 1(j) is the start state and state 4 (j) is the final state), By using the formula: ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 34 Automata Theory & Compiler Design 210551 Module 2 AE = RE? + RMA? Where i=1, j= 4 and k =4; we get Rush = Rig + Rig (Ras) Rag? fab’a + bbb’a) + (ab‘a + bbb’a) (0) @ We know that 0° =e and OR= 0 Therefore Regular expression reduces to =ab'a+ bbb‘a REGULAR LANGUAGES AND PROPERTIES OF REGULAR LANGUAGES. Regular Languages: The regular languages are the languages accepted by Finite automata (DFA, NFA” and ¢ — NFA") and defined by regular expressions. { Strings of a’s and b’s ending abb} L.={ even of a’s and even number of b’s} etc Example: L There are many languages which are not regular. We can prove that certain languages are not regular using one powerful too! called pumping lemma. Example: L= {a" b" | n>=0} L= { equal number of 0s and 1’s}_ ete. PROVING LANGUAGES NOT TO BE REGULAR Pumping Lemma (PL) for Regular Languages: Theorem (Statement) : Let L be a regular language. Then there exists a constant ‘n’ (which depends on L) such that for every string ‘w’ in L such that |w| > n, we can break w into three strings, w=xyz, such that: Lly>0 ie: y#e 2. fey sn 3. For all k > 0, the string xy*z is also in L. Proof. Suppose L — L(A) for sume DFA ‘A’ and regular language L. Suppose ‘A’ ha s'n” number of states. Consider any string w = ajazas ay of length ’m’ where m >= n and each aj is an input symbol. Since we have ‘m’ input symbols, naturally we should have ‘m+1” states, in sequence qo, qu, dz qm Where qp is > start state and qm is + final state ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 35 Automata Theory & Compiler Design 21C851 Module 2 EY HO) AO HE n, by the pigeonhole principle itis net possible to have distinet transitions, since there Since | are only ‘n’ different states. So one of the state can have a loop. Thus we can find two different integers i and j with 0 0, xy*z is also accepted by DFA ‘A’; that is for a language L to be a regular, xy‘zis in L forall k>0. Applications of Pumping lemma: 1 Itis useful to prove certain languages are non-regular. 2. Itis possible to check whether a language accepted by FA is finite or infinite, ‘Show that L= {a b" |m>=0} is not regular. Let L 1s regular language and ‘n’ be the number of states in FA. since |w|=n-+n=2n > n, we can split ‘w’ into xyz such that |xy| 1 as 2 a on ———r—rD w = aaaaaaaaazazzzaaa a bbbbbbbbbLbLbELbb x y z ‘Where [x|=n-1 and |y| =I so that xy] =n-1 +1 = no. If*k? = 0, the string *y’ does not appear, so the number of ‘a’s will be less than number of *b’s ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 36 Automata Theory & Compiler Design 210551 Module 2 ie: w= a"! DY But according to pumping lemma ‘n’ number of ‘a’s should be followed by ‘n” of “b’s, which is ‘a contradiction to the assumption that the language is regular. So the language L= {a" b” | n>= 0} is not regular language. ‘Show that L= {a'b!|i>j} is not regular. Let L is regular language and ‘n’ be the number of states in FA. Consider the string w = a"! b” since |w|— n+] +n—2nt1 =n, we can split ‘w’ into ‘xyz’ such that xy| 1 as atl 2 —— —~__. me smn aaa FS keel yea = Where [x|=n-I and |y| =I so that [xy] =n-1 +1 = n 0. It'k = 0, the string *y’ does not appear, so the string ‘w’ has ‘n’ number of ‘a‘s followed by ‘n” number of ‘b's. ie: w= a" b” But according to pumping lemma ‘n+I’number of ‘a’s should be followed by ‘n’ of *b’s, which is a contradiction to the assumption that the language is regular. So the language L= {a' bj | i>j} is not regular language. ‘Show that L= {w | n,(W) < nyo } iS not regular, Let L is regular language and ‘n’ be the number of states in FA. Consider the string w= a™! b” since |w|=n-1 +n =2n-1 > n, we can split ‘w" into “xyz? such that [xy| 1 as al a ——, ——~—_~ Lm daaaaaaaaaaaaazaaaa bbdbbbobbtbdbobbdob: zope Where [x|=n-1 and |y|= 1 so that [xy| =n-1 +1 = nn, which is true. According to pumping lemma xy‘z € L for all k> 0. If *k’ = 0, the string ‘y’ does not appear, so the string ‘w” has ‘n-1” number of ‘a's followed by “nel” number of ‘b's. ie: w= a"! b™ ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 37 Automata Theory & Compiler Design 210551 Module 2 But according to pumping lemma ‘n-I’number of ‘a’s should be followed by ‘n’ of “b’s, which is a contradiction to the assumption that the language is regular, So the language L~ {wv | ng(w’) < mw) } is not regular. ‘We can prove that L is not regular by taking string w= ab" | n>=0. For solution refer problem1. ie: ij meansi j ori < j; 90 we can take string “w’ ~e"*!'b* or w- ab" Solution is similar to the previous problems. "Show that L= fa" b c"™ | n,m >= 0} is not regular. Let L is regular language and ‘n’ be the number of states in FA. ince L is regular it is closed under homomorphism. So we can take h(a) = a, h(b) = a and h(c) = . Now the language L is reduced to L = {a al c"""| ntm >= 0} ie: L= fa") ntm >= 0} which is in the form L={a'b)|i>=0}, Consider w= a"b" since |w|=n-4n=2n > n, we can split ‘w’ into xyz such that [xy| 1 as a n —_S, w = aaaaaaaaaaaaaaaaaaa bbbbbbbbbbbbbbbbbbb ‘Where [x|=n-l and |y|= 1 so that jxy| =n-1 +1 = 0. If‘ k= |, the string ‘y’ does not appear, so the number of ‘a’s will be less than number of “b's jes w= a"! bY Which is a contradiction to the assumption that the language is regular. So the given language So the language L= a b® | n>= 0} is not regular language L= {a" b ¢*™" | n,m >= 0} is not regular. Let L is regular language and ‘n’ be the number of states in TA. Consider the string w =a" b®, therefore ww =a" b” a" b™ ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 38 Automata Theory & Compiler Design 210551 Module 2 since |w| = ntrtn +n = 4n > n, we can split ‘w" into ‘xyz? such that [xy| 1 as a a a a —— ae ee ——— ws amma ‘bbbbbbbbbbbbbbbtbbb aaaasasaaasagaaaaaaa bbbbbbbbbbbbbbbbbbb a x-# 0 ya z ‘Where [x|=n-l and |y|=1 so that xy] =n-l + nn, which is tue. According to pumping lemma xy*z € L for all k> 0. If “k’ = 0, the string *y’ does not appear, so the number of *a’s on the left of first b will be less than number of a’s after the first b ie: ww=a"" b'a"b?. Which is a contradiction to the assumption that the language is regular. So the language L= {ww | w € (a+b)’} is not regular is not regular language. ‘Show that L= {ww* | w € (a+b)" } is not regular. Let L is regular language and ‘n’ be the number of states in FA. =a boa? Consider the string w =a" b*, therefore ww since |w|=ntnin + n=4n > n, we can split ‘w’ into ‘xyz’ such that |xy| 0. If ‘k’ = 0, the string ‘y’ does not appear, so the number of ‘a's on the left of first b will be less than number of ‘a’s after the first b ie: ww* =a"! b"b" a” Which is a contradiction to the assumption that the language is regular. So the language L= {ww® | w € (a+b)’} is not regular is not regular language. Let L is regular language and ‘n’ be the number of states in FA. Consider the string w= a™ since |w|=n 2n, we can split ‘w" into ‘xyz’ such that |xy| 0. If*k’ =0, the string ‘y’ does not appear, ie: a’ (a) a”? Itis clear that n! > n! —} ie: when j=1, nl >n! But according to pumping lemma, language to be regular, n! = n! ~ 1, which is not true and it is a contradiction. So L can not be regular. Let L is regular language and ‘n’ be the number of states in FA, Consider the string w= 0° since |w|=n > n, we can split ‘w’ into ‘xyz’ such that |xy| las ‘Where |x| =i and |y| =j so that |xy|=i+j1 According to pumping lemma xy'z € L for all k> o. If ‘k’ = 0, the string ‘y’ does not appear, ie: 0' (0) 0°” Itis clear that when, » n¢(n-1)asaprime But according to pumping lemma, language to be regular, when n is prime, n-1 is also prime, which is not true and it is a contradiction, So L ean not be regular. ‘Show that L= {0° | isa perfect square } is uot vegular. Let L is regular language and ‘n’ be the number of states in FA. Consider the string w= 0" since |w| =n? > n, we can split ‘w’ into ‘xyz’ such that [xy| 1 as x=0' y-0i z 0”? is ‘Where [x|~ i and ly =j so that y|=i+j1 According to pumping lemma xy*z € L for all k> o. If ‘k* = 0, the string ‘y* does not appear, ie: 0! (0) 0"? It is clear that when (n° -1 ) is not a perfect square. ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 40 Automata Theory & Compiler Design 210551 Module 2 But according to pumping lemma, language to be regular, when n” is a perfect square, n°-1 is also a perfect square, which is not true and it is a contradiction, So L can not be regular. ‘Show that L~ (0° | n is a perfect cube } is not regular. Let L is regular language and ‘n’ be the number of states in FA. Consider the string w = since |w|=n' >n, we can split ‘w" into ‘xyz’ such that [xy| 1 According to pumping lemma xy'z € L for all k> 0, 3 If “k’ = 0, the string “y’ does not appear, ie: 0' (0) 0") = 0" TEL It is clear that when j=1, (n° -1 ) is not a perfect cube. But according to pumping lemma, language to be regular, when n’ is a perfect cube, n’-1 is also aperfect cube, which is not true and it is a contradiction. So L can not be regular. LEXICAL ANALYSIS PHASE OF COMPILER DESIGN Lexical Analysis, Lexical analysis reads characters from left to right and groups into tokens. A simple way to build lexical analyzer is to construct a diagram to illustrate the structure of tokens of the source program, We can also produce a lexical analyzet automatically by specifying the lexeme pattems to a lexical-analyzer generator and compiling those pattems into code that functions as a lexical analyzer. This approach makes it easier to modify a lexical analyzer, since we have only to rewrite the affected patterns, not the entire program, Three general approaches for implementing lexical analyzer are: 1. Use lexical analyzer generator (LEX) from a regular expression based specification that provides routines for reading and buffering the input. 2. Write lexical analyzer in conventional language (C ) using V/O facilities to read input. 3. Write lexical analyzer in assembly language and explicitly manage the reading of input. Note: The speed of lexical analysis is a concern in compiler design, since only this phase reads the source program character-by character. ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 41 Automata Theory & Compiler Design 21¢851 Module 2 1 2, Lexical analyzer reads the source program character by character to produce tokens, Normally a lexical analyzer doesn’t return a list of tokens at one shot, it returns a token when the parser asks a token from it. Normally L.A. don’t return a comment as a token. It skips a comment, and return the next token (which is not a comment) to the parser. Correlating error messages: It can associate a line number with each error message. In some compilers it makes a copy of the source program with the error messages inserted at the appropriate positions. If the sou ce program uses a macra-prepracessor, the expansion of macros may be performed by the lexical analyzer. Role of Lexical Analyzer source Lexical token + To semastic analysis oven Analyzer |* get sexisked \ o ymbol Read the input characters of the source program, group them into lexemes and produces output as a sequence of tokens. It interacts with the symbol table. Initially parser calls the lexical analyzer, by means of getNextToken command. In response to this command LA read characters from its input until it can identify the next lexeme and produce a token for that lexeme, which can be returned to parser. It eliminates comments and white space. It displays error messages with line number. ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 42 Automata Theory & Compiler Design 210551 Module 2 Lexical Analysis versus Parsing: ‘The main reasons are; 1 P| that had to deal with comments and whitespace as syntactic units would be ty of design: Separation allows us to simplify the task. For example a parser considerably more complex than one that can assume comments and whitespace have already been removed by the LA. Compiler efficiency is improved: A specialized buffering techniques for reading input characters can speed up the compiler significantly. 3. Compiler portability is enhanced: Input device peculiarities can be restricted to the lexical analyzer. ‘Tokens, Patterns and Lexemes: i ii. Token: It describes the class or category of input string. A token is a pair consisting of a token name and an optional attribute value. or example, identifier, keywords, constants are called tokens Pattern: Set of rule that describes the tokens. Itis a description of the form that the lexemes of a token may take. Example: letter [A-Za-z] Lexeme: Sequence of characters in the source program that are matched with the pattern of the token, Example: int, a, num, ans ete Token representation: In many programming languages, the following classes cover most or all of the tokens i. One token for each keyword; The pattern for a keyword is the same as the keyword itself ji, Tokens for the operators, either individuelly or in classes such as the token comparison. iii, One token representing all identifiers. iv. One or more tokens representing constants, such as numbers and literal. ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 43 Automata Theory & Compiler Design 210551 Module 2 v. Tokens for each punctuation symbol, such as left and right parentheses, comma, and semicolon. Attributes for tokens A token has only a single attribute that is a pointer to the symbol-table entry in which the information about the token is kept. Example: The token names and associated attribute values for the statement E =M * C ** 2 are written below as a sequence of pairs. id, pointer to symbol-table entry for E~ Lexical errors: 1. Itis hard for a lexical analyzer to tell, without the aid of other components, that there is a source-code error. Tor instance, if the string fi is encountered for the first time in a C program in the context fi(a==f(x)) A lexical analyzer cannot tell whether fi is 2 misspelling of the keyword if or an undeclared funetion identifier. Since fi is a valid lexeme for the token id, the lexical analyzer must retum the when id tw the parser and let sume oth phase of the compiler probably the paises this case handle an error due to transposition of the letters. 2. Suppose a situation arises in which the lexical analyzer is unable to proceed because none of the patterns for tokens matches any prefix of the remaining input, The simplest recovery strategy is "panic mode" recovery. We delete successive characters from the remaining input, until the lexical analyzer can find a well-formed token at the beginning of what input is left This recovery technique may confuse the parser, but in an interactive computing environment it may be quite adequate. ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 44 Automata Theory & Compiler Design 210551 Module 2 Possible error-recovery actions are: 1, Delete one character from the remaining input 2. Insert a missing character into the remaining input. 3. Replace a character by another character. 4, Transpose two adjacent characters INPUT BUFFERING Lexical analyzer scans the input string from left to right, one character at a time. It uses two pointers as: > beginLexeme pointer > forward pointer To keep track of the position of the input scanned. Initially both the pointers point to the first character of the input string, 7 | | ‘The forward pointer moves ahead to search for end of lexeme. As soon as the blank space is encountered, it indicates end of lexeme. In above example as soon as forward pointer encounters ablank space, the lexeme is identified be | CC CO EC CG The fp will be moved ahead whi it sees white space. That is when fp encounters white space it ignores and moves ahead. Then both fp and bp is set at next token ‘What is meant by input buffering? To recognize tokens reading data/source program from hard disk is done. Accessing hard disk each time is costly and time consuming so special buffer technique has been developed to reduce the amount of overhead required. This process of reading source program into a buffer is called input buffering. A block of data is first read into a buffer and scanned by lexical analyzer. There are two methods used ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 45 Automata Theory & Compiler Design 210551 Module 2 1. One buffer 2. Two buffer Onc buffer scheme: Here only one buffer is used to store the input string. But the problem with this scheme is that, if a lexeme is very long, then it crosses the buffer boundary. To scan the remaining part of lexeme, the buffer has to be refilled, that makes overwriting of first part of lexeme. Sometimes it may result in loss of data due to the user misinterpretation. Two Buffer scheme: ‘Why two buffer schemes is used in lexical analysis? Explain. Because of the amount of time taken to process characters and the large number of characters that must be processed during the compilation of a large source program, specialized two buffering techniques have been developed to reduce the amount of overhead required to process a single input character. Here a buffer (array) divided into two N-character halves, where N = number of characters on one disk block Ex: 4096 bytes — If fewer than N characters remain in the input file , then special character, represented by eof, marks the end of source file and it is different from input character. © One read command is used to read N characters. Two pointers are maintained: be; of the lexeme pointer and forward pointer. * Initially, both pointers point to the first character of the next lexeme. * Using this method we can overcome the problem faced by one buffer scheme, even though the mput 1s lengthier the user knows from where he has to begin in the next buffer, as he can see the contents of previous buffer. Thus there is no scope for loss of any data, Sentinels: In two buffering scheme we must check the forward pointer, each time it is incremented. Thus we make two tests: one for the end of the buffer, and one to determine what character is read, We can combine these two tests. if we use a sentinel character at the end of buffer. ar ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 46 Automata Theory & Compiler Design 21C851 Module 2 Sentinel is a special character inserted at the end of buffer, that cannot be a part of source program; eof is used as sentinel, Look ahead code: Switeh(*forward++) { case eof if (forward is at end of first buffer) { reload second buffer: eginning of second buffer: forward } elseif (forward is at end of second buffer) { reload first buffer: forward = beginning of frst buffer: } else /* eof within a buffer marks the end of input terminate lexical analysis* break: case for the other characters } Operations on Languages Give the formal definitions of operations on languages with notations. In lexical analysis the most important operations on languages are: i. Union ii, Concatenation iii, Star duswe iv. Positive closure ‘These operations are formally defined as follows ‘OPERATION DEFINITION AND Union of Land M___ [LUM ={o| isin Lorik Concatenation of Land M in MY {ot [is in L and tis in MY ‘Kleene closure of L TUR, Positive closure of L Tt up, i Regular Expressions We use regular expressions to describe tokens of a programming language. + A regular expression is built up of simpler regular expressions (using defining rules) ‘ATHMARANJAN K Department of ISE, SIT Mangaluru. Page 47, Automata Theory & Compiler Design 210551 Module 2 + Each regular expression denotes a language. * A language denoted by a regular expression is called as a regular set, © *, concatenation and | has highest to lowest precedence with left associative. * If two regular expression ‘r’ and ‘s’ denote the same language, we say ‘r’ and ‘s’ are equivalent and write ms ‘Write the definition of regular expression Regular expression over alphabet 5 can be defined as follows: &- is a regular expression for the set containing empty string. ais a regular expression, if a belongs to ¥: that is {a} set containing the string a. Suppose r and s are regular expression denoting the languages L(r) and L(s) then, (rs) is a regular expression denoting L(r) U L(s). rs is a regular expression denoting L(r)L(s) (0)* is a regular expression denoting (L(1))* The following table shows some of the regular expressions along with their possible regular sets: Regular expression Set alb fa, b} {(alb)(alb) o aajab|bajbb faa, ab, ba, bb} at {e,a, aa, aaa,...} (alb)* or (a*b*)* fe, a,aa,b,bb,...} ala*b {a, b, ab,aab,aaab,...} Algebraic properties 11s =s\t r((sit)= (ris) (rs)t=n(st) (sit) =rs)rt (sit)r = srltr er re r= (re) t= ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 48 Automata Theory & Compiler Design 21C851 Module 2 [abe] denotes the regular expression albje [a-z] denotes the regular expression alb|......!2 Regular definition: * To write regular expression for some languages can be difficult, because their regular expressions can be quite complex. In those cases, we may use regular definitions. + Weccan give names to regular expressions, and we can use these names as symbols to define other regular expressions. Define regular definition aud write the regular definition for C identifier. co A regular definition is a sequence of the definitions of the form: di > rl where Gyis a distinct name and Ip is a regular expression over symbols in DU { di. dads dua} a2a- 2 NN . . basic symbols previously defined names dsm Regular definition for *C’ identifier: letter + A|B |e digit > 0] 1]....19 id — letter (letter | digit )° OR leter > [A-Za-z] digit > [0-9] id > letter (letter | digit )” ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 49 Automata Theory & Compiler Design 21C851 Module 2 digit > digits > optionalFraction —> optionalExponent —> number — OR digit > [0-9] digits + digit” umber > Recognition of Tokens O/1...19 (digit )° - digits |e (E(+|-[e) digits) |e digits optionalFraction optionalExponent digits (digits)? (F [+-]? digits )? Our current goal is to perform the lexical analysis needed for the following grammar. stmt — if expr then stmt if expr then stmt else stmt e expr — term relop term // relop is relational operator =. >. ete tem term — id number Recall that the terminals are the tokens the nanterminals produce termi A regular definition for the terminals is digit — [0-9] digits — digits” number — digits (. digits)? (E[+-]? digits)? letter —> [A-Za-2] id — letter (letter | digit if if then — then else — else relop > <| ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 50 Automata Theory & Compiler Design 21CS51 Module 2 Lexeme Token Attribute Whitespace ws if if _ then then = — else else — An identifier id Pointer to table entry Amumber — number Pointer to table entry < relop LT « relop Lr = relop EQ => relop NE > relop GT relop GE We also want the lexer to remove whitespace so we define a new token ws — ( blank | tab | newline ) + where blank, tab, and newline are symbols used to represent the corresponding ascii characters, Recall that the lexer will be called by the parser when the latter needs a new token. If the lexer then recognizes the token ws, it does nor retum it to the parser but instead goes on to recognize the next token, which is then retumed. Note that you can't have two consecutive ws tokens in the input because, for a given token, the lexer will match the Iongest lexeme starting at the current position that yields this token. The table on the right summarizes the situation For the parser, all the relational ops are to be treated the same so they are all the same token, relop. Naturally, other parts of the compiler, for example the code generator, will need to distinguish between the various relational ops so that appropriate code is generated. Hence, they have distinct attribute values, ‘ATHMARANJAN K Department of ISE, SIT Mangaluru Page 51 Automata Theory & Compiler Design 21851. Specification of Token To specify tokens Regular Expressions are used. Recogniti n of Token: To recognize tokens there are 2 steps 1. Design of Transition Diagram 2. Implementation of Transition Diagram ‘Transition Diagrams ‘A transition diagram is similar to a flowchart for (a part of) the lexer. We draw one for each possible token. It shows the decisi ns that must be made based on the input seen, The two main components are circles representing states (think of them as decision points of the lexer) and arrows representing edges (think of them as the decisions made). Itis fairly clear how to write code corresponding to this diagram. You look at the first character, if it is <, you look at the next character. If that character is =, you return (relop, LE) to the parser. If instead that character is >, you return (relop, NE). If it is another character, return (relop, LT) and adjust the input buffer so that you will read this character again since you have not used it for the current lexeme. If the first character was =, you retum (relop, EQ), i. relop (relational operator) Identifier and keyword iii, Unsigned number iv. Integer constant vw Whitespace i, ‘Transition diagram for relop: @)returnteelop, Ge) [ @igeren ‘ATHMARANJAN K, DEPARTMENT OF ISE, SIT MANGALURU. Page 52 Automata Theory & Compiler Design 21851. Transition diagram for identifier and keyword: letter or digit iv. Integer constant: start tor ve Whitespace: Whitespace characters are represented by delimiter, where delim includes the characters like blank, tab, new line and other characters that are not considered by the language design to be part of any token, ‘ATHMARAN Page 53 Automata Theory & Compiler Design 210851 Recognition of reserved words and : letter or digit \ raturn(getToken(), installi() ) ‘There are two ways we can handle reserved words that look like identifiers: 1. Install the reserved words in the symbol table initially: When we find an identifier, a call to installID( ) function places that identifier into the symbol table if it is not already there and returns a pointer to the symbol table entry. The function getToken( ) examines the symbol table for the lexeme found, and returns token name as either id or one of the keyword token that was initially installed in the table. 2. Create separate transition diagrams for each keyword Architecture of a transition diagram based lexical analyzer The idea is that we write a piece of code for each decision diagram. This piece of code contains a case for each state, which typically reads a character and then goes to the next case depending on the character read. nextehari) 1s used to read a next char from the input butfer. Ihe numbers in the circles are the names of the cases. Accepting states often need to take some action and return to the parser. Many of these accepting states (the ones with stars) need to restore one character of input. This is called retract() in the code. What should the code for a particular diagram do if at one state the character read is not one of those for which a next state has been defined? That is, what if the character read is not the label of any of the outgoing arcs? This means that we have failed to find the token corresponding to this diagram. The code calls fail(, is not an error case. It simply means that the current input does not match this particular token. So we need to go to the code section for another diagram after restoring the ‘ATHMARANJAN K, DEPARTMENT OF ISE, SIT MANGALURU. Page 54 Automata Theory & Compiler Design 21851. input pointer so that we start the next diagram at the point where this failing diagram started. If wwe have tried all the diagram, then we have a real failure and need to print an error message and perhaps try to repair the input Transition diagram: stare return( relop, LE) othe returntretep, tay @returntrelop. ert —+@)returnivaton, oe) — a ae Coding part: TOKEN getRelop( ) { JUKEN retloken = new(KELUP); while(1) ( * repeat character processing until a return or failure occurs */ Switch (state) =nextChar( ); if(e ‘<*) state = 1; else ‘ATHMARANJAN K, DEPARTMENT OF ISE, SIT MANGALURU. Page 55, Automata Theory & Compiler Design 210851 else fail(); * Iexeme is not a relational operator...other */ break; cease 1: ¢ = nextChar( ); 4, else fail(); /* lexeme is not a relational operator...other */ break; case 8: retract( ); retToken.attribute = GT; return(retToken); ATHMARANJAN K, DEPARTMENT OF ISE, SIT MANGALURU. Page 56

You might also like