Chapter-2 Compiler Design
Chapter-2 Compiler Design
Chapter – two
Lexical analysis
1
Outline
Introduction
Interaction of Lexical Analyzer with Parser
Token, pattern, lexeme
Specification of patterns using regular expressions
Regular expressions
Regular expressions for tokens
2
Introduction
❑ The role of lexical analyzer is:
• to read a sequence of characters from the source
program
• group them into lexemes and
• produce as output a sequence of tokens for each
lexeme in the source program.
3
Interaction of the Lexical Analyzer
with the Parser
Source
Program
symbol
table
(Contains a record
for each identifier)
5
Token, pattern, lexeme…
Example: The following table shows some tokens and
their lexemes in Pascal (a high level, case insensitive
programming language)
Token Some lexemes pattern
begin Begin, Begin, BEGIN, Begin in small or capital
beGin… letters
if If, IF, iF, If If in small or capital letters
ident Distance, F1, x, Dist1,… Letters followed by zero or
more letters and/or digits
Regular expressions
Regular expressions for tokens
7
Regular expression: Definitions
10
Regular expression: Language Operations
Union of L and M
L ∪ M = {s |s ∈ L or s ∈ M}
Concatenation of L and M
LM = {xy | x ∈ L and y ∈ M}
Exponentiation of L
L0 = {ε}; Li = Li-1L The following shorthands
are often used:
Kleene closure of L
L* = ∪i=0,…,∞ Li r+ =rr*
r* = r+| ε
Positive closure of L
r? =r|ε
L+ = ∪i=1,…,∞ Li
11
RE’s: Examples
L(01) = ?
L(01|0) = ?
L(0(1|0)) = ?
Note order of precedence of operators.
L(0*) = ?
L((0|10)*(ε|1)) = ?
12
RE’s: Examples
L(01) = {01}.
L(01|0) = {01, 0}.
L(0(1|0)) = {01, 00}.
Note order of precedence of operators.
13
RE’s: Examples (more)
1- a|b=?
2- (a|b)a = ?
3- (ab) | ε = ?
4- ((a|b)a)* = ?
Reverse
1 – Even binary numbers =?
2 – An alphabet consisting of just three alphabetic
characters: Σ = {a, b, c}. Consider the set of all strings
over this alphabet that contains exactly one b.
14
RE’s: Examples (more)
1- a | b = {a,b}
2- (a|b)a = {aa,ba}
3- (ab) | ε ={ab, ε}
4- ((a|b)a)* = {ε, aa,ba,aaaa,baba,....}
Reverse
1 – Even binary numbers (0|1)*0
2 – An alphabet consisting of just three alphabetic
characters: Σ = {a, b, c}. Consider the set of all strings
over this alphabet that contains exactly one b.
(a | c)*b(a|c)* {b, abc, abaca, baaaac, ccbaca, cccccb}
15
Exercises
Describe the languages denoted by the following
regular expressions:
1- a(a|b)*a
2- ((ε|a)b*)*
3- (a|b)*a(a|b)(a|b)
4- a*ba*ba*ba*
16
Regular Expressions (Summary)
Definition: A regular expression is a string over
∑ if the following conditions hold:
1. ε, Ø, and a Є ∑ are regular expressions
2. If α and β are regular expressions, so is αβ
3. If α and β are regular expressions, so is α+β
4. If α is a regular expression, so is α*
5. Nothing else is a regular expression if it doesn’t
follow from (1) to (4)
Let α be a regular expression, the language
represented by α is denoted by L(α).
17
Regular expressions for tokens
21
Example: Divide the following Java program into
appropriate tokens.
public class Dog {
private String name;
private String color;
22
Automata
Abstract machines
Characteristics
Input: input values (from an input alphabet ∑) are applied
to the machine
23
Automata: cont’d
Types of automata
Finite State Automata (FSA)
• Deterministic FSA (DFSA)
• Nondeterministic FSA (NFSA)
24
Finite Automata
25
Finite-state Automata…
state
a b c a = { a, b, c }
0 1 2 3 4
final state
start state transition
Input
• Representation State a b c
– An FSA may also be
represented with a 0 1
state-transition table. 1 2
The table for the 2 3
above FSA:
3 4
4 26
Design of a Lexical Analyzer/Scanner
Finite Automata
❑ Lex – turns its input program into lexical analyzer.
❑ Finite automata are recognizers; they simply say "yes"
or "no" about each possible input string.
❑ Finite automata come in two flavors:
27
The Whole Scanner Generator Process
Overview
❑ Direct construction of Nondeterministic finite
Automation (NFA) to recognize a given regular
expression.
❑ Easy to build in an algorithmic way
❑ Requires ε-transitions to combine regular sub expressions
❑ Construct a deterministic finite automation
(DFA) to simulate the NFA Optional
❑ Use a set-of-state construction
❑ Minimize the number of states in the DFA
❑ Generate the scanner code.
28
Design of a Lexical Analyzer …
Token ➔ Pattern
Pattern ➔ Regular Expression
Regular Expression ➔ NFA
NFA ➔ DFA
DFA’s or NFA’s for all tokens ➔ Lexical Analyzer
29
Non-Deterministic Finite Automata
(NFA)
Definition
An NFA M consists of five tuples: ( Σ,S, T, S0, F)
A set of input symbols Σ, the input alphabet
a finite set of states S,
a transition function T: S × (Σ U { ε}) -> S (next state),
a start state S0 from S, and
a set of accepting/final states F from S.
The language accepted by M, written L(M), is defined as:
The set of strings of characters c1c2...cn with each ci from
Σ U { ε} such that there exist states s1 in T(s0,c1), s2 in
T(s1,c2), ... , sn in T(sn-1,cn) with sn an element of F.
30
NFA…
It is a finite automata which has choice of
edges
• The same symbol can label edges from one state to
several different states.
An edge may be labeled by ε, the empty
string
• We can have transitions without any input
character consumption.
31
Transition Graph
The transition graph for an NFA recognizing the
language of regular expression (a|b)*abb
all strings of a's and b's ending in the
particular string abb
a
start a b b
0 1 2 3
b S={0,1,2,3}
Σ={a,b}
S0=0
F={3}
32
Transition Table
The mapping T of an NFA can be represented
in a transition table
State Input Input Input
a b ε
0 {0,1} {0} ø
a a b b
0 0 1 2 3 YES
a a b b
0 0 0 0 0 NO
34
Another NFA
a
a
start
b
b
aa*|bb*
35
Deterministic Finite Automata (DFA)
36
DFSA: Example
a S = {S, A, B, C, D}
S a A
∑ = {a, b}
b a So = S
b
F = {C, D}
B C b D
a
b
State Input Next state Check whether the following
S a A
S b B strings are accepted or not:
A a A • ab
A b C • ba
B b B • bbaba
B a C
C b D • aa
D a D • aaabbaaa
37
DFA example
A DFA that accepts (a|b)*abb
38
Simulating a DFA: Algorithm
How to apply a DFA to a string.
INPUT:
An input string x terminated by an end-of-file character
eof.
A DFA D with start state So, accepting states F, and
transition function move.
OUTPUT: Answer ''yes" if D accepts x; "no" otherwise
METHOD
Apply the algorithm in (next slide) to the input string x.
The function move(s, c) gives the state to which there is
an edge from state s on input c.
The function nextChar() returns the next character of
the input string x.
39
Simulating a DFA
s = so;
c = nextchar();
while ( c != eof ) {
s = move(s, c);
c = nextchar();
}
if ( s is in F ) return
"yes";
DFA accepting (a|b)*abb
else return "no";
41
Why do we study RE,NFA,DFA?
Goal: To scan the given source program
Process:
Start with Regular Expression (RE)
Build a DFA
• How?
– We can build a non-deterministic finite automaton, NFA
(Thompson's construction)
– Convert that to a deterministic one, DFA
(Subset construction)
– Minimize the DFA (optional)
(different algorithms)
Implement it
Existing scanner generator: Lex/Flex
42
RE→NFA→DFA→ Minimize DFA states
Step 1: Come up with a Regular Expression
(a|b)*ab
Step 2: Use Thompson's construction to create
an NFA for that expression
43
RE→NFA→DFA→ Minimize DFA states
Step 1: Come up with a Regular Expression
(a|b)*ab
Step 2: Use Thompson's construction to create
an NFA for that expression
44
RE→NFA→DFA→ Minimize DFA states
Step 3: Use subset construction to convert the NFA to a DFA
45
Design of a Lexical Analyzer Generator
Two algorithms:
1- Translate a regular expression into an NFA
(Thompson’s construction)
46
From regular expression to an NFA
It is known as Thompson’s construction.
Rules:
1- For an ε, a regular expressions, construct:
start a
47
From regular expression to an NFA…
2- For a composition of regular expression:
Case 1: Alternation: regular expression(s|r), assume
that NFAs equivalent to r and s have been
constructed.
48
48
From regular expression to an NFA…
Case 2: Concatenation: regular expression sr
ε
…r …s
Case 3: Repetition r*
49
From RE to NFA:Exercises
50
From an NFA to a DFA
(subset construction algorithm)
Rules:
Start state of D is assumed to be unmarked.
Start state of D is = ε-closer (S0),
where S0 -start state of N.
51
NFA to a DFA…
ε- closure
ε-closure (S’) – is a set of states with the following
characteristics:
1- S’ € ε-closure(S’) itself
2- if t € ε-closure (S’) and if there is an edge labeled
ε from t to v, then v € ε-closure (S’)
3- Repeat step 2 until no more states can be added
to ε-closure (S’).
E.g: for NFA of (a|b)*abb
ε-closure (0)= {0, 1, 2, 4, 7}
ε-closure (1)= {1, 2, 4}
52
NFA to a DFA…
Algorithm
While there is unmarked state
X = { s0, s1, s2,..., sn} of D do
Begin
Mark X
For each input symbol ‘a’ do
Begin
Let T be the set of states to which there is a transition ‘a’ from state si
in X.
Y= ε-Closer (T)
If Y has not been added to the set of states of D then {
Mark Y an “Unmarked” state of D add a transition from X to Y labeled a
if not already presented
}
End
End 53
NFA for identifier: letter(letter|digit)*
ε
letter
3 4
ε ε
start
letter ε ε
0 1 2 7 8
digit ε
ε 5 6
54
NFA to a DFA…
Example: Convert the following NFA into the corresponding
DFA. letter (letter|digit)*
A={0}
B={1,2,3,5,8}
start letter C={4,7,2,3,5,8}
A B
D={6,7,8,2,3,5}
letter digit
letter
digit D digit
C
letter
55
Exercise: convert NFA of (a|b)*abb in to DFA.
56
Other Algorithms
57
The Lexical- Analyzer Generator: Lex
The first phase in a compiler is, it reads the
input source and converts strings in the source
to tokens.
Lex: generates a scanner (lexical analyzer or
lexer) given a specification of the tokens using
REs.
The input notation for the Lex tool is referred to as
the Lex language and
The tool itself is the Lex compiler.
The Lex compiler transforms the input patterns into a
transition diagram and generates code, in a file
called lex.yy.c, that simulates this transition diagram.
58
Lex…
59
General Compiler Infra-structure
Parse tree
Program source Tokens Parser
Scanner Semantic
(tokenizer) Routines
(stream of
characters) Annotated/decorated
tree
Analysis/
Transformations/
Symbol and optimizations
literal Tables
IR: Intermediate
Representation
Code
Generator
Assembly code
60
Scanner, Parser, Lex and Yacc
6161
Generating a Lexical Analyzer using Lex
Lex is a scanner generator ----- it takes lexical specification as
input, and produces a lexical analyzer written in C.
Lex source
program Lex compiler lex.yy.c
lex.l
lex.yy.c
gcc compiler a.out
Lexical Analyzer
62
Lex specification
➢ Program structure C declarations in %{
...declaration section... %}
%%
P1 { action1 }
...rule section... P2 { action2 }
%%
...user defined functions...
Rules section – regular expression <--> action.
• The actions are C program.
Declaration section – variables, constants
63
The rules section
%%
[RULES SECTION]
64
Design of a Lexical Analyzer Generator:
RE to NFA to DFA
Thompson’s
construction
ε-closure({0}) = {0,1,3,7}
move({0,1,3,7},a) = {2,4,7}
ε-closure({2,4,7}) = {2,4,7}
move({2,4,7},a) = {7}
ε-closure({7}) = {7}
move({7},b) = {8}
ε-closure({8}) = {8}
move({8},a) = ∅
68
Combining and simulation of NFAs of a Set of
Regular Expressions: Example 2
start a
a {action1} 1 2
start b
abb {action2} a b
3 4 5 6
a*b+ {action3}
start a
When two or more b
accepting states are 7 b 8
reached, the first action
is executed a Action 1
ε 1 2
start b
a b b b
0 ε 3 a 4 5 6
0 2 5 6
1 4 8 8 ε b Action 2
a 7 b
3 7 8
7 None a Action 3
Action 2
Action 3 69
DFA's for Lexical Analyzers
NFA DFA. Transition table for DFA
State a b Token
found
0137 247 8 None
247 7 58 a
8 - 8 a*b+
7 7 8 None
58 - 68 a*b+
68 - 8 abb
71
Pattern matching examples
72
Meta-characters
73
Lex Regular Expression: Examples
• an integer: 12345
[1-9][0-9]*
• a word: cat
[a-zA-Z]+
• a (possibly) signed integer: 12345 or -12345
[-+]?[1-9][0-9]*
• a floating point number: 1.2345
[0-9]*”.”[0-9]+
74
Regular Expression: Examples…
• a delimiter for an English sentence
“.” | “?” | ! OR
[“.””?”!]
• C++ comment: // call foo() here!!
“//”.*
•white space
[ \t]+
• English sentence: Look at this!
([ \t]+|[a-zA-Z]+)+(“.”|”?”|!)
75
Two Rules
76
Lex variables
yyin - of the type FILE*. This points to the current file
being scanned by the lexer.
yyout - Of the type FILE*. This points to the location
where the output of the lexer will be written.
• By default, both yyin and yyout point to standard input
and output.
yytext – variable, a pointer to the matched strings (char
*)
yyleng - Gives the length of the matched pattern.
yylineno - Provides current line number information.
77
Lex functions
78
Lex predefined variables
79
Let us run a lex program
80
Lex : programs
The first example is the shortest possible lex file:
%%
Input is copied to output, one character at a time.
The first %% is always required, as there must
always be a rules section.
However, if we don’t specify any rules, then the
default action is to match everything and copy it to
output.
Defaults for input and output are stdin and stdout,
respectively.
Here is the same example, with defaults explicitly
coded:
81
Rule %%
section /* match everything except newline */
. ECHO;
/* match newline */
\n ECHO;
%%
int yywrap(void) { Invokes the
return 1; Lexical
analyzer
}
int main(void) {
User yylex();
definition return 0;
section
}
82
Developing Lexical analyzer using
Lex : Linux (Fedora)
vi – used to edit lex and yacc source files.
w – save
q – quit
w filename – save as
wq – save and quit
q! – exit overriding change
85
How to compile and run LEX programs...
4. Press esc
5. Press :wq
6. lex lab1.l
7. gcc lex.yy.c -ll
8. ./a.out <hello.c
86
Examples (more) Regular
definitions
%% %{
/*Match every thing #include <stdio.h>
except new line*/ %}
digit [0-9]
. ECHO;
letter [A-Za-z]
/*Match new line*/ id {letter}({letter}|{digit})*
\n ECHO; %%
%% {digit}+ { printf(“number: %s\n”, yytext); }
int yywrap(void) { {id} { printf(“ident: %s\n”, yytext); }
. { printf(“other: %s\n”, yytext); }
return 1; %%
} main()
int main(void) { {
yylex(); yylex();
} Translation
retrun 0; rules
}
87
Example :Finding the number of identifier in a given
program
digit [0-9]
letter [A-Za-z]
%{
int count=0;
%}
%%
{letter}({letter}/{digit})* count++;
%%
int main(void) {
yylex();
printf(“The number of identifiers are=%4d\n”,count);
return 0; }
88
Example: Here is a scanner that counts the number of
characters, words, and lines in a file.
%{
int nchar, nword, nline;
%}
%%
\n { nline++;}
[^ \t\n]+ { nword++, nchar += yyleng; }
. { nchar++; }
%%
int main(void) {
yylex();
printf("%d\t%d\t%d\n", nchar, nword, nline);
return 0;
}
89
%{ /* definitions of manifest constants */
LT, LE, EQ, NE, GT, GE, IF, THEN, ELSE, ID, NUMBER,
Regular definitions
RELOP */
%}
delim [ \t\n]
ws {delim}+ Return token to parser
letter [A-Za-z]
digit [0-9]
id {letter}({letter}|{digit})*
number {digit}+(\.{digit}+)?(E[+\-]?{digit}+)?
%%
{ws} {/*no action and no return*/ }
if {return IF;} Token attribute
then {return THEN;}
else {return ELSE;}
{id} {yylval = install_id(); return ID;}
{number} {yylval = install_num(); return NUMBER;}
“<“ {yylval = LT; return RELOP;}
“<=“ {yylval = LE; return RELOP;}
“=“ {yylval = EQ; return RELOP;}
Install yytext as identifier
“<>“ {yylval = NE; return RELOP;} in symbol table
“>“ {yylval = GT; return RELOP;}
“>=“ {yylval = GE; return RELOP;}
%%
int install_id() {}
int install_num() {} 90
Assignment on Lexical Analyzer
91
Individual
92
The MINI Language Introduction
Assumptions:
Source code – MINI language
Target code – Assembly language
Specifications:
➢ There are no procedures and declarations.
➢ All variables are integer variables, and variables are
declared simply by assigning values to them.
➢ There are only two control statements:
✓ An if – statement and
✓ A repeat statement
➢ Both the control statements may themselves
contain statement sequences.
93
The MINI Language Introduction...
➢ An if – statement has an optional else part and must
be terminated by the key word end.
➢ There are also read and write statements that
perform input/output.
➢ Comments are allowed with curly brackets,
comments cannot be nested.
➢ Expression in MINI are also limited to Boolean and
integer arithmetic expressions.
➢ A Boolean expressions consists of a comparison of
two arithmetic expressions using either of the two
comparison operators < and =.
94
The MINI Language...
➢ An arithmetic expression may involve integer constants,
variables, parenthesis, and any of the four integer
operators +, -, *, and / (integer division).
➢ Boolean expressions may appear only as tests in
control statements – i.e. There are no Boolean
variables, assignment, or I/O.
➢ Here is a sample program in this language for factorial
function.
95
{ sample program
in MINI language – computes factorials
}
read x; { input an integer }
if x > 0 then { don’t compute if x<= 0}
fact:= 1;
repeat
fact := fact * x ;
X:= x-1
until x = 0;
write fact { output factorial of x}
end
96
The MINI Language...
In addition to the tokens, MINI has the following
lexical conventions:
➢ Comments : are enclosed in curly brackets {...} and
cannot be nested.
➢ White space : consists of blanks, tabs, and
newlines.
➢ The principles of longest substring is followed in
recognizing tokens.
97
Design a scanner for MINI language
98