Compilers and Execution
Environments (ID2202)
Fall 2020
Lecture 2: Lexical Analysis
David Broman
Associate Professor, KTH Royal Institute of Technology
Associate Director Operations, Digital Futures
David Broman
Part I
Lexical Analysis
Part II Part III
Regular Expressions Deterministic Finite Automata
Part I Part II Part III
Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman
Part I
Lexical Analysis
Part I Part II Part III
Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman
Lexical Analysis / Scanning
Line comment: Recognize “//“,
int
ignore until new line Identifier token, IDENT
foo
( “foo” is a specific
lexeme
// A function int
int foo(int x){ x
Lexical Analyzer /
return x + 1; Scanner )
} {
Source Text Tokens return Keyword, RETURN
x
Ignores white space +
and comments. 1 Integer Literal / Constant
Space 0x20, Tab 0x09, Lexical Error ; INT
Line Feed 0xa, Carriage }
Return 0xd Specific token
CR LF Windows/DOS RCURLY
LF Unix, MacOS X, Linux
CR Classic Mac, C64
Part I Part II Part III
Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman
Lexical Analysis / Scanning
Exercise: which lexical errors would Not a lexical error. Would be
a C/C++ compiler detect? detected by the parser
void foo(int x){
while(x < 10]{
if(x == 7) // Works?
x = x + 2;
braek;
x = x + 1 Not a lexical error. “braek”
is an identifier.
}
return "Hello\z";
}
How do we implement scanning?
Lexical warning. Unknown
1. By hand, programmatically longest match
escape sequence ‘\z’
2. Scanner tool
Specify => regular expressions (regex) NOTE: most of the errors in this example will
Implement => deterministic finate automata (DFA) be parsing errors.
Part I Part II Part III
Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman
Languages
Language
A set of strings
String
A finite sequence of symbols
Symbol
An element in a finite set (the alphabet)
Example 1: set of strings representing an integer
Finite or infinite set? Answer: infinite
Example 2: set of strings forming an identifier, which
represents a keyword in the C language
Finite or infinite set? Answer: finite
Part I Part II Part III
Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman
Chomsky Hierarchy
Grammars Languages Accepting Machines
Type 0 Recursively Turing machine
Unrestricted grammars enumerable
Type 1 Context- Linear-bounded
Context-sensitive sensitive automata
grammars
Used in parsing
Type 2 Context-free Pushdown
Context-free automata
grammars
Part of this
course
Type 3 Regular Deterministic finite Used in lexing / scanning
Regular grammars language automaton
Nondeterministic finite
automaton
For details, see for instance “Languages and Machines By Sudkamp, 2006
Part I Part II Part III
Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman
Part II
Regular Expressions
Part I Part II Part III
Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman
Regular Expressions
Regular expression
A way to specify a (possibly infinite) set of strings
Example: Alphabet {a, b, c}
Symbol a
A language just containing a
Alternation M|N
M | N is a new regular expression,
Regex a|b represents the
where M and N are regular expressions
language {“a”, “b”}
Concatenation M⋅N
M⋅N is a new regular expression, forming the
Regex (a|b)⋅c represents the
concatenation of M and N, where M and N are
language {“ac”, “bc”}
regular expressions
Epsilon ε Regex (a⋅c)|ε represents the
Regular expression ε is the language of an language {“ac”, “”}
empty string
Repetition M* Regex ((a⋅b)|c)* represents the
* is called the Kleene closure of M. M* forms a language {“abc”, “abc”, “abcc”,
regular expression, representing the “abccc”, “abccabab …}
concatenation of zero or more M Structure from Apple (1998)
Part I Part II Part III
Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman
Regular Expressions, Convensions
Conventions
Symbol a
A language just containing a
May omit ⋅ or ε
Alternation M|N
ab is the same as a⋅b
M | N is a new regular expression,
where M and N are regular expressions a| is the same as a|ε
Concatenation M⋅N
Kleene closure binds tighter
M⋅N is a new regular expression, forming the
concatenation of M and N, where M and N are ab* is the same as a(b)*
regular expressions
Epsilon ε Concatenations binds tighter than
Regular expression ε is the language of an alternation
empty string ab|c is the same as (ab)|c
Repetition M*
* is called the Kleene closure of M. M* forms a
regular expression, representing the
concatenation of zero or more M
Part I Part II Part III
Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman
Regular Expressions, Abbreviations
Abbreviations
Symbol a
A language just containing a
Alternation abbreviation
Alternation M|N [abc] is the same as a|b|c
M | N is a new regular expression,
[b-h] is the same as [bcdefgh]
where M and N are regular expressions
[b-d01A-C] is the same as [bcd01ABC]
Concatenation M⋅N
M⋅N is a new regular expression, forming the
concatenation of M and N, where M and N are Optional and repetition
regular expressions M? with the meaning (M| ε)
Epsilon ε M+ with the meaning (M⋅M*)
Regular expression ε is the language of an
empty string Any characters represented by a
period .
Repetition M*
* is called the Kleene closure of M. M* forms a
regular expression, representing the
concatenation of zero or more M
Part I Part II Part III
Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman
Regular Expressions, Abbreviations
Symbol a Alternation abbreviation
Alternation M|N Optional and repetition
[abc] is the same as a|b|c
Concatenation M⋅N M? with the meaning (M| ε)
[b-h] is the same as [bcdefgh]
Epsilon ε M+ with the meaning (M⋅M*)
[b-d01A-C] is the same as [bcd01ABC]
Repetition M*
Unsigned integer [0-9]+
Keyword while “while”
Unsigned integer,
avoiding octal number 0|[1-9][0-9]* Keyword if “if”
Hexadecimal number,
0[xX][0-9a-fA-F]+ What about tokenizing the string “ifwhile”?
“0x01a” “0XfF31”
Typically, match identifier,
Identifiers, e.g. “foo”, check if keyword
“Bar”, “_foo12”. It is not [_a-zA-Z][_a-zA-Z0-9]*
allowed to start with a What about matching operators + and ++
digit. Two rules to disambiguate: longest
match and rule priority
Part I Part II Part III
Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman
Part III
Deterministic Finite Automata
Part I Part II Part III
Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman
Deterministic Finite Automata (DFA)
1 Finite Automata
A DFA is a machine M, represented as is a 5-tuple:
The language recognized by M is the
set of strings that M accepts. It is
ata M = (S, ⌃, , s0 , F ) written as L(M).
1 Finite Automata
Set of final states
What is the Set of states
L(M )
M = (S, ⌃, , s0 , F )
formalization of the :Q⇥⌃!Q Start state
DFA? Transition
Finite set of L(M )
function Note: deterministic
input symbols because only one
S = {1, 2, 3, 4} (the alphabet) : S ⇥ ⌃ ! S(1) possible output
⌃ = {r, f, o} (2)S= {1,o2, 3, 4} r
f
= {((1, f), 2), ((2, o), 3), ((3, r), 4)} (3)
⌃= {r, f, o}
S = {1, 2, 3, 4}
s0 = 1 (4)
1 2 3 4
F = {4} (5) = {((1,
⌃ = {r, f),f,2),
o}((2, o), 3), (
Part I Part II
s0 = 1 Part=III {((1, f), 2), ((2, o), 3),
Lexical Analysis Regular Expressions sDeterministic
0 = 1 Finite Automata
David Broman
⇥⌃!Q
Deterministic Finite Automata (DFA)
S = {1, 2, 3, 4} (1)
⌃ = {r, f, o} (2)
= {((1, f),
Exercise: Write ((2,ao),
2),down 3), ((3,
regular r), 4)}for
expression (3)
s0a =
lower
1 case identifier (can include an (4) a-z
underscore as first character)
F = {4} (5)
[_a-z][a-z0-9]* _
1 2
S = {1, 2} (6)
⌃ = { , a, b, . . . , z, 0, 1, . . . , 9} (7)
a-z
= {((1, a), 2), . . . , (2, 0), 2), . . . , } (8)
s0 = 1 (9) 0-9
F = {2} (10)
Abbreviation. Represents one
transition line for each character
Part I Part II Part III
Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman
Deterministic Finite Automata (DFA)
Task 1. Suppose your alphabet contains the
characters
Task 2. Write out
N, a, v, /
the DFA
We assume that N represents a new line
a
Write down the regular expression that matches a
line comment, starting with “//“
//[av/]*N / /
N
Task 3. Given the DFA M to the
right, which of the the following 1 2 3 4
strings is a string in L(M)?
1 “aavaN” Is rejected by M
2. “//avvN” Is accepted by M v /
This is how you would write it in
"//" [^'\n']* '\n'
ocamlflex. ^'\n' means not a new line.
Part I Part II Part III
Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman
Reading Guidelines - Module 1
• Compiler overview: C&T Chapters 1.1 - 1.4
• Lexing and scanning: C&T Chapters 2.1 - 2.5
• Parsing: C&T Chapters 3.1 - 3.5. Focus on 3.2 and 3.3.
C&T = the course book by Cooper and Torczon
Reading Guidelines
See the course webpage
for more information.
Part I Part II Part III
Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman
Soon time for coffee…
Part I Part II Part III
Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman
Conclusions
Some key take away points:
• Lexical analysis (scanning) is the first phase in the
front end. The output is a stream of tokens
• Regular expressions can be used when specifying the
scanner.
• A deterministic finite automata (DFA) may be used
when implementing the lexical analysis.
Thanks for listening!
Part I Part II Part III
Lexical Analysis Regular Expressions Deterministic Finite Automata