Compiler Construction
COMP 451
Introduction
1
Today’s Adenda
§ A basic intro in which we will see
— Types of compiler
— Major parts of compilation process
— Other programs that work with compiler to execute a program
§ Structure / phases of a compiler
§ A brief overview of all the phases of compiler.
2
Compiler
§ Compiler is a program that reads a program written in one language and
translates it into an equivalent program into another language.
§ A compiler also reports errors present in the source program as part of
translation process.
3
Compilation Process
Major Parts of the Compilation Process
§ Analysis (Front End)
— Takes the input source program and breaks it into parts.
— Creates an intermediate representation of the source program.
— Generates symbol table.
§ Synthesis (Back End)
— Takes the intermediate representation as input.
— Creates desired target program.
§ Also called passes of a compilation process.
4
Phases of a Compiler
5
Phases of a Compiler
Lexical Analyzer
§ It reads the program and converts it into tokens.
§ Tokens are defined by regular expressions which are understood by the
lexical analyzer.
Syntax Analyzer
§ Also called parser.
§ It takes all the tokens one by one and uses Context Free Grammar to
construct the parse tree.
Semantic Analyzer
§ Verifies the parse tree, whether it’s meaningful or not.
§ Produces a verified/corrected/annotated parse tree.
§ It also performs type checking.
Intermediate Code Generator
§ Last phase of front end.
§ An intermediate code is one which can be readily executed by machine.
§ Three Address Code is a popular example of IC.
6
Phases of a Compiler
Intermediate Code Generator …
§ Till intermediate code, it is same for every compiler out there, but
after that, it depends on the platform.
§ To build a new compiler we don’t need to build it from scratch.
§ We can take the intermediate code from the already existing compiler
and build the last two parts.
7
Phases of a Compiler
Code Optimizer
§ It transforms the code so that it consumes fewer resources and produces
more speed.
§ The meaning of the code being transformed is not altered.
Target Code Generator
§ This is the final stage of compilation.
§ The main purpose of Target Code generator is to write a code that the
machine can understand.
§ The output is dependent on the type of assembler.
8
Symbol Table
§ It stores information
— about scope
— about instances of various entities such as variable and function
names, classes, objects, etc.
§ It is built in lexical and syntax analysis phases.
§ The information is collected by the analysis phases of compiler and is
used by synthesis phases of compiler to generate code.
§ It is used by compiler to achieve compile time efficiency.
§ Each entry in symbol table is associated with attributes that support
compiler in different phases.
§ Items stored in symbol table:
— Variable names and constants
— Procedure and function names
— Literal constants and strings
— Compiler generated temporaries
— Labels in source languages
9
Types of Compilers
§ Single Pass Compiler
§ Multi Pass Compiler
10
Single Pass Compiler
Single pass compiler
• passes through the part of
each compilation unit exactly
once.
• faster and smaller than the
multi pass compiler.
• less efficient in comparison
with multi-pass compiler.
• processes the input exactly
once, so going directly from
lexical analysis to code
generator, and then going
back for the next read.
• can’t backup and re-process,
so grammar should be
simplified.
• Examples may include
command interpreters such
as bash/sh/tcsh
11
Two Pass Compiler
Two pass compiler
• processes the source code of a
program multiple times.
• In multi-pass compiler, we
divide phases in two passes.
• First pass is called front end,
and is platform independent.
• Second pass is called back end
and is platform dependent.
12
Front End / Back End Division
§ As discussed earlier, modern compilers contain two (large) parts, each of
which is often subdivided further.
§ These two parts are the front end, and the back end.
§ The front end
— analyzes the source program,
— determines its constituent parts, and
— constructs an intermediate representation of the program.
§ Typically the front end is independent of the target language.
§ The back end
— synthesizes the target program from the intermediate representation
produced by the front end.
§ Typically the back end is independent of the source language.
13
Front End / Back End Division
§ This front/back division very much reduces the work for a compiling
system that can handle several (N) source languages and several (M)
target languages.
§ Instead of NM compilers, we need N front ends and M back ends.
§ For gcc (originally abbreviating "Gnu C Compiler", but now abbreviating
"Gnu Compiler Collection"), N=7 and M~30 so the savings are
considerable.
14
Front End / Back End Division
• Compiler for more
than one programming
languages but same
machine.
• For each programming
language there is
requirement of making
Front end/first pass for
each of them and only
one Back end/second
pass as shown.
15
Front End / Back End Division
• compiler for same
programming language
but different
machines/systems.
• In this case we make
different Back ends for
different machines and
make only one Front
end for same
programming language
as shown.
16
Front End / Back End Division
17
Programs which help
Compiler accomplish its task
18
Cousins of Compilers
Cousins of Compiler
§ These are different programs that work with compiler to generate
machine code.
§ Other than compiler, there are three programs that work together to do
the task
— Pre processor
— Assembler
— Loader/Linker
19
Cousins of Compilers
Pre processor
§ Macro expansion
§ File inclusion
§ Provides a modified source program.
§ Strip off comments
Compiler
§ Checks for syntax and semantic errors.
§ If no error, the expanded code is converted into assembly code specific
for the processor architecture.
Assembler
§ Produces a re-locatable machine code.
20
Cousins of Compilers
Linker
§ Combines object files and library routines required to run the complete
source program.
§ It then hands over the stuff to the loader.
Loader
§ Takes the entire program (the one provided by the assembler and the
linker) to main memory for execution.
21
Cousins of Compilers
§ Write a simple program in C on *NIX.
§ Compile it using
$ gcc –save-temps prg.c
§ This will produce four files prg.i, prg.s, prg.o, and a.out
§ Go through these files to understand the working of cousins of compilers
discussed in previous slides.
22
Cousins of Compilers
§ Linking is of two types.
— Static linking
• Code of external function/s is placed in the executable.
— Dynamic linking
• Code placement is delayed until load time / run time.
§ To illustrate the concept, try compiling the program prg.c as follows
$ gcc –static prg.c –o prgstatic
And
$ gcc prg.c –o prgdyn
§ Now assuming both prgstatic and prgdyn are in the PWD
$ ls –li
§ Look at the size of both executables.
23
Coming Back to Phases of a
Compiler
Understanding with an Example
24
Lexical Analysis (Scanning)
Source Text into Tokens
§ Reading the input text (character by character) and grouping individual
characters into tokens such as identifiers, integers, reserved words, and
delimiters.
§ The main action of building tokens is often driven by token descriptions.
§ How to describe these tokens?
§ Regular expression notation is an effective approach to describing
tokens.
25
Lexical Analysis (Scanning)
Lexemes and Tokens
§ The character stream input is grouped into meaningful units called
lexemes, which are then mapped into tokens,
§ Tokens are the output of the lexical analyzer.
§ For example, any one of the following C statements
x3 = y + 3;
x3=y + 3 ;
x3=y+3 ;
but not
x 3 = y + 3;
would be grouped into the lexemes x3, =, y, +, 3, and ;
26
Lexical Analysis (Scanning)
Lexemes and Tokens
§ For example, any one of the following C statements
x3 = y + 3;
x3 = y + 3 ;
x3 =y+3 ;
but not
x 3 = y + 3;
would be grouped into the lexemes x3, =, y, +, 3, and ;
§ A token is a <token-name, attribute-value> pair. For example
— The lexeme x3 would be mapped to a token such as <id,1>.
— The name id is short for identifier.
— The value 1 is the index of the entry for x3 in the symbol table
produced by the compiler.
27
Lexical Analysis (Scanning)
Lexemes and Tokens
§ The lexeme =
— would be mapped to the token <=>.
— In reality it is probably mapped to a pair, whose second component is
ignored.
— The point is that there are many different identifiers so we need the
second component, but there is only one assignment symbol =.
§ The lexeme y is mapped to the token <id,2>
§ The lexeme + is mapped to the token <+>.
28
Lexical Analysis (Scanning)
Lexemes and Tokens
§ The lexeme ‘3’ is somewhat interesting and will be discussed later.
§ The lexeme ‘;’ is mapped to the token <;>
§ Note that non-significant blanks are normally removed during scanning.
§ In C, most blanks are non-significant.
§ That does not mean the blanks are unnecessary. Consider
int x; and
intx;
§ The blank between int and x is clearly necessary, but it does not become
part of any token.
§ Blanks inside strings are an exception, they are strictly considered as
part of the token.
29
Syntax Analysis (Parsing)
§ The parser is based on a formal syntax specification such as a CFGs.
§ What is a CFG?
— Context-free Grammars allow a non-terminal to be replaced by a
corresponding production rule whenever it appears in a derivation
process.
— The replacement occurs irrespective of what lies before or after the
non-terminal.
30
Syntax Analysis (Parsing)
§ The parser is based on a formal syntax specification such as a CFGs.
§ What is a CFG?
— Context-free Grammars allow a non-terminal to be replaced by a
corresponding production rule whenever it appears in a derivation
process.
— The replacement occurs irrespective of what lies before or after the
non-terminal.
§ And what is a CSG?
— Context-sensitive Grammars allow replacement of a non-terminal
based on what lies before or after the non-terminal.
31
Syntax Analysis (Parsing)
§ Consider the rule
!→0!1
§ What this says is "wherever you find !, you can replace it with 0!1".
§ Now, consider the rule
"!#→"0!1#
§ This says "You can replace ! with 0!1 only if it is preceded by " and
followed by #”.
§ Here, it imposes a condition on when ! can be replaced with 0!1.
§ You can apply this rule only if ! appears in this particular context.
32
Syntax Analysis (Parsing)
§ Parsing involves a further grouping in which tokens are grouped into
grammatical phrases, which are often represented in a parse tree.
§ For example
x3 = y + 3;
would be parsed into a tree.
§ This parsing would result from a grammar containing rules such as
asst-stmt → id = expr ;
expr → number | id | expr + expr
§ Note the recursive definition of expression (expr).
33
Syntax Analysis (Parsing)
34
Syntax Analysis (Parsing)
§ Often we utilize a simpler tree called the syntax tree with operators as
interior nodes and operands as the children of the operator.
§ The syntax tree below corresponds to the parse tree in previous slide.
§ More on this point later.
35
Semantic Analysis
§ Also called the Type Checker.
§ Type checking is purely dependent on the semantic rules of the source
language. It is independent of the compiler’s target.
§ The semantics of a language provide a set of rules that specify which
syntactically legal programs are actually valid.
§ Such rules typically require that
— all identifiers be declared,
— operators and operands be type-compatible, and
— procedures be called with the proper number of parameters.
36
Semantic Analysis
§ The compiler needs semantic information, e.g., the types (integer, real,
pointer to array of integers, etc) of the objects involved.
§ This enables checking for semantic errors and inserting type conversion
where necessary.
§ For example, if y was declared to be a real and x3 an integer, we need to
insert conversion operators "inttoreal" and "realtoint" as shown.
37
Intermediate Code Generation
§ Many compilers internally generate intermediate code for an "idealized
machine".
§ For example, the intermediate code generated would assume that the
target has an unlimited number of registers and that any register can be
used for any operation.
§ With these assumptions of a machine with an unlimited number of
registers and instructions with three operands, one generates "three-
address code" by walking the semantic tree.
§ Our example C instruction would produce
temp1 = inttoreal(3)
temp2 = id2 + temp1
temp3 = realtoint(temp2)
id1 = temp3
38
Intermediate Code Generation
§ Our example C instruction would produce
temp1 = inttoreal(3)
temp2 = id2 + temp1
temp3 = realtoint(temp2)
id1 = temp3
§ Sometimes three-address code is called quadruples because one can view
the previous code sequence as
inttoreal temp1 3 -–
add temp2 id2 temp1
realtoint temp3 temp2 -–
assign id1 temp3 -–
§ Each "quad" has the form
operation target source1 source2
39
Code Optimazation
§ This is a very serious subject, one that we will not really do justice to in
this introductory course. Some optimizations are fairly easy to see.
§ Since 3 is a constant, the compiler can perform the int to real conversion
and replace the first two quads with
add temp2 id2 3.0
§ The last two quads can be combined into
realtoint id1 temp2
§ In addition to optimizations performed on the intermediate code, further
optimizations can be performed on the machine code by the machine-
dependent back end.
40
Code Generation
§ Modern processors have only a limited number of register.
§ Although some processors, such as the x86, can perform operations
directly on memory locations, we will for now assume only register
operations.
§ Some processors (e.g., the MIPS architecture) use three-address
instructions.
§ We follow this model.
§ Other processors permit only two addresses; the result overwrites one of
the sources.
§ Using three-address instructions restricted to registers (except for load
and store instructions, which naturally must also reference memory),
code something like the following would be produced for our example,
after first assigning memory locations to id1 and id2.
LD R1, id2
ADDF R1, R1, #3.0 // add float
RTOI R2, R1 // real to int
ST id1, R2
41