COSC 408 - Compiler Construction
COSC 408 - Compiler Construction
COURSE MATERIAL
FOR
1
ACKNOWLEDGEMENT
We acknowledge the use of the Courseware of the National Open University of
Nigeria (NOUN) as the primary resource. Internal reviewers in the Ahmadu Bello
University have also been duly listed.
2
COPYRIGHT PAGE
© 2019 Ahmadu Bello University (ABU) Zaria, Nigeria
ISBN:
Published and printed in Nigeria by:
Ahmadu Bello University Press Ltd,
Ahmadu Bello University
Zaria, Nigeria.
Tel: +234
E-mail:
3
COURSE WRITERS/DEVELOPMENT TEAM
Editor
Prof. M.I Sule
Language Reviewer
Enegoloinu Ojokojo
Instructional Designers/Graphics
Emmanuel Ekoja / Ibrahim Otukoya
ODL Expert
Dr. Abdulkarim Muhammad
4
COURSE STUDY GUIDE
i. COURSE INFORMATION
Course Code: COSC 408
Course Title: Compiler Construction
Credit Units: 3
Year of Study: 4
Semester: 2
Description:
COSC408 – Compiler Construction is a three (3) credit unit course of 17 study
sessions. With the increasing diversity and complexity of computers and their
5
applications, the development of efficient, reliable software has become
increasingly dependent on automatic support from compilers and other program
analysis and translation tools. This course covers principal topics in understanding
and transforming programs at the code block, function, program, and behaviour
levels. Specific techniques for imperative languages include data flow,
dependence, inter-procedural, and profiling analyses, resource allocation, and
multi-grained parallelism on both CPUs and GPUs
It is a course for B. Sc. Computer science major students, and is normally taken in
a student's fourth year. It should appeal to anyone who is interested in the design
and implementation of programming languages. Anyone who does a substantial
amount of programming should find the material valuable.
This course is divided into four modules. The first module deals with the review of
grammars, languages and automata, introduction to compiler.
The second module treats, extensively, lexical analysis. Under which such
concepts as the scanner, lexical analyzer generation were discussed.
The third module deals with syntax analysis and discusses context-free grammars,
LL (k), LR (k), operator-precedence grammars, etc. In addition, implementation of
various parsing techniques was discussed.
The fourth module which is the concluding module of the course discusses code
generation issues such as symbol tables, intermediate representation, code
optimization techniques and code generation.
This Course Guide gives you a brief overview of the course contents, course
duration, and course materials.
6
iii. COURSE PREREQUISITES
You should note that although this course has no subject pre-requisite, you are
expected to have:
1. Satisfactory level of English proficiency
2. Basic Computer Operations proficiency
3. Online interaction proficiency
4. Web 2.0 and Social media interactive skills
5. COSC 212 or equivalent
7
syntax analysis,plus discussion of parsing difficulties caused
by features of various source languages.
Keith,D. Cooper & Linda Torczon, "Engineering a Compiler",
Morgan Kaufmann Publishers, 2004
Pemberton, S. & Daniels, M.C. Pascal Implementation. The
P4Compiler ISBN 0853123586(Discussion)
andISBN085312437X(Compiler listing) Complete listing
and readablecommentary for a Pascal compiler written in
Pascal.
8
3. Construct and use the major components of a modern compiler
4. Work together in teams on a substantial software implementation project.
9
B. Summative assessment (Semester examination)
CBT based 30
Essay based 30
TOTAL 100%
C. Grading Scale:
A = 70-100
B = 60 – 69
C = 50 - 59
D = 45-49
F = 0-44
D. Feedback
Courseware based:
1. In-text questions and answers (answers preceding references)
2. Self-Assessment Question(s) (answers preceding references)
Tutor based:
1. Discussion Forum tutor input
2. Graded Continuous assessments
Student based:
1. Online program assessment (administration, learning resource, deployment,
and assessment).
10
xi. LINKS TO OPEN EDUCATION RESOURCES
OSS Watch provides tips for selecting open source, or for procuring free or open
software.
SchoolForge and SourceForge are good places to find, create, and publish open
software. SourceForge, for one, has millions of downloads each day.
Open Source Education Foundation and Open Source Initiative, and other
organisation like these, help disseminate knowledge.
Creative Commons has a number of open projects from Khan
Academy to Curriki where teachers and parents can find educational materials for
children or learn about Creative Commons licenses. Also, they recently launched
the School of Open that offers courses on the meaning, application, and impact of
"openness."
Numerous open or open educational resource databases and search engines
exist. Some examples include:
OEDb: over 10,000 free courses from universities as well as reviews of colleges
and rankings of college degree programs
Open Tapestry: over 100,000 open licensed online learning resources for an
academic and general audience
OER Commons: over 40,000 open educational resources from elementary school
through to higher education; many of the elementary, middle, and high school
resources are aligned to the Common Core State Standards
Open Content: a blog, definition, and game of open source as well as a friendly
search engine for open educational resources from MIT, Stanford, and other
universities with subject and description listings
Academic Earth: over 1,500 video lectures from MIT, Stanford, Berkeley,
Harvard, Princeton, and Yale
11
JISC: Joint Information Systems Committee works on behalf of UK higher
education and is involved in many open resources and open projects including
digitising British newspapers from 1620-1900!
12
Livebinders: search, create, or organise digital information binders by age, grade,
or subject (why re-invent the wheel?)
13
xii. DLC ACADEMIC CALENDAR/PLANNER
PERIOD
Semester Semester 1 Semester 2 Semester 3
Activity JAN FEB MAR APR MAY JUN JUL AUG SEPT OCT NOV DEC
Registration
Resumption
Late Registn.
Facilitation
Revision/
Consolidation
Semester
Examination
14
xiii. COURSE STRUCTURE AND OUTLINE
Course Structure
WEEK MODULE STUDY SESSION ACTIVITY
1. Read Courseware for the corresponding Study Session.
2. View the Video(s) on this Study Session
Study Session 1: 3. Listen to the Audio on this Study Session
Week1 Title: Review of 4. View any other Video/YouTube
Grammars, (https://bit.ly/2uEQto1)
Languages and 5. View referred OER (address)
Automata 6. View referred Animation
(https://bit.ly/2JKkXzj)
Pp. 25 7. Read Chapter/page of Standard/relevant text.
8. Any additional study material
9. Any out of Class Activity
1. Read Courseware for the corresponding Study Session.
STUDY 2. View the Video(s) on this Study Session
Week 2 MODULE 1 Study Session 2 3. Listen to the Audio on this Study Session
Title: What is a 4. View any other Video/YouTube
Compiler? (https://bit.ly/2FvTlsi)
5. View referred OER (address/site)
6. View referred Animation
Pp. 48 (https://bit.ly/2UdO1Tx)
7. Read Chapter/page of Standard/relevant text.
8. Any additional study material
9. Any out of Class Activity
1. Read Courseware for the corresponding Study Session.
2. View the Video(s) on this Study Session
Week 3 Study Session 3 3. Listen to the Audio on this Study Session
Title: The Structure 4. View any other Video/YouTube
of a Compiler (https://bit.ly/2TDpwef)
Pp. 59 5. View referred OER (address/site)
6. View referred Animation
15
(https://bit.ly/2OrxINT)
7. Read Chapter/page of Standard/relevant text.
8. Any additional study material
9. Any out of Class Activity
1. Read Courseware for the corresponding Study Session.
2. View the Video(s) on this Study Session
Week 4 3. Listen to the Audio on this Study Session
Study Session 1 4. View any other Video/YouTube
Title: The Scanner (https://bit.ly/2YnifTo)
5. View referred OER (address/site)
Pp. 70 6. View referred Animation
STUDY (https://bit.ly/2JKpHF7)
7. Read Chapter/page of Standard/relevant text.
MODULE 2
8. Any additional study material
9. Any out of Class Activity
1. Read Courseware for the corresponding Study Session.
2. View the Video(s) on this Study Session
Week 5 & 6 Study Session 2 3. Listen to the Audio on this Study Session
Title: Hand 4. View any other Video/YouTube
Implementation of (https://bit.ly/2TDBqon)
Lexical Analyser 5. View referred OER (address/site)
6. View referred Animation
Pp. 77 (https://bit.ly/2uxsnex)
7. Read Chapter/page of Standard/relevant text.
8. Any additional study material
9. Any out of Class Activity
1. Read Courseware for the corresponding Study Session.
2. View the Video(s) on this Study Session
Study Session 3 3. Listen to the Audio on this Study Session
Week 7 Title: Automatic 4. View any other Video/YouTube
Generation of (https://bit.ly/2Udqx0T)
Lexical Analyser 5. View referred OER (address/site)
Pp. 86 6. View referred Animation
16
(https://bit.ly/2FFMzkV)
7. Read Chapter/page of Standard/relevant text.
8. Any additional study material
9. Any out of Class Activity
1. Read Courseware for the corresponding Study Session.
2. View the Video(s) on this Study Session
Study Session 4 3. Listen to the Audio on this Study Session
Title: Implementing 4. View any other Video/YouTube
a Lexical Analyser (https://bit.ly/2Ous2Tl)
5. View referred OER (address/site)
Pp. 103 6. View referred Animation
(https://bit.ly/2CRAnMj)
7. Read Chapter/page of Standard/relevant text.
8. Any additional study material
9. Any out of Class Activity
1. Read Courseware for the corresponding Study Session.
Study Session 1 2. View the Video(s) on this Study Session
Week 8 Title: Context-Free 3. Listen to the Audio on this Study Session
Grammars 4. View any other Video/YouTube
(https://bit.ly/2TF4k7N)
Pp. 117 5. View referred OER (address/site)
STUDY 6. View referred Animation
MODULE 3 (https://bit.ly/2U27l7c)
7. Read Chapter/page of Standard/relevant text.
8. Any additional study material
9. Any out of Class Activity
1. Read Courseware for the corresponding Study Session.
2. View the Video(s) on this Study Session
Study Session 2 3. Listen to the Audio on this Study Session
Title: Bottom-up 4. View any other Video/YouTube
Parsing Techniques (https://bit.ly/2HJTvjo)
5. View referred OER (address/site)
Pp. 135 6. View referred Animation
17
(https://bit.ly/2Wy9HHP)
7. Read Chapter/page of Standard/relevant text.
8. Any additional study material
9. Any out of Class Activity
1. Read Courseware for the corresponding Study Session.
2. View the Video(s) on this Study Session
Week 9 3. Listen to the Audio on this Study Session
Study Session 3 4. View any other Video/YouTube
Title: Precedence (https://bit.ly/2V0OsxE)
Parsing 5. View referred OER (address/site)
6. View referred Animation
Pp. 149 (https://bit.ly/2I1SXnY)
7. Read Chapter/page of Standard/relevant text.
8. Any additional study material
9. Any out of Class Activity
1. Read Courseware for the corresponding Study Session.
2. View the Video(s) on this Study Session
Study Session 4 3. Listen to the Audio on this Study Session
Title: Top-down 4. View any other Video/YouTube
Parsing Techniques (https://bit.ly/2FDJrG2)
5. View referred OER (address/site)
Pp. 164 6. View referred Animation
(https://bit.ly/2FEbhlz)
7. Read Chapter/page of Standard/relevant text.
8. Any additional study material
9. Any out of Class Activity
18
1. Read Courseware for the corresponding Study Session.
2. View the Video(s) on this Study Session
Study Session 5 3. Listen to the Audio on this Study Session
Week 10 Title: LR Parsers 4. View any other Video/YouTube
(https://bit.ly/2SxnDQ5)
Pp. 184 5. View referred OER (address/site)
6. View referred Animation
(https://bit.ly/2HXpJqG)
7. Read Chapter/page of Standard/relevant text.
8. Any additional study material
9. Any out of Class Activity
1. Read Courseware for the corresponding Study Session.
Study Session 1 2. View the Video(s) on this Study Session
Title: Error 3. Listen to the Audio on this Study Session
Handling 4. View any other Video/YouTube
(https://bit.ly/2FJcvuX)
Pp. 206 5. View referred OER (address/site)
STUDY 6. View referred Animation
MODULE 4 (https://bit.ly/2HX7f9q)
7. Read Chapter/page of Standard/relevant text.
8. Any additional study material
9. Any out of Class Activity
1. Read Courseware for the corresponding Study Session.
2. View the Video(s) on this Study Session
Week 11 Study Session 2 3. Listen to the Audio on this Study Session
Title: Symbol 4. View any other Video/YouTube
Tables (https://bit.ly/2UYI9L5)
5. View referred OER (address/site)
Pp. 223 6. View referred Animation
(https://bit.ly/2HSIWtg)
7. Read Chapter/page of Standard/relevant text.
8. Any additional study material
9. Any out of Class Activity
19
1. Read Courseware for the corresponding Study Session.
2. View the Video(s) on this Study Session
3. Listen to the Audio on this Study Session
Study Session 3 4. View any other Video/YouTube
Title: Intermediate (https://bit.ly/2uwwvM0)
Code Generation 5. View referred OER (address/site)
6. View referred Animation (https://bit.ly/2WtBNUw)
Pp. 245 7. Read Chapter/page of Standard/relevant text.
8. Any additional study material
9. Any out of Class Activity
1. Read Courseware for the corresponding Study Session.
2. View the Video(s) on this Study Session
Week 12 Study Session 4 3. Listen to the Audio on this Study Session
Title: Code 4. View any other Video/YouTube
Generation (https://bit.ly/2TZEO1X)
5. View referred OER (address/site)
Pp. 266 6. View referred Animation
(https://bit.ly/2HRxq1v)
7. Read Chapter/page of Standard/relevant text.
8. Any additional study material
9. Any out of Class Activity
1. Read Courseware for the corresponding Study Session.
2. View the Video(s) on this Study Session
Study Session 5 3. Listen to the Audio on this Study Session
Title: Code 4. View any other Video/YouTube
Optimisation (https://bit.ly/2JIlaml)
5. View referred OER (address/site)
Pp. 287 6. View referred Animation
(https://www.youtube.com/watch?v=D7JmYNaYt4k)
7. Read Chapter/page of Standard/relevant text.
8. Any additional study material
9. Any out of Class Activity
20
Week 13 REVISION/TUTORIALS (On Campus or Online)& CONSOLIDATION
WEEK
Week 14 & SEMESTER EXAMINATION
15
21
Contents
Title Page……………………………………………...……………………………..……1
Acknowledgement Page …………………………………………………….……………2
Copyright Page ………………………………………..…………..………….…………..3
Course Writers/Development Team ……………………...……….………….…………4
22
Study Session 2: Bottom-Up Parsing Techniques..............................................................135
Study Session 3: Precedence Parsing..................................................................................149
Study Session 4: Top-Down Parsing Techniques...............................................................164
Study Session 5: LR Parsers...............................................................................................184
23
Course Outline
MODULE 1: Introduction To Compilers
Study Session 1: Review of Grammars, Languages and Automata
Study Session 2: What is a Compiler?
Study Session 3: The Structure of a Compiler
24
xiv. STUDY MODULES
MODULE 1: Introduction to Compilers
Contents:
Study Session 1: Review of Grammars, Languages and Automata
Study Session 2: What is a Compiler?
Study Session 3: The Structure of a Compiler
STUDY SESSION 1
Review of Grammars, Languages and Automata
Section and Subsection Headings:
Introduction
1.0 Learning Outcomes
2.0 Main Content
2.1 –Formal Grammar
2.1.1 – The Syntax of Grammar
2.1.2 – The Semantic of Grammar
2.2- Types of Grammar and Automata
2.2.1 – Type-0: Unrestricted Grammars
2.2.2 – Type-1: Context-Sensitive Grammars
2.2.3 – Type-2: Context-Free Grammars
2.2.4 – Type-3: Regular Grammars
2.2.5 – Analytic Grammars
2.3- Chomsky Hierarchy
2.3.1 – The Hierarchy
2.4- Formal Languages
2.4.1 – Words over an Alphabet
2.4.2 – Language-Specification Formalisms
2.4.3 – Operations on Languages
2.4.4 – Uses of Formal Languages
25
2.5- Programming Languages
2.5.1 – Formal Theories, Systems and Proofs
2.5.2 – Interpretations and Models
3.0 Tutor Marked Assignments (Individual or Group assignments)
4.0 Study Session Summary and Conclusion
5.0 Self-Assessment Question(s)
6.0 Additional Activities (Videos, Animations & Out of Class activities)
7.0 Self-Assessment Question Answers
8.0 References/Further Reading
Introduction:
As you learnt in COSC 402: Formal Languages and Automata Theory, in the
field of Computer Science, there are different types of grammars on which
different languages are defined. For each of these grammars, there is a class of
automata that can parse or recognize strings form from the grammar. The set of
all strings that can be generated from the grammar constitutes the language of
the grammar.
In this study session, you will be taken through some of the things you learnt
previously on formal grammar, formal language and automata.
26
6. Describe the Chomsky hierarchy
7. Explain the relevance of formal grammar and language to computer
programming
A formal grammar is a set of rules for rewriting strings, along with a "start
symbol" from which rewriting must start. Therefore, a grammar is usually
thought of as a language generator. However, it can also sometimes be used as
the basis for a "recogniser". Recogniser is a function in computing that
determines whether a given string belongs to the language or is grammatically
incorrect. To describe such recognisers, formal language theory uses separate
formalisms, known as Automata Theory.
The process of recognising an utterance (a string in natural languages) by
breaking it down to a set of symbols and analysing each one against the
grammar of the language is referred to as Parsing. Most languages have the
meanings of their utterances structured according to their syntax – a practice
known as compositional semantic s. As a result, the first step to describing the
meaning of an utterance in language is to break it down into parts and look at its
analysed form (known as its parse tree in Computer Science).
27
A grammar mainly consists of a set of rules for transforming strings. (If it
consists of these rules only, it would be a semi-Thue system). To generate a
string in the language, one begins with a string consisting of a single start
symbol. The production rules are then applied in any order, until a string that
contains neither the start symbol nor designated non-terminal symbols is
produced. The language formed by the grammar consists of all distinct strings
that can be generated in this manner. Any particular sequence of production
rules on the start symbol yields a distinct string in the language. If there are
multiple ways of generating the same single string, the grammar is said to be
ambiguous.
Example 2.1
Assuming the alphabet consists of a and b, the start symbol is S and we have the
following production rules:
1.
2.
then we start with S, and can choose a rule to apply to it. If we choose rule 1, we
obtain the string aSb. If we choose rule 1 again, we replace S with aSb and
obtain the string aaSbb. If we now choose rule 2, we replace S with ba and
obtain the string aababb, and are done. We can write the series of choices more
briefly, using symbols: . The language of the
grammar is then the indefinite set
{ | { , where is a repeated
k times (and n in particular represents the number of times production rule 1 has
been applied).
2.1.1 The Syntax of Grammars
In the classic formalisation of generative grammars first proposed by Noam
Chomsky in the 1950s, a grammar G consists of the following components:
28
- A finite set N of non-terminal symbols, none of which appear in strings
formed from G.
- A finite set Σ of terminal symbols that is disjoint from N.
- A finite set P of production rules, each rule of the form
*
- where is the Kleene star operator and denotes set union. That is,
each production rule maps from one string of symbols to another, where
the first string (the "head") contains at least one non-terminal symbol. In
the case that the second string (the "body") consists solely of the empty
string – i.e. it contains no symbols at all, it may be denoted with a special
notation (often Λ, e orε) in order to avoid confusion.
A distinguished symbol , that is, the start symbol.
A grammar is formally defined as the tuple (N, Σ, P, S). Such a formal grammar
is often called a rewriting system or a phrase structure grammar in the
literature.
29
- the language of G, denoted as , is defined as all those sentences that
can be derived in a finite number of steps from the start symbol S, that is,
the set .
Note that the grammar G = (N, Σ, P, S) is effectively the semi-Thue
- system , rewriting strings in exactly the same way; the only
difference is that, we distinguish specific non-terminal symbols which
must be rewritten in rewrite rules, and are only interested in rewritings
from the designated start symbol S to strings without non-terminal
symbols.
Example 2.2
Note that for these examples, formal languages are specified using set builder
notation.
Consider the grammar G where , , S is the start
symbol, and P consists of the following production rules:
1.
2.
3.
4.
30
(Note on notation: reads "String P generates string Q by means of
production i", and the generated part is each time indicated in bold type.)
2.2 Types of Grammars and Automata
In the field of Computer Science, there are four basic types of grammars:
(a) Type-0 grammars (unrestricted grammars) include all formal grammars.
(b) Type-1 grammars (context-sensitive grammars) generate the context-
sensitive languages.
(c) Type-2 grammars (context-free grammars) generate the context-free
languages.
(d) Type-3 grammars (regular grammars) generate the regular languages.
The difference between these types is that they have increasingly strict
production rules and can express fewer formal languages. Two important types
are context-free grammars (Type 2) and regular grammars (Type 3). The
languages that can be described with such a grammar are called context-free
languages and regular languages, respectively. Although much less powerful
than unrestricted grammars (Type 0), which can in fact express any language
that can be accepted by a Turing machine, these two restricted types of
grammars are most often used because parsers for them can be efficiently
implemented.
Recently, there have been other types of classification of grammars such as
analytical grammars identified.
2.2.1 Type-0: Unrestricted Grammars
These grammars generate exactly all languages that can be recognised by a
Turing machine. These languages are also known as the Recursively
31
Enumerable Languages. Note that this is different from the recursive languages
which can be decided by an always-halting Turing machine.
2.
32
is a member of the language, where n is the length of the string. Further, some
important subsets of the context-free languages can be recognised in linear time
using other algorithms.
These are exactly all languages that can be recognised by a non-deterministic
pushdown automaton. Context-free languages are the theoretical basis for the
syntax of most programming languages.
2.2.4 Type-3: Regular Grammars
In regular grammars, the left hand side is again only a single non-terminal
symbol, but now the right-hand side is also restricted. The right side may be the
empty string, or a single terminal symbol, or a single terminal symbol followed
by a non-terminal symbol, but nothing else. (Sometimes a broader definition is
used: one can allow longer strings of terminals or single non-terminals without
anything else, making languages easier to denote while still defining the same
class of languages.)
The language defined above is not regular, but the language
(at least 1 a followed by at least 1 b, where the numbers may be different) is, as
it can be defined by the grammar G3 with , , S the
start symbol, and the following production rules:
34
In-text Question 1
In the context of Computer Science, what do you understand by the word „grammar‟?
Answer
A formal grammar (sometimes called a grammar) is a set of rules of a specific kind, for
forming strings in a formal language.
35
Every regular language is context-free, every context-free language, not
containing the empty string, is context-sensitive and every context-sensitive
language is recursive and every recursive language is recursively enumerable.
These are all proper inclusions, meaning that there exist recursively enumerable
languages which are not context-sensitive, context-sensitive languages which
are not context-free and context-free languages which are not regular.
The following table summarises each of Chomsky's four types of grammars, the
class of language it generates, the type of automaton that recognises it, and the
form its rules must have.
Table 1: Summary of the Languages, Automata and Production
Rules of Chomsky’s Four Types of Grammars
Production rules
Grammar Languages Automaton
(constraints)
Recursively (no
Type-0 Turing machine
enumerable restrictions)
Linear-bounded
Context- non-
Type-1
sensitive Deterministic
Turing machine
Non-
Deterministic
Type-2 Context-free
pushdown
automaton
and
Finite State
Type-3 Regular
automaton
36
2.4 Formal Language
A formal language is a set of words, i.e. finite strings of letters, symbols, or
tokens. The set from which these letters are taken is called the alphabet over
which the language is defined. A formal language is often defined by means of
a formal grammar (also called its formation rules); accordingly, words that
belong to a formal language are sometimes called well-formed words (or well-
formed formulas). Formal languages are
studied in computer science and
linguistics; the field of formal language
theory studies the purely syntactical
aspects of such languages (that is, their
internal structural patterns).
37
form a new word, whose length is the sum of the lengths of the original words.
The result of concatenating a word with the empty word is the original word.
In some applications, especially in logic, the alphabet is also known as the
vocabulary and words are known as formulas or sentences; this breaks the
letter/word metaphor and replaces it by a word/sentence metaphor.
Therefore, a formal language L over an alphabet Σ is just a subset of Σ*, that is,
a set of words over that alphabet.
In computer science and mathematics, which do not usually deal with natural
languages, the adjective "formal" is often omitted as redundant. While formal
language theory usually concerns itself with formal languages that are described
by some syntactical rules, the actual definition of the concept "formal language"
is only as above: a (possibly infinite) set of finite-length strings, no more nor
less. In practice, there are many languages that can be described by rules, such
as regular languages or context-free languages. The notion of a formal grammar
may be closer to the intuitive concept of a "language," one described by
syntactic rules. By an abuse of the definition, a particular formal language is
often thought of as being equipped with a formal grammar that describes it.
Examples 2.3
(a) The following rules describe a formal language L over the alphabet Σ =
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, +, =}:
(b) Every nonempty string that does not contain + or = and does not start
with 0 is in L.
The string 0 is in L.
(c) A string containing = is in L if and only if there is exactly one =, and it
separates two valid strings in L.
(d) A string containing + but not = is in L if and only if every + in the string
separates two valid strings in L.
(e) No string is in L other than those implied by the previous rules.
38
(f) Under these rules, the string "23+4=555" is in L, but the string
"=234=+" is not. This formal language expresses natural numbers, well-
formed addition statements, and well-formed addition equalities, but it
expresses only what they look like (their syntax), not what they mean
(semantics). For instance, nowhere in these rules is there any indication
that 0 means the number zero or that + means addition.
(g) For finite languages one can simply enumerate all well-formed words.
For example, we can describe a language L as just L = {"a", "b",
"ab", "cba"}.
However, even over a finite (non-empty) alphabet such as
Σ = {a, b} there are infinitely many words: "a", "abb",
"ababba","aaababbbbaab‖… Therefore formal languages are typically
infinite, and describing an infinite formal language is not as simple as
writing L = {"a", "b", "ab", "cba"}. Here are some examples of formal
languages:
L =Σ*, the set of all words overΣ;
L = {a}*= {an}, where n ranges over the natural numbers and anmeans "a"
repeated n times (this is the set of words consisting only of the symbol
"a");
(h) The set of syntactically correct programs in a given programming
language (the syntax of which is usually defined by a context-free
grammar);
(i) The set of inputs upon which a certain Turing machine halts; or
(j) The set of maximal strings of alphanumeric ASCII characters on this line,
(i.e., the set {"the", "set", "of", "maximal", "strings", "alphanumeric",
"ASCII", "characters", "on", "this", "line", "i", "e"}).
2.4.2 Language-Specification Formalisms
39
Formal language theory rarely concerns itself with particular languages (except
as examples), but is mainly concerned with the study of various types of
formalisms to describe languages. For instance, a language can be given as:
- those strings generated by some formal grammar;
- those strings described or matched by a particular regular expression;
- those strings accepted by some automaton, such as a Turing machine or
finite state automaton;
- those strings for which some decision procedure (an algorithm that asks a
sequence of related YES/NO questions) produces the answer YES.
Typical questions asked about such formalisms include:
- What is their expressive power? (Can formalism X describe every
language that formalism Y can describe? Can it describe other
languages?)
- What is their recognisability? (How difficult is it to decide whether a
given word belongs to a language described by formalism X?)
- What is their comparability? (How difficult is it to decide whether two
languages, one described in formalism X and one in formalism Y, or in X
again, are actually the same language?).
Surprisingly often, the answer to these decision problems is "it cannot be done
at all", or "it is extremely expensive" (with a precise characterisation of how
expensive exactly). Therefore, formal language theory is a major application
area of computability theory and complexity theory. Formal languages may be
classified in the Chomsky hierarchy based on the expressive power of their
generative grammar as well as the complexity of their recognising automaton.
Context-free grammars and regular grammars provide a good compromise
between expressivity and ease of parsing, and are widely used in practical
applications.
40
Certain operations on languages are common. This includes the standard set
operations, such as union, intersection, and complement. Another class of
operation is the element-wise application of string operations.
Example 2.4
(a) Suppose L1 and L2 are languages over some common alphabet.
(b) The concatenation L1L2 consists of all strings of the form vw where
v is a string from L1 and w is a string from L2.
(c) The intersection L1∩L2 of L1 and L2 consists of all strings which
are contained in both languages
(d) The complement ¬ L of a language with respect to a given alphabet
consists of all strings over the alphabets that are not in the language.
(e) The Kleene star: the language consisting of all words that are
concatenations of 0 or more words in the original language; Reversal:
(f) Let e be the empty word, then eR = e, and
(g) for each non-empty word w = x1… xn over some alphabet, let wR
= xn… x1,
(h) then for a formal language L, LR= {wR | w L}.
String homomorphism
Such string operations are used to investigate closure properties of classes of
languages. A class of languages is closed under a particular operation when the
operation, applied to languages in the class, always produces a language in the
same class again. For instance, the context- free languages are known to be
closed under union, concatenation, and intersection with regular languages, but
not closed under intersection or complement.
2.4.4 Uses of Formal Languages
41
Formal languages are often used as the basis for richer constructs endowed with
semantics. In computer science they are used, among other things, for the
precise definition of data formats and the syntax of programming languages.
Formal languages play a crucial role in the development of compilers, typically
produced by means of a compiler, which may be a single program or may be
separated in tools like lexical analyser generators (e.g. lex), and parser
generators (e.g. yacc). Since formal languages alone do not have semantics,
other formal constructs are needed for the formal specification of program
semantics.
Formal languages are also used in logic and in foundations of mathematics to
represent the syntax of formal theories. Logical systems can be seen as a formal
language with additional constructs, like proof calculi, which define a
consequence relation "Tarski's definition of truth" in terms of a T-schema for
first-order logic, is an example of fully interpreted formal language; all its
sentences have meanings that make them either true or false.
In-text Question 2
Define formal languages.
Answer
A formal language is a set of words, i.e. finite strings of letters, symbols, or tokens.
42
belongs to the programming language for which the compiler was built. Of
course, compilers do more than just parse the source code; they translate it into
some executable format; because of this, a parser usually outputs more than a
yes/no answer. Typically an abstract syntax tree, which is used in subsequent
stages of the compiler to eventually generate an executable containing machine
code that runs directly on the hardware, or some intermediate code that requires
a virtual machine to execute.
2.5.1 Formal Theories, Systems and Proofs
43
expressions. Although a formal language can be identified with its formulas, a
formal system cannot be likewise identified by its theorems. Two formal
systems may have all the same theorems and yet differ in some
significant proof-theoretic way (a formula A may be a syntactic consequence of
a formula B in one but not another for instance).
A formal proof or derivation is a finite sequence of well-formed formulas
(which may be interpreted as propositions) each of which is an axiom or follows
from the preceding formulas in the sequence by a rule of inference. The last
sentence in the sequence is a theorem of a formal system. Formal proofs are
useful because their theorems can be interpreted as true propositions.
2.5.2 Interpretations and Models
Formal languages are entirely syntactic in nature but may be given semantics
that give meaning to the elements of the language. For instance, in mathematical
logic, the set of possible formulas of a particular logic is a formal language, and
an interpretation assigns a meaning to each of the formulas - usually, a truth
value.
The study of interpretations of formal languages is called formal semantics. In
mathematical logic, this is often done in terms of model theory. In model
theory, the terms that occur in a formula are interpreted as mathematical
structures, and fixed compositional interpretation rules determine how the truth
value of the formula can be derived from the interpretation of its terms; a model
for a formula is an interpretation of terms such that the formula becomes true.
3.0 Tutor Marked Assignment
1. Name the class of automata that are used in recognizing the following
grammars:
(a) Regular grammars
(b) Context-sensitive grammars
(c) Type – 0 grammars
(d) Context-free grammars
44
2. What are the use(s) of formal languages?
3. What does it mean to say a class of languages is closed under a particular
operation? Hence or otherwise suppose L1 and L2 are languages over
some common alphabet; state (with appropriate examples) the standard
operations that can be performed on the languages.
4. From what you have learnt so far in this course, justify the relevance of
formal languages to computer programming?
5. Briefly discuss the Chomsky hierarchy. What is the relationship among
the various types of grammars described in the Chomsky hierarchy?
4.0 Conclusion/Summary
In this study session, you have been taken through a brief revision of formal
grammars, formal languages and automata because of crucial roles they play in
the development of compilers. You should read more on these various topics
before proceeding to the next study session. In the subsequent study sessions,
you will be learning about compiler construction. Precisely, study session 2 of
this module will introduce you to the concept of compilers.
You also learnt that a formal grammar is a set of rules of a specific kind, for
forming strings in a formal language. It has four components that form its
syntax and a set of operations that can be performed on it, which form its
semantic.
Each type of grammars is recognised by a particular type of automata. For
example, type-2 grammars are recognised by pushdown automata while type-3
grammars are recognised by finite state automata.
Parsing is the process of recognising an utterance by breaking it down to a set
of symbols and analysing each one against the grammar of the language.
According to Chomsky hierarchy, there are four types of grammars. The
difference between these types is that they have increasingly strict production
rules and can express fewer formal languages.
45
A formal language is a set of words, i.e. finite strings of letters, symbols or
tokens. The set from which these letters are taken is called the alphabet over
which the language is defined. A formal language is often defined by means of
a formal grammar.
Formal languages play a crucial role in the development of compilers and
precise definition of the syntax of programming languages. A formal system
consists of a formal language together with a deductive apparatus.
6.0 Additional Activities (Videos, Animations & Out of Class activities) e.g.
a. Visit YouTube https://www.youtube.com/watch?v=BEiWsKutaY0. Watch
the video & summarise in 1 paragraph
b. View the animation on https://www.youtube.com/watch?v=s8RJPQv5bo8and
critique it in the discussion forum
c. Take a walk and engage any 3 students on Syntax and Semantics In 2
paragraphs summarise their opinion of the discussed topic. etc.
46
iii. Type-2 grammars (context-free grammars) generate the context-
free
iv. Type-3 grammars (regular grammars) generate the regular
languages.
2.
i. No
ii. Yes
iii. Yes
8.0 References/Further Reading
Arnon, Avron (1994). What is a logical system? Chapter 8. In: Dov M.
Gabbay (Ed.). What is a logical system? Oxford University Press.
Chomsky, N. & Schützenberger, M. P. (1963). "The al gebraic theory of
context free languages". In: P. Braffort, D. Hirschberg.Computer
Programming and Formal Languages. Amsterdam: NorthHolland. pp.
118–161.
Chomsky, Noam (1959). "On certain formal properties of grammars".
Information and Control 2 (2): 137–167.
Davis, M. E., Sigal, R. & Weyuker, E. J. (1994).
Computability,Complexity, and Languages: Fundamentals of Theoretical
Computer Science. Boston: Academic Press, Harcourt, Brace.pp. 327.
Ginsburg, S. (1975). Algebraic and Automata Theoretic Properties of
Formal Languages. North-Holland.
Godel, Escher, Bach: An Eternal Golden Braid, Douglas Hofstadter
Grzegorz Rozenberg & Arto Salomaa (1997). Handbook of Formal
Languages: Volume I-III, Springer. ISBN 3 540 61486 9.
Hamilton, G. (1978). Logic for Mathematicians. Cambridge University
Press. ISBN 0 521 21838 1.
Harrison, M. A. (1978). Introduction to Formal Language Theory.
Addison-Wesley.
47
Hopcroft, J. E. & Ullman, J. D. (1979). Introduction to Automata
Theory, Languages, and Computation. Addison-Wesley:Publishing,
Reading Massachusetts.
Suppes, P. (1957). Introduction to Logic. D. Van Nostrand.
48
STUDY SESSION 2
What is a Compiler?
Section and Subsection Headings:
Introduction
1.0 Learning Outcomes
2.0 Main Content
2.1 - Translators
2.2- Why do we need Translators?
2.3- What is a Compiler?
2.4- The Challenges in Compiler Development
2.5- Compiler Architecture
3.0 Tutor Marked Assignment
4.0 Study Session Summary and Conclusion
5.0 Self-Assessment Question(s)
6.0 Additional Activities (Videos, Animations & Out of Class activities)
7.0 Self-Assessment Question Answers
8.0 References/Further Reading
Introduction:
In the previous session, you were taken through some basic concepts you learnt
in an earlier course. This was done because of their relevance/importance to
your understanding of this course.
In this session, you will be introduced to the concept of compilers and their
importance to program development.
49
3. Discuss the major challenges to be faced in building compilers state the
qualities of compilers
4. Mention some of the knowledge required for building compilers
5. Describe the architecture of a compiler.
51
You should note that both interpreters and compilers (like any other program)
are written in some high-level programming language (which may be different
from the language they accept) and they are translated into machine code. For
example, a Java interpreter can be completely written in Pascal, or even Java.
The interpreter source program is machine independent since it does not
generate machine code. (Note the difference between generate and translated
into machine code.) An interpreter is generally slower than a compiler because
it processes and interprets each statement in a program as many times as the
number of the evaluations of this statement. For example, when a for-loop is
interpreted, the statements inside the for-loop body will be analysed and
evaluated on every loop step. Some languages, such as Java and Lisp, come
with both an interpreter and a compiler. Java source programs (Java classes with
.java extension) are translated by the java compiler into byte-code files (with
.class extension). The Java interpreter, java, called the Java Virtual Machine
(JVM), may actually interpret byte codes directly or may internally compile
them to machine code and then execute that code.
Like was mention in section 3.1, compilers and interpreters are not the only
examples of translators. In the table below are a few more:
Table 1.2.1: Table of Translators, Source Language and Target Language
52
English text Natural Language semantics (meaning)
Understanding
Regular
JLex scanner generator a scanner in Java
Expressions
This course deals mainly with compilers for high-level programming languages,
but the same techniques apply to interpreters or to any other compilation
scheme.
In-text Question 1
What is a translator?
Answer
A translator is a program that takes as input a program written in one programming
language ( the source language) and produces as output a program in another language (the
object or target language).
53
2. Qualities of a compiler: these concerns the qualities that are compiler
must possess in other to be effective and useful. These are listed below in
order of importance:
(a) the compiler itself must be bug-free
(b) it must generate correct machine code
(c) the generated machine code must run fast
(d) the compiler itself must run fast (compilation time must be
proportional to program size)
(e) the compiler must be portable (i.e. modular, supporting separate
compilation)
(f) it must print good diagnostics and error messages
(g) the generated code must work well with existing debuggers must have
consistent and predictable optimization.
3. In-depth knowledge:
Building a compiler requires in-depth knowledge of:
(a) Programming languages (parameter passing, variable scoping,
memory allocation, etc.)
(b) Theory (automata, context-free languages, etc.)
(c) Algorithms and data structures (hash tables, graph algorithms,
dynamic programming, etc.)
(d) Computer architecture (assembly programming) software
engineering.
You should try building a non-trivial compiler for a Pascal-like language as the
course project. This will give you a hands-on experience on system
implementation that combines all this knowledge.
2.5 Compiler Architecture
As we saw earlier mentioned, a compiler can be viewed as a program that
accepts a source code (such as a Java program) and generates machine code for
some computer architecture. Suppose that you want to build compilers for n
54
programming languages (e.g. FORTRAN, C, C++, Java, BASIC, etc.) and you
want these compilers to run on m different architectures (e.g. MIPS, SPARC,
Intel, alpha, etc.). If you do that naively, you need to write n*m compilers, one
for each language-architecture combination.
The holy grail of portability in compilers is to do the same thing by writing n +
m programs only. You can do this by using a universal Intermediate
Representation (IR) and you make the compiler a two-phase compiler. An IR is
typically a tree-like data structure that captures the basic features of most
computer architectures. One example of an IR tree node is a representation of a
3-address instruction, such as d s1+ s2that gets two source addresses, s1and s2,
(i.e. two IR trees) and produces one destination address, d. The first phase of
this compilation scheme, called the front-end, maps the source code into IR, and
the second phase, called the back-end, maps IR into machine code. That way,
for each programming language you want to compile, you write one front-end
only, and for each computer architecture, you write one back-end. So, totally
you have n + m components.
But the above ideal separation of compilation into two phases does not work
very well for real programming languages and architectures. Ideally, you must
encode all knowledge about the source programming language in the front end,
you must handle all machine architecture features in the back end, and you must
design your IRs in such a way that all language and machine features are
captured properly.
A typical real-world compiler usually has multiple phases (this will be treated to
greater details in study session 3 of this module. This increases the compiler's
portability and simplifies retargeting. The front end consists of the following
phases:
- scanning: a scanner groups input characters into tokens
- parsing: a parser recognises sequences of tokens according to some
grammar and generates Abstract Syntax Trees (ASTs)
55
- semantic analysis: performs type checking (i.e. checking whether the
variables, functions etc. in the source program are used consistently with
their definitions and with the language semantics) and translates ASTs
into IRs
- optimisation: optimises IRs.
The back end consists of the following phases:
- instruction selection: maps IRs into assembly code
- code optimisation: optimises the assembly code using control-flow and
data-flow analyses, register allocation, etc
- Code emission: generates machine code from assembly code.
The generated machine code is written in an object file. This file is not
executable since it may refer to external symbols (such as system calls). The
operating system provides the following utilities to execute the code:
- Linking: A linker takes several object files and libraries as input and
produces one executable object file. It retrieves from the input files (and
puts them together in the executable object file) the code of all the
referenced functions/procedures and it resolves all external references to
real addresses. The libraries include the operating system libraries, the
language-specific libraries, and, maybe, user-created libraries.
- Loading: A loader loads an executable object file into memory, initializes
the registers, heap, data, etc. and starts the execution of the program.
Relocatable shared libraries allow effective memory use when many different
applications share the same code.
In-text Question 2
What is a compiler?
Answer
A compiler is a program that translates a source program written in some high-level
programming language (such as Java) into machine code for some computer architecture
56
3.0 Tutor Marked Assignment
1. What are the challenges involved in developing compilers?
2. Enumerate the essential qualities of a compiler.
3. Distinguish between the function of loader and a linker.
4. Outline some specific knowledge require for building a compiler.
4.0 Conclusion/Summary
In this study session, you have been taken through the definition, functions, and
architecture of a compiler
You also learnt that:
- A translator is a program that takes as input a program written in one
programming language ( the source language) and produces as output a
program in another language (the object or target language)
- A compiler is a program that translates a source program written in some
high-level programming language (such as Java) into machine code for
some computer architecture (such as the Intel Pentium architecture)
- An interpreter reads an executable source program written in a high-level
programming language as well as data for this program, and it runs the
program against the data to produce some results
- Translators are needed to free the programr from the hassles of
programming in machine language and its attendant problems
- There are several challenges to be surmounted in developing a compiler.
In the next study session, you will be learning more about the structure of a
compiler and the various phases involved in compilation process.
57
6.0 Additional Activities (Videos, Animations & Out of Class activities) e.g.
a. Visit YouTube https://www.youtube.com/watch?v=OVTu4XcmnwE. Watch
the video & summarise in 1 paragraph
b. View the animation on
https://www.youtube.com/watch?v=VDslRumKvRAand critique it in the
discussion forum
c. Take a walk and engage any 3 students on the similarities and differences
between Translators, Compilers and Interpreters. In 2 paragraphs summarise
their opinion of the discussed topic. etc.
58
8.0 References/Further Reading
Aho, A. V. & Ullman, J. D. (1977). Principles of Compiler Design.
Addison-Wesley Publishing Company. ISBN 0-201-00022-9.
Chomsky, Noam & Schützenberger, Marcel P. (1963). " The
algebraic Theory of Context free Languages". In: P. Braffort
& D. Hirschberg. Computer Programming and Formal
Languages. Amsterdam: North Holland. pp. 118–161.
Davis, Martin E., Sigal, Ron & Weyuker, Elaine J. (1994). Computability,
Complexity, and Languages: Fundamentals of Theoretical Computer
Science. Boston: Academic Press,Harcourt, Brace. pp. 327. ISBN
0122063821.
Grzegorz Rozenberg and Arto Salomaa (1997). Handbook of Formal
Languages. Volume I-III. Springer. ISBN 3 540 61486 9.
John, E. Hopcroft & Jeffrey D. Ullman (1979). Introduction toAutomata
Theory, Languages, and Computation. Addison-Wesley Publishing,
Reading Massachusetts. ISBN 0-201-029880-X.
Michael, A. Harrison(1978). Introduction to Formal Language Theory,
Addison-Wesley.
59
STUDY SESSION 3
The Structure of a Compiler
Section and Subsection Headings:
Introduction
1.0 Learning Outcomes
2.0 Main Content
2.1 -The Structure of a Compiler
2.2- Phases of a Compiler
2.2.1 – The Lexical Analyser
2.2.2 – The Syntax Analyser
2.2.3 – The Intermediate Code Generator
2.2.4 – Code Optimisation
2.2.5 – Code Generation
2.2.6 – The Table Management or Bookkeeping
2.2.7 – The Error Handler
2.3 - Passes
2.4 - Reducing the Number of Passes
2.4 - Cross Compilation
2.4 - T – Diagrams
3.0 Tutor Marked Assignment
4.0 Study Session Summary and Conclusion
5.0 Self-Assessment Question(s)
6.0 Additional Activities (Videos, Animations & Out of Class activities)
7.0 Self-Assessment Question Answers
8.0 References/Further Reading
60
Introduction:
In the previous session, you were introduced to the concept of compilers and
their roles in programming.
In this session, you will learn about the structure of a compiler and the various
phases involved in the compilation process.
61
3. Tables of Information: It includes the symbol-table and there are some
other tables that provide information during compilation process.
4. Run-Time Library: It is used for run-time system support.
Languages for Writing Compiler
(a) Machine language
(c) High level language or high level language with bootstrapping facilities
62
instance of token ―identifier‖ is passed along with the integer code for
―identifier‖. For Example, in the FORTRAN statement:
IF (5 .EQ. MAX) GO TO 100
We find the following eight tokens: IF; (; 5; .EQ; MAX; ); GOTO; 100.
63
2.2.3 The Intermediate Code Generator
This uses the structure produced by the syntax analyser to create a stream of
simple instructions. Many styles of intermediate code are possible. One
common style uses instructions with one operator and a small number of
operands. These instructions with one operator and a small number of operands.
These instructions can be viewed as simple macros like the macro ADD2. The
primary difference between intermediate code and assembly code is that the
intermediate code need not specify the registers to be used for each operation.
2.2.4 Code Optimization
This is an optional phase designed to improve the intermediate code so that the
ultimate object program runs faster and/or takes less space. Its output is another
intermediate code program that does the same job as the original, but perhaps in
a way that saves time and/or space.
2.2.5 Code Generation
This is the final phase and it produces the object code by deciding on the
memory locations for data, selecting code to access each datum, and selecting
the registers in which each computation is to be done. Designing a code
generator that produces truly efficient object programs is one of the most
difficult parts of a compiler design, both practically and theoretically.
2.2.6 The Table Management or Bookkeeping
This portion of the compiler keeps track of the names used by the program and
records essential information about each, such as its type (integer, real, etc.).
The data structure used to record this information is called a symbol table.
2.2.7 The Error Handler
This is invoked when a flaw in the source program is detected. It must warn the
program by issuing a diagnostic, and adjust the information being passed from
phase to phase so that each phase can proceed. It is desirable that compilation
be completed on flawed programs, at least through the syntax-analysis phase;
we can detect many errors as possible in one compilation. Both the table
64
management and error handling routines interact with all phases of the
compiler.
In-text Question 1
Which is the first phase of the compiler?
Answer
Lexical Analysis
2.3 Passes
In an implementation of a compiler, portions of one or more phases are
combined into a module called a pass. A pass reads the source program or the
output of the previous pass, makes the transformations specified by its phases,
and writes output into an intermediate file, which may then be read by a
subsequent pass. If several phases are grouped into one pass, then the operation
of the phases may be interleaved, with control alternating among several phases.
The numbers of passes, and the grouping of phases into passes, are usually
dictated by a variety of considerations germane to a particular language and
machine, rather than by any mathematical optimality criteria.
The structure of the source language has strong effect on the number of passes.
Certain languages require at least two passes to generate code easily. For
example, languages such as PL/I or ALGOL 68 allow the declaration of a name
to occur after uses of that name. Code for expression containing such a name
cannot be generated conveniently until the declaration has been seen.
2.4 Reducing the Number of Passes
Since each phase is a transformation on a stream of data representing an
intermediate form of the source program, you may wonder how several phases
can be combined into one pass without the reading and writing of intermediate
files. In some cases, one pass produces its output with little or no memory of
prior inputs. Lexical analysis is typical. In this situation, a small buffer serves as
the interface between passes. In other cases, you may merge phases into one
pass by means of a technique known as ‗back patching‘. In general term s, if the
65
output of a phase cannot be determined without looking at the remainder of the
phase‘s input, the phase can generate output with ‗ slots‘ which can be filled in
later, after more of the input is read.
2.5 Cross Compilation
Consider porting a compiler for C written in C from an existing machine A to
an existing machine B.
Steps to Cross Compilation
(a) Write new back-end in C to generate code for computer B
(b) Compile the new back-end and using the existing C compiler running on
computer A generating code for computer B.
(c) We now have a compiler running on computer A and generating code for
computer B.
(d) Use this new compiler to generate a complete compiler for computer B.
In other words, we can compile the new compiler on computer A to
generate code for computer B
(e) We now have a complete compiler for computer B that will run on
computer B.
(f) Copy this new compiler across and run it on computer B (this is cross
Compilation).
In-text Question 2
What is the job of the Syntax Analyser?
Answer
This groups tokens together into syntactic structures.
66
- The lexical analysis ( or scanning)
- The syntax analysis Front-end
- Semantic analysis
In this course, we will concern ourselves with mostly the front-end i.e. those
parts of compilation that can be automated which are lexical, syntax and
probably semantic analyses.
2.7 T-Diagrams
They are used to describe the nature of a compilation. It is usually in the form of
T and is diagrammatically represented as in figure 2 below:
Fig. 1.3.3:T-Diagrams
4.0 Conclusion/Summary
In this study session, you have been taken through the structure, phases and
functions of each phase of a compiler.
You also learnt that:
- The compilation process can be partitioned into a series of sub-
processes called phases
- Several phases can be grouped into one pass, so that the operation of
the phases may be interleaved, with control alternating among several
phases
- The numbers of passes, and the grouping of phases into passes, are
usually dictated by a variety of considerations germane to a particular
language and machine
- The operations of a compiler can be classified into two: front-end
comprising the lexical analysis (or scanning), the syntax analysis, and
semantic analysis and the back-end comprising of intermediate code
optimization, code generation, and code optimization.
In the next study session, you will be learning more about the compilation
process; such as cross compilation and hand implementation.
68
6.0 Additional Activities (Videos, Animations & Out of Class activities) e.g.
a. Visit YouTube https://www.youtube.com/watch?v=Qkwj65l_96I. Watch the
video & summarise in 1 paragraph
b. View the animation on
https://www.youtube.com/watch?v=OZSGjRSHYUkand critique it in the
discussion forum
c. Take a walk and engage any 3 students on front end and backend of a
compiler ; In 2 paragraphs summarise their opinion of the discussed topic. etc.
69
MODULE 2
Lexical Analysis
Contents:
Study Session 1: The Scanner
Study Session 2: Hand Implementation of Lexical Analyser
Study Session 3: Automatic Generation of Lexical Analyser
Study Session 4: Implementing a Lexical Analyser
STUDY SESSION 1
The Scanner
Section and Subsection Headings:
Introduction
1.0 Learning Outcomes
2.0 Main Content
2.1- Role of the Lexical Analyser
2.2- The Need for Lexical Analyser
2.3- The Scanner
3.0 Tutor Marked Assignment
4.0 Study Session Summary and Conclusion
5.0 Self-Assessment Question(s)
6.0 Additional Activities (Videos, Animations & Out of Class activities)
7.0 Self-Assessment Question Answers
8.0 References/Further Reading
Introduction:
As you learnt in study session 3 of the previous module, the function of the
lexical analyser is to read the source program, one character at a time, and
translate it into a sequence of primitive units called tokens. Keywords,
identifiers, constants, and operators are examples of tokens.
70
This module, starting from this session, exposes you to the problem of
designing and implementing lexical analysers.
71
more efficient recogniser can be constructed for tokens than for syntactic
structures.
By including certain constructs in the lexical rather than the syntactic structure,
we can greatly simplify the design of the syntax analyser.
Lexical analyser also performs other functions such as keeping track of line
numbers, producing an output listing if necessary, stripping out white space
(such as redundant blanks and tabs), and deleting comments.
In-text Question 1
What is the advantage of implementing the lexical analyser and the parser in the same pass?
Answer:
It eliminates the need for the intermediate file.
=
id(x)
*
(
id(b)
+
num(1)
)
;
where id(x) indicates the identifier with name x (a program variable in this case)
and num(1) indicates the integer 1. Each time the parser needs a token, it sends
a request to the scanner. Then, the scanner reads as many characters from the
input stream as it is necessary to construct a single token. The scanner may
72
report an error during scanning (e.g. when it finds an end-of-file in the middle
of a string). Otherwise, when a single token is formed, the scanner is suspended
and returns the token to the parser. The parser will repeatedly call the scanner to
read all the tokens from the input stream or until an error is detected (such as a
syntax error).
Tokens are typically represented by numbers. For example, the token * may be
assigned number 35. Some tokens require some extra information. For example,
an identifier is a token (so it is represented by some number) but it is also
associated with a string that holds the identifier name. For example, the token
id(x) is associated with the string, "x". Similarly, the token num(1) is associated
with the number, 1. Tokens are specified by patterns, called regular
expressions. For example, the regular expression [a-z][a-zA-Z0-9]* recognises
all identifiers with at least one alphanumeric letter whose first letter is lower-
case alphabetic.
A typical scanner:
- Recognises the keywords of the language (these are the reserved words
that have a special meaning in the language, such as the word class in
Java);
- Recognises special characters, such as ( and ), or groups of special
characters, such as := and ==;
- Recognises identifiers, integers, reals, decimals, strings, etc;
- Ignores whitespaces (tabs and blanks) and comments;
- Recognises and processes special directives (such as the #include "file"
directive in C) and macros.
A key issue is speed. One can always write a naive scanner that groups the input
characters into lexical words (a lexical word can be either a sequence of
alphanumeric characters without whitespaces or special characters, or just one
special character), and then tries to associate a token (i.e. number, keyword,
identifier, etc.) to this lexical word by performing a number of string
73
comparisons. This becomes very expensive when there are many keywords
and/or many special lexical patterns in the language. In this study session you
will learn how to build efficient scanners using regular expressions and finite
automata. There are automated tools called scanner generators, such as flex for
C and JLex for Java, which construct a fast scanner automatically according to
specifications (regular expressions). You will first learn how to specify a
scanner using regular expressions, then the underlying theory that scanner
generators use to compile regular expressions into efficient programs (which are
basically finite state machines), and then you will learn how to use a scanner
generator for Java, called JLex
In-text Question 2
A scanner groups input characters into tokens. True or False
Answer
True
4.0 Conclusion/Summary
In this study session, you have been taken through the fundamental concepts
and importance of the lexical analyser.
You also learnt that:
- the lexical analyser could be a separate pass, placing its output on an
intermediate file from which the parser would then take its input
- when the lexical analyser and the parser are together in the same pass, the
lexical analyser acts as a subroutine or co-routine, which is called by the
parser whenever it needs a new token
74
- the analysis of the source program is usually split into two phases, lexical
analysis and syntactic analysis, to simplify the overall design of the
compiler
- a scanner groups input characters into tokens
- the key issue in the design of a scanner is speed.
In the next study session, you will be learning more about the workings and
implementation of a lexical analyser.
6.0 Additional Activities (Videos, Animations & Out of Class activities) e.g.
a. Visit YouTube https://www.youtube.com/watch?v=WccZQSERfCM. Watch
the video & summarise in 1 paragraph
b. View the animation on https://www.youtube.com/watch?v=VGgIZl5WjH0
and critique it in the discussion forum
c. Take a walk and engage any 3 students on why we need a Lexical Analyser;
In 2 paragraphs summarise their opinion of the discussed topic. etc.
75
8.0 References/Further Reading
Aho, A. V. & Ullman, J. D. (1977). Principles of Compiler Design.
Addison-Wesley Publishing Company. ISBN 0-201-00022-9
Chomsky, Noam & Schützenberger, Marcel P. (1963). " The
algebraic Theory of Context Free Languages." In: P. Braffort
& D. Hirschberg. Computer Programming and Formal
Languages. Amsterdam: North Holland. pp. 118–161.
Davis, Martin E., Sigal, Ron & Weyuker, Elaine J. (1994).
Computability, Complexity, and Languages: Fundamentals
of Theoretical Computer Science. Boston: Academic
Press,Harcourt, Brace. pp. 327. ISBN 0122063821.
Grzegorz Rozenberg & Arty Salomaa (1997). Handbook of
FormalLanguages: Volume I-III, Springer. ISBN 3 540
61486 9.
John, E. Hopcroft & Jeffrey D. Ullman (1979). Introduction
toAutomata Theory, Languages, and Computation. Addison-
Wesley Publishing, Reading Massachusetts. ISBN 0-201-
029880-X.
Michael, A. Harrison (1978). Introduction to Formal Language Theory.
Addison-Wesley.
76
STUDY SESSION 2
Hand Implementation of Lexical Analyser
Section and Subsection Headings:
Introduction
1.0 Learning Outcomes
2.0 Main Content
2.1 - Lexical Analysis
2.2- Construction of Lexical Analyser
2.2.1- Hand Implementation
2.2.1.1- Input Buffering
2.2.1.2- Transition Diagram (TD)
2.2.1.3- How to Handle Keywords
3.0 Tutor Marked Assignment
4.0 Study Session Summary and Conclusion
5.0 Self-Assessment Question(s)
6.0 Additional Activities (Videos, Animations & Out of Class activities)
7.0 Self-Assessment Question Answers
8.0 References/Further Reading
Introduction:
Having learnt about the role and need of a lexical analyser in the previous study
session, we will now proceed to discuss various methods of constructing and
implementing a lexical analyser.
77
4. State the problems with hand implementation method of constructing
lexical analysers
5. Construct transition diagrams to handle keywords, identifiers and
delimiters.
78
2.2.1.1 Input Buffering
The lexical analyser scans the characters of the source program one at a time to
discover tokens. Often, however, many characters beyond the next token may
have to be examined before the next token itself can be determined. For this and
other reasons, it is desirable for the lexical analyser to read its input from an
input buffer. There are many schemes that can be used to buffer input but we
shall discuss only one of these schemes here. Figure 1shows a buffer divided
into two halves of, say, 100 characters. One pointer marks the beginning of the
token being discovered. We view the position of each pointer as been between
the character last read and the character next to be read. In practice, each
buffering scheme adopts one convention; either a pointer is at the symbol last
read or the symbol it is ready to read.
The distance which the lookahead pointer may have to travel past the actual
token may be large. For example, in a PL/I program we may see
DECLARE (ARG1, ARG2... ARGn)
- without knowing whether DECLARE is a keyword or an array name until
we see the character that follows the right parenthesis. In either case, the
token itself ends at the second E. If the lookahead pointer travels beyond
the buffer half in which it began, the other half must be loaded with the
next characters from the source.
Since the buffer of Figure 1 is of limited size, there is an implied constraint on
how much lookahead can be used before the next token is discovered. For
79
example, in Figure 1, if the lookahead travelled to the left half and all the way
through the left half to the middle, we could not reload the right half, because
we would lose characters that had not yet been grouped into tokens. While we
can make the buffer larger if we choose or use another buffering scheme, we
cannot ignore the fact that lookahead is limited.
By using buffer, we mean that you read part of the text into the buffer
(temporary storage) and then begin to scan by using get character (GETCHAR)
to scan and you form the token as you go along.
e.g.
real x, y, z
integer r, q
read (6, 7) (a (i), I = 1, 20)
You can prepare a buffer of 80 characters and read the first line to it and begin
to scan, and then go to the next line.
Problems associated with hand implementation
Where the language allows keyword to be used as an identifier e.g. DC: in PL2
Where the language does not recognise blanks as a separator, you will not stop
scanning until you reach a comma. Like in FORTRAN: using hand
implementation for real x, y, z; you take ―realx‖ as an identifier.
In-text Question 1
What is a token?
Answer:
A token is the smallest unit recognisable by the compiler.
80
position in a flowchart is a valuable too, so much so that a specialised kind of
flowchart for lexical analysers, called transition diagram, has evolved. In a TD,
the boxes of the flowchart are drawn as circles and called states. The states are
connected by arrows, called edges. The labels onthe various edges leaving a
state indicate the input characters that can appear after that state.
In TD, we try to construct a TD for each token, and then link up.
For example, the TD for keywords and identifiers are the same. See in Figure 2
below an automation you can use for an identifier.
81
Fig. 2.2.4: Transition diagram for keywords
NOTE:
Here, we make the keywords the basis of our search whereas in the first
method, we make the identifiers the basis of our search.
We are also assuming here that the keyword cannot be the beginning of an
identifier. Therefore, if it sees something like ―THENPAP‖, it will break it
into two like ―THEN PAP‖ because you are not allowed to use keywords as
the beginning of an identifier.
For delimiters especially relational operators, you can also construct a TD like
the one above.
82
In-text Question 2
The lexical analyser scans the characters of the source program many at a time to discover
tokens. True or False?
Answer
False
4.0 Conclusion/Summary
In this study session, you have been taken through hand implementation method
of lexical analyser construction. This method can be carried out in two ways:
input buffer and transition diagram. Due to the attendant problems of this
method, there are now tools that are used in automatic generation of lexical
analyser. This will be discussed in later study session of this module.
You also learnt that:
- the lexical analyser could be constructed by using hand implementation
or automatic generation
- hand implementation can be done by input buffering or transition diagram
- hand implementation has some attendant problems
- using TD, keywords can be handled two ways.
83
6.0 Additional Activities (Videos, Animations & Out of Class activities) e.g.
a. Visit YouTube https://www.youtube.com/watch?v=XjffJ6_B6Vs. Watch the
video & summarise in 1 paragraph
b. View the animation on
https://www.youtube.com/watch?v=C7NykE7PS9Aand critique it in the
discussion forum
c. Take a walk and engage any 3 students on the two ways to construct a
Lexical Analyser; In 2 paragraphs summarise their opinion of the discussed
topic. etc.
84
Davis, Martin E., Sigal, Ron & Weyuker, Elaine J. (1994). Computability,
Complexity, and Languages: Fundamentals of Theoretical Computer
Science. Boston: Academic Press,Harcourt, Brace. pp. 327. ISBN
0122063821.
John, E. Hopcroft & Jeffrey, D. Ullman (1979). Introduction toAutomata
Theory, Languages, and Computation, Addison-Wesley Publishing,
Reading Massachusetts. ISBN 0-201-029880-X.
85
STUDY SESSION 3
Automatic Generation of Lexical Analyser
Section and Subsection Headings:
Introduction
1.0 Learning Outcomes
2.0 Main Content
2.1 - Automatic Generation of Lexical Analyser
2.2- Language Theory Background
2.2.1- Definitions
2.2.2- Operations on Strings
2.2.3- Operations on Languages
2.3- Regular Expressions (REs)
2.3.1- Definition of a Regular Expression and the Language it
denotes
2.3.1.1- Basis
2.3.1.2- Introduction
2.3.1.3- Extensions of Regular Expressions
2.3.3- Lex Regular Expressions
2.4- Tokens/Patterns/Lexemes/Attributes
2.5- Languages for Specifying Lexical Analyser
2.5.1- Specifying a Lexical Analyser with LEX
2.5.2- LEX Source
2.5.3- Steps in LEX Implementation
2.5.4- Sample LEX programs
2.5.5- Creating a Lexical Processor with LEX
2.5.6- LEX History
3.0 Tutor Marked Assignment
4.0 Study Session Summary and Conclusion
5.0 Self-Assessment Question(s)
86
6.0 Additional Activities (Videos, Animations & Out of Class activities)
7.0 Self-Assessment Question Answers
8.0 References/Further Reading
Introduction:
In the previous study session, you learnt about hand implementation and the two
ways it can be carried out. You also learnt about the problems with hand
implementation.
In this study session, we will discuss regular expressions and the automatic
generation of lexical analysers.
87
2.2 Language Theory Background
2.2.1 Definitions
Symbol: (character, letter)
Alphabet: a finite nonempty set of characters. E.g. {0, 1}, ASCII,Unicode
String (sentence, word): a finite sequence of characters, possibly empty.
Language: a (countable) set of strings, possibly empty.
2.2.2 Operations on Strings
i. Concatenation
ii. Exponentiation
(a) x0is the empty stringε.
(b) xi= xi-1x, for i > 0
iii. prefix, suffix, substring, subsequence
2.2.3 Operations on Languages
i. Union
ii. Concatenation
iii. Exponentiation
(a) L0is {ε}, even when L is the
empty set.
(b) Li= Li-1L, for i > 0 Fig 2.3.1: Operations on Languages
88
Many of today's programming languages use regular expressions to match
patterns in strings. E.g., awk, flex, lex, java, javascript, perl, python.
2.3.1 Definition of a Regular Expression and the Language It Denotes
2.3.1.1 Basis
- ε is a regular expression that denotes { ε }.
- A single character a is a regular expression that denotes { a }.
2.3.1.2 Induction
Suppose r and s are regular expressions that denote the languages L(r) and L(s):
- (r)|(s) is a regular expression that denotes L(r) L(s)
- (r)(s) is a regular expression that denotes L(r)L(s)
- (r)* is a regular expression that denotes L(r)*
- (r) is a regular expression that denotes L(r).
We can drop redundant parenthesis by assuming:
- the Kleene star operator * has the highest precedence and is left
associative
- concatenation has the next highest precedence and is left associative
- the union operator | has the lowest precedence and is left associative E.g.,
under these rules r|s*t is interpreted as (r)|((s)*(t)).
2.3.1.3 Extensions of Regular Expressions
- Positive closure: r+ = rr*
- Zero or one instance: r? = ε | r
- Character classes:
(a) [abc] = a | b | c
(b) [0-9] = 0 | 1 | 2 | … | 9
We can freely put parentheses around REs to denote the order of evaluation.
For example, (a| b)c. To avoid using many parentheses, we use the following
rules: concatenation and alternation are associative (i.e. ABC means (AB)C and
is equivalent to A(BC)), alternation is commutative (i.e., A| B = B| A), repetition
89
is idempotent (i.e., A** = A*), and concatenation distributes over alternation (eg,
(a| b)c = ac| bc).
For convenience, we can give names to REs so we can refer to them by their
name. For example:
for – keyword = For
Letter = [a - zA - Z]
Digit = [0 - 9]
Identifier = letter (letter | digit)*
=
Sign +|-|
Integer = sign (0 | [1 - 9]digit*)
Decimal = integer . digit*
= (integer | decimal ) E sign
Real digit+
There is some ambiguity though: If the input includes the characters for8, then
the first rule (for for-keyword) matches 3 characters (for), the fourth rule (for
identifier) can match 1, 2, 3, or 4 characters, the longest being for8. To resolve
this type of ambiguities, when there is a choice of rules; scanner generators
choose the one that matches the maximum number of characters. In this case,
the chosen rule is the one for identifier that matches 4 characters (for8). This
disambiguation rule is called the longest match rule. If there are more than one
rule that match the same maximum number of characters, the rule listed first is
chosen. This is the rule priority disambiguation rule. For example, the lexical
word for is taken as a for-keyword even though it uses the same number of
characters as an identifier.
Today regular expressions come in many different forms.
(a) The earliest and simplest are the Kleene regular expression
(b) Awk and egrep extended grep's regular expressions with union and
parentheses.
90
(c) POSIX has a standard for Unix regular expressions.
(d) Perl has an amazingly rich set of regular expression operators.
(e) Python uses pcre regular expressions.
\\ matches \.
Examples of Lex regular expressions and the strings they match are:
1. "a.*b" matches the string a.*b.
2. . matches any character except a newline.
3. ^ matches the empty string at the beginning of a line.
4. $ matches the empty string at the end of a line.
5. [abc] matches an a, or a b, or a c.
6. [a-z] matches any lowercase letter between a and z.
7. [A-Za-z0-9] matches any alphanumeric character.
8. [^abc] matches any character except an a, or a b, or a c.
9. [^0-9] matches any nonnumeric character.
10.a* matches a string of zero or more a's.
11.a+ matches a string of one or more a's.
12.a? matches a string of zero or one a's.
13.a{2,5} matches any string consisting of two to five a's.
91
(a) matches an a.
14.a/b matches an a when followed by a b.
15.\n matches a newline.
16.\t matches a tab.
Lex chooses the longest match if there is more than one match. E.g., ab*
matches the prefix abb in abbc.
2.4 Tokens/Patterns/Lexemes/Attributes
A token is a pair consisting of a token name and an optional attribute value.
e.g., <id, ptr to symbol table>, <=>
Some regular expressions corresponding to tokens:
- For the token keyword, we can say
- Keyword = BEGIN END IF THEN ELSE
- Identifier = letter (letter digit)*
- Constant = digit*
- Relop = <<= <>>>=
A pattern is a description of the form that the lexemes making up a token in a
source program may have. We will use regular expressions to denote patterns.
e.g., identifiers in C: [_A-Za-z][_A-Za-z0-9]*
A lexeme is a sequence of characters that matches the pattern for a token, e.g.,
- identifiers: count, x1, i, position
- keywords: if
- operators: =, ==, !=, +=
An attribute of a token is usually a pointer to the symbol table entry that gives
additional information about the token, such as its type, value, line number, etc.
2.5 Languages for Specifying Lexical Analyser
There are tools that can generate lexical analyzers. An example of such tools is
LEX which is discussed in the next section. You are to read about other tools in
your further reading.
92
In-text Question 1
What is a lexeme?
Answer:
A lexeme is a sequence of characters that matches the pattern for a token
93
Fig. 2.3.2:The role of LEX
The input to the LEX compiler is called LEX source and the output of LEX
compiler is called lexical analyser i.e. the LEX compiler generate lexical
analyser.
94
2. The Translation Rules: these are of the form:
P1{A1}
P2 {A2}
.
.
.
Pm {Am}
Where each Pi is a regular expression called Pattern over the alphabet consisting
of sigma (∑) and the auxiliary definition names. The patterns described the form
of the tokens.
Each Ai is a program segment describing what action the lexical analyser should
take when token Pi is found.
To create the lexical analyser ‗ L‘, each of the A is, must be compile into
machine codes just like any other program written in the language of the Ais.
In summary, the lexical analyser L created by LEX behaves in the following
manner:
- L reads its input one character at a time until it has found thelargest prefix
in the input which matches one of the regular expressions Pi.
- Once L has found that prefix, L removes it from the input and places it in
a buffer called token. L then executes the actions Ai. After completing Ai,
L returns control to the parser.
- When requested to, L repeats these series of actions on the remaining
input.
95
5. Generate parsing tables & code
%%
[A-Za-z]+ { printf("%s\n", yytext); }
.|\n {}
The pattern part of the first translation rule says that if the current prefix of the
unprocessed input stream consists of a sequence of one or more letters, then the
longest such prefix is matched and assigned to the Lex string variable yytext.
The action part of the first translation rule prints the prefix that was matched. If
this rule fires, then the matching prefix is removed from the beginning of the
unprocessed input stream.
The dot in pattern part of the second translation rule matches any character
except a newline at the beginning of the unprocessed input stream. The \n
matches a newline at the beginning of the unprocessed input stream. If this rule
fires, then the character of the beginning of the unprocessed input stream is
removed. Since the action is empty, no output is generated.
Lex repeatedly applies these two rules until the input stream is exhausted.
Example 2: Lex program to print number of words, numbers, and lines in a file
int num_words = 0, num_numbers = 0, num_lines = 0;
word [A-Za-z]+
number [0-9]+
%%
{word} {++num_words;}
96
{number} {++num_numbers ;}
\n {++num_lines;}
. {}
%%
int main()
{
yylex();
In-text Question 2
List the parts that a LEX source program consists of …
Answer
i. A sequence of auxiliary definitions
ii. A sequence of translation rules
97
/* regular definitions */
delim [ \t\n]
Ws {delim}+
Lette
r [A-Za-z]
Digit [0-9]
Id {letter}({letter}|{digit})*
{digit}+(\.{digit}+)?(E[+-
number ]?{digit}+)?
%%
{ws} { }
If {return(IF);}
Else {return(ELSE);}
{id} {yylval = (int) installID(); return(ID);}
%%
int installID()
{
/* function to install the lexeme, whose first character is pointed
to by yytext, and whose length is yyleng, into the symbol table;
returns pointer to symbol table entry */
}
int installNum() {
98
/* analogous to installID */
}
99
1. Construct Lex-style regular expressions for the following patterns.
(a) All lowercase English words beginning and ending with the substring
"ad".
(b) All lowercase English words in which the letters are in strictly
increasing alphabetic order.
(c) Strings of the form abxba where x is a string of a‘s, b‘s, and c‘s that
does not contain ba as a substring.
2. Write a Lex program that converts a file of English text to "Pig Latin."
Assume the file is a sequence of words separated by whitespace. If a
word begins with a consonant, move the consonant to the end of the word
and add "ay". (E.g., "pig" gets mapped into "igpay".) If a word begins
with a vowel, just add "ay" to the end. (E.g., "art" gets mapped to
"artay".)
3. Describe a language (different from the one discussed in this course) for
specifying Lexical analyser.
4. How is the language described in question (3) above similar or different
from LEX.
4.0 Conclusion/Summary
In this study session, you have been taken through the concept of regular
expressions, and the languages that are used to automatically generate lexical
analysers.
You also learnt that:
- regular expressions are a very convenient form of representing (possibly
infinite) sets of strings
- parentheses can be put around REs to denote the order of evaluation
- lexical analyser generators Flex and Lex use extended regular expressions
to specify lexeme patterns making up tokens
- Lex chooses the longest match if there is more than one match
100
- a pattern is a description of the form that the lexemes making up a token
in a source program may have
- a lexeme is a sequence of characters that matches the pattern for a token
- an attribute of a token is usually a pointer to the symbol table entry that
gives additional information about the token, such as its type, value, line
number, etc
- Lex is a special-purpose programming language for creating programs to
process streams of input characters
- the input to the LEX compiler is called LEX source and the output of
LEX compiler is called lexical analyser LEX source is an input that states
the characteristics of the lexical analyser for the language
- LEX source differs from language to language
- the LEX source program consists of two parts: A sequence of auxiliary
definitions followed by a sequence of translation rules
- there are several tools for automatically generating lexical analysers of
which LEX is one.
In the next study session you will be taken through how to convert regular
expressions to non-deterministic finite automata (NFA) and deterministic finite
automata (DFA).
6.0 Additional Activities (Videos, Animations & Out of Class activities) e.g.
a. Visit YouTube https://www.youtube.com/watch?v=xnolCgyx2LM. Watch
the video & summarise in 1 paragraph
b. View the animation on https://www.youtube.com/watch?v=Q4AXO9S2E3E
101
and critique it in the discussion forum
c. Take a walk and engage any 3 students on steps in Lex implementation; In 2
paragraphs summarise their opinion of the discussed topic. etc.
102
7.0 Self-Assessment Question Answers
1. All strings over the input alphabets {a, b} that contain any number of a‘s
and/or b‘s.
2. All strings over the input alphabets {a, b} that must start and end with an
‗a‘.
8.0 References/Further Reading
Alfred, V. Aho et al. (2007). Compilers: Principles, Techniques,
andTools. Second Edition. Pearson Addison-Wesley.
Andrew, W. Appel (2002). Modern Compiler Implementation in Java.
Second edition. Cambridge University Press.
Keith, D. Cooper & Linda Torczon (2004). Engineering a Compiler.
Morgan Kaufmann.
Steven, S. Muchnick (1997). Advanced Compiler Design and
Implementation. Morgan Kaufmann.
Michael, L. Scott (2009). Programming Language Pragmatics.
Third Edition. Morgan Kaufman.
Robert, W. Sebesta (2010). Concepts of Programming Languages. Ninth
Edition. Addison-Wesley.
103
STUDY SESSION 4
Implementing a Lexical Analyser
Section and Subsection Headings:
Introduction
1.0 Learning Outcomes
2.0 Main Content
2.1 - Finite Automata
2.1.1- Nondeterministic Finite Automaton (NFA)
2.1.2- Deterministic Finite Automata (DFAs)
2.2- Equivalence of Regular Expressions and Finite Automata
2.3- Converting a Regular Expression to an NFA
2.4- Converting a Regular Expression into a Deterministic Finite
Automaton
3.0 Tutor Marked Assignment
4.0 Study Session Summary and Conclusion
5.0 Self-Assessment Question(s)
6.0 Additional Activities (Videos, Animations & Out of Class activities)
7.0 Self-Assessment Question Answers
8.0 References/Further Reading
Introduction:
In the previous session, you were taken through automatic generation of lexical
analyser. The importance of regular expressions in the generation of lexical
analysers was brought out.
In this study session, we will discuss finite state machines, which are the
recognisers that recognise regular expressions and also how to convert REs into
finite automata and vice versa.
104
1.0 Study Session Learning Outcomes
After studying this session, I expect you to be able to:
1. Define finite automata
2. Convert REs to NFA
3. Convert NFA to DFA.
105
2.1.2 Deterministic Finite Automata (DFAs)
A deterministic finite automaton (DFA) is an NFA in which:
- there are no ε moves, and
- for each state s and input symbol a there is exactly one transition out of s
labelled a.
Therefore, a DFA represents a finite state machine that recognises a RE.
For example, the following DFA:
106
But it is very common to ignore state 0 (called the error state) since it is
implied. (The arrows with two or more characters indicate transitions in case of
any of these characters.) The error state serves as a black hole, which does not
let you escape.
A DFA is represented by a transition table T, which gives the next state T[s, c]
for a state s and a character c. For example, the T for the DFA above is:
The corresponding DFA has 4 final states: one to accept the for-keyword and 3
to accept an identifier:
107
(The error state is omitted again). Notice that for each state and for each
character, there is a single transition.
A scanner based on a DFA uses the DFA's transition table as follows:
state = initial_state;
current_character = get_next_character();
while ( true )
{ next_state = T[state,current_character];
if (next_state == ERROR)
break;
state = next_state;
if ( is_final_state(state) )
`we have a valid token'
else `report an error'
This program does not explicitly take into account the longest match
disambiguation rule since it ends at EOF. The following program is more
general since it does not expect EOF at the end of token but still uses the longest
match disambiguation rule.
state = initial_state;
final_state = ERROR;
current_character = get_next_character();
108
while ( true )
{ next_state = T[state,current_character];
if (next_state == ERROR)
break;
state = next_state;
if ( is_final_state(state) )
final_state = state;
current_character = get_next_character();
if (current_character == EOF)
break;
};
if ( final_state == ERROR )
`report an error'
...
if ( current_character == 'c' )
goto s2;
109
...
s2: current_character = get_next_character();
Answer
A recogniser for a language L is a program that takes as input a string x and answers “yes”
if x is a sentence of L and “no” otherwise.
110
Clearly DFAs are a subset of NFAs. But it turns out that DFAs and NFAs have
the same expressive power. The problem is that when converting a NFA to a
DFA we may get an exponential blow-up in the number of states.
We will first learn how to convert a RE into a NFA. This is the easy part. There
are only 5 rules, one for each type of RE:
As it can been shown inductively, the above rules construct NFAs with only one
final state. For example, the third rule indicates that, to construct the NFA for
the RE AB, we construct the NFAs for A and B, which are represented as two
boxes with one start state and one final state for each box. Then the NFA for AB
is constructed by connecting the final state of A to the start state of B using an
empty transition.
111
For example, the RE (a| b)c is mapped to the following NFA:
The algorithm below can be used to convert the TD above into a finite state
automaton.
112
Algorithm:
State 0: C:= GETCHAR()
If LETTER(C) then go to state 1
Else FAIL ()
State 1: C: = GETCHAR ()
If LETTER(C) or DIGIT(C) then go to state 1
Else FAIL ()
2.4 Converting a Regular Expression into a Deterministic Finite
Automaton
To convert a RE into a DFA, you first convert the RE into an NFA as discussed
in section 3.3. The next step is to convert the NFA to a DFA (called subset
construction). Suppose that you assign a number to each NFA state. The DFA
states generated by subset construction have sets of numbers, instead of just one
number. For example, a DFA state may have been assigned the set {5, 6, 8}.
This indicates that arriving to the state labelled {5, 6, 8} in the DFA is the same
as arriving to the state 5, the state 6, or the state 8 in the NFA when parsing the
same input (Recall that a particular input sequence when parsed by a DFA,
leads to a unique state, while when parsed by a NFA it may lead to multiple
states).
First, we need to handle transitions that lead to other states for free (without
consuming any input). These are the transitions. We define the closure of a
NFA node as the set of all the nodes reachable by this node using zero, one, or
more transitions. For example, the closure of node 1 in the left Figure 7 below.
113
Fig. 2.4.7:Closure of NFA
- is the set {1, 2}. The start state of the constructed DFA is labelled by the
closure of the NFA start state. For every DFA state labelled by some set
{s1,..., sn} and for every character c in the language alphabet, you find all
the states reachable by s1, s2, ..., or sn using c arrows and you union
together the closures of these nodes. If this set is not the label of any other
node in the DFA constructed so far, you create a new DFA node with this
label. For example, node {1, 2} in the DFA above has an arrow to a {3, 4,
5} for the character a since the NFA node 3 can be reached by 1 on a and
nodes 4 and 5 can be reached by 2. The b arrow for node {1, 2} goes to
the error node which is associated with an empty set of NFA nodes.
The following NFA recognises (a| b)*(abb | a+b), even though it was not
constructed with the above RE-to-NFA rules. It has the following DFA:
114
In-text Question 2
What are the steps in converting a Regular Expression into a Deterministic Finite
Automaton?
Answer:
i. Convert the RE into an NFA
ii. Convert the resulting NFA to a DFA
4.0 Conclusion/Summary
In this study session, you have been taken through the concept of finite
automata, equivalence of REs and finite automata and also how to convert REs
into NFAs and subsequently DFAs.
You also learnt that:
- a DFA represents a finite state machine that recognises a RE
115
- a finite automaton consists of a finite set of states, a set of transitions
(moves), one start state, and a set of final states (accepting states). In
addition, a DFA has a unique transition for every state-character
combination
- to covert a RE into a DFA, you will first have to convert it into a non-
deterministic finite automaton (NFA) and then convert the NFA into a
DFA.
- An NFA is similar to a DFA but it also permits multiple transitions over
the same character and transitions over ε
- there are several tools for automatically generating lexical analysers of
which LEX is one.
6.0 Additional Activities (Videos, Animations & Out of Class activities) e.g.
a. Visit U-tube https://www.youtube.com/watch?v=Qa6csfkK7_I. Watch the
video & summarise in 1 paragraph
b. View the animation on https://www.youtube.com/watch?v=RxfXyvfTsgQ
and critique it in the discussion forum
c. Take a walk and engage any 3 students on steps in converting an NFA to
DFA; In 2 paragraphs summarise their opinion of the discussed topic. etc.
116
8.0 References/Further Reading
Aho, A. V. & Ullman, J. D. (1977). Principles of Compiler Design.
Addison-Wesley Publishing Company. ISBN 0-201-00022-9.
Alfred, V. Aho et al. (2007). Compilers: Principles, Techniques, and
Tools. Second Edition. Wesley: Pearson Addison.
Andrew, W. Appel (2002). Modern Compiler Implementation in Java.
Second edition. Cambridge University Press.
Davis, Martin E., Sigal, Ron & Weyuker, Elaine J. (1994).
Computability, Complexity, and Languages: Fundamentals of Theoretical
Computer Science. Boston: Academic Press,Harcourt, Brace. pp. 327.
ISBN 0122063821.
John, E. Hopcroft & Jeffrey, D. Ullman (1979). Introduction toAutomata
Theory, Languages, and Computation. ReadingMassachusetts :Addison-
Wesley Publishing. ISBN 0-201-029880-X.
Keith, D. Cooper& Linda, Torczon (2004). Engineering a Compiler.
Morgan Kaufmann.
Michael, L. Scott (2009). Programming Language Pragmatics.
Third Edition. Morgan Kaufman.
Robert, W. Sebesta (2010). Concepts of Programming Languages. Ninth
Edition. Wesley: Addison.
Steven, S. Muchnick (1997). Advanced Compiler Designand
Implementation. Morgan Kaufmann.
117
MODULE 3
Syntax Analysis
Contents:
Study Session 1: Context-Free Grammars
Study Session 2: Bottom-Up Parsing Techniques
Study Session 3: Precedence Parsing
Study Session 4: Top-Down Parsing Techniques
Study Session 5: LR Parsers
STUDY SESSION 1
Context-Free Grammars
Section and Subsection Headings:
Introduction
1.0 Learning Outcomes
2.0 Main Content
2.1 - The Parse
2.1.1- Role of the Parser
2.2- Context-Free Grammars (CFG's)
2.2.1- Notational Conventions
2.3- Derivations and Parse Trees
2.3.1- Derivations
2.3.2- Parse Trees
2.4- Ambiguity
2.5- Verifying Grammars
2.5.1- Immediate Left Recursion
2.5.2- Indirect Left Recursion
2.5.3- Removing Left Recursion
2.5.3.1- Removing Immediate Left Recursion
2.5.3.2- Removing Indirect Left Recursion
118
2.5.4- Algorithm for Elimination of Left Recursion
2.6- Verifying Grammars
3.0 Tutor Marked Assignment
4.0 Study Session Summary and Conclusion
5.0 Self-Assessment Question(s)
6.0 Additional Activities (Videos, Animations & Out of Class activities)
7.0 Self-Assessment Question Answers
8.0 References/Further Reading
Introduction:
In the previous module, you have been taken through the lexical analysis phase
of a compiler. The module showed you that the lexical structure of tokens could
be specified by regular expressions and that from a regular expression you could
automatically construct a lexical analyser to recognise the tokens denoted by the
expression. Module 3 will give similar treatment of the next phase of a
compiler, which is the syntax analysis phase. For the syntactic specification of
a programming language, we shall use a notation called a context-free
grammar (grammar, for short), which is also called BNF (Backus-Naur Form)
description.
In this first study session of the third module, you will be exposed to grammars,
how a grammar defines a language, and what features of programming
languages can, and cannot, be specified by context-free grammars.
119
5. Verify grammars.
2.0 MAIN CONTENT
2.1 The Parser
The Parser takes tokens from lexer (scanner or lexical analyser) and builds a
parse tree. The parser has two basic functions. It checks that the tokens
appearing in its input, which is the output of the lexical analyser, occur in
patterns that are permitted by the
specification for the source
language. It also imposes on the
token a tree-like structure that is
used by the subsequent phases of
the compiler. Fig 3.1.1: The Parser
120
A context- free grammar is formally represented by the tuple:
CFG = ( N, T, P, s )
where: N and T are finite sets of nonterminals and terminals under the
assumption that N and T are disjoint
P is a finite set of productions such that each production is
of the form: A→ α (where A is a nonterminal and α is a
string of symbols)
s is the start symbol
2.2.1 Notational Conventions
Terminals are usually denoted by:
- lower-case letters early in the alphabet: a, b, c;
- operator symbols: +, -, *, etc.;
- punctuation symbols: (, ), {, }, ; etc.;
- digits: 0, 1, 2, ..., 9;
- boldface strings: if, else, etc.;
Nonterminals are usually denoted by:
- upper-case letters early in the alphabet: A, B, C;
- the letter S representing the start symbol;
- lower-case italic names: expr, stmt, etc.;
121
F → (E) | id
- The terminal symbols are the alphabet from which strings are formed. In
this grammar the set of terminal symbols is {id, +, *, (, ) }. The terminal
symbols are the token names.
- The nonterminal symbols are syntactic variables that denote sets of
strings of terminal symbols. In this grammar the set of nonterminal
symbols is { E, T, F}.
- The start symbol is E.
123
2.3.1.1 Leftmost Derivations
Derivations in which only the leftmost nonterminal in any sentential form is
replaced at each step are called leftmost: α lmβ.
if w Aγ wδγ then α lm β.
where A → δ is a production and w is a string of terminals, and is a string of
grammar symbols.
To emphasise the fact that α derives β by a leftmost derivation may be written:
α lm* β.
If S lm* α then α is a left-sentential form of the grammar.
F → (E) | id
E E+T
T+T
F+T
id + T
id + T * F
id + F * F
id + id * F
id + id * id
124
2.3.1.2 Rightmost Derivations
Derivations in which only the rightmost nonterminal in any sentential form is
replaced at each step are called rightmost:
α rmβ.
If γ Aw γδw then α rm β.
α rm* β.
125
E + F * id
E + id * id
T + id * id
F + id * id
id + id * id
Note that these two derivations have the same parse tree.
In-text Question 1
A leftmost derivation expands the leftmost nonterminal in each sentential form. True or
False?
Answer
True
The leaves of the parse tree are labelled by terminals or nonterminals and, read
from left to right. They constitute a sentential form, called the yield or frontier
of the tree. For example, the parse tree for –(id + id)implied by the derivation of
Example 1 above is as shown in Figure 1
below:
126
Fig. 3.1.1: Parse Tree
2.4 Ambiguity
A grammar that produces more than one parse tree for some sentence is said to
be ambiguous i.e. An ambiguous grammar is one that produces more than one
leftmost or more than one rightmost derivation for some sentence. For certain
types of parsers, it desirable that the grammar be made unambiguous, for if it is
not, we cannot uniquely determine which parse tree to select for a sentence.
Consider the context-free grammar G with the productions
E → E + E | E * E | ( E ) | id
This grammar has the following leftmost derivation for id + id * id
E E+E
id + E
id + E * E
id + id * E
id + id * id
This grammar also has the following leftmost derivation for id + id * id
E E*E
E+E*E
id + E * E
id + id * E
127
id + id * id
These derivations have different parse trees.
A grammar is ambiguous if there is a sentence with two or more parse trees.
The problem is that the grammar above does not specify
- the precedence of the + and * operators, or
- the associativity of the + and * operators
However, the grammar in section (3.2) generates the same language and is
unambiguous because it makes * of higher precedence than +, and makes both
operators left associative.
A context-free language is inherently ambiguous if it cannot be generated by
any unambiguous context-free grammar.
The context-free language { ambmanbn | m> 0 and n> 0} { ambnanbm | m > 0 and
n > 0} is inherently ambiguous.
Most natural languages are inherently ambiguous but no programming
languages are inherently ambiguous.
Unfortunately, there is no algorithm to determine whether a CFG is ambiguous;
that is, the problem of determining whether a CFG is ambiguous is undecidable.
We can, however, give some practically useful sufficient conditions to
guarantee that a CFG is unambiguous.
2.5 Left Recursion
"A grammar is left-recursive if we can find some non-terminal A which will
eventually derive a sentential form with itself as the left-symbol."
2.5.1 Immediate Left Recursion
Immediate left recursion occurs in rules of the form
where α and β are sequences of nonterminals and terminals, and β doesn't start
with A. For example, the rule
128
is immediately left-recursive. The recursive descent parser for this rule might
look like:
function Expr()
{
Expr(); match('+'); Term();
}
and a recursive descent parser would fall into infinite recursion when trying to
parse a grammar which contains this rule.
In-text Question 2
What is a Parse Tree?
Answer
A Parse tree is a graphical representation for derivation that filters out the choice regarding
replacement
129
2.5.3 Removing Left Recursion
2.5.3.1 Removing Immediate Left Recursion
The general algorithm to remove immediate left recursion follows. Several
improvements to this method have been made, including the ones described in
"Removing Left Recursion from Context-Free Grammars", written by Robert C.
Moore. For each rule of the form
where:
A is a left-recursive nonterminal
is a sequence of nonterminals and terminals that is not null ( )
β is a sequence of nonterminals and terminals that does not start with A.replace
the A-production by the production:
This newly created symbol is often called the "tail", or the "rest".
As an example, consider the rule
130
2.5.3.2 Removing Indirect Left Recursion
If the grammar has no -productions (no productions of the form | | )
and is not cyclic (no derivations of the form for any nonterminal
A), this general algorithm may be applied to remove indirect left recursion :
Arrange the nonterminals in some (any) fixed order
for i = 1 to n {
for j = 1 to i – 1 {
let the current Aj productions be
A → Aα|β
Example 4: S → Aa | b
A → Ac | Sd |
S → Aa | b
A → bd A' | A'
A' → cA'
| adA' |
132
2.6 Verifying Grammars
Given a grammar G, we may want to prove that L(G) = L for some given
language L. To do this, we need to show that every string generated by G is in
L, and every string in L can be generated by G.
Consider the grammar with the productions: S → ( S ) S | ε.
Let L be the set of strings of balanced parentheses.
We can use induction on the number of steps in a derivation to show that every
string generated from S is balanced, thus showing that every string generated by
the grammar is in L.
We can use induction on the length of a string to show that every balanced
string can be generated from S, thus showing that every balanced string is
generated by the grammar.
133
(d) 4
(e) m?
4.0 Conclusion/Summary
In this study session, you have been taken through the concept of
context-free grammars, derivations and parse trees.
You also learnt that:
- a leftmost derivation expands the leftmost nonterminal in each sentential
form
- a rightmost derivation expands the rightmost nonterminal in each
sentential form
- parse tree is a graphical representation for derivation that filters out the
choice regarding replacement
- the leaves of the parse tree are labelled by terminals or nonterminals and,
read from left to right and they constitute a sentential form, called the
yield or frontier of the tree
- an ambiguous grammar is one that produces more than one leftmost or
more than one rightmost derivation for some sentence.
In the rest of this module, you will learn about parsing techniques.
134
6.0 Additional Activities (Videos, Animations & Out of Class activities) e.g.
a. Visit YouTube https://www.youtube.com/watch?v=grZvMHn2sC4. Watch
the video & summarise in 1 paragraph
b. View the animation on https://www.youtube.com/watch?v=5_tfVe7ED3g
and critique it in the discussion forum
c. Take a walk and engage any 3 students on grammar ambiguity; In 2
paragraphs summarise their opinion of the discussed topic. etc.
135
136
STUDY SESSION 2
Bottom-Up Parsing Techniques
Section and Subsection Headings:
Introduction
1.0 Learning Outcomes
2.0 Main Content
2.1 - Parsing Techniques
2.2- Bottom-Up Parsing
2.2.1- Types of Bottom-Up Parsers
2.2.2- Shift-Reduce Parsers
2.2.2.1- Action Table
2.2.2.2- Shift and Reduce
2.2.2.3- Algorithm: Shift-Reduce Parsing
2.2.3- Shift-Reduce Parsing
2.2.3.1- Handles
2.2.4- Stack Implementation of Shift-Reduce Parsing
3.0 Tutor Marked Assignment
4.0 Study Session Summary and Conclusion
5.0 Self-Assessment Question(s)
6.0 Additional Activities (Videos, Animations & Out of Class activities)
7.0 Self-Assessment Question Answers
8.0 References/Further Reading
Introduction:
In the first session of this module, you were exposed to context free
grammars, how a grammar defines a language, and what features of
programming languages can, and cannot, be specified by context-free
grammars. In this study session, you are going to learn about parsing
techniques. There are various kinds of parsing techniques and these can be
137
broadly categorised into two viz: top-down and bottom-up parsing techniques.
1.0 Study Session Learning Outcomes
After studying this session, I expect you to be able to:
1. Define parsing techniques
2. Distinguish between top-down and bottom-up parsing techniques
3. Describe shift reduce parsing
4. Define handle
5. Analyse an input string using shift-reduce parsing.
(b) Bottom-up Parsing (start from string and reduce to start symbol): A
bottom-up parser builds a parser tree by starting at the leaves and working
up towards the root. Bottom-up parsing is characterised by the following:
- It is not easy to handle by hands, usually compiler-generating software
generate bottom-up parser
- It handles larger class of grammar
Example is LR parsers, simple precedence parser, operator precedence parsers,
etc.
138
2.2 Bottom-Up Parsing
In programming language compilers, bottom-up parsing is a parsing method
that works by identifying terminal symbols first, and combines them
successively to produce non-terminals. The productions of the parser can be
used to build a parse tree of a program written in human-readable source code
that can be compiled to assembly language or pseudo code.
2.2.1 Types of Bottom-up Parsers
The common classes of bottom-up parsers are:
- LR parser (you will learn more about this in study session 5 of this
module)
(a) LR(0) - No lookahead symbol
(b) SLR(1) - Simple with one lookahead symbol
(c) LALR (1) - Lookahead bottom up, not as powerful as full LR (1)
but simpler to implement. YACC deals with this kind of grammar.
(d) LR (1) - Most general grammar, but most complex to implement.
(e) LR(k) - (where k is a positive integer) indicates an LR parser with k
lookahead symbols; while grammars can be designed that require
more than 1 lookahead, practical grammars try to avoid this
because increasing k can theoretically require exponentially more
code and data space (in practice, this may not be as bad). Also, the
class of LR (k) languages is the same as that of LR (1) languages.
- Precedence parsers (you will learn more about this in study session 5
of this module)
(a) Simple precedence parser
(b) Operator-precedence parser
(c) Extended precedence parser.
139
In-text Question 1
A bottom-up parser builds a parse tree by starting at the root and working down towards the
leaves.
Answer
False
2.2.2 Shift-Reduce Parsers
The most common bottom-up parsers are the shift-reduce parsers. These parsers
examine the input tokens and either shift (push) them onto a stack or reduce
elements at the top of the stack, replacing a right-hand side by a left-hand side.
2.2.2.1 Action Table
Often an action (or parse) table is constructed which helps the parser determine
what to do next. The following is a description of what can be held in an action
table.
Actions
(a) Shift - push token onto stack
(b) Reduce - remove handle from stack and push on corresponding
nonterminal
(c) Accept - recognise sentence when stack contains only the distinguished
symbol and input is empty
(d) Error - happens when none of the above is possible; means original input
was not a sentence.
2.2.2.2 Shift and Reduce
A shift-reduce parser uses a stack to hold the grammar symbols while awaiting
reduction. During the operation of the parser, symbols from the input are shifted
onto the stack. If a prefix of the symbols on top of the stack matches the RHS of
a grammar rule which is the correct rule to use within the current context, then
the parser reduces the RHS of the rule to its LHS, replacing the RHS symbols
on top of the stack with the non-terminal occurring on the LHS of the rule. This
shift-reduce process continues until the parser terminates, reporting either
140
success or failure. It terminates with success when the input is legal and is
accepted by the parser. It terminates with failure if an error is detected in the
input.
The parser is a stack automaton which is in one of several discrete states. In
reality, in the case of LR parsing, the parse stack contains states, rather than
grammar symbols, as you will learn in study session 5 of this module. However,
since each state corresponds to a unique grammar symbol, the state stack can be
mapped onto the grammar symbol stack mentioned earlier.
2.2.2.3 Algorithm: Shift-Reduce Parsing
Step 1: Start with the sentence to be parsed as the initial sentential form
Step 2: Until the sentential form is the start symbol do:
(a) Scan through the input until we recognise something that corresponds to
the RHS of one of the production rules (this is called a handle)
(b) Apply a production rule in reverse; i.e., replace the RHS of the rule which
appears in the sentential form with the LHS of the rule (an action known
as a reduction)
In step 2(a) above we are "shifting" the input symbols to one side as we move
through them; hence a parser which operates by repeatedly applying steps 2(a)
and 2(b) above is known as a shift-reduce parser.
A shift-reduce parser is most commonly implemented using a stack, where we
proceed as follows:
- start with an empty stack
- a "shift" action corresponds to pushing the current input symbol onto
the stack
- a "reduce" action occurs when we have a handle on top of the stack.
To perform the reduction, we pop the handle off the stack and replace
it with the nonterminal on the LHS of the corresponding rule. This is
referred to as handle pruning.
Example 1
141
Take the grammar:
Sentence--> NounPhrase VerbPhrase
NounPhrase --> Art Noun
142
Example 2
Given the grammar:
<Expression> --><Term> | <Term> + <Expression>
--><Factor> | <Factor> *
<Term> <Term>
<Factor> --> [ <Expression> ] | 0...9
143
>)
(<Factor>*[<Expression>]) () REDUCE
(<Factor>*<Factor>) () REDUCE
(<Factor>*<Term>) () REDUCE
(<Term>) () REDUCE
SUCCES
(<Expression>) () S
This is analogous to replacing the right hand side of production with left hand
side.
The problem is how to decide which production to reduce at which points. This
bottom-up parsing method is called shift-reduce because it consists of shifting
input symbols unto a stack until the right side of a production appears on the top
of the stack. The right side may then be replaced by the symbols of the left side
of the production, and the process repeated.
Informally, a substring which is the right side of a production such that a
replacement of that substring by the variable on the left side eventually leads to
144
a reduction to the stack symbols through the reverse of a rightmost derivation is
called a ―handle‖
The process of bottom up parsing may be viewed as one of finding and reducing
handle.
Technically, given grammar G, we say a string of terminal omega ωω ε L(G)
iff, S⇒ ω
ω sentence of G, if S ⇒* α where α may contains non-terminal then say α is a
sentential form of G.
In-text Question 2
What data structure does a shift-reduce parse use to hold grammar symbols while awaiting
reduction?
Answer:
Stack
2.2.3.1 Handles
A handle of a string is a substring that matches the right side of a production
whose reduction to the nonterminal on the left represents one step along the
reverse of a rightmost derivation. A handle of a right-sentential form γ is a
production, A → β (A derives β) at a position of γ where the string β may be
found and replaced by A to produce the previous sentential form in the
rightmost derivation of γ.
i.e. If S ⇒ * αAω⇒ αβω, then A→ βin the position following α is a
handle of αβω.
For example, let us consider the grammar:
E→E+E
E→E*E
145
E → (E)
E→i
Using top-down parsing (rightmost derivation) to find the handles, we have:
E ⇒ E + E ⇒rm E + E * E
⇒rm E + E * i1
⇒rm E + i2 * i
⇒rmi3 + i * i
(Note that handles have been underlined above)
Another rightmost derivation:
146
- reduce - the parser knows the right end of the handle is at the top of
the stack. It must then locate the left end of the handle within the stack
and decide with what nonterminal to replace the handle
- accept - the parser announces successful completion of parsing
- error - the parser discovers that a syntax error has occurred.
Stack Input
1) $ αβγ yz$
2) $ αβB yz$
3) $ αβBy z$
(b) S ⇒*rmαBxAz⇒rmαBxyz⇒rmαγxyz
Inpu
Stack t
1) $ αγ xyz$
2) $ αBxy z$
147
$ id1+ id2 * id3 $ Shift
$ id1 + id2* id3 $ reduce by E→id
$E + id2* id3 $ Shift
id2 *
$ E+ id3 $ Shift
$ E+id2 * id3$ reduce by E→id
$ E+E * id3$ Shift
$ E+E* id3$ Shift
$ E+E*id3 $ reduce by E→id
$ E+E*E $ reduce by E→E*E
$ E+E $ reduce by E→E+E
$E $ accept
148
- Reject/error
4.0 Conclusion/Summary
In this session, you have been taken through bottom-up parsing using as an
example shift-reduce parsing. You have also learnt how to implement shift-
reduce parser using stack.
In the next session of this module, you will learn about another type of bottom-
up parsing called precedence parsing.
149
You also learnt that:
- a top-down parser builds a parse tree by starting at the root and working
down towards the leaves
- a bottom-up parser builds a parser tree by starting at the leaves and
working up towards the root
- a handle of a right-sentential form G is a production A→ b and a position
in g where b may be found and replaced by A to produce the previous
right-sentential form in a rightmost derivation of a grammar G.
- the process to construct a bottom-up parse is called handle-pruning
- a shift-reduce parser has just four canonical actions: shift, reduce, accept,
and error.
6.0 Additional Activities (Videos, Animations & Out of Class activities) e.g.
a. Visit YouTube https://www.youtube.com/watch?v=APJ_Eh60Qwo. Watch
the video & summarise in 1 paragraph
b. View the animation on https://www.youtube.com/watch?v=HEl_0CMs3Rg
and critique it in the discussion forum
c. Take a walk and engage any 3 students on types of bottom-up parsers; In 2
paragraphs summarise their opinion of the discussed topic. etc.
150
iii. Accept
iv. Error
2. Reduction
151
8.0 References/Further Reading
Alfred, V. Aho, et al. (2007). Compilers: Principles, Techniques, and
Tools. Second Edition. Wesley: Pearson Addison.
Andrew, W. Appel (2002).Modern Compiler Implementation in Java.
Second edition. Cambridge University Press.
Keith, D. Cooper & Linda, Torczon (2004). Engineering a Compiler.
Morgan Kaufmann.
Steven, S. Muchnick (1997). Advanced Compiler Design and
Implementation. Morgan Kaufmann.
152
STUDY SESSION 3
Precedence Parsing
Section and Subsection Headings:
Introduction
1.0 Learning Outcomes
2.0 Main Content
2.1 - Operator Precedence Parser
2.1.1- Relationship to other Parsers
2.1.2- Operator Grammars
2.1.3- Operator Precedence Grammar
2.1.4- Precedence Relations between Operators (Terminals)
2.1.4.1 - Methods of Generating Relationship between
Operators
2.1.4.1.1 - Intuition Method
2.1.4.1.1 - formal Method
2.1.5- Construction of Matrix of all Precedence Relations for a
Grammar
2.1.6- Operator Precedence Parsing Algorithm
2.1.7- Making Operator Precedence Relations
2.2- Simple Precedence Parser
2.2.1- Implementation of Simple Precedence Parser
2.3- Algorithm for Constructing Precedence Functions
3.0 Tutor Marked Assignment
4.0 Study Session Summary and Conclusion
5.0 Self-Assessment Question(s)
6.0 Additional Activities (Videos, Animations & Out of Class activities)
7.0 Self-Assessment Question Answers
8.0 References/Further Reading
153
Introduction:
You have been introduced to bottom-up parsing and a type of bottom-up parsing
referred to as shift-reduce parsing in the previous study session. In this study
session, you shall learn about another type of bottom-up parsing known as
precedence parsing. This parser can be developed using operator grammars.
154
2.1.1 Relationship to other Parsers
An operator-precedence parser is a simple shift-reduce parser capable of parsing
a subset of LR(1) grammars. More precisely, the operator-precedence parser can
parse all LR(1) grammars where two consecutive nonterminals never appear in
the right-hand side of any rule.
Operator-precedence parsers are not used often in practice; however, they do
have some properties that make them useful within a larger design. First, they
are simple enough to write by hand, which is not generally the case with more
sophisticated shift-reduce parsers. Second, they can be written to consult an
operator table at runtime, which makes them suitable for languages that can add
to or change their operators while parsing.
To show that one grammar is operator precedence, first it should be operator
grammar. Operator precedence grammar is the only grammar which can
construct the parse tree even though the given grammar is ambiguous.
We try to look at the relationship between operators to guide us in parsing.
2.1.2 Operator Grammars
Operator grammars have the property that no production right side is empty or
two or more adjacent non-terminals. This
property enables the implementation of
efficient operator-precedence parsers.
These parsers rely on the following three
precedence relations:
155
2.1.4 Precedence Relations between Operators (terminals)
For terminals a and b,
i. < : ifa < b, we say a ―yields precedence to‖b. This means that a
production involving a has to yield to a production involving b.
ii. = : ifa=b,we saya―has the same precedence as‖b. This means that a
and b are part of the same production.
iii. > : ifa b, we saya―has precedence over‖b. This means thatproduction
involving a will be reduced before the production involving b.
An operator grammar is one that has no production whose right-hand side has
two or more adjacent non-terminals.
2.1.4.1 Methods of Generating Relationship between Operators
i. Intuition
ii. Formal
2.1.4.1.1 Intuition Method
We can use associativity and precedence of the operations to assign the relation
intuitively as follows:
i. If operator θ1 and θ2 has higher precedence than θ2, then we make
θ1 > θ2and θ2< θ1
ii. If θ1 and θ2 are operators of equal precedence then either we make θ1
>θ2 and θ2>θ1, if the operators are left associative or make θ1<θ2 and
θ2<θ1, if they are right associative.
156
a < b if for some non-terminal A there is a right side of the formαaAβand A *
γbδ, where γ is either ε or a single non-terminal. i.e. a < b if a non-terminal A
appears immediately to the right of a and derives a string in which b is the first
terminal symbol. E.g. S → iCtS and C *b, therefore i < b. Also define $ <b if
there is a derivation S * γbδ and γ is either empty or a single non-terminal.
a > b if for some non-terminal A there is a right side of the formαAbβand A *
γaδ, where δ is either ε or a single non-terminal. i.e. a > b if a non-terminal A
appearing immediately to the left of b and derives a string whose last terminal
symbol is a. E.g. in S → iCtS and C * b, therefore b > t. Also define a> $ if
there is a derivation S * γaδ and δis either empty or a single non-terminal.
2.1.5 Construction of Matrix of all Precedence Relations for a Grammar
To construct the matrix of precedence operations for a grammar, you need to
first deduce for each nonterminal those terminals that can be the first or last
terminal in a string derived from that nonterminal. To demonstrate this, let us
consider the grammar below:
E→E+T/T
T→T*F/F
F → (E) / i
For this grammar, all derivations from F will have the symbols ( or i as the first
terminal and the symbols ) or i as the last. A derivation from T could begin T
T * F, showing that * could be both first and last terminal. Or a derivation could
begin T F, meaning that every first and last terminals derivable from F is also
a first or last terminal derivable from T. Thus, the symbols *,), and i can be first
and *, ), and i can be last in a derivation from T. A similar argument applies to
E, and we see that +, *, (, or i can be first and *, +, ), or i can be last. These facts
are summarised in the table below.
157
In-text Question 1
For any pair of terminals a and b, more than one of these relations: a <b, a = b, a >b can be
true. True of False?
Answer:
False
E +, *, (, i *, +, ), i
T *, (, i *, ), i
F (, i ), i
Now, to compute the relation, we look for right sides with two terminals
separated by nothing or by a nonterminal. Only one right side,
(E), qualifies, so we determine (=).
Next, consider <. We look for right sides with a terminal immediately to the left
of a nonterminal to play the roles of a and A in rule (ii). For each such pair, a is
related by < to any terminal which can be first in a string derivable from A. The
candidates in the grammar above are + and T in the right side E + T, * and F in
T * F, and ( and E in (E). The first of
these gives + < *, + < (, and + < i. The *:F pair gives * < ( and * < i. The (:E
pair gives ( < *, ( < +, ( < (, and ( < i. We then add the relationships $ < *,
$ < +, $ < (, $ < i, since $ must be related by < to all possible first terminals
derivable from the start symbol E.
Symmetrically, we can construct the > relation. We look for the right sides with
a nonterminal immediately to the left of a terminal, to play the roles of A and b
of rule (iii). Then, every terminal that could be the last in a string derivable from
158
A is related by > to b. In our grammar, the pair corresponding to A and b are
E:+, T:*, and E:). Thus, we have the
relations * > +, + > +, ) > +, i > +, * > *, ) > *, i > *, * > ), + > ), ) > ), and i >
). We add the relations * > $, + > $, ) > $, and i > $ according to rule (iii). The
precedence relations for our grammar are as shown in the table below:
Table 3.3.2: Operator-Precedence Relations
These operator precedence relations allow us to delimit the handles in the right
sentential forms: <· marks the left end, =· appears in the interior of the handle,
and ·> marks the right end.
Let us assume that between the symbols ai and ai+1 there is exactly one
precedence relation. Suppose that $ is the end of the string. Then for all
terminals we can write: $ <· b and b ·> $. If we remove all nonterminals and
place the correct precedence relation:
<·, =·, ·> between the remaining terminals, there remain strings that can be
analysed by easily developed parser.
Example: The input string:
i1+ i2 * i3
after inserting precedence relations becomes
$ <· i1 ·> + <· i2 ·>*<· i3 ·> $
Having precedence relations allows us to identify handles as follows:
159
- scan the string from left until seeing ·>
- scan backwards the string from right to left until seeing <·
- everything between the two relations <· and ·> forms the handle
Note that not the entire sentential form is scanned to find the handle.
160
2.2 Simple Precedence Parser
A Simple precedence parser like the
operator precedence parser discussed earlier
in this study session is a type of bottom-up
parser for context-free grammars that can be
used only by Simple precedence grammars.
Answer:
True
161
a. Shift:
b. Push(Stack, relationship)
c. Push(Stack, NextToken(Input))
d. RemoveNextToken(Input)
c. if the relationship is
a. Reduce:
b. SearchProductionToReduce(Stack)
c. RemovePivot(Stack)
d. Search in the table the relationship between the Non
terminal from the production and first symbol in the stack
(Starting from top)
e. Push(Stack, relationship)
f. Push(Stack, Non terminal)
Step 5: SearchProductionToReduce (Stack)
a. search the Pivot in the stack the nearest from the top
b. search in the productions of the grammar which one have the
same right side than the Pivot
Example 1
Given the grammar below:
E → E + T' | T'
T' → T
T→T*F|F
F → ( E' ) | num
E' → E
num is a terminal, and the lexer parses any integer as num.
162
Table 3.3.3: Parsing Table for the Grammar in Example 1 above
By steps 2 through
step 5, we have:
163
2. Partition the created symbols into as many groups as possible such that
fa and gb are in the same group if a =· b ( there can be symbols in the
same group even if they are not connected by this relation);
3. Create a directed graph whose nodes are in the groups, next for each
symbols a and b do: place an edge from the group of gb to the group of
fa if a<· b, otherwise if a ·>b place an edge from the group of fa to that
of gb;
4. If the constructed graph has a cycle then no precedence functions exist.
When there are no cycles collect the length of the longest paths from
the groups of fa and gb respectively.
Id + * $
Id ·> ·> ·>
+ <· ·> <· ·>
* <· ·> ·> ·>
$ <· <· <· ·>
164
from which we extract the following precedence functions:
Id + * $
f 4 2 4 0
g 5 1 3 0
S → a | ^ | (T)
T → T,S | S
a. Compute the operator-precedence relations for this grammar. Is it an
operator-precedence grammar?
b. Using the operator-precedence relations computed in (a) above,
analyse the following input string:
i. (a, (a, a))
ii. (((a, a), ^, (a)), a)
c. Find precedence functions for the relations computed in (a) above.
4.0 Conclusion/Summary
In this study session, you have been taken through another type of bottom-up
parsing technique known as precedence parsing. This technique is very suitable
for languages that can add to or change their operators while parsing.
You also learnt that:
- an operator precedence parser is a bottom-up parser that interprets an
operator-precedence grammar
- an operator-precedence parser is a simple shift-reduce parser capable of
parsing a subset of LR(1) grammars
- operator grammars have the property that no production rightside is
empty or two or more adjacent non-terminals
165
- an operator precedence grammar is an ε-free operator grammar in which
the precedence relations define above are disjoint i.e. for any pair of
terminals a and b, never more than one of these relations: a<b, a = b, a>b
is true
- precedence parsing techniques is suitable for languages that can add to or
change their operators while parsing.
In the next study session, you will learn more about top-down parsing
techniques.
6.0 Additional Activities (Videos, Animations & Out of Class activities) e.g.
a. Visit U-tube https://www.youtube.com/watch?v=n5UWAaw_byw. Watch the
video & summarise in 1 paragraph
b. View the animation on https://www.youtube.com/watch?v=hF9aIV5H7Xo
and critique it in the discussion forum
c. Take a walk and engage any 3 students on methods of generating relationship
between operators; In 2 paragraphs summarise their opinion of the discussed
topic. etc.
166
8.0 References/Further Reading
Alfred, V. Aho et al. (2007). Compilers: Principles, Techniques, and
Tools. Second edition. Wesley: Pearson Addison.
Andrew, W. Appel, (2002). Modern Compiler Implementation in Java.
Second edition. Cambridge University Press.
Keith, D. Cooper & Linda, Torczon (2004). Engineering a Compiler.
Morgan Kaufmann.
Steven, S. Muchnick (1997). Advanced Compiler Design and
Implementation. Morgan Kaufmann.
Michael, L. Scott (2009). Programming Language Pragmatics. Third
edition. Morgan Kaufman.
Robert, W. Sebesta (2010).Concepts of Programming Languages. Ninth
edition. Wesley: Addison.
Aho, A.V & Ullman, J. D. (1972). The Theory of Parsing, Translation,
and Compiling, Volume 1: Parsing. Englewood Cliffs, N.J.: Prentice-
Hall.
167
STUDY SESSION 4
Top-Down Parsing Techniques
Section and Subsection Headings:
Introduction
1.0 Learning Outcomes
2.0 Main Content
2.1 - Top-Down Parsing Techniques
2.1.1- Difficulties with Top-Down Parsing
2.1.1.1- Minimising the Difficulties.
2.2 - LL (K) GRAMMARS
2.3 - Recursive Descent Parsing
2.4 - Nonrecursive Predictive Parsing
2.4.1 - Nonrecursive Predictive Parsing Algorithm
2.4.2 - Functions Definitions
2.4.2.1 - FIRST and FOLLOW
2.4.2.1.1 - FIRST
2.4.2.1.2 - FOLLOW
2.4.3 - How to Construct a Parsing Table
3.0 Tutor Marked Assignment
4.0 Study Session Summary and Conclusion
5.0 Self-Assessment Question(s)
6.0 Additional Activities (Videos, Animations & Out of Class activities)
7.0 Self-Assessment Question Answers
8.0 References/Further Reading
Introduction:
In the previous study session of this module you have been introduced to some
types of bottom-up parsing. In this study session, you will be exposed to some
top-down parsing techniques such as recursive descent parsing and
168
nonrecursive predictive parsing.
1.0 Study Session Learning Outcomes
After studying this session, I expect you to be able to:
1. define top-down parsing
2. state the difficulties with top-down parsing and how they can be
overcome
3. describe recursive descent parsing
4. define LL(k) grammars
5. describe nonrecursive predictive parsing
6. write an algorithm for nonrecursive predictive parsing
7. construct a predictive parser for LL(k) grammars
8. use the predictive parsing table for a grammar to analyse/determine input
strings belonging to the grammar.
ii. Order of trying the alternatives may influence your getting the input or
not. You can use a method called factoring i.e left factoring is used to
ensure that the right sequence/order is used or followed.
e.g. A → αβ
A → αγ
If we have the above, which one do we expand A to? We can get rid of this
dilemma by using left factoring to determine the right sequence. That is the
above becomes:
170
A → αA‘
A‘ → β | γ
2.2 LL (K) GRAMMARS
LL (K) grammars are those for which the left parser can be made to work
deterministically, if it is allowed to look at K-input symbols, to the right of its
current input position. The left parser expands the leftmost non-terminal or
variable.
In other words, a context-free grammar is called LL grammar when it is
elaborated to enable scanning the input from left to right (denoted by the first L)
and to produce a leftmost derivation (denoted by the second L).
Often we consider LL (1) grammars that use one input symbol of lookahead at
each step to make parsing decisions.
These two properties allow us to perform efficient recursive parsing. Moreover
there can be automatically constructed predictive parsers from LL grammars.
We use what is called a predictive parsing method to parse certain things
derivable from LL (K)
171
Therefore, today, recursive descent parsing is a top-down approach that avoids
backtracking.
The Recursive procedures can be quite easy to write and fairly efficient if
written in a language that implement procedure calls efficiently.
Example 1
Consider the following grammar
(a) E → TE‘ (b) E → E + T | T
E1 → TE‘ | E T→T*F|F
T → FT‘ F → (E) | i
T1 → *FT‘ ε
F → (E) | i
Grammar (a) and (b) are alike/identical because grammar (b) has left recursion,
we can transform it to the form written in grammar (a) above. We can then
easily implement the parser by writing a procedure for each non-terminal like
the one below.
For grammar (a), we have five non-terminals: E, E‘, T, T‘, F
By writing five (5) procedures for each of these non-terminals, then our parser
is complete.
Procedure E ( );
Begin
T ( ) ; EPRIME ( )
End
Procedure EPRIME ( );
Begin
If input-symbol = “+” then
Begin
ADVANCE ( ) ; T( ); Eprime ( )
172
End
end
Procedure F( ) ;
Begin
If input-symbol = “i” then Advance ( )
Example 2:
Suppose we have a grammar G:
E→E+T
E→T
T→T*F
T→F
F → (E)
F→a
173
E→T (2)
→T*F (3)
→F*F (4)
→a*F (6)
→ a * (E) (5)
→ a * (T + T) (1)
→ a * (T+T) (2)
→ a * (F + T) (4)
→ a * (a + T) (6)
→ a * (a + F) (4)
→ a + (a + a) (6)
Answer:
True
174
Fig. 3.4.2: Model of a Predictive Parser
176
pointed to by ip
if X is a terminal or $then
if X = a then
pop X from the stack and advance ip
else error()
else /* X is a nonterminal */
if M[ X, a ] = X → Y1, Y2, ..., Ykthen
begin
pop X from the stack
push Yk, Yk-1, ..., Y1 onto the stack, with Y1on top
output the production X→Y1, Y2, ..., Yk
end
else error()
until X = $
Fig. 3.4.3: Predictive Parsing program
177
To compute FIRST(X) for all grammar symbols X, apply the following rules
until no more terminals of ε can be added to any FIRST set.
1. If X is terminal, then FIRST(X) is (X)
2. If X is nonterminal and X →aα is a production, then add a to FIRST(X).
If X → ε is a production, then add ε to FIRST(X).
3. If X → Y1 Y2 ...Yk is a production, then for all i such that all of Y1,...,Yi-1
are nonterminals and FIRST(Yj) contains ε for j = 1, 2,..., i-1 (i.e.
Y1,.Y2..,Yi-1 * ε), add every non-ε symbol in FIRST(Yi) to FIRST(X). If
ε is in FIRST (Yj) for all for j = 1, 2,..., k, then add ε to FIRST(X).
Now we can compute FIRST for any string X1X2...Xn as follows. Add to
FIRST(X1X2...Xn) all the non-ε symbols of FIRST(X1). Also add the the non-ε
symbols of FIRST(X2) if ε is in FIRST(X1), the non-ε symbols of FIRST(X2) if
ε is in both FIRST(X1) and FIRST(X2), etc. Finally, add ε to FIRST(X1X2...Xn)
if, for all i, FIRST(Xi) contains ε.
Example: Consider the Grammar below:
G: E → TE‘
E‘ → +TE‘/ ε
T → FT‘
T‘ → *FT‘/ ε
F → (E) / i
Suppose we want to find FIRST(E) i.e. What are the first elements of strings
which E can derive?
Solution
Elements of FIRST(E) = { (, i }
Elements of FIRST(E‘) = { +, ε }
Elements of FIRST T‘) = { * , ε }
2.4.2.1.2 FOLLOW
178
Define FOLLOW(A), for non-terminal A to be the set of terminals a, that can
appear immediately to the right of A in some sentential form. i.e. s →* αAaβ
If A can be the rightmost symbol in some sentential form, then ε is in
FOLLOW(A)
If A is the distinguish or is the start symbol, then FOLLOW (A) contains ε
To compute FOLLOW(A) for all nonterminals A, apply the following rules
until nothing can be added to any follow set.
ε is in FOLLOW(S), where S is the start symbol.
Solution:
FOLLOW (E) = FOLLOW (E‘) = { ε, )}
FOLLOW (T) = FOLLOW (T‘) = { ε, ), +}
FOLLOW (F) = {ε, ), +, *}
179
The following algorithm can be used to construct a predictive parsing table for a
grammar G. The idea behind the algorithm is simple. Suppose A→α is a
production with a in FIRST(α). Then whenever the parser has A on top of the
stack with a the current symbol, the parser will expand A by α. The only
complication occurs when α = ε or α * ε. In this case, you should also expand
A by α if the current input symbol is in FOLLOW(A), or if the $ on the input
has been reached and ε is in FOLLOW(A).
Example 4
180
Using the grammar G below, let us construct a parsing table.
G: E → TE‘
E‘ → +TE‘/ ε
T → FT‘
T‘ → *FT‘/ ε
F → (E) / i
181
Since ε is in FIRST(ε), therefore, we find FOLLOW(T‘) =
{+, ), ε}
g. Considering the 7th production F‘ → (E)
A ≡ F;α ≡ (E)
FIRST ((E)) = {(} and ε is not in FIRST((E))
h. Considering the 8th production F → i
A ≡ F;α ≡ i
FIRST(i) = {i} and ε is not in FIRST(i)
Table 3.4.2: Parsing Table for the Grammar in Example 4
We want to know if the sentence i+i*i can be formed from the grammar above
using the parsing table above
Solution
Using Top-down parsing
182
Since the stack and input is empty, therefore the sentence i+i*i can be formed
from the grammar and the parsing table.
The problem with the parsing table above is that the grammar must be non-left
recursive otherwise it will not be easy to construct the above parsing table.
In-text Question 2
What is a nonrecursive parser?
Answer:
A nonrecursive parser is an implementation of a predictive parser that keeps the symbols on
a stack explicitly, without maintaining them implicitly via recursive calls.
183
T → FT‘
T‘ → T | ε
F → PF‘
F‘ → *F‘ | ε
P → (E) | a | b |
a. Compute FIRST and FOLLOW for each nonterminal of the above
grammar
b. Show that the grammar is LL(1)
c. Construct the predictive parsing table for the grammar
d. Construct a recursive-descent parser for the grammar.
2. Consider the following grammar:
S →aSa | aa
Clearly the grammar generates all even length strings of a‘s except for the
empty string
a. By tracing through the steps of a top-down parser (with backtracking)
which tries the alternate aSa before aa, show that S succeeds on 2, 4,
or 8 a‘s, but fail on 6 a‘s
b. What strings cause the parser to succeed?
4.0 Conclusion/Summary
In this study session, you have been taken through some top-down parsing
techniques such as recursive descent parsing and predictive parsing techniques.
These two techniques like their counterpart bottom-up techniques that you have
learnt about so far in this module can be implemented by hand. The predictive
parser has the limitation that it cannot be constructed for left-recursive
grammars.
You also learnt that:
184
- LL(K) grammars are those for which the left parser can be made to work
deterministically, if it is allowed to look at K-input symbols, to the right
of its current input position
- Top-down parsing aims to identify a leftmost derivation of a given input
string with respect to the predefined grammar. The top-down parsing
process constructs a parse tree starting from the root and proceeding to
expand the tree by creating children nodes till reaching the leaves
- Recursive descent is a top-down parsing strategy that operate doing
backtracking
- Using backtracking makes the recursive descent parsers inefficient, and
they are only a theoretically possible alternative
- A nonrecursive parser is an implementation of a predictive parser that
keeps the symbols on a stack explicitly, without maintaining them
implicitly via recursive calls
- The general definition of FIRST is FIRSTk(α) meaning the first k
symbols in any string that α can derive
- FOLLOW(A), for non-terminal A is the set of terminals a, that can
appear immediately to the right of A in some sentential form. i.e. s →*
αAaβ
In the next study session, you will learn about another type of parser that
cannot be implemented by hand. These are LR parsers which are so called
because they can scan the input from left-to-right and construct a rightmost
derivation in reverse. They are efficient bottom-up parsers that can be
constructed for a large class of context-free grammars.
185
other characters are terminal symbols
[ ]|
| |
|
Using the given Grammar above, complete the tables in (1) and (2)
1.
NON FIRST
TERMINAL
S
X
2.
NON FOLLOW
TERMINAL
S
X
6.0 Additional Activities (Videos, Animations & Out of Class activities) e.g.
a. Visit YouTube https://www.youtube.com/watch?v=4b8rX0wRxMI. Watch
the video & summarise in 1 paragraph
b. View the animation on https://www.youtube.com/watch?v=UR6B5RoYOLQ
and critique it in the discussion forum
c. Take a walk and engage any 3 students rules to construct the FIRST and
FOLLOW sets; In 2 paragraphs summarise their opinion of the discussed topic.
etc.
186
7.0 Self-Assessment Question Answers
1.
NON FIRST
TERMINAL
S { [, a }
X { +, -, b, }
Y { -, }
2.
NON FOLLOW
TERMINAL
S { $, +, -, b, ], c}
X { ], c }
Y { ], c, b}
187
Scott, Michael L. (2009). Programming Language Pragmatics.
Thirdedition, Morgan Kaufman.
Sebesta,Robert W. (2010). Concepts of Programming Languages. Ninth
edition. Wesley: Pearson Addison.
188
STUDY SESSION 5
LR Parsers
Section and Subsection Headings:
Introduction
1.0 Learning Outcomes
2.0 Main Content
2.1 - LR (K) GRAMMARS
2.1.1 - Why study LR grammars?
2.2 - LR Parsing
2.2.1 - Benefits of LR Parsing
2.2.2 - Drawback of LR Parsing
2.2.3 - Techniques for Producing LR Parsing Tables
2.3 - LR (K) PARSER
2.3.1 - Configuration of LR Parser
2.4 - Simple LR (SLR) Parser
2.4.1 - Definitions
2.4.1.1 - Viable Prefix
2.4.1.2 - LR(0) Item (or Simple Item)
2.4.1.3 - Augmented Grammar
2.4.1.4 - CLOSURE
2.4.1.5 - GOTO Function
2.4.1.6 - The Sets-of-Item Construction
2.4.2 - Constructing SLR Parsing Table
2.4.2.1 - Algorithm for Construction of an SLR Parsing Table
3.0 Tutor Marked Assignment
4.0 Study Session Summary and Conclusion
5.0 Self-Assessment Question(s)
6.0 Additional Activities (Videos, Animations & Out of Class activities)
7.0 Self-Assessment Question Answers
189
8.0 References/Further Reading
Introduction:
The problem with the parsing table discussed in the previous study session is
that the grammar must be non-left recursive otherwise it will not be easy to
construct the parsing table. In this session, you will be shown how to construct
efficient bottom-up parsers for a large class of context-free grammars. These
parsers are called LR parsers because they can scan the input from left-to-right
and construct a rightmost derivation in reverse. These class of parsers are
attractive for a variety of reasons amongst which is that they can be constructed
to recognise virtually all programming language constructs for which context-
free grammars can be written.
190
when it is elaborated to enable scanning the input from left to right (denoted by
L) and to produce a rightmost derivation (denoted by R).
2.1.1 Why Study LR Grammars?
We study LR grammars for a variety of reasons amongst which are the
following:
- LR (1) grammars are often used to construct parsers. We call these
parsers LR(1) parsers and it is everyone‘s favourite parser
- virtually all context-free programming language constructs can be
expressed in an LR(1) form
- LR grammars are the most general grammars parse-able by a
deterministic, bottom-up parser
- efficient parsers can be implemented for LR(1) grammars
- LR parsers detect an error as soon as possible in a left-to-right scan of the
input
- LR grammars describe a proper superset of the languages recognised by
predictive (i.e., LL) parsers
LL (k): recognise use of a production A → β seeing first k symbols of β
LR (k): recognise occurrence ofβ(the handle) having seen all of what is derived
from β plus k symbols of lookahead.
2.2 LR Parsing
LR parsing can be generalised as arbitrary context-free language parsing
without a performance penalty, even for LR (k) grammars. This is because most
programming languages can be expressed with LR (k) grammars, where k is a
small constant (usually 1). Note that parsing of non-LR (k) grammars is an
order of magnitude slower (cubic instead of quadratic in relation to the input
length).
2.2.1 Benefits of LR Parsing
191
LR parsing is attractive for a variety of reasons amongst which are the
following reasons:
a. LR parsing can handle a larger range of languages than LL parsing, and is
also better at error reporting, i.e. it detects syntactic errors when the input
does not conform to the grammar as soon as possible. This is in contrast
to an LL (k) (or even worse, an LL (*) parser) which may defer error
detection to a different branch of the grammar due to backtracking, often
making errors harder to localise across disjunctions with long common
prefixes.
b. LR parsers can be constructed to recognise virtually all programming
language constructs for which context-free grammars can be written
c. It is more general than operator precedence or any other common shift-
reduce techniques discussed so far in this module, yet it can be
implemented with the same degree of efficiency as these other methods.
d. LR parsing also dominates the common forms of top-down parsing
without backtrack. That is it is the most general non-backtracking parsing
method.
2.2.2 Drawback of LR Parsing
The principal drawback of the method is that it is too much work to implement
an LR parser by hand for a typical programming language grammar. However,
an LR parser generator is now generally available to assist. You need a
specialised tool called an LR parser generator. The design of one of such will be
discussed in this session.
2.2.3 Techniques for Producing LR Parsing Tables
There are three common techniques for building tables for an ―LR‖ parser.
These are itemised below with their peculiar characteristics.
The first method is called simple LR (SLR (1) for short). It is easiest to
implement. Unfortunately, it may fail to produce a table for certain grammars
192
on which the other methods succeed. In summary, SLR has the following
characteristics:
- smallest class of grammars
- smallest tables (number of states)
- simple, fast construction
The second method is called canonical LR (or LR (1)). It is the most powerful
and works on a very large class of grammars. Unfortunately, the canonical LR
method can be very expensive to implement. In summary, the characteristics of
the LR(1) are as follows:
- full set of LR(1) grammars
- largest tables (number of states)
- slow, expensive and large construction
The third method is called lookahead LR (LALR (1) for short). It is
intermediate in power between the SLR and the canonical LR. The LALR
method works on most programming-language grammars and, with some effort,
can be implemented efficiently. In summary, the characteristics of the LALR(1)
are as follows:
- intermediate sized set of grammars
- same number of states as SLR(1)
- canonical construction is slow and large
- better construction techniques exist
Note that an LR (1) parser for either Algol or Pascal has several thousand states;
while an SLR (1) or LALR (1) parser for the same language may have several
hundred states.
In subsequent sections of this session you will be exposed to how ambiguous
grammars can be used to simplify the description of languages and produce
efficient parsers.
In-text Question 1
What is the principal drawback of LR parsing?
193
Answer:
The principal drawback of LR parsing is that it is too much work to implement an LR parser
by hand for a typical programming language grammar.
194
Fig. 3.5.1: Generating an LR Parser
The parser has an input, a stack and a parsing table. The input is read from left
to right one symbol at a time. The stack contains a string of the form:
SoX1S1X2S2 … XmSm
195
Fig. 3.5.2: LR Parser
The function GOTO takes a state and grammar symbol as arguments and
produces a state. It is essentially the transition the transition table of a
deterministic finite automaton whose input symbols are the terminals and
nonterminals of the grammar.
A configuration of an LR parser is a pair whose first component is the stack
contents and whose second component is the unexpended input.
(SoX1S1X2S2.X3 … X mSm, ai ai+1, ai+2 … a n$)
196
The next move of the parser is determined by reading ai, the current input
symbol and Sm, the state on top of the stack and then consulting the parsing
action table entry ACTION[Sm, ai]
(SoX1S1X2S2….. X mSm, ai, ai+1, a i+2, …, a n$)
Note that:
i. ACTION () will take care of parsing
ii. GOTO() will take care of states
The configurations resulting after each of the four types of move are as follows:
1. If ACTION [Sm, ai] = Shift S, the parser executes a shift move, entering
the configuration:
{SoX1S1X2S2 … XmSmaiS, ai+1ai+2…..an$)
i.e. Shift ai to join Sm and go to a new state S.
Here the parser has shifted the current input symbol ai and the next state
S = GOTO [Sm, ai] onto the stack; ai+1 becomes the new current input
symbol.
2. If ACTION[Sm, ai] = Reduce A → β, then the parser executes a reduce
move entering the configuration:
(SoX1S1X2S2…..Xm-rSm-r A S, ai, ai+1, ai+2, …, an$)
Where S = GOTO[Sm-r, A] and r is the length of β, the right side of the
production. Here the parser first pops two r symbols off the stack (r state
symbols and r grammar symbols), exposing states Sm-r. The parser then
pushed both A, the left side of the production, and S, the entry for
ACTION[Sm-r, A], onto the stack. The current input symbol is not
changed in a reduce move. For the LR parsers we shall construct, X m-r+1
... Xm, the sequence of grammar symbols popped off the stack, will
always match β, the right side of the reducing production.
3. IF ACTION[Sm, ai] = Accept, parsing is completed
197
4. IF ACTION[Sm, ai] = Error, the parser has discovered an error and would
call the error recovery routine.
The LR parsing algorithm is very simple. Initially the LR parser is in the
configuration: (So, a1a2…… an$) where So is a designated initial states and
a1a2...an is the string to be parsed. Then the parser executes moves until an
accept, or error action is encountered. You will either accept or reject the string.
All LR parsers behave in this manner. The only difference between one LR
parser and another is the information in the parsing action and goto fields of the
parsing table.
A→ .XYZ
A → X.YZ
198
A → XY.Z
A → XYZ. are all LR (0) item.
F → (E) | i
199
2.4.1.4 CLOSURE
If I is a set of items for a grammar G, then the set of items CLOSURE[I] is
constructed from I by the following rules:
Every item in I is in CLOSURE[I]
If A → α.Bβ is in CLOSURE [I] and B → γ is a production, then add the item B
→ γ is a production, then we add the item B → .γ to I, if it is not already there.
The function CLOSURE can be computed as in the algorithm below.
Example 2:
Consider the augmented grammar
E‘ → E
E→E+T|T
T → T*F | F
F → (E) | i
I = {E‘ → .E},
Find CLOSURE[I]
Solution:
If I is the set of one item {[E‘ → E]}, then CLOSURE[I] contains the
items
E‘ → .E
E → .E + T
200
E → .T
T → .T*F
T → .F
F → .(E)
F → .i
T → T*F | F
F → (E) | i
E→ E + .T
T → .T*F
T → .F
201
F → .(E)
F →.i
That is, we examine I for items with + immediately to the right of the dot. E‘ →
E. Is not such an item, but E → E.+T is. We move the dot over the + to get [E
→ E+.T] and take the closure of this set.
Example 4:
For the augmented grammar below,
202
Find the canonical collection of sets of items (i.e. CLOSURE(E‘ → .E))
Represent the GOTO function for this set of items as the transition diagram of a
deterministic finite automaton D.
E‘ → E
E→E+T|T
T → T*F | F
F → (E) | i
Solution:
C = CLOSURE (E‘ → .E)
I0 : E‘ → .E
E → .E + T | .T
T → .T*F | .F
F → .(E) | .i
Expanding I0
I1 : GOTO (I0, E) I2 : GOTO (I0, T)
E‘→ E. E → T.
E → E. + T T → T. * F
I5: GOTO (I0,
I3:GOTO (I2, F) I4:GOTO (I0, () i)
T → F. F → (.E) F → i.
E → .E + T | .T
T → .T*F |
.F
F → .(E) | .i
Expanding I1 Expanding
203
I2
Expanding I4
I8 : GOTO (I4, E) GOTO (I4, T) = I2 GOTO (I4, () = I4
GOTO (I4, F) = I3 GOTO (I4, i) = I5
Expanding I6
I9 : GOTO (I6, T) GOTO (I6, F) = I3 GOTO (I6, () = I4
E → E + T.
T → .T*F GOTO (I6, i) = I5
Expanding I7
I
10: GOTO (I7, F) GOTO (I7, () = I4 GOTO (I7, i) = I5
T → T*F.
Expanding I8
I11: GOTO (I8, )) GOTO (I8, +) = I6
F → (E).
Expanding I9
GOTO (I9, +) = I7
NOTE:
The final state for the given grammar G is I11 since it cannot be expanded
further. Therefore I11 is the closure.
204
Fig. 3.5.5: Deterministic Finite Automaton D
In-text Question 2
LR (K) parsers are so called because they scan the input from left to right and construct a
rightmost derivation in reverse. True or False?
Answer:
True
Example 5:
206
Using the augmented grammar and the corresponding computed canonical
collection of sets of items in example 4 above, construct an SLR (1) parsing
table for the grammar.
We use S to denote states
E‘ → E
E→E+T
E→T
T → T*F
T→F
F → (E)
F→i
Example 6:
Use the SLR parsing table developed in example 5 above to process the string:
i*i+i
Solution:
207
Table 3.5.2: String i * i + i Processing Table
E‟→ Ec
E → ES | S
S → aEb | ab
208
(a) Develop the canonical LR(0) collection of sets of items for this
grammar using the sets of items construction algorithm.
(b) Construct the SLR parsing table
(c) Use the developed parsing table to process the string: aababbc$
4.0 Conclusion/Summary
In the concluding session of this module, you have been taken through the
concept of LR parsing techniques and how to develop SLR (1) parser for any
grammar. You also learnt how to use the generated SLR parser to parse
sentences derived from the grammar.
You also learnt that:
- a context-free grammar is called LR grammar when it is elaborated to
enable scanning the input from left to right (denoted by L) and to produce
a rightmost derivation (denoted by R)
- there are three techniques/algorithms to build tables for an ―LR‖ parser
viz:
a. SLR
b. Canonical LR
c. LALR
- an LR parser consists of two parts, a driver routine and a parsing table
- the driver routine is the same for all LR parsers, only the parsing table
changes from one parser to another.
In the next module you will be taken through the code generation phase of the
compiler.
5.0 Self-Assessment Questions
1. When is a context-free grammar called an LR grammar?
2. List the two parts an LR Parser Consists of
209
6.0 Additional Activities (Videos, Animations & Out of Class activities) e.g.
a. Visit U-tube https://www.youtube.com/watch?v=MWX0-_mHYcc. Watch
the video & summarise in 1 paragraph
b. View the animation on
https://www.youtube.com/watch?v=W11NdvhkBuYand critique it in the
discussion forum
c. Take a walk and engage any 3 students on techniques for producing LR
Parsing Tables; In 2 paragraphs summarise their opinion of the discussed topic.
etc.
210
Chapman, Nigel P. (1987). LR Parsing: Theory andPractice.
Cambridge University Press.
Cooper, Keith D. & Torczon, Linda (2004). Engineering a Compiler.
Morgan Kaufmann.
CS.VU.nl, Parsing Techniques - A Practical Guide web page of book
includes downloadable pdf.
Hopcroft, J. E. & Motwani, R. (2000).Introduction to Automata Theory,
Languages, and Computation . Wesley: Addison
Leermakers, R. (1993).The Functional Treatment of Parsing. Boston:
Kluwer Academic.
Muchnick, Steven S. (1997). Advanced Compiler Design and
Implementation. Morgan Kaufmann.
Scott, Michael L. (2009). Programming Language Pragmatics. Third
edition, Morgan Kaufman.
Sebesta, Robert W. (2010). Concepts of Programming Languages. Ninth
edition, Wesley: Addison.
211
MODULE 4
Code Generation
Contents:
Study Session 1: Error Handling
Study Session 2: Symbol Tables
Study Session 3: Intermediate Code Generation
Study Session 4: Code Generation
Study Session 5: Code Optimisation
STUDY SESSION 1
Error Handling
Section and Subsection Headings:
Introduction
1.0 Learning Outcomes
2.0 Main Content
2.1- Dealing with Errors
2.2- Historical Notes
2.3- Integrated Development Environment (IDE)
2.4- Compile-time Errors
2.4.1- Errors during Lexical Analysis
2.4.2- Errors during Syntax Analysis
2.4.3- Errors during Semantic Analysis
2.5- Reporting the Position of Run-Time Errors
2.6- Run-Time Speed versus Safety
2.6.1- Cheap Detection of 'Undefined'
2.6.2- How to Check for 'Undefined'
2.6.3- How to Check at Compile-time
212
3.0 Tutor Marked Assignment
4.0 Study Session Summary and Conclusion
5.0 Self-Assessment Question(s)
6.0 Additional Activities (Videos, Animations & Out of Class activities)
7.0 Self-Assessment Question Answers
8.0 References/Further Reading
Introduction:
So far in this course you have learnt about the front end of the compiler. In the
concluding module of the course you will be learning more about the back end
of the compiler namely, intermediate code generation, code generation and code
optimisation. To start with, this first study session will be introducing you to the
concept of error detection and recovery in compilers.
213
Fig 4.1.1: Dealing with Errors
lots of mistakes, and may not understand the programming language very well,
so they need clear, precise and jargon-free error reports. Especially in a learning
environment, the main function of a compiler is to report errors in source
programs; as an occasional side-effect you might actually get a program
translated and run.
As a general rule, compiler writers should attempt to express error messages in
moderately plain English, rather than with reference to the official programming
language definition (some language definitions use somewhat obscure or
specialised terminology).
For example, a message "can't convert string to integer" is probably clearer than
"no coercion found."
2.2 Historical Notes
In the 1960s and much of the 1970s, batch processing was the normal way of
using a (large) mainframe computer (personal computers only started to become
household items in the early 1980s). It could well be several hours, or even a
day, from when you handed your deck of punched cards to a receptionist until
you could collect the card deck along with a printed listing of your program,
accompanied either by error messages or by some useful results.
Under such circumstances, it was important that compilers report as many errors
as possible, so part of the job of writing a compiler was to 'recover' from an
error and continue checking (but not translating) in the hope of finding more
errors. Unfortunately, once an error has occurred (especially if the error affects
a declaration), it is quite possible for the compiler to get confused and produce a
host of spurious error reports.
Programrs then had the task of deciding which errors to try and fix, and which
ones to ignore in the hope that they would vanish once earlier errors were fixed.
Some compilers were particularly prone to producing spurious error reports.
The only useful advice that helpdesk staff could provide was: fix the first error,
since the compiler hasn't had a chance to confuse itself at that point.
214
A significant amount of compiler development effort was often devoted to
attempts at error recovery. You could try and guess what the programr might
have intended, or insert some token to at least allow parsing to continue or just
give up on that statement and skip to the next semicolon. The latter action could
skip an end or other significant program structure token and so get the compiler
even more confused.
2.3 Integrated Development Environment (IDE)
Fast personal computers are now available, so IDEs are becoming more
popular, with an editor and compiler tightly coupled and usable from a single
graphical interface. Many IDEs also include a debugger as well. In some cases,
the editor is language-sensitive, so it can supply matching brackets and/or
statement schemas to help reduce the number of trivial errors. An IDE may also
use different colours for different concepts within a source language, e.g.
reserved words in bold, comments in green, constants in blue, or whatever.
This speed and tight coupling allows the compiler writer to adopt a much
simpler approach to errors: the compiler just stops as soon as it finds an error,
and the editor then places the cursor at the point in the source text where the
error was detected and displays some specific error message. Note that the point
where an error was detected could well be some distance after the point where
the error actually occurred.
There were line-mode IDEs back in 1964, many BASIC systems were examples
of such systems; we are going to implement something like this in the book
section case study - a simple interpreter.
215
point of error, or (less conveniently) by providing the line number and column
number of that point.
Remember that the actual position of the error (as distinct from where it was
detected) may well be at some earlier point in the program; in some cases (e.g.
bracket mismatch) the compiler may be able to indicate the nature of the earlier
error.
It is important that error messages be clear, correct, and relevant.
The worst counter-example that Murray Langton has encountered was a
compiler which reported "Missing semicolon" when the actual error was an
extra space in the wrong place. To further confuse matters, no indication was
given as to where in the program the error was. Just to add insult to injury, the
source language didn't even use semicolons!
216
- Invalid numbers: A number such as 123.45.67 could be detected as
invalid during lexical analysis (provided the language does not allow a
full stop to appear immediately after a number). Some compiler writers
prefer to treat this as two consecutive numbers 123.45 and .67 as far as
lexical analysis is concerned and leave it to the syntax analyser to report
an error. Some languages do not allow a number to start with a full
stop/decimal point, in which case the lexical analyser can easily detect
this situation.
2.4.2 Errors during Syntax Analysis
During syntax analysis, the compiler is usually trying to decide what to do next
on the basis of expecting one of a small number of tokens. Hence in most cases
it is possible to automatically generate a useful error message just by listing the
tokens which would be acceptable at that point.
Source: A + * B
Error: | Found '*', expect one of: Identifier, Constant, '('
More specific hand-tailored error messages may be needed in cases of bracket
mismatch.
Source: C := ( A + B * 3 ; Error: | Missing ')' or earlier surplus '('
217
The extent to which such type checking is possible depends very much on the
source language.
PL/1 allows an amazingly wide variety of automatic type conversions, so
relatively little checking is possible.
Pascal is much more fussy; you can't even assign a real value to an integer
variable without explicitly specifying whether you want the value to be rounded
or truncated.
Some writers have argued that type checking should be extended to cover the
appropriate units as well for even more checking, e.g. it doesn't make sense to
multiply a distance by a temperature.
Other possible sources of semantic errors are parameter miscount and subscript
miscount. It is generally an error to declare a subroutine as having 4 parameters
and then call that routine with 5 parameters (but some languages do allow
routines to have a variable number of parameters). It is also generally an error to
declare an array as having 2 subscripts and then try and access an array element
using 3 subscripts (but some languages may allow the use of fewer subscripts
than declared in order to select a 'slice' of the array).
In-text Question 1
During compilation it is always possible to give the precise position at which the error was
detected. True or False?
Answer:
True
219
current IDE's do have a debugging option which may help detect some of these
run-time errors:
- attempt to divide by 0
- overflow (and possibly underflow) during arithmetic operations.
- attempt to use a variable before it has been set to some sensible value
(undefined variable)
- attempt to refer to a non-existent array element (invalid subscript).
- attempt to set a variable (defined as having a limited range) to some value
outside this range
- various errors related to pointers:
- Attempt to use a pointer before it has been set to point to somewhere
useful.
- attempt to use a nil pointer, which explicitly doesn't point anywhere
useful
- attempt to use a pointer which points outside the array it should point to.
- attempt to use a pointer after the memory it points to has been released.
Historically, the main reason for not doing these checks is the effect on
performance. When FORTRAN was first developed (circa, 1957), it had to
compete with code written in assembler; indeed, many of the optimising
techniques used in modern compilers were first developed and used at that time.
C was developed (circa, 1971) initially as a replacement for assembler for use
by experienced system programrs when writing operating systems.
In both the above cases there was a justifiable reason for not doing these checks.
Nowadays, computer hardware is very much faster than it was in 1957 or 1971,
and there are many more less-experienced programrs writing code, so the
arguments for avoiding checks are much weaker. Actually adding the checks on
a supposedly working program can be enlightening/surprising/embarrassing;
even programs which have been 'working' for years may turn out to have a
surprising number of bugs.
220
Hoare was responsible for an Algol 60 compiler in the early 1960s; subscript
checking was always done. Indeed, Hoare has said in "Hints on Programming
Language Design" that: "Carrying out checks during testing and then
suppressing then in production is like a sailor who wears a lifejacket when
training on dry land and then removes the lifejacket when going to sea."
In his book "The Psychology of Computer Programming", Wienberg recounts
the following anecdote:
- After months of effort, a particular application was still not working, so a
consultant was called in from another part of the company. He concluded
that the existing approach could never be made to work reliably. While
on his way home he realised how it could be done. After a few days work
he had a demonstration program working and presented it to the original
programming team.
- Team leader: How long does your program take when processing?
- Consultant: About 10 seconds per case.
- Team leader: But our program only takes 1 second. {Team look smug at
this point}
- Consultant: But your program doesn't work. If the program doesn't have
to work, then I can make it as fast as you like.
- Wirth designed Pascal as a teaching language (circa 1972); for many
Pascal compilers the default was to perform all safety checks. Some
Pascal systems had an option to suppress the checks for some limited part
of the program.
- When a programming language allows the use of pointers and pointer
arithmetic for accessing array elements, the cost of doing checks for
access to non-existent array elements might be significant. Note that it
can indeed be done: each pointer is large enough to contain three
addresses, the first being the one which is directly manipulated and used
by the programr, and the other two addresses being the lower and upper
221
limits on the first. This approach may have problems when the language
allows interconversion between integers and pointers.
In the case of 'undefined variables', note that setting all variables initially to 0 is
a really bad idea (unless the language mandates this of course). Such an initial
setting reduces program portability and may also disguise serious logic errors.
2.6.1 Cheap Detection of 'Undefined'
Murray Langton has had some success in
checking for 'undefined' in a 140,000 line
safety-critical legacy FORTRAN program.
The fundamental idea is to set all global
variables to recognisably strange values
which are highly likely to produce visibly
strange results if used. For an IBM
mainframe, the strange values were: Fig 4.1.2: Cheap Detection of 'Undefined'
222
FORTRAN routine which set all these silly values. If any changes were made to
a COMMON block, it was a simple matter to rerun this analysis program.
During execution, the routine which sets silly values uses less than 0.1% of the
total CPU time. When these silly values were first used, it took several months
to track down and eliminate the resulting flood of asterisks and question marks
which appeared in the output, despite the fact that the program had been
'working' for over 20 years.
2.6.2 How to Check for 'Undefined'
The basic idea is to ensure that all variables are flagged as 'undefined' when
declared. Some languages allow simultaneous declaration and initialization, in
which case a variable is flagged as 'defined'. Whenever a value is assigned to a
variable the flag is changed to 'defined'. Whenever a variable is used the flag is
checked and an error is reported if it is 'undefined'.
In the past a few lucky implementors have had hardware assistance in the form
of an extra bit attached to each word in memory (Burroughs 5000+). On modern
byte-addressable machines you could attach an extra byte to each variable to
hold the flag. Unfortunately, due to alignment requirements, this would tend to
double the amount of memory needed for data (many systems require 4-byte
items such as numbers to have an address which is a multiple of 4; even if
misalignment is allowed its use may slow the program down significantly).
The simplest way of providing a flag is to use some specific value which is
(hopefully) unlikely to appear in practice. Particular values depend on the type
of the variable involved.
- Boolean: Such variables are most likely to be allocated one byteof
storage with 0 for false and 1 for true. A value such as 255 or 128 is a
suitable flag.
- Character: When used for binary input/output, any value could appear,
so no checking is possible. Hence it must be possible to switch off
223
checking in such cases. When used as a character there are many possible
non-printing characters. 127 or 128 or 255 may be suitable choices.
- Integer: Most computer systems use two's complement representation for
negative numbers which gives an asymmetric range (for 16-bits, range is
-32768 to +32767). We can restore symmetry by using the largest
negative number as the 'undefined' flag.
- Real: If your hardware conforms to the IEEE standard (most PC's do)
you can use NaN (not a number).
In-text Question 2
The extent to which type checking is possible is independent of the source language True or
False?
Answer:
False
225
4. Briefly describe some ways to check for ‗undefined. ‘
4.0 Conclusion/Summary
In this study session, you have been taken through the concept of error detection
and recovery in compilers. Certain errors occur at certain stage of compiling
and the way the error is handled depends on the type of error and at what stage
it was detected.
You also learnt that:
- compiler writers should attempt to express error messages in moderately
plain English, rather than with reference to the official programming
language definition
- during compilation it is always possible to give the precise position at
which the error was detected
- it is important that error messages be clear, correct, and relevant
- there is variation as to how the location of the error of division by 0 is
reported
- the extent to which type checking is possible depends very much on the
source language
- in a count-controlled loop you can often check the range of the control
variable by checking the loop bounds before entering the loop.
In the next study session, you will be learning about another concept that is
paramount to error handling which is symbol table organisation.
5.0 Self-Assessment Questions
1. List out some of the run-time errors that can be encountered
2. List out three errors that can occur during Lexical analysis
6.0 Additional Activities (Videos, Animations & Out of Class activities) e.g.
a. Visit U-tube https://www.youtube.com/watch?v=JhZUc3JEl78. Watch the
video & summarise in 1 paragraph
b. View the animation on
226
https://www.youtube.com/watch?v=rMfp6gKR_kkand critique it in the
discussion forum
c. Take a walk and engage any 3 students on compile time errors; In 2
paragraphs summarise their opinion of the discussed topic. etc.
227
Alfred, Aho, V. et al. (2007). Compilers: Principles, Techniques, and
Tools. Second edition. Wesley: Pearson Addison.
Alfred, V. Aho & Jeffrey, D. Ullman. The theory of Parsing,Translation, and
Compiling, available fromACM.org
Andrew, Appel W. (2002). Modern Compiler Implementation in Java.
Second edition. Cambridge University Press.
Chapman, Nigel P. (1987). LR Parsing: Theory and Practice.
Cambridge University Press.
CS.VU.nl, Parsing Techniques - A Practical Guide web page of book
includes downloadable pdf
Keith, D. Cooper & Linda, Torczon (2004). Engineering a Compiler.
Morgan Kaufmann.
Leermakers, R. (1993). The Functional Treatment of Parsing. Boston:
Kluwer Academic.
Robert, W. Sebesta (2010). Concepts of Programming Languages. Ninth
edition. Addison-Wesley.
Steven, S. Muchnick (1997). Advanced Compiler Design and
Implementation. Morgan Kaufmann.
228
STUDY SESSION 2
Symbol Tables
Section and Subsection Headings:
Introduction
1.0 Learning Outcomes
2.0 Main Content
2.1 - Semantic Analysis
2.2 - Symbol Tables
2.2.1 - Symbol Table Information/Entries
2.2.2 - Symbol Table Organisation
2.2.3 - Attribute information
2.2.4 - Characters in a Name
2.3 - Table Organisation
2.3.1 - Storage Allocation Information
2.3.2 -The List Data Structure for Symbol Tables
2.3.3 - Hash Addressing
2.3.3.1 - Hash tables
2.3.4 - Collision Resolution
2.3.4.1 - Re-Hashing
2.3.4.2 - Chaining
2.3.5 - Representing Scope Information
2.3.6 - Implementation of Block Structured Languages
2.4 - Parameter Passing
2.4.1 - Call by Reference
2.4.2 - Call by Value
2.4.3 - Call by Result
2.4.4 - Call by Value Result
2.4.5 - Call by Name
229
3.0 Tutor Marked Assignment
4.0 Study Session Summary and Conclusion
5.0 Self-Assessment Question(s)
6.0 Additional Activities (Videos, Animations & Out of Class activities)
7.0 Self-Assessment Question Answers
8.0 References/Further Reading
Introduction:
A compiler needs to collect and use information about the names appearing in
the source program. This information is usually entered into a data structure
called a symbol table. The information collected about a name includes the
string of characters by which it is denoted, its type, its form, its location in
memory and other attributes depending on the language. Information about a
name is entered into the table during lexical and syntactic analysis.
The information collected in the symbol table is used during several stages in
the compilation process. There are also a number of ways in which the symbol
table can be used to aid in error detection and correction. Also space in the
symbol table can be used for code optimisation purposes.
In this study session, you will learn the principal ways of organising and
accessing symbol tables.
230
7. Describe collision resolution methods in hashing and advantages.
2.0 MAIN CONTENT
2.1 Semantic Analysis
Semantic analysis is roughly the equivalent of checking that some ordinary text
written in a natural language
(e.g. English) actually means
something (whether or not
that is what it was intended
to mean).
232
The symbol table entry itself can be set up when the role of a name becomes
clear, with the attribute values being filled in as the information become
available. In some cases, the entry can be initiated from the lexical analyser as
soon as a name is seen in the input. More often, one name may denote several
different objects, even in the same block or procedure.
For example, the C declarations
int x;
struct x {float y, z;};
uses x as both an integer and as the tag of a structure with two fields. In such
cases, the lexical analyser can only return to the parser the name itself rather
than a pointer to the symbol table entry. The record in the symbol table is
created when the syntactic role played by the name is discovered. Attributes of a
name are entered in response to declarations. Labels are identifiers followed by
a colon, so one action associated with recognising such an identifier may be to
enter this fact into symbol table.
233
2.2.2 Symbol Table Organisation
Symbol tables may be implemented using linear lists, hash-tables and various
sorts of tree structures. An issue is the speed with which an entry can be added
or accessed. A linear list is slow to access but simple to implement. A hash
table is fast but more complex. Tree structures give intermediate performance.
In summary, each of these structures has the following characteristics:
Linear List
- O(n) probes per lookup
- easy to expand — no fixed size
- one allocation per insertion
Ordered Linear List
- O(log2n) probes per lookup using binary search
- insertion is expensive (to reorganise list)
Binary Tree
- O(n) probes per lookup, if the tree is unbalanced
- O(log2n) probes per lookup, if the tree is balanced
- easy to expand with no fixed size
- one allocation per insertion
Hash Table
- O(1) probes per lookup on the average
- expansion costs vary with specific scheme
In the abstract, a symbol table is merely a table with two fields, a name field
and an information field. We require several capabilities of the symbol table.
We need to be able to:
- determine whether a given name is in the table,
- add a name to the table,
- access the information associated with a given name, and
- add new information for a given name,
234
- delete a name or group of names from the table.
- in a compiler, the names in the symbol table denote object of various
sorts. There may be separate tables for variable names, labels, procedure
names, constants, field names (for structures) and other types of names
depending on the programming language.
2.2.3 Attribute information
Attributes are internal representation of declarations. Symbol table associates
names with attributes. Names
may have different attributes
such as below depending on their
meaning:
Fig 4.2.2: Attribute information
235
hold a lexeme, utilise space more efficiently if there is only space for a pointer
in the symbol table entry. In the record for a name, a pointer is placed to a
separate array of characters giving the position of the first character of the
lexeme. The indirect scheme permits the size of the name field of the symbol
table entry itself to remain a constant.
The complete lexeme constituting a name must be stored to ensure that all uses
of the same name can be associated with the symbol table record.
S o R t
A
R e a d a r r a Y
I
Arguments are the symbols or identifiers while values are the attributes
obtained from declaration and usage on the symbols. Arguments can be stored
by letting the argument portion of the symbol table contain pointers to the actual
236
symbols stored in the string space. The argument portion may also contain the
length of the symbol in the string space.
Two ways to implement tables are:
i. One-table for all entries
ii. One-table for each entry i.e. if k-entries, then k-tables.
A sorted table is searched using binary search and the average search number of
comparisons is nlog2n. An unsorted table requires on the average
comparisons.
2.3.1 Storage Allocation Information
Information about the storage locations that will be bound to names at run time
is kept in the Symbol table. If machine code is to be generated by the compiler,
then the position of each data object relative to a fixed origin, such as the
beginning of an activation record must be ascertained. The same remark applies
to a block of data loaded as a module separate from the program. For example,
COMMON blocks in FORTRAN are loaded separately, and the positions of
names relative to the beginning of the COMMON block in which they lie must
be determined.
2.3.2 The List Data Structure for Symbol Tables
The simplest and easiest to implement data structure for a symbol table is a
linear list of records. Arrays are used to store their names and their associated
information. New names are added to the list in the order in which they are
encountered. The position of the end of the array is marked by the pointer
available, pointing to where the next symbol table entry will go. The search for
a name proceeds backwards from the end of the array to the beginning. When
the name is located, associated information can be found in the words following
next. If we reach the beginning of the array without finding the name, a fault
occurs- an expected name is not in the table.
237
Making for an entry for a name and looking up the name in the symbol table are
independent operations. In a block-structured language, an occurrence of a
name is in the scope of the most closely nested declaration of the name. This
scope can be implemented by making a fresh entry for a name every time it is
declared. A new entry is made in the words immediately following the pointer
available; that pointer is increased by the size of the symbol table record. Since
entries are inserted in order, starting from the beginning of the array, they
appear in the order they are created in.
Table 4.3.3: A Linear List of Records
If the symbol table contains n names the work necessary to insert a new name is
constant if we do the insertion without checking to see the name is already in
the table. If multiple entries for names are not allowed look the entire table
before discovering that a name is not in the table. Insertions and inquiries take
time proportional to n, names and m inquiries is at most c n (n + e), c is a
constant.
2.3.3 Hash Addressing
This is a method for converting symbols into indices of n-entries in the symbol
table with fixed size. Collision occurs when two or more symbols hashed into
the same index.
2.3.3.1 Hash Tables
Variation of the searching technique is known as hashing. Open hashing refers
to the property that there need be no limit on the number of entries that can be
238
in the table. This scheme gives the capability of performing e enquiries on n
names in time proportional to n (n + e)/m. Since m can be made large up to n
this is more efficient than linear list. In the basic hashing scheme, there are two
parts to the data structure:
a. A hash table consisting of a fixed array of m pointers to table entries.
b. Table entries organised into m separate linked lists, called buckets. Each
record in the symbol table appears on exactly one of these lists. Storage
for the records may be drawn from an array of records. The dynamic
storage allocations facilities of the implementation language can be used
to obtain space for the records.
To determine whether there is an entry for string s in the symbol table, apply a
hash function h to s, such that h(s) returns an integer between 0 and m-1. If s is
in the symbol table, then it is on the list numbered h(s). If s is not in the symbol
table, it is entered by creating a record for s that is linked at the front of the list
numbered h (s). The average list is n/m record long if there are n names in a
table of size m. By choosing m so that n/m is bounded by a small constant the
time to access a table entry is constant. This space taken by the symbol table
consists m words for the hash table and cn words for the table entries, where c is
the number of words per table entry. Thus the space for the hash table depends
only on m and the space for table entries depends only on the number of entries.
The choice of m depends on the intended application for a symbol table. One
suitable approach for computing hash functions is to proceed as follows:
a. Determine a positive integer h from the characters c1, c2, ..., ck in string
s. The conversion of single characters to integers is usually supported by
the implementation language.
b. Convert the integer h determined above into the no. of the list. Divide by
m and take the reminder.
239
A technique for computing h is to add up the integer values of the characters in
a string. Multiply the old value of h by a constant @ before adding in the next
character. That is, h0=0, hi=@ hi-1+ci
1. #define PRIME 211
2. #define EOS ‗\0‘
3. int hashpjw(s)
4. char*s;
5. {
6. char *p;
7. unsigned h=0, g;
8. for( p=0;*p !=EOS; p=p+1){
9. h=(h << 4)+(*p)‘
10.if(g= h@0xf0000000){
11.h=h^(g >>24);
12.h=h ^ g ;
13.}
14.}
15.return h % PRIME;
16.}
In the hash function hashpjw, the sizes included the first primes larger than
100,200,...,1500. A close second was the function that computed the old value
by 6559,i goring overflows, adding in the next character. Function hashpjw is
computed by starting with h=0. For each character c, shift bits of h left 4
positions and add in c. If any of four high-order bits of h is 1, shift four bits
right 24 positions , exclusively-or them into h, and reset to 0 any of the high
order bits that was 1.
pThe sim
240
In-text Question 1
What is Hash Addressing?
Answer
Hash Addressing is a method for converting symbols into indices of n-entries in the symbol
table with fixed size.
2.3.4.1 Re-Hashing
Suppose we hash a symbol s to h and find that a different symbol already
occupies the entry h. Then a collision has occurred. We then compare s against
an entry h + p1(modulo the table size) for some integer p1. If a collision occurs
again, we compare with an entry h + p2 (modulo the table size) for some integer
p2.
This continues until h = h + pi (modulo table size) is empty, contains s or is
again entry h. In other words pi = 0.
In the latter case, we stop the program because the table is full.
If {Pi} is the set if natural numbers then it is a linear rehash otherwise it is non-
linear.
2.3.4.2 Chaining
Suppose we hash a symbol s to h and find that a different symbol already
occupies the entry h, a collision has occurred.
Chaining method uses a hash table called bucket of a fixed size as the symbol
table. It is a table of pointers to the elements of the symbol table and points to
nothing initially. Another pointer points to the last symbol entered into the
241
symbol table. Symbols hash to buckets of the hash table. Each bucket points to
nil or to the first element in the symbol table that hashes to it.
Symbols are entered in first-come-first-served FCFS manner in the symbol
table. The symbol table has an additional pointer field used to chain entries
which hash to the same bucket.
243
1. A hash link that chains the entry to other entries whose names hash to the
same value
2. A scope link that chains all entries in the same scope.
Deletion of entries from the hash table must be done with the care, because
deletion of an affects the previous one on its list. When we delete the i-1st entry
points to i+1st entry.
The i-1st entry can be found if the hash links from a circular link list. We can
use a stack to keep track of the lists containing entries to be deleted. A marker is
placed in the stack when a new procedure is scanned. When we finish
processing the procedure, the list numbers can be popped from the stack until
the marker for the procedure is reached.
2.3.6 Implementation of Block Structured Languages
Languages with block structure such as ALGOL present certain complexities
not found in C. First, in a block-structured language, not only procedures, but
blocks as well, may define their own data. Thus activation records or portions of
activation records must be reserved for blocks. Second, many languages,
ALGOL, for example, permit arrays of adjustable length. Third, the data-
referencing environment of a procedure or block includes all procedures and
blocks surrounding it in the program
Consider the code segment below:
244
Fig. 2: Block Structured Program
The symbol table for this block structure can be arranged as in Figure 3 below:
The way to build the symbol table is to construct a block list consisting of block
numbers, surrounding block numbers, Number of entries and Pointers.
245
Symbols are entered into the table in the order in which their blocks closed. A
block list containing entries, surrounding block numbers, number of entries and
pointers is used to implement this (symbol table) on a stack.
The rule for finding the current declaration corresponding to the use of an
identifier/symbol is first to look in the current block (the one in which the
identifier is used), then the surrounding block and so on until a declaration of
that identifier is found.
Temporary locations may be reserved for identifiers declared whose blocks
have not been closed and then transferred into the main symbol table when their
blocks have closed. We number the blocks in the manner they open.
246
In-text Question 2
List the two ways of resolving collision
Answer
i. Re-hashing
ii. Chaining
247
to the called procedure, but when the called procedure finishes, the results P 1,
P2, P3, …, Pn will be stored in a1, a2,a3, …, an.
2.4.4 Call by Value Result
A parameter can be stored as both value and result. In this case, the local
location of the formal parameters is initialised to the value contained in the
address of the actual parameter and the called procedure returns the result back
to the actual parameter.
2.4.5 Call by Name
This call implementation requires a textual substitution of the formal parameter
name by the actual parameter. It is implemented by using a routine called
THUNK to evaluate the actual parameter at each reference and returns its
address. i.e. Here, the called procedure will be recompiled substituting a1, a2, a3,
…, an for the parameters P1, P2, P3, …, Pn (the name without the address or the
value).
4.0 Conclusion/Summary
In this study session, you have been taken through symbol table and how it can
be constructed for a block structured language. As you have learnt in this study
248
session, symbol table is a very important feature of all compilers and it is made
use of at all phases of compilation.
You also learnt that:
- semantic analysis is roughly the equivalent of checking that some
ordinary text written in a natural language actually means something
- the purpose of semantic analysis is to check that we have a meaningful
sequence of tokens
- a symbol table contains the environmental information concerning the
attributes of various programming language constructs
- symbol tables may be implemented using linear lists, hash-tables and
various sorts of tree structures
- in block-structured implementation, symbols are entered into the table in
the order in which their blocks closed
- parameters can be passed by value, by result, by reference, etc.
In the next study session, you will learn about how codes are generated.
6.0 Additional Activities (Videos, Animations & Out of Class activities) e.g.
a. Visit U-tube https://www.youtube.com/watch?v=oyG_JfrbTCQ. Watch the
video & summarise in 1 paragraph
b. View the animation on
https://www.youtube.com/watch?v=gxorCOk0YPwand critique it in the
discussion forum
c. Take a walk and engage any 3 students on symbol table information; In 2
paragraphs summarise their opinion of the discussed topic. etc.
249
7.0 Self-Assessment Question Answers
1.
- Call by reference
- Call by value
- Call by result
- Call by value result
- Call by name
2. The purpose of semantic analysis is to check that we have a meaningful
sequence of tokens
250
8.0 References/Further Reading
Alfred, V. Aho, Monica Lam, Ravi Sethi, & Jeffrey D. Ullman (2007).
Compilers: Principles, Techniques, and Tools. Second Edition.
Pearson Addison-Wesley.
Andrew, W. Appel (2002). Modern Compiler Implementation in Java.
Second edition. Cambridge University Press.
Keith, D. Cooper & Linda, Torczon (2004). Engineering a Compiler.
Morgan Kaufmann.
Steven, S. Muchnick (1997). Advanced Compiler Design and
Implementation. Morgan Kaufmann.
Michael, L. Scott (2009). Programming Language Pragmatics. Third
edition. Morgan Kaufman.
Robert, W. Sebesta (2010). Concepts of Programming Languages. Ninth
edition. Addison-Wesley.
251
STUDY SESSION 3
Intermediate Code Generation
Section and Subsection Headings:
Introduction
1.0 Learning Outcomes
2.0 Main Content
2.1 - Machine Independent Languages
2.1.1- Intermediate Code Generator
2.2 - Intermediate Languages/Representation
2.2.1 - Graphical Representations
2.2.2 - Three-Address Code
2.2.2.1 - Types of Three-Address Statements
2.2.3 - Stack-based Representation
2.3 - Conversion Algorithms
2.4 - Intermediate Code Generation for Declarations, Expressions,
Commands, and Procedures
2.4.1 - Intermediate Code Generation for Declarations
2.4.2 - Intermediate Code Generation for Assignment Statements
2.4.3 - Intermediate Code Generation for Boolean Expressions
2.4.4 - Intermediate Code Generation for Commands
2.4.5 - Generating Intermediate Code for a Simple Program
3.0 Tutor Marked Assignment
4.0 Study Session Summary and Conclusion
5.0 Self-Assessment Question(s)
6.0 Additional Activities (Videos, Animations & Out of Class activities)
7.0 Self-Assessment Question Answers
8.0 References/Further Reading
252
Introduction:
Having learnt about symbol tables in the previous study session, you will be
taken further into code generation in this study session. Code generation phase
of the compiler starts with intermediate code generation. In this study session,
you will learn specifically about three-address code which is the most popular
type of intermediate language representation.
1.0 Study Session Learning Outcomes
After studying this session, I expect you to be able to:
1. Define intermediate representation
2. Define three-address code
3. State with examples types of three-address code
4. Describe stack-based implementation of intermediate representation
5. Convert representations like three-address code to the stack-based code
6. Convert representations stack-based code to three-address code
7. Generate intermediate code for Declarations, Expressions, Commands,
and Procedures.
253
- The compiler becomes optimisable and can be considerably improved.
2.1.1 Intermediate Code Generator
The data structure passed between the analysis and synthesis phases is called the
intermediate representation (IR) of the program. A well designed intermediate
representation facilitates the independence of the analysis and syntheses (front-
and back-end) phases. Intermediate representations may be:
- Assembly language like or
- Be an abstract syntax tree.
In Intermediate code generation we use syntax directed methods to translate the
source program into an intermediate form programming language constructs
such as declarations, assignments and flow-of-control statements.
254
Fig. 4.3.3: A Syntax Tree for the Assignment Statement a:=b*-c+b*-c
255
This same syntax-directed definition will produce the dag if the functions
mkunode(op, child) and mknode(op, left, right) return a pointer to an existing
node whenever possible, instead of constructing new nodes. The token id has an
attribute place that points to the symbol-table entry for the identifier id.name,
representing the lexeme associated with that occurrence of id. lf the lexical
analyser holds all lexemes in a single array of characters, then attribute name
might be the index of the first character of the lexeme. Two representations of
the syntax tree in Figure 1 appear in Figure 3. Each node is represented as a
record with a field for its operator and additional fields for pointers to its
children. In Figure 3(b), nodes are allocated from an array of records and the
index or position of the node serves as the pointer to the node. All the nodes in
the syntax tree can be visited by following pointers, starting from the root at
position IO.
256
Fig. 4.3.5 (b): Two Representations of the Syntax Tree
257
where t1 and t2 are compiler-generated temporary names.
This unravelling of complicated arithmetic expressions and of nested flow-of-
control statements makes three-address code desirable for target code
generation and optimisation. The use of names for the intermediate values
computed by a program allow- three-address code to be easily rearranged –
unlike postfix notation. Three-address code is a linearised representation of a
syntax tree or a dag in which explicit names correspond to the interior nodes of
the graph.
The syntax tree and dag in Figure 1 are represented by the three-address code
sequences in Figure 4.3.4. Variable names can appear directly in three-address
statements, so Figure 4(a) has no statements corresponding to the leaves in
Figure 4.3.4.
t1 := -c
t2 := b * t1
t3 := -c
t4 := b * t3
t5 := t2 + t4
a := t5
Fig. 4.3.4(a): Three-address Code for Syntax Tree
t1 := -c
t2 := b * t1
t5 := t2 + t2
a := t5
Fig. 4.3.4(b): Three-address Code for DAG
The reason for the term ―three-address code‖ is that each statement usually
contains three addresses, two for the operands and one for the result. In the
implementations of three-address code given later in this section, a programr-
defined name is replaced by a pointer to a symbol-table entry for that name.
258
2.2.2.1 Types of Three-Address Statements
Three-address statements are akin to assembly code. Statements can have
symbolic labels and there are statements for flow of control. A symbolic label
represents the index of a three-address statement in the array holding inter-
mediate code. Actual indices can be substituted for the labels either by making a
separate pass, or by using ―back patching,‖. Here are the most common three-
address statements and they are the ones used in the remainder of this course
material:
1. Assignment statements of the form: x := y op z where op is a binary
arithmetic or logical operation;
2. Assignment statements of the form: x := op y where op is a unary
operation. Essential unary operations include unary minus, logical
negation, and shift operators;
3. Copy statements of the form: x := y where the value of y is assigned to x;
4. The unconditional jump goto L. The three-address statement with label L
is the next to be executed;
5. Conditional jumps such as: if x relop y goto L.
This instruction applies a relational operator (<, =, >, <=, etc.) to x and y,
and executes the statement with label L next if x stands in relation relop
to y. If not, the three-address statement following if x relop y goto L is
executed next, as in the usual sequence.;
6. Procedure calls: call p,n and returned values from functions:
return y.
259
param xn
call p,n
generated as part of a call of the function: p(x1, x2,... xn). The integer n
indicating the number of actual parameters in‖call p, n‖ is not redundant
because calls can be nested;
7. Indexed assignments of the form: x := y[i] and x[i] := y.
The first one sets x to the value in the location i memory
units beyond y. The statement x[i] := y sets the contents of
the location i units beyond x to the value of y. In both these
instructions, x, y, and i refer to data objects;
8. Address and pointer assignments: x := &y and x := *y. The first of
these sets the value of x to be the location of y. Presumably y is a name,
perhaps a temporary, that denotes an expression with an I-value such as
A[i, j], and x is a pointer name or temporary. That is, the r-value of x is
the l-value (location) of some object!. In the statement x: = ~y,
presumably y is a pointer or a temporary whose r- value is a location. The
r-value of x is made equal to the contents of that location. Finally, +x: = y
sets the r-value of the object pointed to by x to the r-value of y.
The choice of allowable operators is an important issue in the design of an
intermediate form. The operator set must clearly be rich enough to implement
the operations in the source language. A small operator set is easier to
implement on a new target machine. However, a restricted instruction set may
force the front end to generate long sequences of statements for some source,
language operations. The optimiser and code generator may then have to work
harder if good code is to be generated.
In-text Question 1
Using an intermediate language representation entails two properties. What are they?
Answer:
Retargetability and Optimisability
260
2.2.3 Stack-Based Representation
In this study session, we discuss the stack-based representation of intermediate
code. It has a number of advantages, some of which are:
- an interpreter for the stack-based language tends to be more compact and
straightforward
- a syntax of the language tends to be simple.
- But the representation also has the following disadvantages, which make
it unsuitable for manipulating and improving code:
- it is not trivial to change the order of instructions
- little research has been done to the stack-based code.
Complications with the stack-based code arise often with control flows.
2.3 Conversion algorithms
It is usually trivial to convert representations like three-address code to the
stack-based code, so the case is left as an exercise. It is the inverse of this that is
a challenging problem.
The main task behind the algorithm converting the stack-based code is to
identify dependencies among operations. And it is conditional and
unconditional jumps that make hard to figure these dependencies. So the code
without them can be transformed into the three-address code in a
straightforward way, as follows:
- push 2
- push 3
- push 4
- add
- mul
261
We can see each stack position has a corresponding temporary variable. Put in
another way, store and load are done only by push and pop, respectively, and a
temporary variable that can be accessed at a time is limited to only the top as
opposed to a usual case in which variables are specified freely.
s0 = 2
s1 = 3
s2 = 4
s1 = s1 + s2
s0 = s0 * s1
When a variable is typed, it may be beneficial to adopt SSA form. This
dispenses with the need to analyse what type each variable holds at a moment,
which, as illustrated below, can be quite tricky. The adaptation can be done after
the conversion or simultaneously as the code is being converted.
Now, suppose the execution may not go from top to bottom. In that case, we
basically have to analyse the control flow before translating the code.
More specifically, we calculate how each instruction contributes to the
depth of the stack. For example,
...
goto A // unconditionally jump to label A
...
A: // a label
add // push the sum of two values popped from the stack.
...
As we can see, at label A, the status of the stack depends on the operation
before instruction "goto A".
One conservative approach is to annotate the byte-code before converting it.
The basic idea is that when we interpret the code, we know both where we are
and how tall the stack is. So by pretending as if we were evaluating the code, we
can calculate the height of the stack for each position. An algorithm for this
262
would be like (Note that in actual writing you have to arrange the code so that it
will terminate):
procedure eval(start, depth)
{
for i from start to code.length
{
depth_at[i] = depth
case code[i]
{
'push': depth = depth + 1
'pop': depth = depth - 1
'goto': i = code[i].target
'if_so_goto': eval(code[i].target, depth)
...
}
}
}
eval(0, 0) // start the calculation
Coding the above solution may be tedious in practice, especially when a number
of instructions in the language is large. Java byte code is a notable example of
this. So a radical alternative below is to convert the stack-based code not in the
usual sequential way but per basic block (i.e., a block that has no jumps). To see
this, consider the following:
0 (A): push 10
1 (A): push 13
2 (A): less_than // pop < pop
3 (A): if_not_goto 6
4 (B): push '10 < 13'
263
5 (B): goto 7
6 (C): push 'it is not 10 < 13; your computer is broken!'
7 (C): print
In the above we can identify three basic blocks: the first (A) from line 0 to 3, the
second (B) from line 4 to 5 and the third (C) from line6 to 7. We first compile
A, then we know the height of the stack with which either B or C begins. After
each block is compiled, we output blocks in the order they appear in the source
code.
If you are an astute reader/learner, you would notice that throughout this section
we are assuming the depth of the stack is fixed at each instruction position and
thus can be determined at compiler time. If the assumption does not hold, then
we have to have some kind of stack at runtime.
2.4 Intermediate Code Generation for Declarations, Expressions,
Commands, and Procedures
2.4.1 Intermediate Code Generation for Declarations
264
2.4.2 Intermediate Code Generation for Assignment Statements
265
Example 2: a or b and not c
t1 := not c
t2 := b and t1
t3 := a or t2
Example 3: a < b
100: if (a < b) goto 103
101: t := 0
102: goto 104
103: t := 1
104:
266
2.4.4 Intermediate Code Generation for Commands
goto Lnext
L2: if (c < d) goto L3
goto L4
L3: t1 := y + z
267
x := t1
goto L1
L4: t2 := y – z
x := t2
goto L1
Lnext:
In-text Question 2
Certain conversion algorithms can be applied to convert stack based code of intermediate
representation to three-address code and vice versa. True or False?
Answer:
True
268
2.4.5 Generating Intermediate Code for a Simple Program
Generate three-address code for the following program fragment:
269
{
a[i] = getch();
++i;
}
Ssort(a, N);
Three-address Code:
270
20) t5 := 4 * min 42) param N
21) t6 := a[t5] 43) call s, 2
enterproc(t, m, proc,
22) t7 := 4 * i 44) 40+2*4)
4.0 Conclusion/Summary
In this study session, you have learnt about intermediate code generation in
compilers.
You also learnt that:
- the front part of the compiler usually translates the source code program
into an intermediate language representation, which after that is
converted into a machine code for the concrete target computer
- using an intermediate language representation entails two properties:
retargetability and optimisability
- a popular intermediate language is the three-address code
- certain conversion algorithms can be applied to convert stack-based code
of intermediate representation to three-address code and vice versa.
In the next study session, you will learn about code generation.
271
5.0 Self-Assessment Questions
1. What is the purpose of the front part of the compiler?
2. List the three types of intermediate representation
6.0 Additional Activities (Videos, Animations & Out of Class activities) e.g.
a. Visit U-tube https://www.youtube.com/watch?v=xtouovp9kvQ. Watch the
video & summarise in 1 paragraph
b. View the animation on
https://www.youtube.com/watch?v=XoCpoAuXW7kand critique it in the
discussion forum
c. Take a walk and engage any 3 students on conversion algorithms; In 2
paragraphs summarise their opinion of the discussed topic. etc.
272
8.0 References/Further Reading
Alfred, Aho V. et al. (2007). Compilers: Principles, Techniques, and
Tools. Second Edition. Wesley: Pearson Addison.
Andrew, Appel W. (2002). Modern Compiler Implementation in Java.
Second edition. Cambridge University Press.
Cooper, Keith D. & Torczon, Linda (2004). Engineering a Compiler.
Morgan Kaufmann.
Muchnick, Steven S. (1997). Advanced Compiler Design and
Implementation. Morgan Kaufmann.
Scott, Michael L. (2009). Programming Language Pragmatics.
Third Edition. Morgan Kaufman.
Sebesta Robert W. (2010). Concepts of Programming Languages. Ninth
Edition. Wesley: Addison
273
STUDY SESSION 4
Code Generation
Section and Subsection Headings:
Introduction
1.0 Learning Outcomes
2.0 Main Content
2.1 - Code Generation
2.2 - Issues in the Design of a Code Generator
2.2.1 - Input to the Code Generator
2.2.2 - Target Programs
2.2.3 - Memory Management
2.2.4 - Instruction Selection
2.2.5 - Register Allocation
2.2.6 - Choice of Evaluation Order
2.3 - Approaches to Code Generation
2.4 - Run-Time Storage Management
2.4.1 - Static Allocation
2.4.2 - Stack Allocation
2.4.3 - Run-Time Addresses for Names
3.0 Tutor Marked Assignment
4.0 Study Session Summary and Conclusion
5.0 Self-Assessment Question(s)
6.0 Additional Activities (Videos, Animations & Out of Class activities)
7.0 Self-Assessment Question Answers
8.0 References/Further Reading
Introduction:
In the previous study session, you have learnt about intermediate code
generation and the various intermediate languages that can be used for
274
intermediate code generation. In this study session, you will learn about the
final phase of a compiler, which is code generation. It takes as input an
intermediate representation of the source program and produces as output an
equivalent target program.
275
Fig. 4.4.1: Position of Code Generator
277
2.2.3 Memory Management
Mapping names in the source program to addresses of data objects in run time
memory is done cooperatively by the front end and the code generator. We
assume that a name in a three-address statement refers to a symbol table entry
for the name.
If machine code is being generated, labels in three address statements have to be
converted to addresses of instructions. This process is analogous to the ―back
patching‖. Suppose that labels refer to quadruple numbers in a quadruple array.
As we scan each quadruple in turn we can deduce the location of the first
machine instruction generated for that quadruple, simply by maintaining a count
of the number of words used for the instructions generated so far. This count
can be kept in the quadruple array (in an extra field), so if a reference such as j:
goto i is encountered, and i is less than j, the current quadruple number, we may
simply generate a jump instruction with the target address equal to the machine
location of the first instruction in the code for quadruple i. If, however, the jump
is forward, so i exceeds j, we must store on a list for quadruple i the location of
the first machine instruction generated for quadruple j. Then we process
quadruple i, we fill in the proper machine location for all instructions that are
forward jumps to i.
2.2.4 Instruction Selection
The nature of the instruction set of the target machine determines the difficulty
of instruction selection. The uniformity and completeness of the instruction set
are important factors. If the target machine does not support each data type in a
uniform manner, then each exception to the general rule requires special
handling.
Instruction speeds and machine idioms are other important factors. If we do not
care about the efficiency of the target program, instruction selection is
straightforward. For each type of three- address statement we can design a code
skeleton that outlines the target code to be generated for that construct.
278
For example, every three address statement of the form x: = y + z, where x, y,
and z are statically allocated, can be translated into the code sequence
MOV b, R0
ADD c, R0
MOV R0, a
MOV a, R0
ADD e, R0
MOV R0, d
Here the fourth statement is redundant, and so is the third if ‗a‘ is not
subsequently used.
The quality of the generated code is determined by its speed and size.
A target machine with a rich instruction set may provide several ways of
implementing a given operation. Since the cost differences between different
implementations may be significant, a naive translation of the intermediate code
may lead to correct, but unacceptably inefficient target code. For example if the
target machine has an ―increment‖ instruction (INC), then the three address
statement a := a+1 may be implemented more efficiently by the single
279
instruction INC a, rather than by a more obvious sequence that loads a into a
register, add one to the register, and then stores the result back into a.
MOV a, R0
ADD #1,R0
MOV R0, a
Instruction speeds are needed to design good code sequence but unfortunately,
accurate timing information is often difficult to obtain. Deciding which machine
code sequence is best for a given three address construct may also require
knowledge about the context in which that construct appears.
2.2.5 Register Allocation
Instructions involving register operands are usually shorter and faster than
those involving operands in memory. Therefore, efficient utilisation of register
is particularly important in generating good code. The use of registers is often
subdivided into two sub-problems:
1. During register allocation, we select the set of variables that will reside
in registers at a point in the program.
2. During a subsequent register assignment phase, we pick the specific
register that a variable will reside in.
Finding an optimal assignment of registers to variables is difficult, even with
single register values. Mathematically, the problem is NP-complete. The
problem is further complicated because the hardware and/or the operating
system of the target machine may require that certain register usage conventions
be observed.
Certain machines require register pairs (an even and next odd numbered
register) for some operands and results. For example, in the IBM System/370
machines integer multiplication and integer division involve register pairs. The
multiplication instruction is of the form:
M x, y
280
where x, is the multiplicand, is the even register of an even/odd register pair.
The multiplicand value is taken from the odd register pair. The multiplier y is a
single register. The product occupies the entire even/odd register pair.
D x, y
where the 64-bit dividend occupies an even/odd register pair whose even
register is x; y represents the divisor. After division, the even register holds the
remainder and the odd register the quotient.
Now consider the two three address code sequences (a) and (b) in figure 2
below in which the only difference is the operator in the second statement. The
shortest assembly sequence for (a) and (b) are given in (c).
Ri stands for register i. L, ST and A stand for load, store and add respectively.
The optimal choice for the register into which ‗a‘ is to be loaded depends on
what will ultimately happen to e.
t := a +
t := a + b b
:= t +
t := t * c t c
t := t / d t := t / d
(a) (b)
281
L R1, a L R0, a
A R1, b A R0, b
M R0, c A R0, c
R0,
D R0, d SRDA 32
ST R1, t D R0, d
ST R1, t
(a) (b)
2.2.5 Choice of Evaluation
Fig.4.4. 3: Optimal machine code sequence Order
The order in which computations are performed can affect the efficiency of the
target code. Some computation orders require fewer registers to hold
intermediate results than others. Picking a best order is another difficult, NP-
complete problem. Initially, we shall avoid the problem by generating code for
the three -address statements in the order in which they have been produced by
the intermediate code generator.
2.3 Approaches to Code Generation
The most important criterion for a code generator is that it produces correct
code. Correctness takes on special significance because of the number of special
cases that code generator must face. Given the premium on correctness,
designing a code generator so it can be easily implemented, tested, and
maintained is an important design goal.
2.4 Run-Time Storage Management
The semantics of procedures in a language determines how names are bound to
storage during allocation. Information needed during an execution of a
procedure is kept in a block of storage called an activation record; storage for
names local to the procedure also appears in the activation record.
An activation record for a procedure has fields to hold parameters, results,
machine-status information, local data, temporaries and the like. Since run-time
282
allocation and de-allocation of activation records occurs as part of the procedure
call and return sequences, we focus on the following three-address statements:
1. Call
2. Return
3. Halt
4. action, a placeholder for other statements
For example, the three-address code for procedures c and p in fig. 4 contains
just these kinds of statements. The size and layout of activation records are
communicated to the code generator via the information about names that is in
the symbol table. For clarity, we show the layout in Fig. 4 rather than the form
of the symbol-table entries.
We assume that run-time memory is divided into areas for code, static data, and
a stack.
283
2.4.1 Static Allocation
Consider the code needed to implement static allocation. A call statement in the
intermediate code is implemented by a sequence of two target-machine
instructions. A MOV instruction saves the return address, and a GOTO
instruction transfers control to the target code for the called procedure:
GOTO callee.code_area
The code for a procedure ends with a return to the calling procedure ends with a
return to the calling procedure, except the first procedure has
no caller, so its final instruction is HALT, which presumably returns control to
the operating system. A return from procedure callee is implemented by
GOTO *callee.static_area
this transfers control to the address saved at the beginning of the activation
record.
Example 1:
The code in Figure 5 is constructed from the procedures c and p in Figure 4. We
use the pseudo-instruction ACTION to implement the statement action, which
284
represents three-address code that is not relevant for this discussion. We
arbitrarily start the code for these procedures at addresses 100 and 200,
respectively, and assume that each ACTION instruction takes 20 bytes. The
activation records for the procedures are statically allocated starting at location
300 and 364, respectively.
285
In-text Question 1
Information needed during an execution of a procedure is kept in a block of storage called an
activation record. True / False?
Answer:
False
286
A procedure call sequence increments SP, saves the return address, and
transfers control to the called procedure:
ADD #caller.recordsize, SP
MOV #here+16, SP /* save return address*/
GOTO callee.code_area
SUB #caller.recordsize, SP
Example 2:
The program in figure 4.4.4 is a condensation of the three-address code for the
Pascal program for reading and sorting integers. Procedure q is recursive, so
more than one activation of q can be alive at the same time.
287
Fig. 4.4.4:Three-Address Code to Illustrate Stack Allocation
Suppose that the sizes of the activation records for procedures s, p, and q have
been determined at compile time to be ssize, psize, and qsize, respectively. The
first word in each activation record will hold a return address. We arbitrarily
assume that the code for these procedures starts at addresses 100,200 and 300
respectively, and that the stack starts at 600. The target code for the program in
figure 4.4.4 is as follows:
/*code for s*/
100: MOV #600, SP /*initialize the stack*/
108: ACTION1
128: ADD #ssize, SP /*call sequence begins*/
136: MOV #152, *SP /*push return address*/
144: GOTO 300 /*call q*/
152: SUB #ssize, SP /*restore SP*/
160: ACTION2
180: HALT
………
288
/*code for p*/
200: ACTION3
289
We assume that ACTION4 contains a conditional jump to the address 456of the
return sequence from q; otherwise, the recursive procedure q is condemned to
call itself forever. In an example below, we consider an execution of the
program in which the first call of q does not return immediately, but all
subsequent calls do.
If ssize, psize, and qsize are 20,40, and 60, respectively, then SP is initialized to
600, the starting address of the stack, by the first instruction at address 100. SP
holds 620 just before control transfers
from s to q, because ssize is 20. Subsequently, when q calls p, the instruction at
address 320 increments SP to 680, where the activation record for p begins; SP
reverts to 620 after control returns to q. If the next two recursive calls of q
return immediately, the maximum value of SP during this execution is 680.
However, the last stack location used is 739, since the activation record for q
starting at location 680 extends for 60 bytes.
In-text Question 2
Static allocation cannot become stack allocation by using relative addresses for storage in
activation records. True or False?
Answer:
True
2.4.3 Run-Time Addresses for Names
The storage allocation strategy and the layout of local data in an activation
record for a procedure determine how the storage for names is accessed.
If we assume that a name in a three-address statement is really a pointer to a
symbol-table entry for the name; it makes the compiler more portable, since the
front end need not be changed even if the compiler is moved to a different
machine where a different run-time organisation is needed. On the other hand,
generating the specific sequence of access steps while generating intermediate
code can be of significant advantage in an optimising compiler, since it lets the
optimiser take advantage of details it would not even see in the simple three-
address statement.
290
In either case, names must eventually be replaced by code to access storage
locations. We thus consider some elaborations of the simple three-address
statement x: = 0. After the declarations in a procedure are processed, suppose
the symbol-table entry for x contains a relative address 12 for x. First consider
the case in which x is in a statically allocated area beginning at address static.
Then the actual run-time address for x is static+12. Although, the compiler can
eventually determine the value of static+12 at compile time, the position of the
static area may not be known when intermediate code to access the name is
generated. In that case, it makes sense to generate three-address code to
―compute‖ static+12, with the understanding that this computation will be
carried out during the code-generation phase, or possibly by the loader, before
the program runs. The assignment x := 0 then translates into
static [12] := 0
If the static area starts at address 100, the target code for this statement is
MOV #0, 112
On the other hand, suppose our language is one like Pascal and that a display is
used to access non-local names. Suppose also that the display is kept in
registers, and that x is local to an active procedure whose display pointer is in
register R3. Then we may translate the copy x := 0 into the three-address
statements
t1:= 12+R3
*t1:= 0
291
3.0 Tutor Marked Assignment
1. For the following program in Simple, generate stack code.
A program in Simple
let
integer
n,x,n.
in
read n;
if n < 10 then x := 1; else
skip; fi; while n < 10 do x
:= 5*x; n := n+1;
end;
skip;
write
n;
write
x;
end
4.0 Conclusion/Summary
In this study session, you have been taken through the basic issues in code
generation.
You also learnt that:
- the primary objective of the code generator, the final phase of the
compiler, is to convert atoms or syntax trees to instructions
- code generator takes as input an intermediate representation of the source
program and produces as output an equivalent target program
292
- producing an assembly language program as output makes the process of
code generation somewhat easier
- producing a relocatable machine language program as output allows
subprograms to be compiled separately
- mapping names in the source program to addresses of data objects in run
time memory is done cooperatively by the front end and the code
generator
- information needed during an execution of a procedure is kept in a block
of storage called an activation record
- static allocation can become stack allocation by using relative addresses
for storage in activation records
- relative addresses in an activation record can be taken as offsets from any
known position in the activation record.
In the next study session, which is the concluding study session of this course
you will be learning about code optimisation.
6.0 Additional Activities (Videos, Animations & Out of Class activities) e.g.
a. Visit U-tube https://www.youtube.com/watch?v=-ti07Z0xKKg. Watch the
video & summarise in 1 paragraph
b. View the animation onhttps://www.youtube.com/watch?v=Zbf_XzgtEQ4 and
critique it in the discussion forum
c. Take a walk and engage any 3 students on Issues in the design of a code
generator; In 2 paragraphs summarise their opinion of the discussed topic. etc.
293
7.0 Self-Assessment Question Answers
1. The primary objective of the code generator is to convert atoms or syntax
trees to instructions.
2. Mapping names in the source program to addresses of data objects in run
time memory is done cooperatively by the front end and code generator.
8.0 References/Further Reading
Alfred, V. Aho et al. (2007). Compilers: Principles, Techniques, and
Tools. Second Edition. Wesley: Pearson Addison.
Andrew, W. Appel (2002). Modern Compiler Implementation in Java.
Second edition. Cambridge University Press.
Cooper, Keith D. & Torczon, Linda (2004). Engineering a Compiler.
Morgan Kaufmann.
Muchnick, Steven S. (1997).Advanced Compiler Design
andImplementation. Morgan Kaufmann.
Scott, Michael L. (2009). Programming Language Pragmatics.
Thirdedition, Morgan Kaufman.
Sebesta,Robert W. (2010). Concepts of Programming Languages. Ninth edition.
Wesley: Pearson Addison.
294
STUDY SESSION 5
Code Optimisation
Section and Subsection Headings:
Introduction
1.0 Learning Outcomes
2.0 Main Content
2.1 - Code Optimisation
2.2- Criteria for Code-Improving Transformations
2.3- Improving Transformations
2.4- Optimising Compiler
2.5- Common Optimisation Algorithms
2.5.1- Function-Preserving Transformations
2.5.1.1 - Copy propagation
2.5.1.2 - Dead Code Elimination
2.5.1.3 - Common Sub-expression Identification/Elimination
2.5.2 - Loop Optimisations
2.5.2.1 - Strength Reduction
2.5.2.2- Induction variable elimination
2.5.2.3- Code Motion
2.5.3- Function Chunking
3.0 Tutor Marked Assignment
4.0 Study Session Summary and Conclusion
5.0 Self-Assessment Question(s)
6.0 Additional Activities (Videos, Animations & Out of Class activities)
7.0 Self-Assessment Question Answers
8.0 References/Further Reading
295
Introduction:
In the concluding study session of the course, you will be learning about code
optimisation. Code optimisation is code transformation techniques that are
applied to the intermediate codes to make faster and better in terms of
performance and memory management. Although, the optimisation phase is an
optional phase in compilers it makes for better and efficient code generation
when it is present.
Optimisation is a very rich and complex topic, so this study session will only
attempt to introduce the basics.
296
Design using good algorithms.
- Ensure your data structures match the algorithms.
- Structure using modules with clean simple interfaces.
- Implement using clear straightforward code.
- When there is an overall performance problem
- Measure the actual performance in reasonable detail.
- Identify the troublesome areas.
- Redesign and re-implement these problem areas.
In this study session, we will consider various algorithms and data structures
and discuss their likely impact on performance.
Note that actual measurement is crucial, since the problems are often not where
you guess they might be. For your initial implementation you may well have
selected simple algorithms which are known to perform poorly in order to get
something working quickly. Nevertheless, you should still measure performance
in detail, since there may be some other source of (at least some of) your
problems.
If you are very lucky, your implementation language might have some optional
facility for selectively measuring CPU time. Take care to only activate such a
feature for a few crucial routines; the timing overhead could easily exceed the
execution time for small routines and distort the result.
More commonly you will be unlucky and will have to explicitly add timing
code to selected routines; make sure you can easily disable it and enable it as
required. Typically, you will have to insert calls to some CPU timing routine at
the beginning and end of a routine, and then subtract the two values to get the
time for that routine, which will include the time for any routines called by it.
Various measurements on the performance of actual compilers have been
reported over the years. Specific areas which have been known to cause
problems include:
297
- multiple routine calls during lexical analysis for each and every source
character
- skipping white space during lexical analysis
- skipping over a comment
- decoding a tightly packed parse table during syntax analysis
- looking things up in the name table during semantic analysis
- determining whether some name is a reserved keyword or a user-
definable identifier.
Optimisation within a compiler is concerned with improving in some way the
generated object code while ensuring the result is identical. Technically, a
better name for this unit might be "Improvement", since compilers only attempt
to improve the operations the programr has requested. Optimisations fall into
three categories:
a. Speed: improving the runtime performance of the generated object code.
This is the most common optimization
b. Space: reducing the size of the generated object code
c. Safety: reducing the possibility of data structures becoming corrupted
(for example, ensuring that an illegal array element is not written to).
Unfortunately, many "speed" optimisations make the code larger, and many
"space" optimisations make the code slower. This is known as the space-time
trade-off.
298
original version of the source program. The influence of this criterion pervades
this chapter; at all times we take the "safe" approach of missing an opportunity
to apply a transformation rather than risk changing what the program does.
Second, a transformation must, on the average, speed up programs by a
measurable amount. Sometimes we are interested in reducing the space taken by
the compiled code, although the size of code has less importance than it once
had. Of course, not every transformation succeeds in improving every program,
and occasionally an "optimisation" may slow down a program slightly, as long
as on the average it improves things.
Third, a transformation must be worth the effort. It does not make sense for a
compiler writer to expend the intellectual effort to implement a code improving
transformation and to have the compiler expend the additional time compiling
source programs if this effort is not repaid when the target programs are
executed. Certain local or "peephole" transformations of the kind are simple
enough and beneficial enough to be included in any compiler.
Some transformations can only be applied after detailed, often time-consuming,
analysis of the source program, so there is little point in applying them to
programs that will be run only a few times. For example, a fast, no optimising,
compiler is likely to be more helpful during debugging or for "student jobs‖ that
will be run successfully a few times and thrown away. Only when the program
in question takes up a significant fraction of the machine's cycles does improved
code quality justify the time spent running an optimising compiler on the
program.
299
The main subjects of research are machine-independent optimisations. They are
implemented by algorithms that improve the target code without taking into
consideration any properties of the target machine. Making machine-dependent
optimisations, such as register allocation and utilisation of machine idioms is
also possible.
The purpose behind making such optimisations is to make more efficient the
most frequently executed parts of the compiler.
In-text Question 1
Optimisations fall into three categories. What are they?
Answer
Speed, space and safety
300
- Loop optimisations
- induction variables and reduction in strength
- code motion
- Function Chunking.
A code improving transformation is called local if it is performed by looking at
statements within one concrete block. Respectively, a code improving
transformation is global if it is performed by looking at statements not only in
one concrete block, but also outside in global and other outside blocks.
2.5.1 Function-Preserving Transformations
There are a number of ways in which a compiler can improve a program
without changing the function it computes. Common sub-expression
elimination, copy propagation, dead-code elimination, and constant folding are
common examples of such function-preserving transformations. The other
transformations come up primarily when global optimisations are performed.
Frequently, a program will include several calculations of the same value, such
as an offset in an array. Some of these duplicate calculations cannot be avoided
by the programmer because they lie below the level of detail accessible within
the source language. For example, block B5 shown in figure 1 recalculates 4*i
and 4*j.
{
int i,j;
301
int v,x;
if ( n <= m ) return;
i = m - 1;
j = n;
v = a[ n ];
while ( 1 )
{
do i = i + 1; while ( a[ i ] < v );
do j = j - 1; while ( a[ j ] > v );
if ( I >= j ) break;
x = a[ i ];
a[ i ] = a[ j ];
a[ j ] = x;
}
x = a[ i ];
a[ i ] = a[ n ];
302
a[ n ] = x;
quicksort( m, j );
quicksort( i+1, n );
}
Three-address Code :
1) i := m – 1 16 ) t7 := 4 * i
2) j := n 17 ) t8 := 4 * j
18 ) t9 := a[ t8
3) t1 := 4 * n ]
4) v := a[ t1 ] 19 ) a[ t7 ]:= t9
20 ) t10 := 4 *
5) i := i + 1 j
6) t2 := 4 * i 21 ) a[ t10 ]:= x
303
7) t3 := a[ t2 ] 22 ) goto (5)
8) if ( t3< v ) goto (5) 23 ) t11 := 4 * i
9) j := j – 1 24 ) x := a[ t11 ]
10 ) t4 := 4 * j 25 ) t12 := 4 * i
t13 := 4 *
11 ) t5 := a[ t4 ] 26 ) n
12 ) if ( t5> v ) goto (9) 27 ) t14 := a[ t13 ]
if ( i >= j ) goto (23) 28
13 ) ) a[ t12 ]:= t14
29 ) t15 := 4 *
14 ) t6 := 4 * i n
15 ) x := a[ t6 ] 30 ) a[ t15 ]:= x
B1
i := m – 1
j := n
t1 := 4 * n
v := a[ t1 ]
B2
i := i + 1
t2 := 4 * i
t3 := a[ t2 ]
if ( t3< v ) goto B2
B3
j := j – 1
t4 := 4 * j
t5 := a[ t4 ]
if ( t5> v ) goto B3
B4
if ( i >= j ) goto B6
304
B5 B6
t6 := 4 * i t11 := 4 * i
x := a[ t6 ] x := a[ t11 ]
t7 := 4 * i t12 := 4 * i
t8 := 4 * j t13 := 4 * n
t9 := a[ t8 ] t14 := a[ t13 ]
a[ t7 ]:= t9 a[ t12 ]:= t14
t10 := 4 * j t15 := 4 * n
a[ t10 ]:= x a[ t15 ]:= x
goto B2 2.5.1.1 Copy
Propagation
Input: A flow graph
with a set of copies x= y that reach a block in the graph along every path, with
no assignment of x or y following the last occurrence of x= y on the path to the
block.
Output: A revised flow graph
Algorithm: For each copy x := y , do the following:
1. Determine those uses of x that are reached by this definition of x namely,
s : x := y
2. Determine whether for every use of x found in 1 ) no definitions of x or y
can occur prior to this use of x within the ultimate block of use.
3. If the block s meets the conditions of 2 ) then remove s and replace
all uses of x found in 1 ) by y.
Block B5 in the code below can be further improved by eliminating x using two
new transformations. One concerns assignments of the form f:=g called copy
statements, or copies for short. For example, when the common sub expression
in c:=d+e is eliminated in Figure 2, the algorithm uses a new variable t to hold
the value of d+e. Since control may reach c:=d+e either after the assignment to
305
a or after the assignment to b, it would be incorrect to replace c:=d+e by either
c:=a or by c:=b.
The idea behind the copy-propagation transformation is to use g for f, wherever
possible after the copy statement f:=g. For example, the assignment x:=t 3 in
block B5 of Figure 2 is a copy.
Copy propagation applied to B5 yields:
x:=t3
a[t2]:=t5
a[t4]:=t3
goto B2
This may not appear to be an improvement, but as we shall see, it gives us the
opportunity to eliminate the assignment to x
B1
i := m - 1
j := n
t1 := 4 * n
v := a[ t1 ]
B
2
i := i + 1
t2 := 4 * i
t3 := a[ t2 ]
306
if ( t3< v ) goto B2
B
3
j := j - 1
t4 := 4 * j
t5 := a[ t4 ]
if ( t5> v ) goto B3
B4
if ( i >= j ) goto B6
B5 B6
//x := t3 //x := t3
a[ t2 ]:= t5 t14 := a[ t1 ]
a[ t4 ]:= t3 a[ t2 ]:= t14
goto B2 a[ t1 ]:= t3
307
It is obvious that the complicated calculation will never be performed; since the
last value assigned to a before the IF statement is a constant, we can calculate
the result at compile-time. Simple substitution of arguments produces if (5 !=
5), which is false. Since the body of an if(false) statement will never execute - it
is dead code we can rewrite the code:
a=5
// some statements
The algorithm was used to identify and remove sections of dead code
2.5.1.3 Common Sub-Expression Identification/Elimination
Common sub-expression elimination is a speed optimisation that aims to reduce
unnecessary recalculation by identifying, through code-flow, expressions (or
parts of expressions) which will evaluate to the same value: the re-computation
of an expression can be avoided if the expression has previously been computed
and the values of the operands have not changed since the previous
computation.
2.5.1.3.1 Common Sub-Expression Identification Algorithm
Input: A flow graph with available expression information
Output: A revised flow graph
Algorithm: For every statement s of the form x := y + z such that y + zis
available at the beginning of s‘s block, and neither y nor z is defined prior to
statement s in that block, do the following:
1. To discover the evaluations of y + z that reach s‘s block follow going
through any block that evaluates y + z /. The last evaluation of y + z in
each block encountered is an evaluation of y + z that reaches x.
2. Create a new variable u
3. Replace each statement w: = y + z found in (1) by
u := y + z
w := u
4. Replace statement s by x := u
308
Example 1: u := 4 * i
t2 := 4 * i t2 := u
t3 := a[ t2 ] t3 := a[ t2 ]
t t
6 := 4 * i 6 := u
t7 := a[ t6 ] t7 := a[ t6 ]
B
1
i := m - 1
j := n
t1 := 4 * n
v := a[ t1 ]
B
2
i := i + 1
t2 := 4 * i
t3 := a[ t2 ]
if ( t3< v ) goto B2
B
3
j := j - 1
t4 := 4 * j
t5 := a[ t4 ]
if ( t5> v ) goto B3
B4
if ( i >= j ) goto B6
309
B5 B6
x := t3 x := t3
a[ t2 ]:= t5 t14 := a[ t1 ]
a[ t4 ]:= x a[ t2 ]:= t14
goto B2 a[ t1 ]:= x
Example 2:
Consider the following program:
a=b+c
d=e+f
g=b+c
In the above example, the first and last statement's right hand side are identical
and the value of the operands do not change between the two statements; thus
this expression can be considered as having a common sub-expression.
The common sub-expression can be avoided by storing its value in a temporary
variable which can cache its result. After applying this Common Sub-
Expression Elimination technique, the program becomes:
t0 = b + c
a = t0
d=e+f
g = t0
310
code motion, which moves code outside a loop; induction-variable elimination,
which we apply to eliminate I and j from the inner loops B2 and B3 and,
reduction in strength, which replaces and expensive operation by a cheaper one,
such as a multiplication by an addition.
There are three main techniques for loop optimisation (loops are usually
processed inside out):
- Strength Reduction which replaces an expensive (time consuming)
operator by a faster one;
- Induction Variable Elimination which eliminates variables from inner
loops;
- Code Motion which moves pieces of code outside loops.
2.5.2.1 Strength Reduction
This concept refers to the compiler optimisation method of substituting some
machine instruction by a cheaper one and still maintaining equivalence in
results. Certain machine instructions are considerably cheaper than others and
can often be used as special cases of more expensive operators. For example, x²
is invariably cheaper to implement as x*x than as a call to an exponentiation
routine. Fixed-point multiplication or division by a power of two is cheaper to
implement as a shift. Floating-point division by a constant can be implemented
as multiplication by a constant, which may be cheaper.
This type of optimisation can generate high gains especially when targeting
different hardware and the compiler is aware of the subtle differences it can
benefit from.
Example 3:
Apply strength reduction to the code below:
B1
i := m - 1
j := n
311
t1 := 4 * n
v := a[ t1 ]
B2
B3
j := j - 1
t4 := 4 * j
t5 := a[ t4 ]
if ( t5> v ) goto B3
B
4
if ( i >= j ) goto B6
B5 B6
Solution:
As the relationship t4:=4*j surely holds after such an assignment to t 4 in the
code above and t4 is not changed elsewhere in the inner loop around B3, it
follows that just after the statement j:=j-1 the relationship t4 := 4*j-4 must hold.
We may therefore replace the assignment t4:= 4*j by t4:= t4-4. The only problem
is that t4 does not have a value when we enter block B3 for the first time. Since
we must maintain the relationship t4=4*j on entry to the block B3, we place an
initialisation of t4 at the end of the block where j itself is initialised, shown by
the last line of block B1 in the code below
The replacement of a multiplication by a subtraction will speed up the object
code if multiplication takes more time than addition or subtraction, as is the case
on many machines
After Strength Reduction is applied to 4*j in block B3 we have:
B1
i := m - 1
312
j := n
t1 := 4 * n
v := a[ t1 ]
t4 := 4 * j
B2
B3
j := j - 1
t4 := t4 - 4
t5 := a[ t4 ]
if ( t5> v ) goto B3
B4
if ( i >= j ) goto B6
B5 B6
Diagrammatically it can be depicted as in figure 1 below
313
2.5.2.2 Induction Variable Elimination
Some loops contain two or more induction variables that can be combined into
one induction variable. Induction variable elimination can reduce the number of
additions (or subtractions) in a loop, and improve both run-time performance
and code space. Some architectures have auto-increment and auto-decrement
instructions that can sometimes be used instead of induction variable
elimination.
Example 4:
For the code in example 3 above, consider the loop around B3.
Note that the values of j and t4 remain in lock-step; every time the value of j
decreases by 1, that of t4 decreases by 4 because 4*j is assigned to t4.Such
identifiers are called induction variables.
When there are two or more induction variables in a loop, it may be possible to
get rid of all but one, by the process of induction-variable elimination. For the
inner loop around B3 in Fig. we cannot get rid of either j or t4 completely. t4 is
used in B3 and j in B4.
After Induction Variable Elimination is applied we have:
B
1
i := m - 1
j := n
t1 := 4 * n
v := a[ t1 ]
t2 := 4 * i
t4 := 4 * j
B
2
t2 := t2 + 4
314
t3 := a[ t2 ]
if ( t3< v ) goto B2
B3
t4 := t4 - 4
t5 := a[ t4 ]
if ( t5> v ) goto B3
B
4
if ( t2>= t4 ) goto B6
B5 B6
a[ t2 ]:= t5 t14 := a[ t1 ]
a[ t4 ]:= t3 a[ t2 ]:= t14
goto B2 a[ t1 ]:= t3
Example 5:
The code fragment below has three induction variables (i1, i2, and
i3) that can be replaced with one induction variable, thus
eliminating two induction variables.
int a[SIZE];
int b[SIZE];
void f (void)
{
int i1, i2, i3;
a[i2++] = b[i3++];
315
return;
}
The code fragment below shows the loop after induction variable
elimination.
int a[SIZE];
int b[SIZE];
void f (void)
{
int i1;
return;
316
While (i<= limit-2)
Code motion will result in the equivalent of
t= limit-2;
while (i<=t)
Example 6:
Consider the code below:
for (i = 0; i < n; ++i) {
x = y + z;
a[i] = 6 * i + x * x;
}
The calculations x = y + z and x * x can be moved outside the loop since within
they are looping invariant (i.e. they do not change over the iterations of the
loop) so our optimised code will be something like this:
x = y + z;
t1 = x * x;
for (i = 0; i < n; ++i) {
a[i] = 6 * i + t1;
}
This code can be optimized further. For example, strength reduction could
remove the two multiplications inside the loop (6*i and a[i]).
Example 7:
for(i=0;i<10;i++)
{
a = a + c;
317
}
In the above mentioned code, a = a + c can be moved out of the 'for' loop, and
the new code is
a = a + 10*c;
In-text Question 2
The best program transformations are those that yield the most benefit for the least effort.
True or False?
Answer:
True
4.0 Conclusion/Summary
In this concluding study session of the course, you have been taken through the
concept of code optimisation and common optimisation algorithms.
318
You also learnt that:
- optimisations fall into three categories: speed, space, safety
- the best program transformations are those that yield the most benefit for
the least effort
- the code produced by straightforward compiling algorithms can be made
to run faster using code improving transformations
- compilers using such transformations are called optimising compilers
- a code improving transformation is called local if it is performed by
looking at statements within one concrete block
- a code improving transformation is global if it is performed by looking at
statements not only in one concrete block.
Dramatic improvements in the running time of a program-such as cutting the
running time form a few hours to a few seconds-are usually obtained by
improving the program at all levels, from the source level to the target level. At
each level, the available options fall between the two extremes of finding a
better algorithm and of implementing a given algorithm so that fewer operations
are performed.
6.0 Additional Activities (Videos, Animations & Out of Class activities) e.g.
a. Visit YouTube https://www.youtube.com/watch?v=bc3ysHc5RH0. Watch the
video & summarise in 1 paragraph
b. View the animation on https://www.youtube.com/watch?v=D7JmYNaYt4k
and critique it in the discussion forum
c. Take a walk and engage any 3 students on common optimisation algorithms;
In 2 paragraphs summarise their opinion of the discussed topic. etc.
319
320
7.0 Self-Assessment Question Answers
1. A code improving transformation is called local if it is performed by
looking at statements within one concrete block.
2. A code improving transformation is global if it is performed by looking at
statements not only in one concrete block, but also outside in global and
other outside blocks.
321
Glossary
Formal grammar: A formal grammar (sometimes called a grammar) is a set
of rules of a specific kind, for forming strings in a formal language.
Word: A word over an alphabet can be any finite sequence, or string, of letters
Keywords: these are the reserved words that have a special meaning in the
language, such as the word class in Java
322
Token: A token is the smallest unit recognisable by the compiler.
Parse tree: is a graphical representation for derivation that filters out the choice
regarding replacement.
323
Handle: A handle of a string is a substring that matches the right side of a
production whose reduction to the nonterminal on the left represents one step
along the reverse of a rightmost derivation.
LL (K) grammars: LL (K) grammars are those for which the left parser can be
made to work deterministically, if it is allowed to look at K-input symbols, to
the right of its current input position.
Open hashing: refers to the property that there need be no limit on the number
of entries that can be in the table.
Syntax tree: A syntax tree depicts the natural hierarchical structure of a source
program.
324