KEMBAR78
Unit I - CD - Intro | PDF | Parsing | Compiler
0% found this document useful (0 votes)
13 views23 pages

Unit I - CD - Intro

Uploaded by

samriddhi623
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views23 pages

Unit I - CD - Intro

Uploaded by

samriddhi623
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Compiler Design

Aim of the Course

The course aims at understanding the working of compiler in detail so as to have knowledge of
whole spectrum of language processing technology

Objectives of Course

 The primary objective is that at the end of the course the students must be quite
comfortable with the concepts related to compilers and should be able to deploy their
knowledge in various related fields.
 Students should be confident that they can use language processing technology for
various software developments
 Students should be confident that they can design, develop, understand, modify /enhance,
and maintain compilers for (even complex!) programming languages.

Contents:

UNIT I
Translators, interpreters, assemblers, Compilers, Types of Compilers, Model of a compiler.
Analysis of source program, The phases of a compiler, Cousins of the compilers.
UNIT II
Finite automata, non-deterministic and deterministic finite automata, Acceptance of strings by
NDFA and DFA, Transforming NDFA to DFA, minimization/optimization of a DFA, related
algorithm. Regular sets and regular expression. Obtaining regular expression from finite
automata.

Lexical analyzer design, The role of Lexical Analyzer, Input Buffering, Specification of tokens,
and Recognition of tokens.

UNIT III
Syntax analysis, CFG, derivation of a parse tree, elimination of left recursion Regular grammar,
Right linear and left linear grammar. Parsing, Top-Down and Bottom Up parsing, general
parsing strategies.

Top-down Parsing techniques: Brute-force approach, recursive descent parser and algorithms,
Simple LL (1) grammar, LL (1) with null and without null rules grammars, predictive parsing.

Bottom-up parsing- Handle of a right sentential form, Shift-reduce parsers, operator precedence
parsing, LR parsing.
UNIT IV
Symbol table contents Organization for block structured languages-stack symbols tables. Stack
implemented hash structured symbol tables. Symbol table organization for Object Oriented
Programming Languages.

Intermediates code generation, translation schemes for programming language constructs.

Code Optimization: - Definition, Local code optimization techniques, Elimination of local and
global common sub Expressions, loop optimization.

UNIT V
Code Generation: - Definition, machine model, simple code generation method. Peephole
optimization.

Error Handling: - Error recovery, recovery from various phase and parsing.

Text Book

Alfred V. Aho, Ravi Sethi, Jeffery D. Ullman, Compilers: Principles, Techniques, and Tools,
Addison Wesley Longman

Essential References
1. Jean Paul Tremblay, Paul G. Sorenson , The Theory & Practice of Compiler Writing.
2. Barrett, Bates, Gustafson, Couch , Compiler Construction Theory & Practice.

Electronic Materials, Web Sites etc

https://nptel.ac.in/courses/106/104/106104072/

https://nptel.ac.in/courses/106/104/106104123/
Motivation

Language processing is an important component of programming.

Why study Compiler?

Compilers use the whole spectrum of language processing technology

Everything Goes Through a Compiler

Almost every piece of software used on any computer has been produced by a compiler. This
astonishing claim arises from the extreme difficulties faced by any human in writing absolute
binary instructions for any computer.

The "compiler" used may be rather primitive, perhaps just a simple symbolic assembler that
maps symbols directly to instructions. But the use of symbols to specify computer operations is
nearly universal. Few Example:

 Operating Systems (command line processing): Nearly every component of a typical


operating system--its kernel, libraries, support utilities, etc.-- are typically written in a
high-level symbolic form, and translated through a compiler to machine form.
 Databases (Query language processing): Database management systems are written in
high-level language form. The popular SQL database access language is an interpreted
language: some machine translator is needed to make sense of its directions, and turn
them into useful actions.
 Markup Languages: HTML is an interpreted language--the "L" stands for "language":
HyperText Markup Language. It's a powerful way to display text, images, forms, menus,
and more. And it's fairly easy to write directly with a text editor.
 Browsers: The popular HTML browsers, for example those written by Netscape and
Microsoft, were written in a high-level language, most likely C++ or java. By using a
high-level language, it was possible to almost automatically port the browser to a large
number of different machine platforms with little additional human effort. Form
processing, extracting information automatically from form.
 Compilers, assemblers and linkers: Compilers themselves are written in a high-level
language, sometimes in the source language of the compiler itself! A compiler for C was
organized in a modular fashion years ago by Steve Johnson to make it easier to generate
code for a variety of different machines.
 VLSI Design: New microprocessor designs require one or more compilers for the
machine. These compilers usually start with a cross-compiler, which runs on one
established platform, but generates machine code for the new system. The cross-compiler
is written in a high-level language using an established compiler. Again, C is popular for
this kind of effort.

Where ever input has a structure one can think of language processing
Unit I

Contents: Translators, interpreters, assemblers, Compilers, Model of a compiler, Analysis of


source program, The phases of a compiler, Types of Compilers, Cousins of the compilers.

Language Processing System


Phase 1: Creating a Program
Phase 1 consists of editing a file with an editor program. The programmer creates program
(typically referred to as source code) in an editor, make any necessary corrections and save the
program on a secondary storage device, such as hard drive.

Phases 2 Preprocessing
In phase 2, the programmer gives the command to compile the program. A preprocessor program
executes automatically before the compiler's translation phase begins. The preprocessor obeys
commands called preprocessor directives, which indicate that certain manipulations are to be
performed on the program before compilation. These manipulations usually include other text
files to be compiled and perform various text replacements.

Phase 3 Compiling
In phase 3, the compiler translates the program into target-language code (also referred to as
object code).

Phase 4: Linking
Phase 4 is called linking. Programs typically contain references to functions and data defined
elsewhere, such as in the standard libraries or in the private libraries of groups of programmers
working on a particular project. The object code produced by the compiler typically contains
"holes" due to these missing parts. A linker links the object code with the code for the missing
functions to produce an executable image (with no missing pieces). If the program compiles and
links correctly, an executable image is produced.

Phase 5: Loading
Phase 5 is called loading. Before a program can be executed, it must first be placed in memory.
This is done by the loader, which takes the executable image from disk and transfers it to
memory. Additional components from shared libraries that support the program are also loaded.

Phase 6: Execution
Finally, the computer, under the control of its CPU, executes the program one instruction at a
time.

Cousins of Compiler

All the modules that are involved in the program development environment along with compiler
are termed as cousins of compiler. Mainly they include the following:

1. Preprocessor

2. Assembler

3. Linker

4. Loader
C++ Programming Environment:

1. The Preprocessor
The Preprocessor accepts source code (*.cpp) as input and is responsible for
 removing comments
 interpreting special preprocessor directives denoted by #.
For example, #include – paste in a named file. Usually used for header files.
2. C++ Compiler
The C++ compiler translates source to assembly code. The source code is received from
the preprocessor.
3. Assembler
The assembler creates object code. On a UNIX system you may see files with a *.o suffix
(*.OBJ on MS-DOS) to indicate object code files.
4. Linker
If a source file references a functions defined in another source files the link editor (in
short, the linker) combines the code to create an executable file. External Variable
references are resolved by the linker as well. The linker also looks for the main() function
to mark it as the starting point of the executable.
The linker combines multiple object files (*.o) and libraries (*.a) into one executable.

C++ v/s Java

C++ decides the location of each class member at compile time, Java defers this decision until
you attempt to load and run the class.

To be more technical:
 C++ compiler (by default) will do: pre-compilation, compilation and assembly
 Java compiler (by default) will do: compilation. It has no assembly step, since it is not
machine specific.
 C++ linker will link the object files to an executable.
 In Java, linking is done at runtime. Whenever execution of Java bytecode requires the
presence of a new class then a class loader is invoked to fetch the class material.
Translator: A translator is a programming language processor that converts a computer
program from one language to another. It takes a program written in source language (code) and
converts it into target language (code). It also discovers and identifies the error
during translation.

In order to design and implement a useful compiler, one must be reasonably expert in all the
following disciplines:

 Source Language: The language in which the input (source) program (that is to be
translated) is written.
 Target Language: The language in which the program is translated. The output
generated by the translator.
 Host Language: The language in which translator itself is written.
 Host Meta-language: Special little languages that describe tokens or language structure
 Target machine: The architecture of machine on which the program will finally execute.
 Compiler tools, the algorithms and data structures that are commonly used to design a
compiler.

For example: Java Language compiler

Source Language: Java

Target Language: Bytecode

Host Language: C++

Types of Language Processors:

1. Assembler
2. Interpreter
3. Compiler

Assembler:
An assembler is a program that takes a program written in assembly language as input and
converts it into an equivalent program written in machine language. It produces machine specific
code.

Object code in
Source code written in
Assembly Language
ASSEMBLER machine language

Interpreter
An interpreter is a program that takes a program usually written in high level language as input
translates and executes the source code statements one by one. If an error occurs, it stops at error
point. It does not generate any object code.

Source code written in High Level INTERPRETER


Language OUTPUT

DATA ERROR

Compiler

A compiler is a program that takes a program usually written in high level language as input and
converts it into an equivalent program written in target language (which may be machine
language, symbolic language or some intermediate language). It reports error with line numbers
after scanning the source code.

COMPILER
Source code written in High Level Language Target code

ERROR MESSAGES
Difference between Interpreter and Compiler

POINT OF INTERPRETER COMPILER


DIFFERENCE
Translation Translates and executes Translates full source code in one go.
Process statements one by one.
Object code No object code is produced. An object code is generated.
Memory required For executing a program, it If a program is successfully compiled
needs to be translated every once, the object code is generated
time. So interpreter must be which can be executed any number of
present in memory every time times without the need to be translated
the program is executed. it again on the same machine. So
compiler need not be present in
memory every time the program is
executed.
Time required Takes more time as Comparatively less time is required
translation and execution
occurs simultaneously.
Error reporting If an error occurs, it stops Errors are reported with line numbers,
immediately at error after reading whole source code.
point(generally used for
debugging )
Types of Compiler
1. Self Resident Compiler
It generates code for its host machine only. For example, COBOL compiler for
mainframe computers.
2. Cross Compiler
A cross compiler generates object code for a machine other than the host machine i.e. it
runs on one machine and generates code for different type of machine. These compilers
are used to generate object code that can run on machine with new architecture or on
special devices that cannot host their own compiler. For example, generating object code
for a microcontroller in microwave oven that don't support an operating system.
3. Self Compiling Compiler / Bootstrapping
A self compiling compiler is written in its source language i.e. source language and host
language are same. First a bootstrap compiler is developed i.e. a small compiler written in
some available language (for which the compiler already exists). This bootstrap compiler
compiles the subset of source language that is sufficient to compile the complier code.
Once the compiler code is successfully compiled it can be used to compile the source
language. For Example, the compiler of C language is written in C language itself.
4. Ahead-of-Time (AOT) Compiler
If we translate the entirety of a program into something interpretable, we call this ahead-of-
time (AOT) compilation. AOT compilers may be chained, the last of which is often an existing
assembler, like this:

5. Just-in-time (JIT) compiler


When we compile portions of a program while other previously compiled portions are
running, we have just-in-time (JIT) compilation. JIT compilers will remember what they
already compiled instead of compiling the same source over and over again. They can
even do adaptive compilation and re-compilation based on runtime behavior.
Many languages allow you to execute code during compilation time and compile new
code at run time like macros in Julia.
Note: Julia is a high-level, high-performance, dynamic programming language. It
provides Lisp-like macros and other metaprogramming facilities. Julia has foreign
function interfaces for C/Fortran, C++, Python, R, Java, and many other languages. Julia can
also be embedded in other programs through its embedding API. Specifically, Python
programs can call Julia using PyJulia.
Uses of cross compilers

• The fundamental use of a cross compiler is to separate the build environment from target
environment. This is useful in a number of situations:
• Embedded computers where a device has extremely limited resources.
For example, a microwave oven will have an extremely small computer to read its
touchpad and door sensor, provide output to a digital display and speaker, and to control
the machinery for cooking food. This computer will not be powerful enough to run a
compiler, a file system, or a development environment.
Since debugging and testing may also require more resources than are available on an
embedded system, cross-compilation can be less involved and less prone to errors than
native compilation.
• Compiling for multiple machines.
For example, a company may wish to support several different versions of an operating
system or to support several different operating systems. By using a cross compiler, a
single build environment can be set up to compile for each of these targets.
• Compiling on a server farm.
Similar to compiling for multiple machines, a complicated build that involves many
compile operations can be executed across any machine that is free, regardless of its
underlying hardware or the operating system version that it is running.

Examples for Cross Compilers


1. C4droid compiler for Android
C4droid is a user-friendly (but powerful) C/C++ IDE + C/C++ compiler for
Android. Note that C4droid supports devices with ARM processors only (not
devices with Intel x86 or MIPS processor).

Basic features:
 Offline compiler: create your own applications on Android device and run
them even without Internet access
 No root required (but C4droid can use it for your programs if you want)
 Full ANSI C and ISO C99 support with TCC + uClibc
 Source code editor with syntax highlighting, tabs, code completion, code
formatting, file association and infinite undo/redo
 Export & share your programs as APKs or native executables (for terminal
apps.
2. AIDE - Android IDE - Java, C++
AIDE is an integrated development environment (IDE) for developing real
Android Apps directly on Android devices. AIDE supports the full edit-compile-
run cycle: write code with the feature rich editor offering advanced features like
code completion, real-time error checking, refactoring and smart code navigation,
and run your App with a single click. AIDE will turn your Android tablet with
keyboard into a real development box. We use the Transformer Prime running Ice
Cream Sandwich to code with AIDE. AIDE will turn your Android Phone into a
small development computer to browse and touch your code on the go.
AIDE supports building Apps with Java/Xml and the Android SDK as well as
Apps with C/C++ and the Android NDK.
AIDE is fully compatible with Eclipse projects. You can just copy the sourcecode
to your device and open the Eclipse project in AIDE to start coding. Alternatively
you can keep your sourcecode on your Dropbox - AIDE integrates with Dropbox
and allows to easily download from your Dropbox and sync back your changes.
AIDE can also open Android Studio projects, which follow the default folder
structure.
AIDE supports GIT for professional development
Models of Compiler /Phases of Compilation Process

Source File 1. File System Records 2. Character


Manipulator

Character Stream
A Tokens
5. Intermediate 4. Parser 3. Scanner
Code Generator B C
D E F G
Intermediate Representation

6. Code Optimized 7. Code Object 8. Peep-hole


Optimizer Generator Optimizer
Code Code

Target Code

Model 1
Source Object
Program Program

ANALYSIS PHASE SYNTHESIS PHASE

Lexical Syntactic Semantic Code Code


Analyzer Analyzer Analyzer Generator Optimizer

TABLES
Model 2
Character Stream

Lexical Analyzer

Token-Stream

Syntax Analyzer

Syntax Tree

Semantic Analyzer

Symbol Table Annotated Syntax Tree


Error
Manager
Handler
Intermediate Code Generator

Intermediate Representation

Machine Independent Code Optimizer

Optimized Intermediate Representation

Code Generator

Target Machine Code

Machine Dependent Code Optimizer

Optimized target-machine code

Model 3
Phases of Compilation Process:

According to model 3 of compiler there are seven phases in compilation process.

They are as follows:

3. Lexical Analysis
4. Syntax Analysis
5. Semantic Analysis
6. Intermediate Code Generation
7. Machine Independent Code Optimization
8. Code Generation
9. Machine Dependent Code Optimization
1. Lexical Analysis
The compiler takes the service of operating system’s file access mechanism to read the
source file stored in the memory. The file system reads the source and provides record to
a character manipulator. The character manipulator accepts the records and emits
characters. It also detects and removes comments, extra spaces and special control
commands that are not a part of programming language.
The lexical analyzer / scanner accepts stream of characters and group them into lexemes.
Lexemes are then matched with patterns of token and a sequence of tokens is generated.
Token is an atomic unit of language / a logical entity of a programming language.
Examples of token include: identifier, keyword, operator and so on. A token is always
recognized and handled by (Token_Type, Token_Value). A delimiter is a special token
which is handled by token name only and not token type.

Example of lexical analysis:


a = b + c;

Lexeme Token
a (ID,1)
= ASSIGN_OP
b (ID, 2)
+ ARITH_OP
c (ID, 3)
;

2. Syntax Analysis
This is the second phase of the compiler. In this phase, we check the syntax and construct
the abstract syntax tree. This phase is modeled through context free grammars and the
structure is recognized through push down automata or table-driven parsers. The syntax
analysis phase verifies that the string can be generated by the grammar for the source
language. In case of any syntax errors in the program, the parser tries to report as many
errors as possible. Error reporting and recovery form a very important part of the syntax
analyzer.
Example of Syntax Analysis:
a = b + c;
Output of Lexical Analyzer is stream of tokens as follows:

Lexeme Token
a id1
= =
b id2
+ +
c id3
;

Let the Context Free Grammar (CFG) for arithmetic expression in given language be:
E→E=E
E → T+E | T
T → F*T | F
F → id
Then abstract syntax tree be like

E = E

T T + E

F F T

id1 Id2 F
d

Id3
d

3. Semantic Analysis
Type checking and type casting is performed in semantic analysis phase. The output of
semantic analyzer is annotated parse tree with more information on the data types. Few
tasks performed by semantic analyzer are:
 To check whether variables are of types on which operations are allowed.
 To check whether a variable has been declared before use.
 To check whether a variable has been initialized.

Example of Semantic Analysis:

a = b + c;

Let the data type of ‘a’ and ‘b’ be float and of ‘c’ be integer; the semantic analyzer
must upcast the data type of ‘c’ to float.

E = E

T T + E

F F T

id1 Id2 F

Id3

intofloat(id3)

Syntax Analyzer and Semantic Analyzer are together known as Parser.

4. Intermediate Code Generation


Intermediate code is a representation of source code into a form that can be directly
mapped to target code. There are number of representation for intermediate code such as:
 Syntax Tree
 Postfix Notation
 Three Address Code
Example of Intermediate Code Representation
Three Address Code
a = b + c;

Operator Operand1 Operand2 Result


+ id2 id3 t1
= t1 id1

5. Machine Independent Code Optimization


The input to this phase is intermediate code. This code optimizer performs machine
independent code optimization i.e. optimization of source code. It performs local as well
as global code optimization. There are number of optimization techniques that can be
employed by a compiler such as algebraic simplification, constant folding, loop
optimization and so on.
Example of Code Optimization- Three Address Code
a = b + c;

Operator Operand1 Operand2 Result


+ id2 id3 t1
= t1 id1

After optimization

Operator Operand1 Operand2 Result


+ id2 id3 id1

6. Code Generation
The optimized intermediate code is mapped to target language code in this phase. The
object code may be of different form such as:
 Binary machine code
 Symbolic code (e.g. Assembly language code)
 Some intermediate language code (bytecode)
Example of Code Generation
MOV R1, id2
MOV R2, id3
ADD R1, R2
MOV id3, R1

7. Machine Dependent Code Optimization


In this phase, short sequences of object code are examined and machine specific
optimizations are performed.
For Example, use of machine idioms (hardware instruction)
Statement a = i++;
Code for i++:
MOV AX, i
ADD AX, #1
Can be replaced by single instruction INCR i

NOTE: Code optimization phases are optional in compilation process.


position = initial + rate * 60

Lexical Analyzer

<id,1> <=> <id,2> <+> <id,3> <*> <60>

Syntax Analyzer

id1 +

id2 *
position …
id3 60
initial …
rate …
Semantic Analyzer

SYMBOL TABLE
=

id1 +

id2 *

id3 inttofloat

60

Intermediate Code Generator

t1 = inttofloat (60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3

Code Optimizer

t1 = id3 * 60.0
id1 = id2 + t1

Code Generator

LDF R2, id3


MULF R2, R2, #60.0
LDF R1, id2
ADDF R1, R1, R2
STF id1, R1

Translation of an assignment statement


Symbol Table Management

Symbol table is a data structure used to keep track of scope and binding information about
names in source program. The names may include programming elements like variable names,
constants, function names, statement labels, class names, structure names and so on.

The symbol table is searched every time a name is encountered in the source program. Changes
to the table are made if a new name or information about a new name is encountered. Therefore,
the data structure chosen to build the symbol table must be efficient in terms of adding new
entries to it and searching existing names from it.

Example:

3 int a, b, c;

7 a = b + c;

Symbol Table Entries (sample)

ID_Name Data_Type Source line Source line Object Time ...


Number at Number at Address
which it is which it is
declared referenced
a int 3 7 108
b int 3 7 110
c int 3 7 112

Error Handling

In case of any errors in the program, the compilation phase in which the error occurred, tries to
report as many errors as possible. Error reporting and recovery form a very important part of the
compilation process. The error handler has the following goals:
 It should report the presence of errors clearly and accurately.
 It should recover from each error quickly enough to be able to detect subsequent errors.
 It should not significantly slow down the processing of correct programs.

The Grouping of Phases into Passes


Grouping of phases deals with logical organization of a compiler. The phases of compilation
process can be broadly divided into two parts as:

1. Analysis Phase / Front end of Compiler


Lexical Analysis, Syntax Analyis, Semantic Analysis, Intermediate Code Generation,
Machine Independent Code Optimization (optional) phase forms the front end of
compiler. These phases are highly dependent on source language and are independent of
code requirement of target machine.
Different languages can also be implemented on same machine by rewriting the front and
using the same back-end. To do this all the front ends are required to produce the same
intermediate code. Then same backend can be used to map the intermediate code to target
code.
It is practically difficult sometimes because different languages are designed with
different viewpoints. Therefore it becomes difficult to write the front ends that produce
same intermediate code for different languages.

2. Synthesis Phases / Back end of Compiler


In the synthesis phase, intermediate code is mapped to target language code. Code
Generation phase and Machine dependent Code Optimization phase (optional) are
dependent on the architecture of target machine. These phases form the back end of
compiler.

Pass Structure of Compiler

Several phases of compilation process can be implemented in a single pass. Relatively few
numbers of passes are desirable from the point of view of compilation process. Therefore, to
reduce the number of passes, grouping of some phases into one pass is required.

Sometimes it is easy to combine number of phases into one pass. For example, Lexical Analyzer
and Syntax Analyzer can be easily grouped into one pass because the interface between them is
only a token.

But grouping certain phases in to single pass is difficult. For example, grouping Intermediate
Code Generator and Object Code Generator is difficult because it is hard to perform object code
generation until sufficient numbers of intermediate code statements have been generated. The
interface between the two is not based on only one intermediate instruction (certain languages
permit the use of variable before its declaration. Similarly, forward jumps are allowed by source
languages).

To overcome this problem and enable merging of two such phases into single pass, the technique
of “Back-patching” is used. In this technique, the object code is generated by leaving
“statement holes” and these holes are filled up later when the information becomes available.
Compiler Construction Tools:

1. Scanner Generator: It produces lexical analyzer from a regular expression description


of the token of language.
2. Parser Generator: It produces syntax analyzer from a grammatical description of a
programming language.
3. Syntax Directed Translation Engine: It produces collection of routines for traversing
parse trees and generating intermediate code.
4. Code-Generator Generator: It produces a Code-Generator from a collection of rules
that define the translation of each operation of intermediate instruction into target
language instruction.
5. Data Flow Analysis Engine: It facilitates the data-flow analysis which states how the
values are transmitted from one part of program to other part. Data-flow analysis is key
part of code optimization.
6. Compiler Construction Toolkits: They provide an integrated set of routins for
constructing various phases of compiler.

You might also like