Unit I - CD - Intro
Unit I - CD - Intro
The course aims at understanding the working of compiler in detail so as to have knowledge of
whole spectrum of language processing technology
Objectives of Course
The primary objective is that at the end of the course the students must be quite
comfortable with the concepts related to compilers and should be able to deploy their
knowledge in various related fields.
Students should be confident that they can use language processing technology for
various software developments
Students should be confident that they can design, develop, understand, modify /enhance,
and maintain compilers for (even complex!) programming languages.
Contents:
UNIT I
Translators, interpreters, assemblers, Compilers, Types of Compilers, Model of a compiler.
Analysis of source program, The phases of a compiler, Cousins of the compilers.
UNIT II
Finite automata, non-deterministic and deterministic finite automata, Acceptance of strings by
NDFA and DFA, Transforming NDFA to DFA, minimization/optimization of a DFA, related
algorithm. Regular sets and regular expression. Obtaining regular expression from finite
automata.
Lexical analyzer design, The role of Lexical Analyzer, Input Buffering, Specification of tokens,
and Recognition of tokens.
UNIT III
Syntax analysis, CFG, derivation of a parse tree, elimination of left recursion Regular grammar,
Right linear and left linear grammar. Parsing, Top-Down and Bottom Up parsing, general
parsing strategies.
Top-down Parsing techniques: Brute-force approach, recursive descent parser and algorithms,
Simple LL (1) grammar, LL (1) with null and without null rules grammars, predictive parsing.
Bottom-up parsing- Handle of a right sentential form, Shift-reduce parsers, operator precedence
parsing, LR parsing.
UNIT IV
Symbol table contents Organization for block structured languages-stack symbols tables. Stack
implemented hash structured symbol tables. Symbol table organization for Object Oriented
Programming Languages.
Code Optimization: - Definition, Local code optimization techniques, Elimination of local and
global common sub Expressions, loop optimization.
UNIT V
Code Generation: - Definition, machine model, simple code generation method. Peephole
optimization.
Error Handling: - Error recovery, recovery from various phase and parsing.
Text Book
Alfred V. Aho, Ravi Sethi, Jeffery D. Ullman, Compilers: Principles, Techniques, and Tools,
Addison Wesley Longman
Essential References
1. Jean Paul Tremblay, Paul G. Sorenson , The Theory & Practice of Compiler Writing.
2. Barrett, Bates, Gustafson, Couch , Compiler Construction Theory & Practice.
https://nptel.ac.in/courses/106/104/106104072/
https://nptel.ac.in/courses/106/104/106104123/
Motivation
Almost every piece of software used on any computer has been produced by a compiler. This
astonishing claim arises from the extreme difficulties faced by any human in writing absolute
binary instructions for any computer.
The "compiler" used may be rather primitive, perhaps just a simple symbolic assembler that
maps symbols directly to instructions. But the use of symbols to specify computer operations is
nearly universal. Few Example:
Where ever input has a structure one can think of language processing
Unit I
Phases 2 Preprocessing
In phase 2, the programmer gives the command to compile the program. A preprocessor program
executes automatically before the compiler's translation phase begins. The preprocessor obeys
commands called preprocessor directives, which indicate that certain manipulations are to be
performed on the program before compilation. These manipulations usually include other text
files to be compiled and perform various text replacements.
Phase 3 Compiling
In phase 3, the compiler translates the program into target-language code (also referred to as
object code).
Phase 4: Linking
Phase 4 is called linking. Programs typically contain references to functions and data defined
elsewhere, such as in the standard libraries or in the private libraries of groups of programmers
working on a particular project. The object code produced by the compiler typically contains
"holes" due to these missing parts. A linker links the object code with the code for the missing
functions to produce an executable image (with no missing pieces). If the program compiles and
links correctly, an executable image is produced.
Phase 5: Loading
Phase 5 is called loading. Before a program can be executed, it must first be placed in memory.
This is done by the loader, which takes the executable image from disk and transfers it to
memory. Additional components from shared libraries that support the program are also loaded.
Phase 6: Execution
Finally, the computer, under the control of its CPU, executes the program one instruction at a
time.
Cousins of Compiler
All the modules that are involved in the program development environment along with compiler
are termed as cousins of compiler. Mainly they include the following:
1. Preprocessor
2. Assembler
3. Linker
4. Loader
C++ Programming Environment:
1. The Preprocessor
The Preprocessor accepts source code (*.cpp) as input and is responsible for
removing comments
interpreting special preprocessor directives denoted by #.
For example, #include – paste in a named file. Usually used for header files.
2. C++ Compiler
The C++ compiler translates source to assembly code. The source code is received from
the preprocessor.
3. Assembler
The assembler creates object code. On a UNIX system you may see files with a *.o suffix
(*.OBJ on MS-DOS) to indicate object code files.
4. Linker
If a source file references a functions defined in another source files the link editor (in
short, the linker) combines the code to create an executable file. External Variable
references are resolved by the linker as well. The linker also looks for the main() function
to mark it as the starting point of the executable.
The linker combines multiple object files (*.o) and libraries (*.a) into one executable.
C++ decides the location of each class member at compile time, Java defers this decision until
you attempt to load and run the class.
To be more technical:
C++ compiler (by default) will do: pre-compilation, compilation and assembly
Java compiler (by default) will do: compilation. It has no assembly step, since it is not
machine specific.
C++ linker will link the object files to an executable.
In Java, linking is done at runtime. Whenever execution of Java bytecode requires the
presence of a new class then a class loader is invoked to fetch the class material.
Translator: A translator is a programming language processor that converts a computer
program from one language to another. It takes a program written in source language (code) and
converts it into target language (code). It also discovers and identifies the error
during translation.
In order to design and implement a useful compiler, one must be reasonably expert in all the
following disciplines:
Source Language: The language in which the input (source) program (that is to be
translated) is written.
Target Language: The language in which the program is translated. The output
generated by the translator.
Host Language: The language in which translator itself is written.
Host Meta-language: Special little languages that describe tokens or language structure
Target machine: The architecture of machine on which the program will finally execute.
Compiler tools, the algorithms and data structures that are commonly used to design a
compiler.
1. Assembler
2. Interpreter
3. Compiler
Assembler:
An assembler is a program that takes a program written in assembly language as input and
converts it into an equivalent program written in machine language. It produces machine specific
code.
Object code in
Source code written in
Assembly Language
ASSEMBLER machine language
Interpreter
An interpreter is a program that takes a program usually written in high level language as input
translates and executes the source code statements one by one. If an error occurs, it stops at error
point. It does not generate any object code.
DATA ERROR
Compiler
A compiler is a program that takes a program usually written in high level language as input and
converts it into an equivalent program written in target language (which may be machine
language, symbolic language or some intermediate language). It reports error with line numbers
after scanning the source code.
COMPILER
Source code written in High Level Language Target code
ERROR MESSAGES
Difference between Interpreter and Compiler
• The fundamental use of a cross compiler is to separate the build environment from target
environment. This is useful in a number of situations:
• Embedded computers where a device has extremely limited resources.
For example, a microwave oven will have an extremely small computer to read its
touchpad and door sensor, provide output to a digital display and speaker, and to control
the machinery for cooking food. This computer will not be powerful enough to run a
compiler, a file system, or a development environment.
Since debugging and testing may also require more resources than are available on an
embedded system, cross-compilation can be less involved and less prone to errors than
native compilation.
• Compiling for multiple machines.
For example, a company may wish to support several different versions of an operating
system or to support several different operating systems. By using a cross compiler, a
single build environment can be set up to compile for each of these targets.
• Compiling on a server farm.
Similar to compiling for multiple machines, a complicated build that involves many
compile operations can be executed across any machine that is free, regardless of its
underlying hardware or the operating system version that it is running.
Basic features:
Offline compiler: create your own applications on Android device and run
them even without Internet access
No root required (but C4droid can use it for your programs if you want)
Full ANSI C and ISO C99 support with TCC + uClibc
Source code editor with syntax highlighting, tabs, code completion, code
formatting, file association and infinite undo/redo
Export & share your programs as APKs or native executables (for terminal
apps.
2. AIDE - Android IDE - Java, C++
AIDE is an integrated development environment (IDE) for developing real
Android Apps directly on Android devices. AIDE supports the full edit-compile-
run cycle: write code with the feature rich editor offering advanced features like
code completion, real-time error checking, refactoring and smart code navigation,
and run your App with a single click. AIDE will turn your Android tablet with
keyboard into a real development box. We use the Transformer Prime running Ice
Cream Sandwich to code with AIDE. AIDE will turn your Android Phone into a
small development computer to browse and touch your code on the go.
AIDE supports building Apps with Java/Xml and the Android SDK as well as
Apps with C/C++ and the Android NDK.
AIDE is fully compatible with Eclipse projects. You can just copy the sourcecode
to your device and open the Eclipse project in AIDE to start coding. Alternatively
you can keep your sourcecode on your Dropbox - AIDE integrates with Dropbox
and allows to easily download from your Dropbox and sync back your changes.
AIDE can also open Android Studio projects, which follow the default folder
structure.
AIDE supports GIT for professional development
Models of Compiler /Phases of Compilation Process
Character Stream
A Tokens
5. Intermediate 4. Parser 3. Scanner
Code Generator B C
D E F G
Intermediate Representation
Target Code
Model 1
Source Object
Program Program
TABLES
Model 2
Character Stream
Lexical Analyzer
Token-Stream
Syntax Analyzer
Syntax Tree
Semantic Analyzer
Intermediate Representation
Code Generator
Model 3
Phases of Compilation Process:
3. Lexical Analysis
4. Syntax Analysis
5. Semantic Analysis
6. Intermediate Code Generation
7. Machine Independent Code Optimization
8. Code Generation
9. Machine Dependent Code Optimization
1. Lexical Analysis
The compiler takes the service of operating system’s file access mechanism to read the
source file stored in the memory. The file system reads the source and provides record to
a character manipulator. The character manipulator accepts the records and emits
characters. It also detects and removes comments, extra spaces and special control
commands that are not a part of programming language.
The lexical analyzer / scanner accepts stream of characters and group them into lexemes.
Lexemes are then matched with patterns of token and a sequence of tokens is generated.
Token is an atomic unit of language / a logical entity of a programming language.
Examples of token include: identifier, keyword, operator and so on. A token is always
recognized and handled by (Token_Type, Token_Value). A delimiter is a special token
which is handled by token name only and not token type.
Lexeme Token
a (ID,1)
= ASSIGN_OP
b (ID, 2)
+ ARITH_OP
c (ID, 3)
;
2. Syntax Analysis
This is the second phase of the compiler. In this phase, we check the syntax and construct
the abstract syntax tree. This phase is modeled through context free grammars and the
structure is recognized through push down automata or table-driven parsers. The syntax
analysis phase verifies that the string can be generated by the grammar for the source
language. In case of any syntax errors in the program, the parser tries to report as many
errors as possible. Error reporting and recovery form a very important part of the syntax
analyzer.
Example of Syntax Analysis:
a = b + c;
Output of Lexical Analyzer is stream of tokens as follows:
Lexeme Token
a id1
= =
b id2
+ +
c id3
;
Let the Context Free Grammar (CFG) for arithmetic expression in given language be:
E→E=E
E → T+E | T
T → F*T | F
F → id
Then abstract syntax tree be like
E = E
T T + E
F F T
id1 Id2 F
d
Id3
d
3. Semantic Analysis
Type checking and type casting is performed in semantic analysis phase. The output of
semantic analyzer is annotated parse tree with more information on the data types. Few
tasks performed by semantic analyzer are:
To check whether variables are of types on which operations are allowed.
To check whether a variable has been declared before use.
To check whether a variable has been initialized.
a = b + c;
Let the data type of ‘a’ and ‘b’ be float and of ‘c’ be integer; the semantic analyzer
must upcast the data type of ‘c’ to float.
E = E
T T + E
F F T
id1 Id2 F
Id3
intofloat(id3)
After optimization
6. Code Generation
The optimized intermediate code is mapped to target language code in this phase. The
object code may be of different form such as:
Binary machine code
Symbolic code (e.g. Assembly language code)
Some intermediate language code (bytecode)
Example of Code Generation
MOV R1, id2
MOV R2, id3
ADD R1, R2
MOV id3, R1
Lexical Analyzer
Syntax Analyzer
id1 +
id2 *
position …
id3 60
initial …
rate …
Semantic Analyzer
…
SYMBOL TABLE
=
id1 +
id2 *
id3 inttofloat
60
t1 = inttofloat (60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3
Code Optimizer
t1 = id3 * 60.0
id1 = id2 + t1
Code Generator
Symbol table is a data structure used to keep track of scope and binding information about
names in source program. The names may include programming elements like variable names,
constants, function names, statement labels, class names, structure names and so on.
The symbol table is searched every time a name is encountered in the source program. Changes
to the table are made if a new name or information about a new name is encountered. Therefore,
the data structure chosen to build the symbol table must be efficient in terms of adding new
entries to it and searching existing names from it.
Example:
3 int a, b, c;
7 a = b + c;
Error Handling
In case of any errors in the program, the compilation phase in which the error occurred, tries to
report as many errors as possible. Error reporting and recovery form a very important part of the
compilation process. The error handler has the following goals:
It should report the presence of errors clearly and accurately.
It should recover from each error quickly enough to be able to detect subsequent errors.
It should not significantly slow down the processing of correct programs.
Several phases of compilation process can be implemented in a single pass. Relatively few
numbers of passes are desirable from the point of view of compilation process. Therefore, to
reduce the number of passes, grouping of some phases into one pass is required.
Sometimes it is easy to combine number of phases into one pass. For example, Lexical Analyzer
and Syntax Analyzer can be easily grouped into one pass because the interface between them is
only a token.
But grouping certain phases in to single pass is difficult. For example, grouping Intermediate
Code Generator and Object Code Generator is difficult because it is hard to perform object code
generation until sufficient numbers of intermediate code statements have been generated. The
interface between the two is not based on only one intermediate instruction (certain languages
permit the use of variable before its declaration. Similarly, forward jumps are allowed by source
languages).
To overcome this problem and enable merging of two such phases into single pass, the technique
of “Back-patching” is used. In this technique, the object code is generated by leaving
“statement holes” and these holes are filled up later when the information becomes available.
Compiler Construction Tools: