Compiler Construction
Module 1
Introduction to Compiler
Review of Grammars, Languages and Automata
What is a Compiler?
The Structure of a Compiler
Introduction to Compiler
A Compiler is a software that typically takes a high level language (Like C++ and
Java) code as input and converts the input to a lower level language at once. It lists
all the errors if the input code does not follow the rules of its language. This
process is much faster than interpreter but it becomes difficult to debug all the
errors together in a program.
Review of Grammars, Languages and Automata
In automata theory, a grammar is a formal structure that defines the syntax of a
language. It consists of a set of production rules that describe how symbols of the
language can be combined to produce strings. Grammatical frameworks are
essential in the study of formal languages, which are sets of strings defined over a
specific alphabet.
Definition of Compiler
A compiler is a translating program that translates the instructions of high level
language to machine level language. A program which is input to the compiler is
called a Source program. This program is now converted to a machine level
language by a compiler is known as the Object code.
1
Compiler Construction
How does a compiler work?
The process of compiling code involves several steps.Basically, Code passes
through these steps sequentially and if there is any mistake in code then it will be
examined through these steps and thus compilation process stops in-between and
show compilation error.Oherwise if everything is OK then compiler does not show
any error and compile the source code . There are the following steps which are
involved during compiling
Lexical Analysis
The source code is first tested by the compiler’s lexer, which breaks the source
code into tokens, such as keywords, identifiers, operators, and punctuation.
Syntax Analysis
The next step is syntax analysis, where the compiler’parser checks the code for
syntax errors and ensures that it follows the rules of the programming language.
The compiler generates an Abstract Syntax Tree (AST) that represents the structure
of the code.
Semantic Analysis
Once the code is syntactically correct, the compiler performs semantic analysis on
parsed code to find the meaning. The compiler check for logical errors, such as
type mismatches, undeclared variables, and incorrect usage of operators.
Optimization
The compiler may perform various optimizations to improve the performance of
the resulting code. It is an optional phase of compiler. It removes dead code and
arranges the sequence of instructions in order to boost the program execution
Code generations
The last step is code generation, where the compiler translates the AST into
machine-readable code. The code generator creates assembly language code, which
is then translated into binary code that can be executed by the computer.
Why are compilers important?
There are the following reasons why compilers are important to us:
1)Enabling concept of High Level Language
2
Compiler Construction
They enable developers to write code in high-level programming languages,
which are easier to understand and more human-readable than machine code.
2)One time effort(Efficiency)
Compilers convert high-level code into machine code .Now, this machine code is
stored in a file having extension(.exe) which is directly executed by computer.
So,programmer need only one time effort to compile code into machine code and
after that they can use code whenever they want using (.exe) file. This process
allows programs to run much faster and more efficiently than if they were
interpreted at runtime.
3)Portability
Compilers translates source code which is written in high-level programming
language into machine code.And thus,it enable portability feature because this
machine code can be runned on many different operating systems and hardware
architectures because it is platform independent.
4)Security
Compiler can provide programmer secuirity by preventing memory-related errors,
such as buffer overflows, by analyzing and optimizing the code. It can also
generate warnings or errors if it detects potential memory issues.
There are different Compilers :
Cross-Compiler – The compiled program can run on a computer whose
CPU or Operating System is different from the one on which the compiler
runs.
Bootstrap Compiler – The compiler written in the language that it intends
to compile.
Decompiler – The compiler that translates from a low-level language to a
higher level one.
Transcompiler – The compiler that translates high level languages.
A compiler can translate only those source programs which have been written in the
language for which the compiler is meant. Each high level programming language
requires a separate compiler for the conversion.
3
Compiler Construction
For Example, a FORTRAN compiler is capable of translating into
a FORTRAN program. A computer system may have more than one compiler to
work for more than one high level languages.
Top most Compilers used according to the Computer Languages –
C– Turbo C, Tiny C Compiler, GCC, Clang, Portable C Compiler
C++ -GCC, Clang, Dev C++, Intel C++, Code Block
JAVA– IntelliJ IDEA, Eclipse IDE, NetBeans, BlueJ, JDeveloper
Kotlin– IntelliJ IDEA, Eclipse IDE
Python– CPython, JPython, Wing, Spyder
JavaScript– WebStorm, Atom IDE, Visual Studio Code, Komodo Edit
Tokens in compiler design are the sequence of characters which represents a unit of information
in the source program.
The above program is fed to Lexical Analyzer
Examples of Tokens created
Lexeme Token
int Keyword
maximum Identifier
( Operator
int Keyword
x Identifier
, Operator
int Keyword
Y Identifier
) Operator
{ Operator
If Keyword
What is Identifier?
Identifier refers to name given to entities such as variables, functions, structures etc. Identifiers
must be unique. They are created to give a unique name to an entity to identify it during the
execution of the program. Identifier name is made up of Only alphabetic characters, numeric
digits, and the underscore character (_) are legal in an identifier. The first character of an
identifier must be alphabetic or an underscore (it cannot be a numeric digit). Upper case letters
are considered distinct from lower case letters; that is, identifiers are case sensitive.
Transition diagram is a special kind of flowchart for language analysis
4
Compiler Construction
Types of Compiler
Single Pass Compilers
Two Pass Compilers
Multipass Compilers
Compiler-Construction Tools
The compiler writer, like any software developer, can profitably use
modern software development environments containing tools such as
language editors, debuggers, version managers, profilers, test harnesses, and so
on. In addition to these general software-development tools, other more specialized
tools have been created to help implement various phases of a compiler.
These tools use specialized languages for specifying and implementing specific
components, and many use quite sophisticated algorithms. The most successful
tools are those that hide the details of the generation algorithm and produce
components that can be easily integrated into the remainder of thE
compiler. Some commonly used compiler-construction tools include
1. Parser generators that automatically produce syntax analyzers from a
grammatical description of a programming language.
2. Scanner generators that produce lexical analyzers from a regular-expression
description of the tokens of a language.
3. Syntax-directed translation engines that produce collections of routines for
walking a parse
tree and generating intermediate code.
4. Code-generator generators that produce a code generator from a
collection of rules for translating each operation of the intermediate language
into the machine language for a target machine.
5
Compiler Construction
5. Data-flow analysis engines that facilitate the gathering of information about how
values are transmitted from one part of a program to each other part. Data-flow
analysis is a key part of
code optimization.
6. Compiler-construction toolk2ts that provide an integrated set of routines for
constructing various phases of a compiler.
Instruction formats refer to the way instructions are encoded and represented in
machine language. There are several types of instruction formats, including zero,
one, two, and three-address instructions.
Each type of instruction format has its own advantages and disadvantages in terms
of code size, execution time, and flexibility. Modern computer architectures
typically use a combination of these formats to provide a balance between
simplicity and power.
Different Types of Instruction Fields
A computer performs a task based on the instructions provided. Instructions in
computers are comprised of groups called fields. These fields contain different
information for computers which are all written in 0s and 1s. Each field has a
different significance or meaning, based on which a CPU decides what to perform.
The most common fields are:
The operation field specifies the operation to be performed, like addition.
Address field which contains the location of the operand, i.e., register or
memory location.
Mode field which specifies how operand is to be founded.
An instruction is of variable length depending upon the number of addresses it
contains. Generally, CPU organization is of three types based on the number of
address fields:
Single Accumulator organization: In this, the operation is performed using a
special register called the accumulator.
6
Compiler Construction
General register organization : In Register Organization, multiple registers
are used for the computation purpose.
Stack organization: In the third organization, the work on stack basis
operation due to which it does not contain any address field.
Types of Instructions
Based on the number of addresses, instructions are classified
as:
NOTE: We will use the X = (A+B)*(C+D) expression to
showcase the procedure.
Zero Address Instructions
These instructions do not specify any operands or addresses.
Instead, they operate on data stored in registers or memory
locations implicitly defined by the instruction. For example, a
zero-address instruction might simply add the contents of two
registers together without specifying the register names.
A stack-based computer does not use the address field in the instruction. To
evaluate an expression, it is first converted to reverse Polish Notation i.e. Postfix
Notation.
7
Compiler Construction
One Address Instructions
These instructions specify one operand or address, which typically refers to a
memory location or register. The instruction operates on the contents of that
operand, and the result may be stored in the same or a different location. For
example, a one-address instruction might load the contents of a memory location
into a register.
This uses an implied ACCUMULATOR register for data manipulation. One
operand is in the accumulator and the other is in the register or memory location.
8
Compiler Construction
Implied means that the CPU already knows that one operand is in the accumulator
so there is no need to specify it
Two Address Instructions
These instructions specify two operands or addresses, which may be memory
locations or registers. The instruction operates on the contents of both operands,
and the result may be stored in the same or a different location. For example, a
two-address instruction might add the contents of two registers together and store
the result in one of the registers.
9
Compiler Construction
This is common in commercial computers. Here two addresses can be specified in
the instruction. Unlike earlier in one address instruction, the result was stored in
the accumulator, here the result can be stored at different locations rather than just
accumulators, but require more number of bit to represent the address.
10
Compiler Construction
11