Compiler Design Unit-1
Compiler Design Unit-1
UNIT -1
TRANSLATOR
A translator is a program that takes as input a program written in one language and
produces as output a program in another language. Beside program translation, the
translator performs another very important role, the error-detection. Any violation of HLL
specification would be detected and reported to the programmers. Important role of
translator are:
1 Translating the HLL program input into an equivalent machine language program.
2 Providing diagnostic messages wherever the programmer violates specification of
the HLL.
A translator is a program that takes as input a program written in one language and
produces as output a program in another language. Beside program translation, the
translator performs another very important role, the error-detection. Any violation of HLL
specification would be detected and reported to the programmers. Important role of
translator are:
TYPE OF TRANSLATORS:-
a. Compiler
b. Interpreter
c. Preprocessor
Compiler
Error msg
Executing a program written n HLL programming language is basically of two parts. The
source program must first be compiled and translated into a object program. Then the
resulting object program is loaded into a memory executed.
Languages such as BASIC, SNOBOL, LISP can be translated using interpreters. JAVA also
uses interpreter. The process of interpretation can be carried out in following phases.
1. Lexical analysis
2. Synatx analysis
3. Semantic analysis
4. Direct Execution
Advantages:
Modification of user program can be easily made and implemented as execution
proceeds.
5. Type of object that denotes a various may change dynamically.
Debugging a program and finding errors is simplified task for a program used
for interpretation.
The interpreter for the language makes it machine independent.
Disadvantages:
The execution of the program is slower.
Memory consumption is more.
“A loader is a program that places programs into memory and prepares them for
execution.” It would be more efficient if subroutines could be translated into object form
the loader could”relocate” directly behind the user‟s program. The task of adjusting
programs othey may be placed in arbitrary core locations is called relocation.
STRUCTURE OF A COMPILER
Code Optimization :-
This is optional phase described to improve the intermediate code so that the
output runs faster and takes less space.
Code Generation:-
The last phase of translation is code generation. A number of optimizations to
reduce the length of machine language program are carried out during this phase. The
output of the code generator is the machine language program of the specified computer.
Table Management (or) Book-keeping:-
This is the portion to keep the names used by the program and records
essential information about each. The data structure used to record this information called a
„Symbol Table‟.
Error Handlers:-
It is invoked when a flaw error in the source program is detected.
The output of LA is a stream of tokens, which is passed to the next phase, the
syntax analyzer or parser. The SA groups the tokens together into syntactic structure called
as expression. Expression may further be combined to form statements. The syntactic
structure can be regarded as a tree whose leaves are the token called as parse trees.
The parser has two functions. It checks if the tokens from lexical analyzer,
occur in pattern that are permitted by the specification for the source language. It also
imposes on tokens a tree-like structure that is used by the sub-sequent phases of the compiler.
Example, if a program contains the expression A+/B after lexical analysis this
expression might appear to the syntax analyzer as the token sequence id+/id. On seeing the /,
the syntax analyzer should detect an error situation, because the presence of these two
adjacent binary operators violates the formulations rule of an expression.
Code Optimization
This is optional phase described to improve the intermediate code so that the
output runs faster and takes less space. Its output is another intermediate code program that
does the some job as the original, but in a way that saves time and / or spaces.
1, Local Optimization:-
There are local transformations that can be applied to a program
to make an improvement. For example,
If A > B goto L2
Goto L3
L2 :
This can be replaced by a single statement
If A < B goto L3
Another important local optimization is the elimination of common
sub-expressions
A := B + C + D
E := B + C + F
Might be evaluated as
T1 := B + C
A := T1 + D
E := T1 + F
Take this advantage of the common sub-expressions B + C.
2, Loop Optimization:-
Another important source of optimization concerns about increasing
the speed of loops. A typical loop improvement is to move a
computation that produces the same result each time around the loop
to a point, in the program just before the loop is entered.
Code Generator :-
Code Generator produces the object code by deciding on the memory locations
for data, selecting code to access each datum and selecting the registers in which each
computation is to be done. Many computers have only a few high speed registers in which
computations can be performed quickly. A good code generator would attempt to utilize
registers as efficiently as possible.
Table Management OR Book-keeping :-
A compiler needs to collect information about all the data objects that appear
in the source program. The information about data objects is collected by the early phases of
the compiler-lexical and syntactic analyzers. The data structure used to record this
information is called as Symbol Table.
Error Handing :-
One of the most important functions of a compiler is the detection and
reporting of errors in the source program. The error message should allow the programmer to
determine exactly where the errors have occurred. Errors may occur in all or the phases of a
compiler.
Whenever a phase of the compiler discovers an error, it must report the error to
the error handler, which issues an appropriate diagnostic msg. Both of the table-management
and error-Handling routines interact with all phases of the compiler.
Example:
Position:= initial + rate *60
Lexical Analyzer
Syntsx Analyzer
id1 +
id2 *
id3 id4
Semantic Analyzer
id1 +
id2 *
id3 60
int to real
Code Optimizer
Code Generator
MOVF id3, r2
MULF *60.0, r2
MOVF id2, r2
ADDF r2, r1
MOVF r1, id1
The history of programming languages spans from documentation of early mechanical computers to
modern tools for software development. Early programming languages were highly specialized, relying
on mathematical notation
The first step towards more people friendly programming languages was the development of mnemonic
assembly languages in the early 1950’s.The instructions in assembly languages were just mnemonic
representations of machine instructions.
A major step towards higher level languages was made in the later half of the 1950’s with the
development of FORTRAN for scientific computation, Cobol for business data Processing and Lisp for
symbolic computation.
In the following decades many more languages were created with innovative features to help
make programming easier, more natural, and more robust.
Languages can also be classified in variety of ways.
Classification by Generation: Ist generation are the machine languages, 2nd generation are the
assembly languages, 3rd generation are the higher level languages like Fortran,cobol, Lisp,C
etc.4th generation are the languages designed for specific application like NOMAD,SQL,POST
The term fifth generation language has been applied to logic and the constraint based language
like prolog and OPS5.
Classification by the use: imperative languages in which your program specifies How
computation is to be done the declarative for languages in which your program specifies what
computation is to be done.
Examples:
Imperative languages: C,C++,C#,Java.
Declarative languages: ML, Haskell, Prolog
Object oriented language is one that supports Object oriented programming, a
Programming style in which a program consists of a collection of objects that interact with
one another.
Examples: Simula 67, small talk, C ++, Java,Ruby
Scripting languages are interpreted languages with high level operators designed for
“gluing together” computations These computations originally called Scripts
A compiler must accept all source programs that conform to the specification of the language;
the set of source programs is infinite and any program can be very large, consisting of possibly millions
of lines of code. Any transformation performed by the compiler while translating a source program must
preserve the meaning of the program being compiled. Compiler writers thus have influence over not just
the compilers they create, but all the programs that their compilers compile. This leverage makes
writing compilers particularly rewarding; however, it also makes compiler development challenging.
Modelling in compiler design and implementation: The study of compilers is mainly a study of how
we design the right mathematical models and choose the right algorithms.Some of most fundamental
models are finite-state machines and regular expressions.These models are useful for de-scribing the
lexical units of programs (keywords, identifiers, and such) and for describing the algorithms used by the
compiler to recognize those units. Also among the most fundamental models are context-free grammars,
used to describe the syntactic structure of programming languages such as the nesting of parentheses or
control constructs. Similarly, trees are an important model for representing the structure of programs
and their translation into object code.
The science of code optimization: The term "optimization" in compiler design refers to the attempts
that a com-piler makes to produce code that is more efficient than the obvious code. In modern times,
the optimization of code that a compiler performs has become both more important and more complex.
It is more complex because processor architectures have become more complex, yielding more
opportunities to improve the way code executes. It is more important because massively par-allel
computers require substantial optimization, or their performance suffers by orders of magnitude .
1. The optimization must be correct, that is, preserve the meaning of the compiled program,
2. The optimization must improve the performance of many programs,
3. The compilation time must be kept reasonable, and
4. The engineering effort required must be manageable.
Thus, in studying compilers, we learn not only how to build a compiler, but also the general
methodology of solving complex and open-ended problems.
Specialized Architectures Over the last three decades, many architectural concepts have been
proposed. They include data flow machines, vector machines, VLIW (Very Long Instruction Word)
machines, SIMD (Single Instruction, Multiple Data) arrays of processors, systolic arrays,
multiprocessors with shared memory, and multiprocessors with distributed memory. The development
of each of these architectural concepts was accompanied by the research and development of
corresponding compiler technology.
Program Translations: The following are some of the important applications of program-translation
techniques.
Binary Translation: Compiler technology can be used to translate the binary code for one machine to
that of another, allowing a machine to run programs originally compiled for another instruction set.
Binary translation technology has been used by various computer companies to increase the availability
of software for their machines.
Hardware Synthesis: Not only is most software written in high-level languages; even hardware de-
signs are mostly described in high-level hardware description languages like Verilog and VHDL.
Hardware designs are typically described at the register trans-fer level (RTL), where variables represent
registers and expressions represent combinational logic.
Database Query Interpreters: Besides specifying software and hardware, languages are useful in
many other applications. For example, query languages, especially SQL (Structured Query Language),
are used to search databases. Database queries consist of predicates containing relational and boolean
operators. They can be interpreted or com-piled into commands to search a database for records
satisfying that predicate.
Programming Language Basics:
Most languages, including C and its family, use static scope. we consider static-scope rules for a
language with blocks, where a block is a grouping of declarations and statements. C uses braces { and }
to delimit a block; the alternative use of begin and end for the same purpose dates back to Algol.
A C program consists of a sequence of top-level declarations of variables and
functions.Functions may have variable declarations within them, where variables include local variables
and parameters. The scope of each such declaration is restricted to the function in which it appears.
The scope of a top-level declaration of a name x consists of the entire program that follows, with the
exception of those statements that lie within a function that also has a declaration of x.
A block is a sequence of declarations followed by a sequence of statements, all surrounded by
braces. a declaration D "belongs" to a block B if B is the most closely nested block containing D; that
is, D is located within B, but not within any block that is nested within B. The static-scope rule for
variable declarations in a block-structured lan-guages is as follows. If declaration D of name x belongs
to block B, then the scope of D is all of B, except for any blocks B' nested to any depth within J5, in
which x is redeclared. Here, x is redeclared in B' if some other declaration D' of the same name x
belongs to B'.
An equivalent way to express this rule is to focus on a use of a name x. Let Bi, i?2, • • • , Bk be
all the blocks that surround this use of x, with Bk the smallest, nested within Bk-i, which is nested
within Bk-2, and so on. Search for the largest i such that there is a declaration of x belonging to B^.
This use of x refers to the declaration in B{. Alternatively, this use of x is within the scope of the
declaration in Bi.
Explicit Access Control
Through the use of keywords like public, private, and protected, object-oriented languages such as C
+ + or Java provide explicit control over access to member names in a superclass. These keywords
support encapsulation by restricting access. Thus, private names are purposely given a scope that
includes only the method declarations and definitions associated with that class and any "friend" classes
(the C + + term). Protected names are accessible to subclasses. Public names are accessible from outside
the class.
Dynamic Scope
Any scoping policy is dynamic if it is based on factor(s) that can be known only when the program
executes. The term dynamic scope, however, usually refers to the following policy: a use of a
name x refers to the declaration of x in the most recently called procedure with such a declaration.
Dynamic scoping of this type appears only in special situations. We shall consider two ex-amples of
dynamic policies: macro expansion in the C preprocessor and method resolution in object-oriented
programming.
Declarations and Definitions
Declarations tell us about the types of things, while definitions tell us about their values. Thus, i n t i is a
declaration of i, while i = 1 is a definition of i.
The difference is more significant when we deal with methods or other procedures. In C + + , a method
is declared in a class definition, by giving the types of the arguments and result of the method (often
called the signature for the method. The method is then defined, i.e., the code for executing the method
is given, in another place. Similarly, it is common to define a C function in one file and declare it in
other files where the function is used.
Parameter Passing Mechanisms
In this section, we shall consider how the actual parameters (the parameters used in the call of a
procedure) are associated with the formal parameters (those used in the procedure definition). Which
mechanism is used determines how the calling-sequence code treats parameters. The great majority of
languages use either "call-by-value," or "call-by-reference," or both.
Call - by - Value
In call-by-value, the actual parameter is evaluated (if it is an expression) or copied (if it is a variable).
The value is placed in the location belonging to the corresponding formal parameter of the called
procedure. This method is used in C and Java, and is a common option in C + + , as well as in most
other languages. Call-by-value has the effect that all computation involving the formal parameters done
by the called procedure is local to that procedure, and the actual parameters themselves cannot be
changed.
Note, however, that in C we can pass a pointer to a variable to allow that variable to be changed by the
callee. Likewise, array names passed as param eters in C, C + + , or Java give the called procedure what
is in effect a pointer or reference to the array itself. Thus, if a is the name of an array of the calling
procedure, and it is passed by value to corresponding formal parameter x, then an assignment such as
x [ i ] = 2 really changes the array element a[2]. The reason is that, although x gets a copy of the value
of a, that value is really a pointer to the beginning of the area of the store where the array named a is
located.
Similarly, in Java, many variables are really references, or pointers, to the things they stand for. This
observation applies to arrays, strings, and objects of all classes. Even though Java uses call-by-value
exclusively, whenever we pass the name of an object to a called procedure, the value received by that
procedure is in effect a pointer to the object. Thus, the called procedure is able to affect the value of the
object itself.
Call - by - Reference
In call-by-reference, the address of the actual parameter is passed to the callee as the value of the
corresponding formal parameter. Uses of the formal parameter in the code of the callee are implemented
by following this pointer to the location indicated by the caller. Changes to the formal parameter thus
appear as changes to the actual parameter.
If the actual parameter is an expression, however, then the expression is evaluated before the call, and
its value stored in a location of its own. Changes to the formal parameter change this location, but can
have no effect on the data of the caller.
Call-by-reference is used for "ref" parameters in C + + and is an option in many other languages. It is
almost essential when the formal parameter is a large object, array, or structure. The reason is that strict
call-by-value requires that the caller copy the entire actual parameter into the space belonging to the
corresponding formal parameter. This copying gets expensive when the parameter is large. As we noted
when discussing call-by-value, languages such as Java solve the problem of passing arrays, strings, or
other objects by copying only a reference to those objects. The effect is that Java behaves as if it used
call-by-reference for anything other than a basic type such as an integer or real.
Call - by - Name
A third mechanism — call-by-name — was used in the early programming language Algol 60. It
requires that the callee execute as if the actual parameter were substituted literally for the formal
parameter in the code of the callee, as if the formal parameter were a macro standing for the actual
parameter (with renaming of local names in the called procedure, to keep them distinct). When the
actual parameter is an expression rather than a variable, some unintuitive behaviors occur, which is one
reason this mechanism is not favored today.
LEXICAL ANALYSIS
Upon receiving a get next token command form the parser, the lexical analyzer
reads the input character until it can identify the next token. The LA return to the parser
representation for the token it has found. The representation will be an integer code, if the
token is a simple construct such as parenthesis, comma or colon.
LA may also perform certain secondary tasks as the user interface. One such task is
striping out from the source program the commands and white spaces in the form of blank,
tab and new line characters. Another is correlating error message from the compiler with the
source program.
LEXICAL ANALYSIS VS PARSING:
The lexical analyzer (the "lexer") parses A parser does not give the nodes any
individual symbols from the source code file meaning beyond structural cohesion. The
into tokens. From there, the "parser" proper next thing to do is extract meaning from this
turns those whole tokens into sentences of structure (sometimes called contextual
your grammar analysis).
INPUT BUFFERING
The LA scans the characters of the source pgm one at a time to discover tokens.
Because of large amount of time can be consumed scanning characters, specialized buffering
techniques have been developed to reduce the amount of overhead required to process an input
character.
Buffering techniques:
1. Buffer pairs
2. Sentinels
The lexical analyzer scans the characters of the source program one a t a time to discover
tokens. Often, however, many characters beyond the next token many have to be examined
before the next token itself can be determined. For this and other reasons, it is desirable for
thelexical analyzer to read its input from an input buffer. Figure shows a buffer divided into
two haves of, say 100 characters each. One pointer marks the beginning of the token being
discovered. A look ahead pointer scans ahead of the beginning point, until the token is
discovered .we view the position of each pointer as being between the character last read and
thecharacter next to be read. In practice each buffering scheme adopts one convention either
apointer is at the symbol last read or the symbol it is ready to read.
Token beginnings look ahead pointerThe distance which the lookahead pointer may
have to travel past the actual token may belarge. For example, in a PL/I program we may see:
DECALRE (ARG1, ARG2… ARG n) Without knowing whether DECLARE is a keyword or
an array name until we see the character that follows the right parenthesis. In either case, the
token itself ends at the second E. If the look ahead pointer travels beyond the buffer half in
which it began, the other half must be loaded with the next characters from the source file.
Since the buffer shown in above figure is of limited size there is an implied constraint on how
much look ahead can be used before the next token is discovered. In the above example, ifthe
look ahead traveled to the left half and all the way through the left half to the middle, we could
not reload the right half, because we would lose characters that had not yet been groupedinto
tokens. While we can make the buffer larger if we chose or use another buffering scheme,we
cannot ignore the fact that overhead is limited.
Token: Token is a sequence of characters that can be treated as a single logical entity.
Typical tokens are,
1) Identifiers 2) keywords 3) operators 4) special symbols 5)constants
Pattern: A set of strings in the input for which the same token is produced as output. This set
of strings is described by a rule called a pattern associated with the token.
Lexeme: A lexeme is a sequence of characters in the source program that is matched by
the pattern for a token.
Example:
Description of token
if if If
relation <,<=,= ,< >,>=,> < or <= or = or < > or >= or letter
followed by letters & digit
i pi any numeric constant
LEXICAL ERRORS:
Lexical errors are the errors thrown by your lexer when unable to continue. Which
means that there's no way to recognise a lexeme as a valid token for you lexer. Syntax
errors, on the other side, will be thrown by your scanner when a given set of already
recognised valid tokens don't match any of the right sides of your grammar rules.
simple panic-mode error handling system requires that we return to a high-level
parsing function when a parsing or lexical error is detected.
o is a regular expression denoting { € }, that is, the language containing only the
empty string.
o For each „a‟ in ∑, is a regular expression denoting { a }, the language with only
one string consisting of the single symbol „a‟ .
o If R and S are regular expressions, then
REGULAR DEFINITIONS
For relop ,we use the comparison operations of languages like Pascal or SQL where = is
“equals” and < > is “not equals” because it presents an interesting structure of lexemes. The
terminal of grammar, which are if, then , else, relop ,id and numbers are the names of tokens
as far as the lexical analyzer is concerned, the patterns for the tokens are described using
regular definitions.
digit -->[0,9]
digits -->digit+
number -->digit(.digit)?(e.[+-]?digits)?
]?digits)?
letter -->[A-Z,a-z]
id -->letter(letter/digit)*
if --> if
then -->then
else -->else
relop --></>/<=/>=/==/< >
In addition, we assign the lexical analyzer the job stripping out white space, by recognizing
the “token” we defined by:
+
ws --> (blank/tab/newline)
Here, blank, tab and newline are abstract symbols that we use to express the ASCII
characters of the same names. Token ws is different from the other tokens in that ,when we
recognize it, we do not return it to parser ,but rather restart the lexical analysis
analys from the
character that follows the white space . It is the following token that gets returned to the
parser.
TRANSITION DIAGRAM:
Transition Diagram has a collection of nodes or circles, called states. Each state
represents a condition that could occur during the process of scanning the input looking for a
lexeme that matches one of several patterns .
Edges are directed from one state of the transition diagram to another. each edge is labeled
by a symbol or set of symbols.
If we are in one state s, and the next input symbol is a, we look for an edge out of state s
labeled by a. if we find such an edge ,we advance the forward pointer and enter the
state of the transition diagram to which that edge leads.
Some important conventions about transition diagrams are
1. Certain states are said to be accepting or final .These states indicates that a lexeme has
been found, although the actual lexeme may not consist of all positions b/w the lexeme Begin
and forward pointers we always indicate an accepting state by a double circle.
2. In addition, if it is necessary to return the forward pointer one position, then we shall
additionally place a * near that accepting state.
3. One state is designed the state ,or initial state ., it is indicated by an edge labeled “start”
entering from nowhere .the transition diagram always begins in the state before any input
symbols have been used.
If = if
Then = then
Else = else
Relop = < | <= | = | > | >=
Id = letter (letter | digit) *|
Num = digit |
2.10 AUTOMATA
DESCRIPTION OF AUTOMATA
Deterministic Automata
Non-Deterministic Automata.
DETERMINISTIC AUTOMATA
A deterministic finite automata has at most one transition from each state on any
input. A DFA is a special case of a NFA in which:-
The regular expression is converted into minimized DFA by the following procedure:
The Finite Automata is called DFA if there is only one path for a specific input from
current state to next state.
a
a
So S2
S1
NONDETERMINISTIC AUTOMATA
A set of states S.
A set of input symbols ∑.
A transition for move from one state to an other.
A state so that is distinguished as the start (or initial) state.
A set of states F distinguished as accepting (or final) state.
A number of transition to a single symbol.
A NFA can be diagrammatically represented by a labeled directed graph, called a
transition graph, In which the nodes are the states and the labeled edges represent
the transition function.
This graph looks like a transition diagram, but the same character can label two or
more transitions out of one state and edges can be labeled by the special symbol €
as well as by input symbols.
The transition graph for an NFA that recognizes the language ( a | b ) * abb is
shown
DEFINITION OF CFG
.
Creating a lexical analyzer with Lex
Lex specifications:
declarations
%%
translation rules
%%
auxiliary procedures
3. The third section holds whatever auxiliary procedures are needed by the
actions.Alternatively these procedures can be compiled separately and loaded with the
lexical analyzer.
Note: You can refer to a sample lex program given in page no. 109 of chapter 3 of the book:
Compilers: Principles, Techniques, and Tools by Aho, Sethi & Ullman for more clarity.