6th SEM CSE
A translator inputs and then converts a source program into
an object or target program.
Source program is written in a source language
Object program belongs to an object language
A translators could be: Assembler, Compiler, Interpreter
Assembler:
source program object program
(in assembly language) Assembler (in machine language)
Compiler Design 2
- Compiler: translates a source program written in a High-
Level Language (HLL) such as Pascal, C++ into
computer’s machine language (Low-Level Language
(LLL)).
* The time of conversion from source program into
object program is called compile time
* The object program is executed at run time
- Interpreter: processes an internal form of the source
program and data at the same time (at run time); no object
program is generated.
Compiler Design 3
A compiler can be treated as a single box that maps a
source program into a semantically equivalent target
program.
The analysis part (Front End) breaks up the source
program into constituent pieces and imposes a
grammatical structure on them.
The synthesis part (Back End) constructs the desired
target program from the intermediate representation
and the information in the symbol table.
Lexical analyzer reads the stream of characters making up the
source program and groups the characters into meaningful
sequences called lexeme.
For each lexeme, the lexical analyzer produces a token of the
form
(token-name, attribute-value)
that it passes on to the subsequent phase, syntax analysis.
Token-name: an abstract symbol is used during syntax analysis
Attribute-value: points to an entry in the symbol table for this
token.
Compiler Design 7
1.”position” is a lexeme mapped into a token <id, 1>, where id is an abstract symbol
standing for identifier and 1 points to the symbol table entry for position. The
symbol-table entry for an identifier holds information about the identifier, such as its
name and type.
2. = is a lexeme that is mapped into the token<=>. Since this token needs no attribute-
value, we have omitted the second component. For notational convenience, the
lexeme itself is used as the name of the abstract symbol.
3. “initial” is a lexeme that is mapped into the token <id, 2>, where 2 points to the
symbol-table entry for initial.
4. + is a lexeme that is mapped into the token <+>.
5. “rate” is a lexeme mapped into the token <id, 3>, where 3 points to the symbol-table
entry for rate.
6. * is a lexeme that is mapped into the token <*> .
7. 60 is a lexeme that is mapped into the token <60>
Blanks separating the lexemes would be discarded by the lexical analyzer.
Compiler Design 8
The parser uses the first components of the tokens produced by the lexical
analyzer to create a tree-like intermediate representation that depicts the
grammatical structure of the token stream.
A typical representation is a syntax tree in which each interior node
represents an operation and the children of the node represent the
arguments of the operation
Compiler Design 9
The semantic analyzer uses the syntax tree and the information in the symbol
table to check the source program for semantic consistency with the language
definition.
Gathers type information and saves it in either the syntax tree or the symbol table,
for subsequent use during intermediate-code generation.
An important part of semantic analysis is type checking, where the compiler
checks that each operator has matching operands. For example, many
programming language definitions require an array index to be an integer; the
compiler must report an error if a floating-point number is used to index an array.
The language specification may permit some type conversions called coercions.
For example, a binary arithmetic operator may be applied to either a pair of
integers or to a pair of floating-point numbers. If the operator is applied to a
floating-point number and an integer, the compiler may convert or coerce the
integer into a floating-point number.
Compiler Design 10
After syntax and semantic analysis of the source program, many compilers
generate an explicit low-level or machine-like intermediate representation
(a program for an abstract machine). This intermediate representation
should have two important properties:
◦ It should be easy to produce and
◦ It should be easy to translate into the target machine.
The considered intermediate form called three-address code, which consists
of a sequence of assembly-like instructions with three operands per
instruction. Each operand can act like a register.
Compiler Design 11
The machine-independent code-optimization phase attempts to improve the
intermediate code so that better target code will result.
Usually better means:
◦ faster, shorter code, or target code that consumes less power.
The optimizer can deduce that the conversion of 60 from integer to floating
point can be done once and for all at compile time, so the int to float
operation can be eliminated by replacing the integer 60 by the floating-point
number 60.0. Moreover, t3 is used only once
There are simple optimizations that significantly improve the running time
of the target program without slowing down compilation too much.
Compiler Design 12
If the target language is machine code, registers or memory
locations are selected for each of the variables used by the
program.
Then, the intermediate instructions are translated into sequences
of machine instructions that perform the same task.
A crucial aspect of code generation is the Judicious assignment
of registers to hold variables.
Compiler Design 13
Compiler Design 14
The symbol table is a data structure containing a
record for each variable name, with fields for the
attributes of the name.
The data structure should be designed to allow the
compiler to find the record for each name quickly and
to store or retrieve data from that record quickly
These attributes may provide information about the
storage allocated for a name, its type, its scope and in
the case of procedure names, such things as the
number and types of its arguments, the method of
passing each argument (for example, by value or by
reference), and the type returned.
Compiler Design 15
Deals with the logical organization of a compiler
For example, the front-end phases of lexical analysis,
syntax analysis, semantic analysis, and intermediate
code generation can be grouped together into one
pass.
Code optimization might be an optional pass.
Then there could be a back-end pass consisting of
code generation for a particular target machine.
Parser generators that automatically produce syntax
analyzers from a grammatical description of a
programming language.
Scanner generators that produce lexical analyzers
from a regular-expression description of the tokens of
a language.
Syntax-directed translation engines that produce
collections of routines for walking a parse tree and
generating intermediate code.
Code Generators translates each operation of the
intermediate language into the machine language for
a target machine.
Data flow analysis engines that facilitate the
gathering of information about how values are
transmitted from one part of a program to each other
part. Data-flow analysis is a key part of code
optimization.
Compiler Construction Toolkits that provide an
integrated set of routines for constructing various
phases of a compiler.
The Move to Higher-level Languages
First-generation - machine languages.
Second-generation - assembly languages.
Third-generation - higher-level languages like Fortran,
Cobol, Lisp, C, C++, C#, and Java.
Fourth-generation languages are languages designed for
specific applications like SQL for database queries and
Postscript for text formatting.
Fifth-generation language has been applied to logic and
constraint-based languages like Prolog and OPS5.
Imperative languages specifies how a computation is
to be done.
Languages such as C, C++, C#, and Java are imperative
languages.
In imperative languages there is a notion of program
state and statements that change the state.
Functional languages such as ML and Haskell and
constraint logic languages such as Prolog are often
considered to be declarative languages (what
computation is to be done).
Scripting languages are interpreted languages with
high-level operators designed for "gluing together"
computations.
Awk, JavaScript, Perl, PHP, Python, Ruby, and Tel are
popular examples of scripting languages.
Develop new algorithms and translations to support for
new programming language feature.
New translation algorithms must take advantage of the
new hardware capabilities.
Millions of lines of codes may be there in input program.
A compiler must translate correctly the potentially infinite
set of programs that could be written in the source
language.
The problem of generating the optimal target code from a
source program is undecidable.
Modeling in Compiler Design and Implementation
The study of compilers is about how we design the
right mathematical models and choose the right
algorithms.
Some of fundamental models are finite-state
machines, regular expressions and context free
languages.
The term "optimization" in compiler design refers to
the attempts that a compiler makes to produce code
that is more efficient than the obvious code.
Compiler optimizations must meet the following
design objectives:
◦ Optimization must be correct, that is preserve the meaning
of the compiled program,
◦ Optimization must improve the performance of programs.
◦ The compilation time must be kept reasonable and
◦ engineering effort required must be manageable.
Optimizations speed ups execution time also
conserve power.
Compilation time should be short to support a
rapid development and debugging cycle.
Compiler is a complex system; we must keep
the system simple to assure that the
engineering and maintenance costs of the
compiler are manageable.
Implementation of High-Level Programming Languages
Higher-level programming languages are easier to
program in, but are less efficient, that is, the target
programs generated run more slowly.
Programmers using a low-level language have more
control over a computation and can produce more efficient
code.
The register keyword in the C programming language is an
early example of the interaction between compiler
technology and language evolution.
Different Programming Languages support different
levels of abstractions.
Data Flow Optimizations has been developed to
analyze the flow of data through the program and
removes redundancies across these constructs.
The key ideas behind object orientation are
◦ Data abstraction.
◦ Inheritance of properties
Both of which have been found to make programs
more modular and easier to maintain.
Compiler optimizations have been developed
to reduce the overhead ex: unnecessary range
check, unreachable objects
Effective algorithms have been developed to
minimize the overhead of garbage collection.
All high-performance systems take advantage
of the same two basic techniques: parallelism
and memory hierarchies.
Instruction-level parallelism can also appear
explicitly in the instruction set.
VLIW (Very Long Instruction Word) machines
have instructions that can issue multiple
operations in parallel. The Intel IA64 is a well
known example of such an architecture.
Programmers can write multithreaded code for
multiprocessors or parallel code can be automatically
generated by a compiler.
Compiler hides from the programmers the details of
finding parallelism in a program.
Distributes the computation across the machine, and
minimizing synchronization and communication
among the processors.
Compilers focus on optimizing the processor
execution by making the memory hierarchy more
effective.
In the early days of computer architecture
design, compilers were developed after the
machines were built.
That has changed.
Since programming in high level languages is
the important, the performance of a computer
system is also determined by the compilers
that can exploit its features.
Compilers influenced the design of computer
architecture called RISC.
Prior to RISC, CISC architecture which makes
assembly programming easier was used.
Compiler optimizations often can reduce these
instructions to a small number of simpler operations by
eliminating the redundancies across complex
instructions.
RISC – Power PC, SPARC.
CISC – x86.
They include data flow machines
◦ VLIW (Very Long Instruction Word) machines.
◦ SIMD (Single Instruction, Multiple Data) arrays of
processors
◦ Multiprocessors with shared memory.
◦ Multiprocessors with distributed memory.
Compiler technology is needed not only to support
programming for these architectures, but also to
evaluate proposed architectural designs.
Binary Translation - Compiler technology can be
used to translate the binary code for one machine to
that of another.
Binary translation can also be used to provide
backward compatibility.
Hardware Synthesis - hardware designs are mostly
described in high-level hardware description
languages like Verilog and VHDL.
Hardware Synthesis tools translate RTL descriptions
automatically into gates, which are then mapped to
transistors and eventually to a physical layout.
Database Query Interpreters – Database queries consist
of predicates containing relational and boolean
operators. They can be interpreted or compiled into
commands to search a database for records satisfying
that predicate.
Compiled Simulation - Simulations can be very
expensive. Instead of writing a simulator that interprets
the design.
It is faster to compile the design to produce machine
code that simulates that particular design natively.
Compiled simulation can run orders of magnitude
faster than an interpreter-based approach.
Type Checking – Type checking is an effective and
well-established technique to catch inconsistencies in
programs. It can be used to catch errors.
Bounds Checking - It is easier to make mistakes when
programming in a lower-level language than a higher
level one.
Memory Management Tools - Garbage collection is
another excellent example. Purify is a widely used tool
that dynamically catches memory management errors
as they occur.
As the first phase of a compiler, the main task of the
lexical analyzer is to read the input characters of the
source program, group them into lexemes.
Produce as output a sequence of tokens for each
lexeme in the source program.
The stream of tokens is sent to the parser for syntax
analysis.
Stripping out comments and white spaces
Correlating error messages with source
program
Keeps track of Line numbers
Expansion of macros
Sometimes, lexical analyzers are divided into
a cascade of two processes:
◦ Scanning consists of the simple processes that
do not require tokenization of the input, such as
deletion of comments and compaction of
consecutive whitespace characters into one.
◦ Lexical analysis proper is the more complex
portion, scanner produces the sequence of
tokens as output.
The separation of lexical and syntactic analysis often
allows us to simplify at least one of these tasks.
Simplicity of design is the most important
consideration.
Compiler efficiency is improved.
- A separate lexical analyzer allows us to apply specialized
techniques that serve only the lexical task, not the job of
parsing.
- In addition, specialized buffering techniques for reading
input characters can speed up the compiler significantly.
Compiler portability is enhanced.
A token is a pair consisting of a token name and an
optional attribute value.
The token name is an abstract symbol representing a
kind of lexical unit.
e.g., a particular keyword, or a sequence of input
characters denoting an identifier.
A pattern is a description of the form that the
lexemes of a token may take.
In the case of a keyword as a token, the pattern
is just the sequence of characters that form the
keyword.
A lexeme is a sequence of characters in the
source program that matches the pattern for a
token and is identified by the lexical analyzer
as an instance of that token.
The token names and associated attribute values for the
Fortran statement.
<id, pointer to symbol-table entry for E>
<assign_op>
<id, pointer to symbol-table entry for M>
<mult_op>
<id, pointer to symbol-table entry for C>
<exp_op>
<number, integer value 2>
f i ( a == f (x) ) . . .
a lexical analyzer cannot find whether fi is a
misspelling of the keyword if or an undeclared
function identifier.
Since fi is a valid lexeme for the token id, the lexical
analyzer must return the token id to the parser and let
some other phase of the compiler.
Suppose if the lexical analyzer is unable to proceed because
none of the patterns for tokens matches any prefix of the
remaining input.
The simplest recovery strategy is "panic mode" recovery.
Delete successive characters from the remaining input, until
the lexical analyzer can find a well-formed token at the
beginning of what input is left.(iffff)
Other possible error-recovery actions are:
1. Delete one character from the remaining input. (iff)
2. Insert a missing character into the remaining input.(pritf)
3. Replace a character by another character.(ef)
4. Transpose two adjacent characters.
Two-buffer scheme that handles large lookaheads
safely called Buffer Pairs.
Specialized buffering techniques have been developed
to reduce the amount of overhead required to process a
single input character.
-Each buffer is of the same size N, ie size of disk block-
4096 bytes
-using one system read command-can read N characters
into buffer
Two pointers to the input are maintained:
1. Pointer lexemeBegin, marks the beginning of the
current lexeme, whose extent we are attempting to
determine.
2. Pointer forward scans ahead until a pattern match is
found.
Thus, for each character read, we make two tests: one
for the end of the buffer and one to determine what
character is read.
Each buffer can hold a sentinel character at the end.
The sentinel is a special character that cannot be part of
the source program and a natural choice is the character
eof.
switch ( *forward+ + ) {
case eof:
if (forward is at end of first buffer ) {
reload second buffer;
forward = beginning of second buffer;
}
else if (forward is at end of second buffer ) {
reload first buffer;
forward = beginning of first buffer;
}
else eof within a buffer marks the end of input
terminate lexical analysis;
break;
Cases for the other characters
}
Regular expressions are an important notation for
specifying lexeme patterns.
While they cannot express all possible patterns, they
are very effective in specifying those types of patterns
that we actually need for tokens.
Strings and Languages
An alphabet is any finite set of symbols such as
letters, digits, and punctuation.
examples
◦ The set {0,1) is the binary alphabet
◦ ASCII is used in many software systems
◦ unicode
◦ A String over an alphabet is a finite sequence of symbols
drawn from that alphabet
◦ In language theory, the terms "sentence" and "word" are
often used as synonyms for "string."
◦ |s| represents the length of a string s,
Ex: banana is a string of length 6
◦ The empty string, is the string of length zero.
◦ if x and y are strings, then the concatenation of x and y is
also string, denoted xy, For example, if x = usp and y = cd,
then xy = uspcd.
◦ The empty string is the identity under concatenation; that is,
for any string s, ɛS = Sɛ = s.
A language is any countable set of strings over some
fixed alphabet
Let L = {A, . . . , Z}, then{“A”,”B”,”C”,
“BE”…,”ABZ”,…] is consider the language defined by
L
Abstract languages like , the empty set, or
{},the set containing only the empty string, are
languages under this definition.
Example:
Let L be the set of letters {A, B, . . . , Z, a, b, . . . , z ) and
let D be the set of digits {0,1,.. .9).
L and D are, respectively, the alphabets of uppercase and
lowercase letters and of digits.
Other languages can be constructed from L and D, using the
operators illustrated above
1. L U D is the set of letters and digits - strictly speaking the
language with 62 (52+10) strings of length one, each of which
strings is either one letter or one digit.
2. LD is the set of 520 strings of length two, each consisting of one
letter followed by one digit.(10×52).
Ex: A1, a1,B0,etc
3. L4 is the set of all 4-letter strings. (ex: aaba, bcef)
4. L* is the set of all strings of letters, including , the
empty string.
5. L(L U D)* is the set of all strings of letters and digits
beginning with a letter.
6. D+ is the set of all strings of one or more digits.
Regular Expressions are an important notation for
describing all the languages that can be built from the
operators applied to the symbols of some alphabet
For Example: we were able to describe identifies by giving
names to sets of letters and digits and using the language
operators union, concatenation and closure.
In this notation, if letter_ is stand for any letter or the
underscore and digit is stand for any digit, then we could
describe the language of C identifies by:
letter_ (letter_ | digit )*
Compiler Construction
The RE are built Recursively out of smaller regular
expressions, using the rules described below.
Each RE r denotes a language L(r), which is also
defined recursively from the languages denoted by r’s
sub expressions.
Compiler Construction
Here are the rules that define the regular expression
over alphabet ∑ and languages that those
expressions denote.
Basis: there are two rules form the basis:
€ is a regular expression, and L(€) is { € }, that is,
the language containing only the empty string.
if a is symbol in Σ, then a regular expression
denoting L(a)={ a }, is the language with only one
string consisting of the single symbol ‘a’ .
Compiler Construction
Induction:
- Larger regular expressions are built from smaller ones.
Let r and s are regular expressions denoting languages L(r)
and L(s), respectively.
1. (r) | (s) is a regular expression denoting the language L(r) U
L(s).
2. (r) (s) is a regular expression denoting the language L(r) L(s)
3. (r) * is a regular expression denoting (L (r)) * .
4. (r) is a regular expression denoting L(r).
Compiler Construction
This last rule says that we can add additional pairs of
parentheses around expressions without changing the
language they denote.
for example, we may replace the regular expression
(a) | ((b) * (c)) by a| b*c.
Compiler Construction
Example: Let ∑ = {a, b}
The regular expression a|b denotes the language {a, b}.
(a|b) (a|b) denotes {aa, ab, ba, bb} the language of all strings of
length two over the alphabet ∑.
a* denotes the language consisting of all strings of zero or more
a's, that is, {Є, a, aa, aaa, ... }
(a|b)* denotes the set of all strings consisting of zero or more
instances of a or b, that is, all strings of a's and b's: {Є, a, b, aa,
ab, ba, bb, aaa, ... }.
a|a*b denotes the language {ab, b, ab, aab, aaab, ... }, that is, the
string a and all strings consisting of zero or more a's and ending
in b.
A language that can be defined by a Regular
Expressions is called regular set.
If two RE r and s denote the same regular set,
we say they are equivalent and write r=s
For instance: (a|b)= (b|a)
If ∑ is an alphabet of basic symbols, then a regular
definition is a sequence of definitions of the form:
d 1 ⇒ r1
d 2 ⇒ r2
……
dn ⇒ rn
Each d is a new symbol, not in ∑ and not the same as
i
any other of the d's.
Each r is a regular expression over the alphabet ∑ ∪
i
{d1 , d2 , . . . ,di-1}
Examples strings such as 5280, 0 . 0 1234, 6 . 336E4,
or 1 . 89E-4
1) One or more instances
unary operator + is a positive closure of RE and its
language
2) Zero or one Instance
unary operator ?
ex: r?= r |
3) Character Classes
R.E a1|a2|….|an, can be replaced by the shorthand
[a1a2…an]
Moreover when a1,a2,…an form a logical sequence,
we can replace them by [a1-an]
Regular definition for C identifiers:
letter_ -> [A-Z a-z _]
digit -> [0-9]
Id -> letter_ ( letter_ | digit )*
Regular definition for unsigned numbers:
digit -> [0-9]
digits -> digit+
number -> digits ( . digits)? (E[+-]? digits )?
•Given the grammar of branching statement:
Compiler Construction
The terminals of the grammar, which are if, then,
else, relop, id, and number, are the names of tokens
as used by the lexical analyzer.
The lexical analyzer also has the job of stripping out
whitespace, by recognizing the "token" ws defined
by:
Token ws is not return to parser instead it restart the
LA from character that follows the white space.
Goal of Lexical Analyzer is, for each lexeme
or family of lexemes, which token name is
returned to the parser and what attribute
value is returned..
Is discussed in table
Compiler Construction
To recognize tokens there are 2 steps
1. Design of Transition Diagram
2. Implementation of Transition Diagram
A transition diagram is similar to a flowchart for (a part of)
the lexer.
We draw one for each possible token. It shows the decisions
that must be made based on the input seen.
Transition diagrams have a collection of nodes or circles,
called States.
Each state represents a condition that could occur during
the process of scanning the input looking for a lexeme that
matches one of several patterns.
Edges are directed from one state of the transition diagram
to another.– edges are labeled by i/p symbols.
Ex : RELOP = < | <= | = | <> | > | >=
2 return(relop,LE)
=
1 3 return(relop,NE)
>
<
start other *
0 = 5 4 return(relop,LT)
return(relop,EQ)
>
=
6 7 return(relop,GE)
* indicates input retraction other * return(relop,GT)
8
Compiler Construction
Ex2: ID = letter(letter | digit) *
Transition Diagram:
letter or digit
letter other
*
start
9 10 11
return ( getToken(),
installID() )
*indicates input retraction
Compiler Construction
Two questions remain.
1. How do we distinguish between identifiers and
keywords such as if, then and else, which also match
the pattern in the transition diagram?
2. What is (getToken(), installID())?
1) Install the reserved words in the Symbol Table initially.
installID() checks if the lexeme is already in the symbol
table. If it is not present, the lexeme is installed ( places
it in symbol table) as an id token. In either case a pointer
to the entry is returned.
getToken examines the symbol table entry for the lexeme
found and returns the token name
2) Create separate transition for each keyword;
Multiple accepting state
Accepting float
e.g. 12.31E4
Accepting integer Accepting float
e.g. 12 e.g. 12.31
Compiler Construction
There are several ways that a collections of
Transition Diagram can be used to build a LA.
Each State is represented by a piece of code
Implementation of relop transition diagram is
shown below
Token getRelop() {
TOKEN retToken = new(RELOP);
while (1) {
switch (state) {
case 0: c = nextchar();
if (c is white space) state = 0;
else if (c == ‘<‘) state = 1;
else if (c == ‘=‘) state = 5;
else if (c == ‘>’) state = 6;
else fail ( );
break;
case 1: ….
case 8: retract(1);
retToken.attribute = GT;
return (retToken); }
} }
1. The transition diagram for each token to be tried
sequentially.
- then function fail() resets the pointer forward and start
the next TD, each time its called.
2. Run the various TDs “in parallel,” feeding the next
input character to all of them and allowing each one to
make whatever transitions it required.
3. The Preferred approach is to combine all the TD into
one.
Actual parameter
Formal Parameter
1. Call–by-Value
2. Call-by-Reference
3. Call-by-Name:
Actual parameter were substituted
literally for the formal parameter(as if macro
for Actual parameter) in the code of callee.
Two formal parameter can refer to the same
location such variables called aliases of one
another.
Ex: a is an array of procedure p
P calls another proc q(x, y) with a call q(a, a)
Now x and y are become aliases of each other
x[10]=2 is lead to y[10]=2
Programming languages has precise rules
that gives the syntactic structure of programs
Ex: in C, pgm is made of functions
Can be specified by CFG
Grammar benefits for both language designer
and compiler writers
Gives a precise syntactic specification of a
programming language.
Can construct a parser that determines the
syntactic structure.
Useful for translating source programs into
correct object code and detecting error
a grammar allows a language can evolved or
developed iteratively.
Verifies the tokens can be generated by the
grammar
Report syntax errors
Recover from commonly occurring errors
Construct a parse tree and passes it to rest of
the compiler
We categorize the parsers into three groups:
1 Universal parser
Can parse any grammar but too inefficient to
use in production compilers
2 Top-Down Parser
the parse tree is created top to bottom,
starting from the root.
3 Bottom-Up Parser
the parse is created bottom to top; starting
from the leaves
Both top-down and bottom-up parsers
scan the input from left to right
(one symbol at a time).
Efficient top-down and bottom-up parsers
can be implemented only for sub-classes of
context-free grammars.
– LL for top-down parsing
– LR for bottom-up parsing
Common Programming errors can occur at
many different levels.
1. Lexical errors: include misspelling of
identifiers, keywords, or operators.
2. Syntactic errors : include misplaced
semicolons or extra or missing braces.
3. Semantic errors: include type mismatches
between operators and operands.
4. Logical error: incorrect reasoning
use of = instead of ==
Report the presence of errors clearly
and accurately
Recover from each error quickly
enough to detect subsequent errors.
Add minimal overhead to the
processing of correct programs
Panic-Mode Recovery
Phrase-Level Recovery
Error Productions
Global Correction
On discovering an error, the parser discards
input symbols one at a time until one of a
designated set of Synchronizing tokens is
found.
Synchronizing tokens are usually delimiters.
Ex: semicolon or } whose role in the source
program is clear and unambiguous.
It often skips a considerable amount of input
without checking it for additional errors.
Advantages:
Simplicity
Is guaranteed not to go into an infinite loop
A parser may perform local correction on the
remaining input. i.e
it may replace a prefix of the remaining input by
some string that allows the parser to continue.
Ex: replace a comma by a semicolon, insert a
missing semicolon
Local correction is left to the compiler
designer.
It is used in several error-repairing compliers,
as it can correct any input string
We can augment the grammar for the language
at hand with productions that generate the
erroneous constructs.
Then we can use the grammar augmented
by these error productions to Construct a
parser.
If an error production is used by the parser, we
can generate appropriate error diagnostics to
indicate the erroneous construct that has been
recognized in the input.
We use algorithms that perform minimal
sequence of changes to obtain a globally least
cost correction
Given an incorrect input string x and grammar G,
these algorithms will find a parse tree for a
related string y.
Such that the number of insertions, deletions and
changes of tokens required to transform x into y
is as small as possible.
It is too costly to implement in terms of time
space, so these techniques only of theoretical
interest
Its is implemented by filling in the blank
entries in the predictive parsing table with
pointers to error routines
Routines may change, insert or delete
symbols on the input and issue appropriate
error messages.