Module -I
Introduction to Compiling:
1.1 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
Fig 1.1: Language Processing System
Preprocessor
A preprocessor produce input to compilers. They may perform the following functions.
1. Macro processing: A preprocessor may allow a user to define macros that are short hands for
longer constructs.
2. File inclusion: A preprocessor may include header files into the program text.
3. Rational preprocessor: these preprocessors augment older languages with more modern flow-of-
control and data structuring facilities.
4. Language Extensions: These preprocessor attempts to add capabilities to the language by certain
amounts to build-in macro
COMPILER
Compiler is a translator program that translates a program written in (HLL) the source program and
translate it into an equivalent program in (MLL) the target program. As an important part of a
compiler is error showing to the programmer.
Fig 1.2: Structure of Compiler
Executing a program written n HLL programming language is basically of two parts. the source
program must first be compiled translated into a object program. Then the results object program is
loaded into a memory executed.
Fig 1.3: Execution process of source program in Compiler
ASSEMBLER
Programmers found it difficult to write or read programs in machine language. They begin to use a
mnemonic (symbols) for each machine instruction, which they would subsequently translate into
machine language. Such a mnemonic machine language is now called an assembly language.
Programs known as assembler were written to automate the translation of assembly language in to
machine language. The input to an assembler program is called source program, the output is a
machine language translation (object program).
INTERPRETER
An interpreter is a program that appears to execute a source program as if it were machine language.
Fig1.4: Execution in Interpreter
Languages such as BASIC, SNOBOL, LISP can be translated using interpreters. JAVA also uses
interpreter. The process of interpretation can be carried out in following phases.
1. Lexical analysis
2. Synatx analysis
3. Semantic analysis
4. Direct Execution
Advantages:
Modification of user program can be easily made and implemented as execution proceeds.
Type of object that denotes a various may change dynamically.
Debugging a program and finding errors is simplified task for a program used for interpretation.
The interpreter for the language makes it machine independent.
Disadvantages:
The execution of the program is slower.
Memory consumption is more.
LOADER AND LINK-EDITOR:
Once the assembler procedures an object program, that program must be placed into memory and
executed. The assembler could place the object program directly in memory and transfer control to it,
thereby causing the machine language program to be execute. This would waste core by leaving the
assembler in memory while the user’s program was being executed. Also the programmer would
have to retranslate his program with each execution, thus wasting translation time. To over come this
problems of wasted translation time and memory. System programmers developed another
component called loader
“A loader is a program that places programs into memory and prepares them for execution.” It would
be more efficient if subroutines could be translated into object form the loader could”relocate”
directly behind the user’s program. The task of adjusting programs o they may be placed in arbitrary
core locations is called relocation. Relocation loaders perform four functions.
1.2 TRANSLATOR
A translator is a program that takes as input a program written in one language and produces as
output a program in another language. Beside program translation, the translator performs another
very important role, the error-detection. Any violation of d HLL specification would be detected and
reported to the programmers. Important role of translator are:
1 Translating the HLL program input into an equivalent ml program.
2 Providing diagnostic messages wherever the programmer violates specification of the HLL.
1.3 LIST OF COMPILERS
1. Ada compilers
2 .ALGOL compilers
3 .BASIC compilers
4 .C# compilers
5 .C compilers
6 .C++ compilers
7 .COBOL compilers
8 .Common Lisp compilers
9. ECMAScript interpreters
10. Fortran compilers
11 .Java compilers
12. Pascal compilers
13. PL/I compilers
14. Python compilers
15. Smalltalk compilers
1.4 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler: A compiler operates in phases. A phase is a logically interrelated operation
that takes source program in one representation and produces output in another representation. The
phases of a compiler are shown in below
There are two phases of compilation.
a. Analysis (Machine Independent/Language Dependent)
b. Synthesis(Machine Dependent/Language independent)
Compilation process is partitioned into no-of-sub processes called ‘phases’.
Lexical Analysis:-
LA or Scanners reads the source program one character at a time, carving the source program into a
sequence of automic units called tokens.
Fig 1.5: Phases of Compiler
Syntax Analysis:-
The second stage of translation is called Syntax analysis or parsing. In this phase expressions,
statements, declarations etc… are identified by using the results of lexical analysis. Syntax analysis is
aided by using techniques based on formal grammar of the programming language.
Intermediate Code Generations:-
An intermediate representation of the final machine language code is produced. This phase bridges
the analysis and synthesis phases of translation.
Code Optimization :-
This is optional phase described to improve the intermediate code so that the output runs faster and
takes less space.
Code Generation:-
The last phase of translation is code generation. A number of optimizations to reduce the length of
machine language program are carried out during this phase. The output of the code generator is
the machine language program of the specified computer.
Table Management (or) Book-keeping:- This is the portion to keep the names used by the
program and records essential information about each. The data structure used to record this
information called a ‘Symbol Table’.
Error Handlers:-
It is invoked when a flaw error in the source program is detected. The output of LA is a stream of
tokens, which is passed to the next phase, the syntax analyzer or parser. The SA groups the tokens
together into syntactic structure called as expression. Expression may further be combined to form
statements. The syntactic structure can be regarded as a tree whose leaves are the token called as
parse trees.
The parser has two functions. It checks if the tokens from lexical analyzer, occur in pattern that are
permitted by the specification for the source language. It also imposes on tokens a tree-like structure
that is used by the sub-sequent phases of the compiler.
Example, if a program contains the expression A+/B after lexical analysis this expression might
appear to the syntax analyzer as the token sequence id+/id. On seeing the /, the syntax analyzer
should detect an error situation, because the presence of these two adjacent binary operators violates
the formulations rule of an expression. Syntax analysis is to make explicit the hierarchical structure
of the incoming token stream by identifying which parts of the token stream should be grouped.
Example, (A/B*C has two possible interpretations.)
1, divide A by B and then multiply by C or
2, multiply B by C and then use the result to divide A.
each of these two interpretations can be represented in terms of a parse tree.
Intermediate Code Generation:-
The intermediate code generation uses the structure produced by the syntax analyzer to create a
stream of simple instructions. Many styles of intermediate code are possible. One common style uses
instruction with one operator and a small number of operands. The output of the syntax analyzer is
some representation of a parse tree. the intermediate code generation phase transforms this parse tree
into an intermediate language representation of the source program.
Code Optimization
This is optional phase described to improve the intermediate code so that the output runs faster and
takes less space. Its output is another intermediate code program that does the some job as the
original, but in a way that saves time and / or spaces.
a. Local Optimization:-
There are local transformations that can be applied to a program to make an improvement. For
example,
If A > B goto L2
Goto L3
L2 :
This can be replaced by a single statement
If A < B goto L3
Another important local optimization is the elimination of common sub-expressions
A := B + C + D
E := B + C + F
Might be evaluated as
T1 := B + C
A := T1 + D
E := T1 + F
Take this advantage of the common sub-expressions B + C.
b. Loop Optimization:-
Another important source of optimization concerns about increasing the speed of loops. A
typical loop improvement is to move a computation that produces the same result each time
around the loop to a point, in the program just before the loop is entered.
Code generator :-
Code Generator produces the object code by deciding on the memory locations for data, selecting
code to access each datum and selecting the registers in which each computation is to be done. Many
computers have only a few high speed registers in which computations can be performed quickly. A
good code generator would attempt to utilize registers as efficiently as possible.
Table Management OR Book-keeping :-
A compiler needs to collect information about all the data objects that appear in the source program.
The information about data objects is collected by the early phases of the compiler-lexical and
syntactic analyzers. The data structure used to record this information is called as Symbol Table.
Error Handing :-
One of the most important functions of a compiler is the detection and reporting of errors in the
source program. The error message should allow the programmer to determine exactly where the
errors have occurred. Errors may occur in all or the phases of a compiler.
Whenever a phase of the compiler discovers an error, it must report the error to the error handler,
which issues an appropriate diagnostic msg. Both of the table-management and error-Handling
routines interact with all phases of the compiler.
CSE304 Compiler Design Notes
Phases of a compiler
If we examine the compilation process in more detail, it operates as a sequence
of phases, each of which transforms one representation of the source program to
another.
A typical decomposition of a compiler into phases.
In practice, several phases may be grouped together, and the intermediate
representations between the grouped phases need not be constructed explicitly. The
symbol table, which stores information about the entire source program, is used by all
phases of the compiler.
Prepared By M. Raja, CSE, KLU 8
CSE304 Compiler Design Notes
1. Lexical Analysis:
lexical analysis or scanning forms the first phase of a compiler. The lexical
analyzer reads the stream of characters which makes the source program and groups
them into meaningful sequences called lexemes. For each lexeme, the lexical analyzer
produces tokens as output. A token format is shown below.
<token-name, attribute-value>
These tokens pass on to the subsequent phase known as syntax analysis. The token
elements are listed below:
o Token-name: This is an abstract symbol used during syntax analysis.
o Attribute-value: This point to an entry in the symbol table for the
corresponding token.
Information from the symbol-table entry 'is needed for semantic analysis and code
generation.
For example, let us try to analyze a simple arithmetic expression evaluation in Lexical
context
position = initial + rate * 60
The individual units in the above expression can be grouped into lexemes.
a. position is a lexeme that would be mapped into a token <id, 1>, where id is an
abstract symbol standing for identifier and 1 points to the symbol table entry for
position. The symbol-table entry for an identifier holds information about the
identifier, such as its name and type.
b. The assignment symbol = is a lexeme that is mapped into the token <=>. Since this
token needs no attribute-value, we have omitted the second component. We could
have used any abstract symbol such as assign for the token-name, but for notational
convenience we have chosen to use the lexeme itself as the name of the abstract
symbol.
c. initial is a lexeme that is mapped into the token <id, 2>, where 2 points to the
symbol-table entry for initial.
d. + is a lexeme that is mapped into the token <+>.
e. rate is a lexeme that is mapped into the token <id, 3>, where 3 points to the
symbol-table entry for rate.
Prepared By M. Raja, CSE, KLU 9
CSE304 Compiler Design Notes
f. * is a lexeme that is mapped into the token <*>.
g. 60 is a lexeme that is mapped into the token <60>. Blanks would be discarded by the
lexical analyzer.
The expression is seen by Lexical analyzer as
<id,1> <=> <id,2> <+> <id,3> <*> <60>
The token names =, +, and * are abstract symbols for the assignment, addition, and
multiplication operators, respectively.
Prepared By M. Raja, CSE, KLU 10
CSE304 Compiler Design Notes
Prepared By M. Raja, CSE, KLU 11
CSE304 Compiler Design Notes
2. Syntax Analysis:
Syntax analysis forms the second phase of the compiler.
The list of tokens produced by the lexical analysis phase forms the input and arranges
them in the form of tree-structure (called the syntax tree).This reflects the structure
of the program. This phase is also called parsing.
The syntax tree consists of interior node representing an operation and the child of
the node representing arguments. A syntax tree for the token statement is as shown in
the above example.
Operators are considered as root nodes of this syntax tree. In the above case = has
left and a right node. The left node consists of <id,1> and the right node Is again
parsed and the immediate operator is taken as right node.
Figure 1.4 Syntax Tree
The tree has an interior node labeled * with <id, 3> as its left child and the integer
60 as its right child. The node <id, 3> represents the identifier rate.
The node labeled * makes it explicit that we must first multiply the value of rate by
60. The node labeled + indicates that we must add the result of this multiplication to
the value of initial.
The root of the tree, labeled =, indicates that we must store the result of this addition
into the location for the identifier position.
This ordering of operations is consistent with the usual conventions of arithmetic
which tell us that multiplication has higher precedence than addition, and hence that
the multiplication is to be performed before the addition.
3. Semantic analysis:
Semantic analysis forms the third phase of the compiler. This phase uses the syntax
tree and the information in the symbol table to check the source program for
consistency with the language definition. This phase also collects type information and
saves it in either the syntax tree or the symbol table, for subsequent use during
intermediate-code generation.
Type checking forms an important part of semantic analysis. Here the compiler
checks whether each operator has matching operands. For example, many programming
Prepared By M. Raja, CSE, KLU 12
CSE304 Compiler Design Notes
language definitions require an array index to be an integer; the compiler must report
an error if a floating- point number is used to index an array.
Coercions(some type conversions) may be permitted by the language specifications.
If the operator is applied to a floating-point number and an integer, the compiler may
convert or coerce the integer into a floating-point number.
Coercion exists in the example quoted
position = initial + rate * 60
Suppose position, initial and rate variables are declared as float. Since the <id, 3> is a
floating point, then 60 is also converted to floating point. The semantic tree (Fig 1.5) can be
seen as follows. The syntax tree is modified to include the conversion/semantic aspects. In
the example quoted 60 is converted to float as inttofloat
Figure 1.5 Semantic Tree
4. Intermediate Code Generation:
Intermediate code generation forms the fourth phase of the compiler. After
syntax and semantic analysis of the source program, many compilers generate a low-
level or machine-like intermediate representation, which can be thought as a program
for an abstract machine.
This intermediate representation must have two important properties:
(a) It should be easy to produce
(b) It should be easy to translate into the target machine.
The above example is converted into three-address code sequence
There are several points worth noting about three-address instructions.
1. First each three-address assignment instruction has at most one operator on the
right side. Thus, these instructions fix the order in which operations are to be done;
the multiplication precedes the addition in the source program (1.1).
2. Second, the compiler must generate a temporary name to hold the value computed
Prepared By M. Raja, CSE, KLU 13
CSE304 Compiler Design Notes
by a three-address instruction.
3. Third, some "three-address instructions" like the first and last in the sequence have
fewer than three operands.
t1 = inttofloat (60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3
5. Code Optimization:
Code Optimization forms the fifth phase in the compiler design.
This is a machine-independent phase which attempts to improve the intermediate
code for generating better (faster) target code.
For example, a straightforward algorithm generates the intermediate code (1.3),
using an instruction for each operator in the tree representation that comes from the
semantic analyzer.
The optimizer can deduce that the conversion of 60 from integer to floating point can
be done once and for all at compile time, so the intto-float operation can be
eliminated by replacing the integer 60 by the floating-point number 60.0.
Moreover, t3 is used only once to transmit its value to id1 so the optimizer can
transform the above three-address code sequence into a shorter sequence.
t1= id3 * 60.0
id1 = id2 + t1
6. Code Generator:
Code Generator forms the sixth phase in the compiler design. This takes the
i ntermediate representation of the source program as input and maps it to the target
language.
The intermediate instructions are translated into sequences of machine instructions
that perform the same task. A critical aspect of code generation is the assignment of
registers to hold variables.
Using R1 & R2 the intermediate code will get converted into machine code.
LDF R2, id3
Prepared By M. Raja, CSE, KLU 14
CSE304 Compiler Design Notes
MULF R2, R2, #60.0
LDF R1, id2
ADDF R1, R1, R2
STF id1, R1
The first operand of each instruction specifies a destination. The F in each
instruction tells us that it deals with floating-point numbers. The code in loads the
contents of address id3 into register R2, then multi- plies it with floating-point
constant 60.0.
The # signifies that 60.0 is to be treated as an immediate constant. The third
instruction moves id2 into register R1 and the fourth adds to it the value
previously computed in register R2. Finally, the value in register R1 is stored into the
address of id1, so the code correctly implements the assignment statement.
Symbol-Table Management
An essential function of a compiler is to record the variable names used in the
source program and collect information about various attributes of each name.
These attributes may provide information about the storage allocated for a name,
its type, its scope (where in the program its value may be used), and in the case of
procedure names, such things as the number and types of its arguments, the
method of passing each argument (for example, by value or by reference), and the
type returned.
The symbol table is a data structure containing a record for each variable name,
with fields for the attributes of the name.
The Grouping of Phases into Passes
Activities from several phases may be grouped together into a pass that reads an
input file and writes an output file.
For example, the front-end phases of lexical analysis, syntax analysis, semantic
analysis, and intermediate code generation might be grouped together into one
pass.
Code optimization might be an optional pass.
back-end pass consisting of code generation for a particular target machine.
Some compiler collections have been created around carefully designed
intermediate representations that allow the front end for a particular language to
interface with the back end for a certain target machine.
Prepared By M. Raja, CSE, KLU 15
CSE304 Compiler Design Notes
With these collections, we can produce compilers for different source languages for
one target machine by combining different front ends with the back end for that
target machine.
Similarly, we can produce compilers for different target machines, by combining a
front end with back ends for different target machines.
Compiler-Construction Tools
Some commonly used compiler-construction tools include
1. Parser generators that automatically produce syntax analyzers from a grammatical
description of a programming language.
2. Scanner generators that produce lexical analyzers from a regular-expression
description of the tokens of a language.
3. Syntax-directed translation engines that produce collections of routines for
walking a parse tree and generating intermediate code.
4. Code-generator generators that produce a code generator from a collection of
rules for translating each operation of the intermediate language into the machine
language for a target machine.
5. Data-flow analysis engines that facilitate the gathering of information about how
values are transmitted from one part of a program to each other part. Data-flow
analysis is a key part of code optimization.
6. Compiler-construction toolkits that provide an integrated set of routines for
constructing various phases of a compiler
Prepared By M. Raja, CSE, KLU 16
CSE304 Compiler Design Notes
The Role of Lexical Analyzer:
As the first phase of a compiler, the main task of the lexical analyzer is to read
the input characters of the source program, group them into lexemes, and produce as
output a sequence of tokens for each lexeme in the source program.
The stream of tokens is sent to the parser for syntax analysis. It is common for the
lexical analyzer to interact with the symbol table as well.
When the lexical analyzer discovers a lexeme constituting an identifier, it needs
to enter that lexeme into the symbol table. In some cases, information regarding the
kind of identifier may be read from the symbol table by the lexical analyzer to assist it
in determining the proper token it must pass to the parser.
The interaction is implemented by having the parser call the lexical analyzer.
The call, suggested by the getNextToken command, causes the lexical analyzer to read
characters from its input until it can identify the next lexeme and produce for it the
next token, which it returns to the parser. Since the lexical analyzer is the part of the
compiler that reads the source text, it may perform certain other tasks besides
identification of lexemes.
One such task is stripping out comments and whitespace (blank, newline, tab,
and perhaps other characters that are used to separate tokens in the input). Text, it
may perform certain other tasks besides identification of lexemes. One such task is
stripping out comments and whitespace (blank, newline, tab, and perhaps other
characters that are used to separate tokens in the input). Another task is correlating
Prepared By M. Raja, CSE, KLU 17
CSE304 Compiler Design Notes
error messages generated by the compiler with the source program. The lexical
analyzer may keep track of the number of newline characters seen, so it can associate a
line number with each error message.
Sometimes, lexical analyzers are divided into a cascade of two processes:
a. Scanning consists of the simple processes that do not require tokenization of the
input, such as deletion of comments and compaction of consecutive whitespace
characters into one.
b. Lexical analysis proper is the more complex portion, where the scanner produces
the sequence of tokens as output.
Lexical Analysis versus Parsing
There are a number of reasons why the analysis portion of a compiler is
normally separated into lexical analysis and parsing (syntax analysis) phases.
1. Simplicity of design is the most important consideration.
2. Compiler efficiency is improved.
3. Compiler portability is enhanced.
Tokens, Patterns, and Lexemes
A token is a pair consisting of a token name and an optional attribute value.
The token name is an abstract symbol representing a kind of lexical unit, e.g., a
particular keyword, or a sequence of input characters denoting an identifier.
A pattern is a description of the form that the lexemes of a token may take. In
the case of a keyword as a token, the pattern is just the sequence of characters that form
the keyword. For identifiers and some other tokens, the pattern is a more complex
structure that is matched by many strings.
A lexeme is a sequence of characters in the source program that matches the
pattern for a token
Prepared By M. Raja, CSE, KLU 18
CSE304 Compiler Design Notes
Figure 1.2 Examples of tokens
Attributes for Tokens
When more than one lexeme can match a pattern, the lexical analyzer must provide
the subsequent compiler phase’s additional information about the particular
lexeme that matched.
For example, the pattern for token number matches both 0 and 1, but it is extremely
important for the code generator to know which lexeme was found in the source
program.
Tokens have at most one associated attribute, although this attribute may have a
structure that combines several pieces of information. Information about an
identifier - e.g., its lexeme, its type, and the location at which it is first found (in case
an error message about that identifier must be issued) - is kept in the symbol table.
Example: The token names and associated attribute values for the Fortran statement
E = M * C ** 2
are written below as a sequence of pairs.
<id, pointer to symbol-table entry for E>
< assign-op >
<id, pointer to symbol-table entry for M>
<mult-op>
<id, pointer to symbol-table entry for C>
<exp-op>
<number, integer value 2 >
Note that in certain pairs, especially operators, punctuation, and keywords, there is
no need for an attribute value.
In this example, the token number has been given an integer-valued attribute.
Prepared By M. Raja, CSE, KLU 19
CSE304 Compiler Design Notes
In practice, a typical compiler would instead store a character string representing
the constant and use as an attribute value for number a pointer to that string.
Lexical Errors
It is hard for a lexical analyzer to tell, without the aid of other components that
there is a source-code error.
fi(a = = f(x)) . . .
a lexical analyzer cannot tell whether fi is a misspelling of the keyword if or an
undeclared function identifier.
Since fi is a valid lexeme for the token id, here the lexical analyzer is unable to
proceed because none of the patterns for tokens matches any prefix of the
remaining input.
The simplest recovery strategy is "panic mode" recovery. Delete successive
characters from the remaining input, until the lexical analyzer can find a well-
formed token at the beginning of what input is left.
Other possible error-recovery actions are:
1. Delete one character from the remaining input.
2. Insert a missing character into the remaining input.
3. Replace a character by another character.
4. Transpose two adjacent characters.
Before discussing the problem of recognizing lexemes in the input, let us
examine
Some ways that the simple but important task of reading the source program can be
speeded. We then consider an improvement involving "sentinels" that saves time
checking for the ends of buffers.
Buffer Pairs
Because of the amount of time taken to process characters and the large number
of characters that must be processed during the compilation of a large source program,
specialized buffering techniques have been developed to reduce the amount of
overhead required to process a single input character.
An important scheme involves two buffers that are alternately reloaded, as
shown in the following as
Prepared By M. Raja, CSE, KLU 20
CSE304 Compiler Design Notes
Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096
bytes. If fewer than N characters remain in the input file, then a special character,
represented by eof, marks the end of the source file and is different from any possible
character of the source program.
Two pointers to the input are maintained:
1. Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent we
are attempting to determine.
2. Pointer forward scans ahead until a pattern match is found;
Sentinels
For each character read, we make two tests:
- One for the end of the buffer,
- One to determine what character is read.
Here combine the buffer-end test with the test for the current character if we
extend each buffer to hold a sentinel character at the end.
The sentinel is a special character that cannot be part of the source program, and a
natural choice is the character eof.
Prepared By M. Raja, CSE, KLU 21
o Lexical analysis enables more efficient execution in modern compilers. Techniques
like input buffering (e.g., double-buffering) improve performance by reducing the
number of input operations needed.
4. Modularity:
o Lexical analysis modularizes the compilation process by separating the task of
recognizing tokens from more complex tasks like syntax checking, making the
compiler easier to design and maintain.
By performing these tasks efficiently, the lexical analyzer ensures that the compiler can process the source
code correctly and efficiently, setting the stage for subsequent phases of compilation.
1.3.2 INPUTBUFFERING
Input buffering is a technique used by the lexical analyzer (scanner) in a compiler to efficiently read
and process characters from the source code during lexical analysis. The main goal of input buffering is
to minimize the overhead of repeated input operations and improve the performance of the lexical
analyzer.
Since the lexical analyzer processes the source code character by character, if it reads directly from the
input source (e.g., a file or a stream) every time a character is needed, it could result in a lot of unnecessary
I/O operations. To address this, buffering is used to hold a block of input characters in memory, allowing
the lexical analyzer to process them efficiently in bulk.
Key Concepts of Input Buffering
1. Buffering Basics:
o Buffer: A contiguous block of memory used to hold a portion of the input source
code. The lexical analyzer reads from the buffer instead of directly from the source
file.
o By using a buffer, the program can access characters from memory, which is faster
than accessing characters from disk or other input devices.
2. The Role of Input Buffering:
o The lexical analyzer needs to read the source code in chunks and process tokens.
Instead of constantly reading from the source (which can be slow), the input is read
into a buffer. The buffer acts as an intermediary between the input source and the
lexical analyzer.
o The buffer allows the lexical analyzer to look ahead and consume characters as
needed without having to frequently perform input operations.
3. Challenges:
o The main challenge is that the lexical analyzer often needs to look at multiple
characters ahead in the input (to determine where a token starts and ends).
o It also needs to handle situations where a token might span multiple reads from the
buffer or the source code.
Types of Input Buffering
There are two common types of input buffering techniques used in lexical analysis:
1. Single Buffering:
o In single buffering, only one buffer is used to hold the input characters.
o The lexical analyzer reads from this buffer character by character.
o Once the buffer is exhausted, it needs to be refilled from the source code, which may
cause some delays as the analyzer waits for new input.
Drawback: Single buffering requires the analyzer to frequently access the input source (e.g., disk or
terminal), which can be inefficient.
2. Double Buffering:
o In double buffering, two buffers are used to store input characters.
o While the lexical analyzer is reading from one buffer, the other buffer is
simultaneously being filled with the next block of characters from the source code.
o This method minimizes the idle time spent waiting for new input because one buffer
is always ready to be processed while the other is being populated with data from the
source code.
Advantage: Double buffering significantly improves performance by reducing the frequency of input
operations and ensuring that characters are always available for processing.
How Double Buffering Works
In double buffering, the input is divided into two buffers (usually named Buffer 1 and Buffer 2):
1. Initial Setup: The lexical analyzer starts by filling Buffer 1 with a chunk of characters from
the input source.
2. Processing: The lexical analyzer processes the characters in Buffer 1. While it is reading
and processing Buffer 1, Buffer 2 is simultaneously filled with the next chunk of input
characters from the source.
3. Buffer Swap: Once Buffer 1 is fully processed (or when more input is needed), the lexical
analyzer switches to Buffer 2 and starts reading from it. The roles of the buffers are swapped,
with Buffer 1 being filled with the next chunk of input while the lexical analyzer processes
Buffer 2.
4. Loop: This process continues, with the buffers alternating between being read and being
filled, allowing the lexical analyzer to process characters efficiently without waiting for
input operations.
Example of Double Buffering:
Let’s say the input source is the string "int sum = 10 + 5;". In double buffering:
Buffer 1 could initially contain: "int sum = ".
Buffer 2 could be filled simultaneously with the next part of the input: "10 + 5;".
The lexical analyzer reads the characters from Buffer 1 while Buffer 2 is being filled. Once Buffer 1 is
exhausted, the analyzer switches to Buffer 2, and Buffer 1 is filled with the next chunk of data.
Advantages of Input Buffering
1. Improved Efficiency:
o Input buffering reduces the number of times the lexical analyzer has to perform an
I/O operation, which makes the process of lexical analysis much faster, especially
for large programs.
o Double buffering is particularly effective because it ensures that the lexical analyzer
is never waiting for input; there's always a buffer ready for processing.
2. Reduced Latency:
o By having a second buffer ready, double buffering minimizes delays that would occur
if the analyzer had to wait for new characters to be read from the source.
3. Optimized Token Recognition:
o Input buffering ensures that the lexical analyzer can look ahead at characters to
correctly identify tokens, which is especially important for languages with complex
tokens (e.g., where a token might start in one buffer and end in another).
Example: How Buffering Helps Lexical Analysis
Consider the following input:
int a = 5 + 3;
Without input buffering, the lexical analyzer would read each character individually, possibly making
many trips to the source code and spending significant time in each one.
With double buffering, the lexical analyzer reads two blocks at a time (say, "int a =" in one buffer and "5
+ 3;" in another). This improves efficiency by reducing I/O overhead and enabling the lexer to quickly
process the input and tokenize it.
Buffering in Action in Lexical Analyzers
1. Efficient Reading: In a typical lexical analyzer, input buffering ensures that after the initial
load of characters, the analyzer spends most of its time processing tokens, not waiting for
more data.
2. Look-Ahead: Input buffering allows the lexer to efficiently perform look-ahead in
tokenization. For example, in the case of a multi-character token, such as an identifier or
keyword, the lexical analyzer can read ahead and determine where the token ends.
Input buffering is a crucial optimization in lexical analysis, enabling the lexical analyzer to efficiently
process input by reducing the number of direct I/O operations. Double buffering is particularly effective,
ensuring continuous input availability while the lexer is processing the current buffer. These techniques
improve the performance of the compiler, especially when analyzing large programs, by minimizing
delays caused by waiting for input and enabling faster token recognition.
1.3.3 SPECIFICATION OF TOKENS
In compiler design, the specification of tokens refers to the process of defining the patterns or rules that
identify the basic building blocks of a programming language. Tokens are the smallest units of meaning
in a programming language and serve as the input for the syntax analyzer (parser). These tokens can
include keywords, operators, identifiers, literals, and punctuation marks.
The process of token specification is typically done using regular expressions or context-free
grammars, which describe the pattern or structure of valid tokens.
1. What are Tokens?
A token is a sequence of characters in the source code that matches a pattern defined by the programming
language's syntax rules. Tokens are categorized into classes, each representing a specific type of construct
in the language.
Common types of tokens include:
Keywords: Reserved words in the programming language that have a predefined meaning,
such as if, while, return, int, for, etc.
Identifiers: Names used for variables, functions, classes, etc. An identifier typically starts
with a letter or underscore and may be followed by letters, digits, or underscores.
Literals: Constants or fixed values in the program. These can be:
o Integer literals (e.g., 10, 0, -5).
o Floating-point literals (e.g., 3.14, 0.1, -0.56).
o String literals (e.g., "hello", "world").
o Character literals (e.g., 'a', '%').
Operators: Symbols that represent operations, such as +, -, *, /, ==, etc.
Delimiters: Punctuation marks used for separating different components in the source code,
such as ;, ,, (), {}, [], etc.
2. Specifying Tokens Using Regular Expressions
In lexical analysis, regular expressions (regex) are commonly used to define the patterns that tokens must
match. A regular expression describes a set of strings that belong to a particular class of tokens.
For example:
Identifiers: In many languages, an identifier consists of a letter (or underscore) followed by any number
of letters, digits, or underscores. The regular expression for an identifier might be:
[a-zA-Z_][a-zA-Z0-9_]*
This regex ensures that the token begins with a letter or underscore, followed by any number of
alphanumeric characters or underscores.
Integer Literals: An integer literal consists of one or more digits. The regex for an integer might be:
[0-9]+
This pattern matches one or more digits.
Keywords: Keywords are typically specified directly, as they are reserved words with a fixed set. For
example:
if | else | return | int | for
Operators: The regex for operators such as +, -, *, /, ==, etc., can be:
\+ | - | \* | / | == | != | < | >
String Literals: String literals are typically enclosed in double quotes and may contain escaped characters
like \n, \t, etc. The regex might look like:
\"(\\.|[^\\"])*\"
This regex matches a string literal, handling escape sequences within the string.
3. Token Specification Examples
Here’s a table of some common tokens and their corresponding regular expressions in a typical
programming language:
4. Using Finite Automata (DFA) for Token Specification
A deterministic finite automaton (DFA) is a mathematical model used by lexical analyzers to recognize
tokens. Regular expressions can be translated into a DFA, which then processes the input stream character
by character.
The process works as follows:
1. Conversion to DFA: The regular expressions for token classes are converted into a DFA.
2. Input Processing: The DFA processes the input source code character by character,
transitioning between states based on the input symbols.
3. Token Recognition: When the DFA reaches an accepting state, a token has been successfully
recognized, and the lexical analyzer outputs the token.
5. Example of Token Specification Process
Consider a simple language with the following token types:
Keywords: if, else, int
Identifiers: Any string starting with a letter or underscore, followed by letters, digits, or
underscores.
Integer literals: A sequence of digits.
Here’s how token specification might work:
1. Keywords: We define keywords as a special class using direct matching (e.g., if, else, int).
2. Identifiers: We define identifiers using a regular expression, [a-zA-Z_][a-zA-Z0-9_]*.
3. Integer Literals: We define integer literals with the regular expression [0-9]+.
Sample Input: int x = 10;
The lexical analyzer would process this input and recognize the following tokens:
Keyword: int
Identifier: x
Operator: =
Integer Literal: 10
Punctuation: ;
6. Lexical Analyzer Tools (e.g., LEX)
Tools like LEX (a lexical analyzer generator) automate the token specification process. A user writes
regular expressions for token classes in the LEX input file, and LEX generates the C code that implements
the lexical analyzer. For example:
%%
int { return KEYWORD_INT; }
[0-9]+ { return INTEGER; }
[a-zA-Z_][a-zA-Z0-9_]* { return IDENTIFIER; }
%%
This LEX specification defines tokens for the keyword int, integer literals, and identifiers, and generates
a scanner that can recognize these tokens.
The specification of tokens is an essential part of the lexical analysis phase in compiler design. By using
regular expressions or finite automata, the compiler defines patterns for different types of tokens,
including keywords, identifiers, literals, operators, and punctuation. This allows the lexical analyzer to
efficiently break down the source code into meaningful components, which can then be processed by later
phases of the compiler (like the syntax analyzer).
1.3.4 THE LEXICAL-ANALYZERGENERATOR LEX
LEX is a powerful tool used for generating lexical analyzers (also known as scanners) in compilers. It
automates the process of turning regular expressions, which specify the patterns of tokens, into a C-based
program that can identify those tokens in a given input stream. LEX is widely used to simplify the
implementation of the lexical analysis phase in a compiler.
What is LEX?
LEX is a lexical analyzer generator that takes a set of regular expressions (used to define the patterns of
tokens) as input and generates a C program that performs lexical analysis on a source code file. The
generated C code uses a finite automaton to recognize the defined patterns (tokens) and perform actions
based on what it matches.
How LEX Works
1. Input Specification: The user provides a specification file for the lexical analyzer, which
consists of a set of regular expressions for token classes and associated C code for actions
to be taken when tokens are matched.
2. Regular Expressions: LEX uses regular expressions to define patterns for different tokens.
For example, it might have regular expressions for keywords (if, else), identifiers (variable
names), integer literals, etc.
3. Action Code: Alongside each regular expression, the user provides action code (usually in
C) that is executed when a pattern is matched. The action might involve returning a token to
the parser, performing an operation, or printing an error message.
4. Finite Automaton Generation: LEX translates the regular expressions into a finite
automaton, which is a state machine that can process the input string character by character,
transitioning between states based on the input and matching tokens.
5. C Code Generation: After processing the regular expressions and actions, LEX generates a
C program that includes the logic to scan input based on the defined regular expressions.
This C program can then be compiled and used to perform lexical analysis.
Structure of a LEX Specification File
A LEX specification file generally consists of three sections:
1. Definitions Section:
o This section is used to define macros, regular expressions, or include necessary
libraries.
o It typically contains global declarations or defines for constants and regular
expressions.
2. Rules Section:
o This section is the heart of the LEX specification.
o Each rule consists of a regular expression (defining the token) followed by a block
of C code that is executed when the token is recognized.
o Rules are separated by new lines, and LEX processes them in the order they are
specified.
3. User Code Section:
o This section allows for additional C code that will be included in the generated
scanner.
o It often includes function definitions and main() function code that initializes the
scanner and calls the lexical analyzer functions.
Example LEX Specification
Here’s an example of a simple LEX specification for a lexical analyzer that recognizes integer literals,
identifiers, and a few keywords (if, else, int):
%{
/* Definitions section: include necessary libraries or macros */
#include <stdio.h>
#include <stdlib.h>
%}
/* Rules section: specify token patterns and actions */
%%
if { printf("Keyword: if\n"); }
else { printf("Keyword: else\n"); }
int { printf("Keyword: int\n"); }
[0-9]+ { printf("Integer: %s\n", yytext); }
[a-zA-Z_][a-zA-Z0-9_]* { printf("Identifier: %s\n", yytext); }
\n { /* ignore newlines */ }
[ \t] { /* ignore whitespace */ }
. { printf("Invalid character: %s\n", yytext); }
%%
/* User code section */
int main(int argc, char *argv[]) {
yylex(); /* Call the lexical analyzer */
return 0;
}
Explanation of the Example
Definitions Section:
o This section includes the necessary libraries (stdio.h for input/output and stdlib.h for
general functions).
o There are no macros or advanced definitions in this simple example.
Rules Section:
o if, else, and int: These are keywords in the language. When one of these keywords
is matched, the corresponding action (printf) is executed.
o [0-9]+: This regular expression matches one or more digits, representing an integer
literal. The action prints the matched number using yytext, which is a special
variable that holds the current matched token.
o [a-zA-Z_][a-zA-Z0-9_]*: This matches an identifier (a sequence starting with a
letter or underscore, followed by letters, digits, or underscores).
o [ \t]: This matches whitespace characters (spaces and tabs), which are ignored.
o \n: This matches a newline character, which is also ignored.
o .: This matches any character not matched by the previous rules and prints an error
message.
User Code Section:
o The main() function calls yylex(), which is the function generated by LEX that
performs the lexical analysis.
o The yylex() function is the entry point for the lexical analyzer. It reads input, applies
the rules, and invokes the corresponding actions.
LEX Output
When this LEX file is processed, the lex tool generates a C file (e.g., lex.yy.c) that contains a lexical
analyzer. This C file is then compiled, resulting in an executable that can read an input stream and perform
lexical analysis. For example, given the input:
int x = 10;
if (x > 5) {
x = 20;
}
The output from running the generated program might be:
Keyword: int
Identifier: x
Invalid character: =
Integer: 10
Invalid character: ;
Keyword: if
Invalid character: (
Identifier: x
Invalid character: >
Integer: 5
Invalid character: )
Invalid character: {
Identifier: x
Invalid character: =
Integer: 20
Invalid character: ;
Invalid character: }
Key Features of LEX
1. Token Definition: LEX allows you to define tokens using regular expressions, which are
simple and concise.
2. Action Code: LEX allows you to specify C code that is executed when a token is recognized.
This can include returning tokens to the parser, printing information, or handling errors.
3. Efficient: LEX uses finite automata to match regular expressions efficiently. This ensures
that lexical analysis is fast even for large inputs.
4. Error Handling: LEX allows you to define actions for unmatched characters, making it
easy to handle lexical errors.
Advantages of Using LEX
1. Automation: LEX automates the creation of lexical analyzers, reducing the amount of code
that needs to be written manually.
2. Efficiency: The finite automaton approach used by LEX ensures that lexical analysis is
performed efficiently.
3. Portability: LEX generates C code, which can be compiled and run on different platforms,
making it portable across systems.
4. Flexibility: LEX allows you to define a wide range of token patterns and actions, making it
highly customizable for different programming languages.