Compiler Questions
Compiler Questions
A compiler translates high-level code into a lower-level language, such as machine or assembly code. To design a
compiler for arithmetic expressions, we follow these essential steps:
2. Syntax Analysis: Checks the structure of the expression against grammar rules.
4. Intermediate Representation (IR): Converts the expression into an abstract form like a parse tree.
We will translate:
a=b+c*d-e
Tokens: a, =, b, +, c, *, d, -, e
• a, b, c, d, e: Identifiers (variables)
• =, +, *, -: Operators
/\
+ e
/\
b *
/\
c d
t1 = c * d
t2 = b + t1
t3 = t2 - e
a = t3
MUL d ; Multiply c by d
LOAD b ; Load b
ADD t1 ; Add t1 to b
LOAD t2 ; Load t2
The lexical analyzer (or scanner) is the first phase of a compiler. Its primary purpose is to process the raw source code
and break it into meaningful tokens.
Key Responsibilities:
1. Tokenization: Splits the code into tokens, which are the smallest meaningful units like keywords, identifiers,
literals, and operators.
o Example: For x = y + 42;, tokens are [IDENTIFIER: x, ASSIGN_OP: =, IDENTIFIER: y, ADD_OP: +,
NUM_LITERAL: 42, SEMICOLON].
2. Pattern Recognition: Uses regular expressions or finite automata to identify different token types, e.g., \d+ for
numbers, [a-zA-Z_]\w* for identifiers.
4. Preprocessing: Removes comments and white spaces, which are irrelevant to syntax analysis.
5. Symbol Table Integration: Sends identifiers and literals to the symbol table for later use by the compiler phases.
Output: [(KEYWORD, int), (IDENTIFIER, a), (ASSIGN_OP, =), (IDENTIFIER, b), (ADD_OP, +), (NUM_LITERAL, 10),
(SEMICOLON, ;)]
1. Compiler
A compiler translates high-level source code into machine code or assembly language. It does so in multiple phases,
including lexical analysis, syntax analysis, optimization, and code generation.
int result = x * y + z;
LOAD x
MUL y
ADD z
STORE result
2. Interpreter
An interpreter directly executes high-level code without generating intermediate machine code. It translates and
executes line-by-line.
Role in Execution:
2. Evaluates 3 + 5.
3. Prints 8.
3. Assembler
An assembler converts assembly language into machine code. Assembly language is human-readable and closely mirrors
machine instructions.
Role in Compilation:
• Bridges the gap between the high-level compiler output (assembly) and low-level machine code.
LOAD A
ADD B
STORE C
Machine Code:
Conclusion
1. The compiler is responsible for analyzing, optimizing, and translating high-level code into low-level machine
instructions.
2. The interpreter allows immediate execution of high-level code but trades speed for convenience.
3. The assembler translates human-readable assembly into binary machine code for execution.
The lexical analyzer plays a critical role in ensuring the raw source code is transformed into structured tokens, laying the
groundwork for accurate and efficient compilation. Together, these tools create a streamlined pipeline for turning
human-readable programs into executable forms for computers.
2. Question (a): Given the grammar, calculate the First( ) and Follow( ) sets and identify the terminals and non-
terminals.
i. Grammar:
1. E→TE′E → T E'E→TE′
3. T→FT′T → F T'T→FT′
ii. Grammar:
1. S→XYS → X YS→XY
2. X→aX/bX → a X / bX→aX/b
3. Y→cY/dY → c Y / dY→cY/d
Answer (a)
i. Grammar Analysis
The First( ) set of a non-terminal contains all terminals that can appear at the beginning of strings derived from that non-
terminal. If a production can lead to ε\varepsilonε, include ε\varepsilonε in First( ).
1. First(F):
F→(E)F → (E)F→(E) or F→idF → idF→id
First(F) = { (, id }
2. First(T'):
T′→∗FT′/εT' → * F T' / \varepsilonT′→∗FT′/ε
*First(T') = { , \varepsilon }
3. First(T):
T→FT′T → F T'T→FT′
First(T) = First(F) = { (, id }
4. First(E'):
E′→+TE′/εE' → + T E' / \varepsilonE′→+TE′/ε
First(E') = { +, \varepsilon }
5. First(E):
E→TE′E → T E'E→TE′
First(E) = First(T) = { (, id }
Summary of First( ):
Non-terminal First( )
The Follow( ) set of a non-terminal contains all terminals that can appear immediately after that non-terminal in some
derivation.
1. Follow(E):
EEE appears in (E)(E)(E).
Follow(E) = { ), $ } (Here, $ denotes the end of input.)
2. Follow(E'):
E′→+TE′/εE' → + T E' / \varepsilonE′→+TE′/ε, and E′E'E′ is followed by ))) and $.
Follow(E') = { ), $ }
3. Follow(T):
TTT appears in E→TE′E → T E'E→TE′. Hence, TTT is followed by First(E'), which is +,ε{ +, \varepsilon }+,ε.
Follow(T) = { +, ), $ }
4. Follow(T'):
T′→∗FT′/εT' → * F T' / \varepsilonT′→∗FT′/ε, and T′T'T′ is followed by Follow(T).
Follow(T') = { +, ), $ }
5. Follow(F):
FFF appears in T→FT′T → F T'T→FT′. Hence, FFF is followed by First(T'), which is ∗,ε{ *, \varepsilon }∗,ε.
*Follow(F) = { , +, ), $ }
Summary of Follow( ):
Non-terminal Follow( )
EEE { ), $ }
E′E'E′ { ), $ }
TTT { +, ), $ }
T′T'T′ { +, ), $ }
FFF { *, +, ), $ }
First( ) Sets:
Non-terminal First( )
Follow( ) Sets:
Non-terminal Follow( )
SSS { $, c, d }
YYY {$}
o For each production A→αA → \alphaA→α, assign α\alphaα to M[A,a]M[A, a]M[A,a], where a∈First(α)a
\in First(\alpha)a∈First(α).
• The grammar has no left recursion or ambiguity, and the parsing table (constructed using First/Follow) will have
no conflicts.
Grammar:
1. S→XXS \to X X
Answer (c)
An LL(1) parser is a type of top-down parser that processes the input from Left to right and constructs a Leftmost
derivation of the string. The "1" in LL(1) indicates that the parser uses one input symbol of lookahead to make parsing
decisions.
1. Input Buffer: Holds the input string, ending with the end marker "$".
2. Stack: Stores grammar symbols (terminals and non-terminals). Initially contains the start symbol and end marker
"$".
3. Parsing Table: Guides the parser on how to replace non-terminals with their productions based on the current
input.
Step-by-Step Process
We calculate the First( ) and Follow( ) sets for the given grammar:
First( ) Sets:
Non-terminal First( )
SS a,z{ a, z }
XX a,z{ a, z }
Follow( ) Sets:
Non-terminal Follow( )
SS { $, a, z }
XX { $, a, z }
Parsing Table:
Non-terminal aa zz $
SS XXXX XXXX
XX aXaX zz
+--------------------------------+
+--------------------------------+
+---------------------------------------------+
+---------------------------------------------+
+-------------------------+
| Is Stack Empty? |
+-------------------------+
if top == current_symbol:
pointer += 1
if current_symbol in parsing_table[top]:
production = parsing_table[top][current_symbol]
if production != "ε":
else:
else:
Step-by-Step Parsing:
7 [] $ Accept
• The parser uses the First( ) and Follow( ) sets to determine the correct production for non-terminals based on
the current input symbol.
• Each step involves matching terminals or expanding non-terminals until the stack is empty.
• If a mismatch occurs (e.g., no valid entry in the parsing table), the string is rejected.
Non-terminal aa zz $
SS XXXX XXXX
XX aXaX zz
Diagrammatic View
+-----------------------+
| Initialize Stack |
+-----------------------+
+-------------------------+
| Stack (Non-terminal?) |
+-------------------------+
+------+-------+
| |
v v
+-------------+ +-------------+
| Table | | Symbol |
+-------------+ +-------------+
+------------------+
| Stack Empty? |
| Accept or Reject |
+------------------+
Conclusion
• The LL(1) parser for this grammar correctly parses strings like zzzzzz using a predictive parsing table and the
systematic elimination of grammar symbols from the stack.
• The methodology ensures efficiency and simplicity, suitable for grammars that conform to the LL(1) constraints.
Question3 (a): Explain the process of generating a CLR(1) parsing table from an LR(0) item set with lookahead symbols.
Use the grammar provided in 2(c) to illustrate your explanation. Highlight any challenges encountered during the
construction and how they are resolved.
The CLR(1) parser uses lookahead symbols to construct a more accurate parsing table compared to LR(0). Here's the
step-by-step process:
1. S′→SS' → S
2. S→XXS → XX
3. X→aX ∣zX → aX \ | z
An LR(0) item is a grammar production with a "dot" (•) indicating the position of the parser. Example:
S′→•SS' → • S.
Steps to generate LR(0) item sets:
2. Add transitions by shifting the "dot" over the next symbol, creating new item sets.
• I1:{S′→S•}I_1: \{ S' → S• \}
• I3:{S→XX•}I_3: \{ S → XX• \}
Each CLR(1) item is an LR(0) item paired with a lookahead symbol derived from the Follow( ) set of the non-terminal on
the left-hand side.
• I0:{S′→•S,$;S→•XX,$;X→•aX,{a,z};X→•z,{a,z}}I_0: \{ S' → • S, \$; S → • XX, \$; X → • aX, \{a, z\}; X → • z, \{a, z\}
\}
1. Action Entries:
o For each state IiI_i, add a shift, reduce, or accept action based on the transitions and items.
2. Goto Entries:
State aa zz $\$ XX SS
0 S2 S3 4 1
1 Accept
2 S2 S3 5
State aa zz $\$ XX SS
3 R3 (X→zX → z)
1. Large State Space: Adding lookahead symbols creates more states than LR(0), making CLR(1) tables significantly
larger.
Solution: Use LALR(1), which merges states with identical cores.
Question (b): Using the grammar in 2(c), construct the parsing table for an LALR(1) parser by merging states with
identical cores. Explain how merging reduces table size while maintaining correctness.
An LALR(1) parser reduces the size of the CLR(1) parsing table by merging states that have the same core (LR(0) items
without lookaheads). The lookaheads of merged states are combined, ensuring no conflicts arise.
States I0I_0 and I2I_2 have identical cores. Combine their lookaheads to merge them.
Merged States:
Parsing Table:
State aa zz $\$ XX SS
0 S2 S3 4 1
1 Accept
2 S2 S3 5
3 R3 (X→zX → z)
Step 3: Advantages of LALR(1)
1. Smaller State Space: Merging states reduces the number of rows in the parsing table.
2. Efficiency: LALR(1) is as efficient as LR(0) but with improved accuracy due to lookahead.
Conclusion
• The CLR(1) parsing table is derived by adding lookahead symbols to the LR(0) item sets, providing more precise
parsing decisions but at the cost of a larger table.
• The LALR(1) parsing table reduces this size by merging states with identical cores, balancing efficiency and
accuracy.
4. Question (a): Describe the primary function of parsers in the context of programming languages. How do parsers
contribute to the compilation process?
Answer (a):
The primary function of parsers is to analyze the syntax of source code based on the grammar rules of the programming
language. Parsers verify whether the input code adheres to the language's syntax and generate an intermediate
representation for further processing.
1. Syntax Analysis:
o A parser takes a sequence of tokens (generated by the lexical analyzer) and checks if they follow the
syntax rules.
o If any syntax errors are detected, the parser provides error messages to help developers debug.
2. Intermediate Representation:
o The parser converts source code into a parse tree or abstract syntax tree (AST). This tree represents the
hierarchical structure of the program.
o The AST generated by the parser serves as input for semantic analysis, optimization, and code
generation.
Types of Parsers:
1. Top-Down Parsers:
o Example: LR Parser.
• Parsers act as a bridge between the source code and machine code.
• They ensure that the source code adheres to grammatical rules before further processing.
• Efficient parsing techniques improve the compilation speed and error detection accuracy.
Question (b): Differentiate between top-down and bottom-up parsing techniques. Provide examples of programming
languages where each parsing technique is commonly used.
Answer (b):
Construction Builds the parse tree from root to leaves. Builds the parse tree from leaves to root.
Works with LL grammars (no left Can handle a wider range of grammars (e.g., left
Grammar Restriction
recursion). recursion).
Examples of
Python, JavaScript (LL parsers). C, C++, Java (LR parsers).
Languages
1. Top-Down Parsing:
o Often used in scripting languages like Python or JavaScript for their interpreters.
2. Bottom-Up Parsing:
o Widely used in system-level programming languages like C, C++, and Java due to their complex
grammars.
Question (c): Explain LR parsing and LALR parsing techniques. How do these parsing methods differ, and what are their
respective strengths in parsing complex grammars?
Answer (c):
LR Parsing:
• A bottom-up parsing technique that constructs a parse tree by reducing the rightmost derivation of the input
string.
Steps:
2. Shift, reduce, or accept based on the input symbol and current state.
Example Techniques:
• CLR(1): Canonical LR parser with one lookahead symbol for better accuracy.
LALR Parsing:
• A simplified version of CLR(1) parsing where states with identical cores (LR(0) items) are merged.
State Space Larger due to unique states for each lookahead. Smaller by merging states with identical cores.
Accuracy Can handle more grammars with fewer conflicts. May face conflicts in some ambiguous grammars.
Complexity More complex to implement and maintain. Simpler and more efficient than CLR(1).
Usage Used in advanced tools requiring precise parsing. Commonly used in production compilers.
Aspect LR Parsing (CLR(1)) LALR Parsing
Examples CLR(1): Rare, as LALR is preferred for efficiency. LALR(1): Used in YACC, Bison.
Strengths:
1. LR Parsing:
2. LALR Parsing:
o Preferred in production compilers due to its balance between accuracy and performance.
Conclusion:
• Parsers are critical for syntax verification and forming an intermediate representation of the source code.
• Top-down parsing is simpler and suited for smaller languages, while bottom-up parsing is robust and used in
complex compilers.
• LR parsing ensures high accuracy but is resource-intensive, while LALR parsing strikes a balance, making it the
preferred choice in modern compiler design.
Given Grammar:
Analysis of Ambiguity:
A grammar is considered ambiguous if it allows multiple distinct parse trees or derivation structures for the same string.
Let’s check if x+x×xx + x × x can be generated ambiguously.
/|\
S + S
x /|\
S × S
x x
Derivation:
1. S→S+SS → S + S
2. S→x,S→S×SS → x, S → S × S
3. S→x,S→x×xS → x, S → x × x
/|\
S + S
/|\ x
S×S
x x
Derivation:
1. S→S+SS → S + S
2. S→S×S,S→xS → S × S, S → x
3. S→x×x,S→xS → x × x, S → x
Conclusion:
Since the string x+x×xx + x × x can be derived using two distinct parse trees, the grammar is ambiguous. The ambiguity
arises because the grammar does not specify operator precedence or associativity for ++ and ××.
The Intermediate Code Generation phase in a compiler bridges the gap between the source code and machine code. It
transforms high-level code into an intermediate representation (IR), which is closer to machine code but still abstract
enough to be independent of the target machine.
Significance:
1. Portability: IR is independent of the target machine, enabling retargeting for different architectures.
2. Optimization: Simplifies implementation of optimizations, such as common subexpression elimination and dead
code removal.
3. Simplifies Code Generation: IR provides a structured form for easier final code generation.
Common Intermediate Code Representations:
o Example: a=b+ca = b + c
o t1 = a - b
o t2 = c + d
o x = t1 * t2
o Example: x=a+bx = a + b:
o =
o /\
o x +
o /\
o a b
o Represents the program's flow using basic blocks and control edges.
Steps:
o t1=a−bt1 = a - b
o t2=a−ct2 = a - c
o t3=t2+t2t3 = t2 + t2
o d=t1+t3d = t1 + t3
LOAD R1, a ; R1 = a
SUB R1, b ; R1 = a - b
STORE t1, R1 ; t1 = a - b
LOAD R1, a ; R1 = a
SUB R1, c ; R1 = a - c
STORE t2, R1 ; t2 = a - c
LOAD R2, t2 ; R2 = t2
ADD R2, t2 ; R2 = t2 + t2
STORE t3, R2 ; t3 = t2 + t2
LOAD R3, t1 ; R3 = t1
ADD R3, t3 ; R3 = t1 + t3
STORE d, R3 ; d = t1 + t3
Non-Deterministic Behavior:
Non-deterministic behavior occurs when a system has multiple choices for the next state from a given state for the same
input symbol. In the context of automata, Nondeterministic Finite Automata (NFA) allow:
1. Multiple Transitions: For a given state and input symbol, there can be multiple transitions.
1. Equivalence:
o Every NFA has an equivalent Deterministic Finite Automaton (DFA). Although NFAs are easier to design,
DFAs are computationally easier to implement.
2. Conversion Process:
o The Subset Construction Algorithm is used to convert an NFA into a DFA by creating states in the DFA
corresponding to subsets of NFA states.
3. Differences:
Aspect NFA DFA
Ease of
Requires backtracking or parallel processing. Easier to simulate directly.
Simulation
Example:
NFA Example:
0 0, 1 -
1 - 2
2 Final Final
DFA Equivalent:
{0} {0,1} -
{2} - Final
Conclusion:
• Non-deterministic behavior in NFAs provides flexibility and simplicity in design but requires conversion to a DFA
for efficient computation.
• DFAs are deterministic and guarantee a unique path for every input string.
6(a) Analysis of the Expression 10#3@910\#3@9 Using CFG and Semantic Rules
Given Expression:
10#3@910\#3@9
CFG Rules:
1. E→E#EE → E \# E
2. E→E@EE → E @ E
Semantic Rules:
1. E→E1#E2E → E_1 \# E_2: Evaluate E1#E2=(E1+E2)mod 10E_1 \# E_2 = (E_1 + E_2) \mod 10.
2. E→E1@E2E → E_1 @ E_2: Evaluate E1@E2=E1×E2E_1 @ E_2 = E_1 \times E_2.
We construct a parse tree that represents the syntactic structure and annotate it with semantic information for
evaluating 10#3@910 \# 3 @ 9.
1. Parse Tree:
______|______
| E
E |
___|___ @
| | __|__
E # E E
| | | |
10 3 3 9
• E→3E → 3: E.value=3E.value = 3
• E→9E → 9: E.value=9E.value = 9
• Ensures operators (# and @) are applied in the correct order (left-to-right associativity).
o # as modulo operation.
o @ as multiplication.
Semantic analysis verifies the types and values during evaluation, ensuring meaningful computations.
Semantic analysis ensures that the program is meaningful and adheres to the language's semantic rules. It occurs after
syntax analysis and validates:
1. Type Checking:
3. Consistency Checks:
Errors Detected Missing semicolon, unmatched parentheses. Type mismatches, undeclared variables.
1. Undeclared Variable:
2. Type Mismatch:
(i)
1. A→AaβA → A a β
2. A→Ba/Ac/dA → B a / A c / d
(ii)
1. B→Bb/Ab/dB → B b / A b / d
Transformed Grammar:
Final Grammar:
1. A→Ba/dA′A → B a / d A'
3. B→Ab/dB′B → A b / d B'
4. B′→bB′/εB' → b B' / ε
Definition:
An annotated parse tree represents the syntactic structure of a program along with additional information derived
during compilation, such as types, values, and other semantic attributes.
1. Nodes:
2. Edges:
3. Attributes:
o Synthesized Attributes: Computed from child nodes to parent (e.g., values of expressions).
o Inherited Attributes: Passed from parent nodes to children (e.g., type information).
4. Root Node:
o The final attribute of the root often gives the overall result.
5. Leaf Nodes:
Significance:
/ \
x +
/\
a b
Annotations:
1. Synthesized Attributes:
o These are attributes that are computed bottom-up during the parsing process, meaning the value of a
non-terminal is computed based on its child nodes. The synthesized attribute is propagated up the parse
tree from the leaves to the root.
o Example: In arithmetic expressions, the value of an expression like E+TE + T is the sum of the values of
EE and TT, and this value is computed as a synthesized attribute.
2. Inherited Attributes:
o These are attributes that are computed top-down. The value of an inherited attribute depends on its
parent node or the context passed down to it. These attributes are typically used for context-sensitive
information, like type checking or scope resolution.
o Example: In a language where variables must be declared before use, an inherited attribute might be
passed down to check if a variable is declared within the correct scope.
E→E+T|T
T→T*F|F
F → id { F.val = lookup(id.name) }
• For E → E + T and T → T * F**: These rules would also involve synthesized attributes, where the result of a
combination of expressions (like E + T) depends on the values computed from their child expressions (like E.val
and T.val).
• S-Attributed SDT: This type of SDT involves only synthesized attributes, where each attribute is computed
bottom-up.
• L-Attributed SDT: This type of SDT allows both synthesized and inherited attributes, but the inherited attributes
must be computed in a way that respects the left-to-right evaluation order.
Reasoning:
The semantic rules given are S-Attributed SDT because they only use synthesized attributes. All attributes, such as F.val
and E.val, are computed by combining values of their child nodes or using values from the symbol table, which are
computed bottom-up. There is no use of inherited attributes in the given rules.
Three-address code (TAC) is a low-level representation used in compilers for intermediate code generation. It uses simple
instructions, each containing at most three operands (hence "three-address"). Below are common types of statements in
three-address code:
1. Assignment Statements:
o Example:
o x=a+b
2. Arithmetic Operations:
o Example:
o t1 = a + b
o t2 = t1 * c
o Example:
o if x > 10 goto L1
o Example:
o goto L2
5. Function Calls:
o Example:
o t1 = call f, x, y
6. Return Statements:
o Example:
o return t1
7. Comparison Operations:
o Example:
o t1 = a < b
(c) Constructing a Non-Deterministic Finite Automaton (NFA) for Recognizing Strings Containing "001"
Problem:
We are tasked with constructing an NFA that recognizes all strings containing the substring "001" over the alphabet
Σ={0,1}\Sigma = \{0, 1\}.
o q0q_0: Initial state, where no part of "001" has been matched yet.
2. Alphabet:
3. Transitions:
o From q0q_0:
▪ On '0', go to q1q_1.
o From q1q_1:
▪ On '0', go to q2q_2.
o From q2q_2:
▪ On '1', go to q3q_3.
o From q3q_3:
▪ On '0' or '1', stay in q3q_3 (since the substring "001" has already been matched).
4. Initial State:
o q0q_0
5. Accepting State(s):
o q3q_3
• Transitions:
o δ(q0,0)=q1\delta(q_0, 0) = q_1
o δ(q0,1)=q0\delta(q_0, 1) = q_0
o δ(q1,0)=q2\delta(q_1, 0) = q_2
o δ(q1,1)=q0\delta(q_1, 1) = q_0
o δ(q2,1)=q3\delta(q_2, 1) = q_3
o δ(q2,0)=q2\delta(q_2, 0) = q_2
o δ(q3,0)=q3\delta(q_3, 0) = q_3
o δ(q3,1)=q3\delta(q_3, 1) = q_3
| ^
+---1--------------------+
Explanation:
• The NFA starts at q0q_0, where no "001" has been matched yet.
• As it reads through the input, it transitions through the states, eventually accepting strings that contain the
substring "001".
• Once in q3q_3, the automaton remains in the accepting state regardless of further input because "001" has been
found.
This NFA correctly accepts all strings that contain "001" as a substring, as it only requires that "001" be matched at some
point in the input.
8. (a) Compute the Three Address Code for the Following Expressions
o t1 = x + y
o t2 = y + z
3. t3 = t1 * t2
5. t4 = x + y
6. t5 = t4 + z
7. Finally, add the results of t3t3 and t5t5, and assign it to aa:
8. a = t3 + t5
t1 = x + y
t2 = y + z
t3 = t1 * t2
t4 = x + y
t5 = t4 + z
a = t3 + t5
(ii) Expression: if P<Q then 1 else 0\text{if } P < Q \text{ then } 1 \text{ else } 0
2. if P < Q goto L1
4. t1 = 1
5. goto L2
7. L1: t1 = 0
8. L2:
if P < Q goto L1
t1 = 1
goto L2
L1: t1 = 0
L2:
1. Compute x+yx + y:
2. ( + , x , y ) --> t1
3. Compute y+zy + z:
4. ( + , y , z ) --> t2
6. ( * , t1 , t2 ) --> t3
8. ( + , x , y ) --> t4
10. ( + , t4 , z ) --> t5
12. ( + , t3 , t5 ) --> a
Complete Triples:
( + , x , y ) --> t1
( + , y , z ) --> t2
( * , t1 , t2 ) --> t3
( + , x , y ) --> t4
( + , t4 , z ) --> t5
( + , t3 , t5 ) --> a
Peephole optimization involves making small local improvements in the intermediate code to reduce its size or improve
its efficiency. Common techniques include eliminating redundant expressions, constant folding, and combining
consecutive operations.
Let's take the following code snippets and apply three peephole optimization techniques:
t1 = a + b
t2 = t1 * c
t3 = t2 + d
t4 = t3 * 2
• We observe that t2 is computed as t1 * c, and then t3 is computed as t2 + d. The value of t2 is used only in t3
and can be combined directly into the expression for t3:
• t3 = (a + b) * c + d
• t4 = t3 * 2
t3 = (a + b) * c + d
t4 = t3 * 2
t1 = 5 * 2
t2 = t1 + 3
t3 = t2 * 4
• In this snippet, 5 * 2 is a constant expression that can be computed at compile time. Similarly, t2 = t1 + 3 can be
optimized by substituting the constant value of t1:
• t1 = 10
• t2 = 10 + 3
• t3 = (10 + 3) * 4
t1 = 10
t2 = 13
t3 = 52
t1 = a + b
t2 = t1 * 1
t3 = t2 + c
• The operation t2 = t1 * 1 is an identity operation (multiplying by 1). It can be eliminated entirely, simplifying the
code:
• t2 = a + b
• t3 = t2 + c
t2 = a + b
t3 = t2 + c
Summary of Peephole Optimizations:
1. Redundant Expression Elimination combines expressions that do not need intermediate variables, reducing the
overall code.
2. Constant Folding computes constant expressions at compile time, improving performance by removing
unnecessary calculations at runtime.
3. Identity Elimination removes operations that do not change the result, simplifying the code and making it more
efficient.
By applying these peephole optimizations, we reduce unnecessary operations, making the code more efficient both in
terms of execution time and resource utilization.
1. (a) Definitions
(i) Compiler:
A compiler is a program that translates the high-level source code written in a programming language (such as C, Java,
etc.) into machine code, assembly language, or an intermediate code that a computer's processor can understand and
execute. The compiler typically works in a multi-stage process to analyze the syntax and semantics of the source code,
and then it generates the corresponding machine-level code.
Example:
• A C compiler translates C programs into machine code that can be executed by a CPU.
(ii) Interpreter:
An interpreter is a type of program that directly executes instructions written in a programming or scripting language.
Unlike a compiler, which generates machine code all at once, an interpreter reads and executes the source code line-by-
line. It doesn't produce an intermediate machine code file.
Example:
• The Python interpreter executes Python code line-by-line when the program is run.
(iii) Assembler:
An assembler is a program that translates assembly language (a low-level programming language that uses mnemonics
for machine instructions) into machine code. It converts human-readable assembly instructions into the binary code that
the computer can execute.
Example:
• An assembler converts instructions like MOV AX, 5 into machine code instructions that are executed by the CPU.
A compiler performs several phases of analysis and synthesis to convert a source program into an executable machine
code. Here are the main phases:
o This phase breaks the source code into tokens. A lexical analyzer (lexer) reads the source code and
divides it into basic elements (keywords, operators, identifiers, etc.).
o Output: Tokens like keywords (int, float), operators (+, -), and identifiers (x, y).
o This phase takes the tokens produced by the lexer and checks their syntactical correctness using the
grammar of the programming language.
o The output of this phase is a parse tree or abstract syntax tree (AST).
3. Semantic Analysis:
o After syntax analysis, the compiler ensures that the program has valid semantics. For instance, it checks
for type mismatches, undeclared variables, and correct usage of operators.
o Output: Annotated syntax trees that reflect the correct semantic rules.
o In this phase, the compiler translates the AST or syntax tree into an intermediate code, which is
independent of both the source language and the target machine.
5. Optimization:
o The optimization phase tries to improve the intermediate code to make it more efficient in terms of
execution time and memory usage. This includes removing redundant operations, simplifying
expressions, etc.
6. Code Generation:
o This phase converts the intermediate representation into the target machine code or assembly language,
specific to the architecture of the machine on which the program will run.
7. Code Optimization:
o This is an additional optimization phase focused on improving the efficiency of the generated machine
code.
o The final phase links all modules and generates the final executable code.
1. (c) Bootstrapping
Bootstrapping is the process of writing a compiler (or interpreter) for a programming language in that same language. It
is the process by which a simple, initial version of the compiler is developed, and then used to compile more complex
versions of itself.
Example:
• A simple compiler for a programming language is first written in a low-level language (e.g., C), and then it is used
to compile a more sophisticated version of itself.
Advantages of Bootstrapping:
Given CFG:
6. P→QRP \rightarrow QR
First() Function:
The First() function of a non-terminal is the set of terminals that appear at the beginning of any string derived from that
non-terminal.
• For WW:
o Hence, First(W) = { i, e }.
• For YY:
• For TT:
o Hence, First(T) = { -, V }.
• For UU:
• For VV:
• For PP:
Follow() Function:
The Follow() function of a non-terminal is the set of terminals that can appear immediately to the right of that non-
terminal in some sentential form.
• For WW:
• For YY:
• For TT:
• For UU:
• For VV:
1. For each non-terminal, find the First() set for each of its productions.
2. Ensure that there are no conflicts: If two productions of a non-terminal start with the same terminal, the
grammar is not LL(1).
3. If a non-terminal can derive ϵ\epsilon, check if the First() set of the production containing ϵ\epsilon and the
Follow() set of the non-terminal do not overlap.
1. First(W) = {i, e}, so there is no conflict between W→iXTYW \rightarrow iXTY and W→eW \rightarrow e.
2. First(Y) = {b, ϵ\epsilon}, so we check if Follow(Y) overlaps with First(Y), and there is no conflict as Follow(Y) = { $
}.
A LL(1) parser uses a predictive parsing technique that works from left to right on the input, and chooses the production
rules based on the lookahead symbol.
o If it’s a non-terminal, consult the parsing table using the non-terminal and lookahead symbol to decide
the production rule.
• If we match y with the lookahead symbol (the first symbol in the input yywy), we move to the next symbol.
13. (a) Canonical LR(0) Collection of Items for the Provided CFG
• S→TS \rightarrow T
Let's compute the Canonical LR(0) Collection of Items for this grammar. A canonical LR(0) item is a production with a dot
(.) indicating the current position in the rule.
• S′→SS' \rightarrow S
• S′→SS' \rightarrow S
• S→TS \rightarrow T
• F→TF \rightarrow T
We begin by creating the initial item set, I0I_0, which contains the start item for the augmented grammar:
• S′→.SS' \rightarrow .S
1. S′→.SS' \rightarrow .S
2. S→.TS \rightarrow .T
5. F→.TF \rightarrow .T
The closure operation expands the items by adding any new items that can be inferred from the current set of items.
2. For S→.TS \rightarrow .T: Add T→.(F)T \rightarrow .(F) and T→.idT \rightarrow .id.
4. For T→.idT \rightarrow .id: No further items are added because idid is a terminal.
1. S′→.SS' \rightarrow .S
2. S→.TS \rightarrow .T
5. F→.TF \rightarrow .T
Step 4: Transition and item sets
Now we perform the transitions on various symbols (non-terminals and terminals) to create new item sets.
• Transition on S:
• Transition on T:
o I2:S→T.I_2: S \rightarrow T.
• Transition on (:
• Transition on id:
• I0:
o S′→.SS' \rightarrow .S
o S→.TS \rightarrow .T
o F→.TF \rightarrow .T
• I1:
o S′→S.S' \rightarrow S.
• I2:
o S→T.S \rightarrow T.
• I3:
• I4:
Analysis of States:
The items in each state represent the set of production rules and the position of the parsing head (.). Each state
describes the possible configurations during the parsing process. The collection of LR(0) items captures all the states of
the parser as it processes the input string.
13. (b) Construct the LR(0) Parsing Table
1. Action Table: This table contains actions (shift, reduce, accept, or error) based on the current state and the
lookahead symbol.
2. Goto Table: This table contains transitions to new states based on non-terminal symbols.
1. State I0:
2. State I1:
o On $, accept the input (since we reached the end of the production S′→SS' \rightarrow S).
3. State I2:
4. State I3:
5. State I4:
Action Table:
State id ( ) $
I0 Shift I4 Shift I3
I1 Accept
I3 Shift I4
Goto Table:
State S T F
I0 I1 I2
I1
I2
I3 I2 I4
I4
The dot (.) symbol in an LR(0) item indicates the position of the parser's head within a production rule. It shows whether
the parser has processed part of the rule or is still waiting to process more symbols.
• Example:
o In S′→.SS' \rightarrow .S, the dot before S indicates that the parser is expecting the S non-terminal.
o In T→id.T \rightarrow id., the dot after id indicates that the parser has successfully parsed id and is ready
to move to the next step.
In LR parsing, we often augment the grammar by adding a new start symbol, S′S', to ensure that the parser reaches an
accept state when the input string is completely processed.
o S′→SS' \rightarrow S
o S→TS \rightarrow T
o F→TF \rightarrow T
This grammar augmentation helps the parser identify when it has successfully parsed the entire input string and can
move to an accept state.
Conclusion
LR(0) parsing provides a systematic approach to constructing a parser for context-free grammars. The use of canonical
item sets, action and goto tables, and the dot symbol allows the parser to efficiently recognize and process input based
on the grammar's rules. By augmenting the grammar and following the state transitions, LR(0) parsers can handle many
types of programming languages with deterministic parsing behavior.
A context-free grammar (CFG) is said to be ambiguous if there exists at least one string that can be generated in multiple
ways, leading to more than one distinct parse tree or derivation for that string. Ambiguity in a CFG can make it difficult to
determine the intended structure or meaning of an expression. This issue can arise in languages where different
constructs may appear to be similar but should be interpreted differently based on context.
1. Multiple Parse Trees: A single input string may have more than one parse tree, causing confusion about the
meaning or structure of the program.
2. Unpredictable Parsing Behavior: An ambiguous grammar can lead to non-deterministic behavior in parsers, as
the parser may choose between multiple derivations.
3. Semantic Interpretation: Ambiguous grammars make it difficult to interpret or compute meanings because
different parse trees may correspond to different interpretations.
Example of Ambiguity:
2. A→aA \rightarrow a
o Then, A→(A)A \rightarrow (A) for the second part of the string (a)(a), with A→aA \rightarrow a inside the
parentheses.
o Finally, A→aA \rightarrow a for the last aa. This gives the derivation: A→a(A)→a(a)→a(a)aaA \rightarrow
a(A) \rightarrow a(a) \rightarrow a(a)aa.
o Start with A→(A)A \rightarrow (A) for the first part of the string (a)(a), where A→aA \rightarrow a.
o Then, apply A→aA \rightarrow a for the second aa after the parentheses.
o Finally, apply A→aA \rightarrow a for the last aa. This gives a different derivation: A→(A)→(a)→a(a)aaA
\rightarrow (A) \rightarrow (a) \rightarrow a(a)aa.
Since the string "a(a)aa" can be derived in two different ways, the grammar GG is ambiguous.
Conclusion:
The ambiguity arises because the rule A→(A)A \rightarrow (A) allows for multiple interpretations of how the parentheses
and the identifier aa are handled. The different parse trees lead to different structural interpretations of the same string.
04. (b) Construct a Non-Deterministic Finite Automaton (NFA) for Strings Ending with 'I'
We are tasked with constructing an NFA that recognizes the set of strings ending with the character 'I', over the alphabet
Σ={0,1}\Sigma = \{ 0, 1 \}.
• Alphabet: Σ={0,1}\Sigma = \{ 0, 1 \}
• Transitions:
o δ(q0,0)=q0\delta(q_0, 0) = q_0
o δ(q0,1)=q0\delta(q_0, 1) = q_0
o δ(q0,I)=q1\delta(q_0, I) = q_1
o δ(q1,0)=q0\delta(q_1, 0) = q_0
o δ(q1,1)=q0\delta(q_1, 1) = q_0
| |
I I
v v
• State q0q_0 is the initial state, which accepts any string of 0s and 1s.
• When we encounter II, we transition to state q1q_1, which accepts strings that end with 'I'.
Explanation:
• The NFA can stay in state q0q_0 until the character 'I' is encountered.
• Once 'I' is encountered, the NFA moves to state q1q_1 and accepts the string if it ends there.
04. (c) Construct a Deterministic Finite Automaton (DFA) for Strings Where the 3rd Symbol from the Right is 'a'
We are tasked with constructing a DFA that recognizes the set of strings in which the 3rd symbol from the right is 'a'.
• Alphabet: Σ={0,1}\Sigma = \{ 0, 1 \}
• Transitions:
(q0) ---> 0,1 ---> (q1) ---> 0,1 ---> (q2) ---> 0,1 ---> (q3) ---> 0,1 ---> (q4)
Explanation:
• State q0q_0 starts at the beginning of the string and tracks the symbols.
• States q1,q2,q3q_1, q_2, q_3 keep track of the last three symbols.
• Once three symbols have been processed, the DFA moves to state q4q_4, which accepts if the third symbol from
the right is 'a'.
An SLR(1) parser uses Simple LR(1) parsing, which constructs the parsing table based on first and follow sets, ensuring
that the lookahead symbol can uniquely guide the parser's decisions.
1. The First and Follow sets must not lead to any conflicts when constructing the parsing table.
2. For any state, if there is a reduction by a production in the action table, the lookahead symbol must not be in the
First set of any other production in that state.
1. Compute First and Follow Sets: For each non-terminal, calculate the First and Follow sets.
2. Construct the Action Table: Based on the First and Follow sets, determine shift, reduce, or accept actions for
each state and lookahead symbol.
3. Check for Conflicts: Ensure that there are no conflicts in the parsing table for a given state and lookahead
symbol.
Example:
For the grammar Z→T+FZ \rightarrow T + F, check if the grammar is SLR(1) by calculating First and Follow sets and
ensuring no conflicts.
CLR(1) parsing (Canonical LR(1)) is a more powerful parsing technique compared to SLR(1). It resolves conflicts by using
context-sensitive information derived from the parsing stack and lookahead symbol.
1. The LR(1) parsing table must not contain any conflicts in the actions based on the lookahead symbol.
2. Canonical item sets are used for better conflict resolution by considering all possible shifts and reductions.
Steps for CLR(1) Parsing Table Construction:
3. Construct the Parsing Table: Based on the Canonical Collection, create the action and goto tables.
4. Ensure No Conflicts: Check the action table for any conflicting shift/reduce or reduce/reduce conflicts.
Parsing Power Can handle a broader class of grammars Limited to simpler grammars
Parsing Table Construction Uses canonical item sets with lookahead Relies only on simple state transitions
Handling of Conflicts Can resolve conflicts using more context Cannot resolve conflicts without extra mechanisms
CLR(1) parsing is more powerful than LR(0) as it can handle more complex grammars by using the lookahead symbol
effectively, while LR(0) is restricted to simpler grammars without lookahead.
LALR(1) parsing combines the simplicity of LR(1) with optimization techniques to handle larger grammars efficiently. It
involves merging similar states in the LR(1) parsing table that share the same set of core items.
1. The LALR(1) parser merges identical core states of LR(1) items, reducing the size of the parsing table.
2. There should be no shift/reduce or reduce/reduce conflicts after merging similar core states.
2. Merge core states: Merge states that have identical core items to reduce the number of states.
3. Ensure No Conflicts: After merging, ensure there are no conflicts in the LALR(1) parsing table.
LALR(1) Parsing plays a crucial role in compiler design because it combines the simplicity and efficiency of LR(1) parsing
with optimizations that reduce the size of the parsing table. This makes it an ideal choice for compilers handling large or
complex grammars without consuming excessive memory or processing time.
2. Practical Usage: While LR(1) can handle more complex grammars, it is often too large for practical use in real-
world compilers. LALR(1) reduces this complexity, making it a better choice for many compilers, especially for
programming languages that involve complex expressions and control flow.
3. Error Handling: LALR(1) parsers are less prone to conflicts than other parsing methods (like SLR(1)), offering
better error detection and more predictable parsing behavior.
4. Compatibility: LALR(1) parsing is widely supported in modern compilers. Many tools like Yacc and Bison use
LALR(1) parsing because it strikes a good balance between computational efficiency and parsing capability.
Real-World Example:
Compilers for C/C++ often use LALR(1) parsing techniques because the syntax of these languages is complex yet
manageable with LALR(1). This allows the compiler to handle expressions, loops, and function calls efficiently without
the overhead of a more complex parser like CLR(1).
1. Compact Parsing Table: By merging states with identical cores in the LR(1) item set, LALR(1) reduces the size of
the parsing table.
2. Predictive Parsing: LALR(1) has a single lookahead symbol for making parsing decisions, which simplifies error
recovery and reduces ambiguity during parsing.
3. Lower Memory Usage: Compared to CLR(1), LALR(1) consumes less memory, making it ideal for large
applications like compilers and interpreters.
Left Recursion occurs in a CFG when a non-terminal can derive a string starting with itself. This creates an infinite loop in
recursive descent parsers. We can eliminate left recursion by transforming the CFG using a standard method.
Given CFG:
Step-by-Step Transformation:
o S→S+AS \rightarrow S + A is left-recursive because the non-terminal SS appears on the left side of its
own production.
2. Remove left recursion by introducing a new non-terminal S′S' and transforming the grammar:
o Define S′S' to handle recursion, like S′→+AS′∣ϵS' \rightarrow + A S' \mid \epsilon.
Transformed Grammar:
This transformation removes left recursion and makes the grammar suitable for top-down parsing.
To convert an NFA to a regular expression, we use the state elimination method, which iteratively eliminates states while
adjusting the transitions to preserve the language recognized by the NFA.
Example NFA:
Assume we have an NFA with states q0,q1,q2q_0, q_1, q_2 and alphabet Σ={0,1}\Sigma = \{ 0, 1 \}. Here's a rough
representation:
| |
1 0
| |
v v
2. After eliminating states and adjusting the edges, the resulting regular expression will be a combination of the
transitions that allow the NFA to accept the same strings.
After applying the state elimination method to the above NFA, the regular expression could be:
This regular expression matches any string that starts with '0', ends with '1', and can have any combination of '0' and '1'
in the middle.
Code Optimization is a technique used in compilers to improve the performance of the generated code, making it more
efficient and faster without changing its output.
• Dead code refers to parts of the code that are never executed. This could be due to redundant calculations,
conditional branches that never execute, or variables that are never used.
• Dead code elimination is the process of removing such code to reduce program size and improve execution
time.
Example:
Consider the following code snippet:
int a = 5;
int b = 10;
int d = 20;
The calculation for cc is redundant if it is never used. By applying dead code elimination, the code can be simplified to:
int a = 5;
int b = 10;
int d = 20;
This reduces the execution time and improves resource utilization by removing unnecessary calculations.
Three Address Code (TAC) is a form of intermediate code in which each instruction performs a single operation. Each
operation has at most one operator and two operands.
t1 = b + c
t2 = t1 / d
x = t2
• Step 3: y=t2+yy = t2 + y
t1 = c * d
t2 = b * t1
y = t2 + y
1. Loop Invariant Code Motion: Moving calculations that do not change within the loop outside the loop.
2. Strength Reduction: Replacing expensive operations like multiplication by a constant with simpler operations like
addition or bit shifts.
Optimized Code:
1. Loop Invariant Code Motion: Move x⋅ix \cdot i outside the loop since it does not change with each iteration.
This transformation reduces the number of computations inside the loop, improving performance by removing
unnecessary multiplicative operations.
1. Step 1: t1=b−ct1 = b - c
3. Step 3: t3=t2/dt3 = t2 / d
t1 = b - c
t2 = x * t1
t3 = t2 / d
result = t3
This results in a set of instructions that performs the desired computation in intermediate code, and can then be further
translated into target machine code.