COMPILER DESIGN
UNIT-3
Bottom-Up Parsing
(also known as Shift-Reduce Parsing):
○ Concept: This parsing strategy builds a parse tree from the
leaves (input tokens) up towards the root (the start symbol of
the grammar). It processes the input string from left to right,
trying to find substrings that match the right-hand side (RHS) of
a grammar production and then replacing them with the
corresponding left-hand side (LHS) non-terminal. This process is
called a reduction.
○ Analogy: Imagine trying to assemble a complex toy by starting
with individual small pieces and joining them together to form
larger sub-assemblies, eventually building the complete toy.
Reductions:
○ Concept: A reduction is the core operation in bottom-up parsing.
It involves identifying a handle (a specific substring in the
input/stack) that matches the right-hand side of a grammar
production, and then replacing that handle with the non-terminal
symbol on the left-hand side of that production.
○ Example: If you have a grammar production E -> E + T and
in your input/stack you see E + T, you would reduce E + T to
E. This is essentially building a subtree (the E + T part) and
attaching it to its parent node (E).
Handle Pruning:
○ Correction/Clarification: The term commonly used is "finding
a handle" and then performing a "reduction" on it. While
"pruning" literally means trimming or cutting away, in the
context of parsing, the act of identifying a handle and replacing it
effectively "prunes" the right-hand side symbols from the current
working string or stack, replacing them with a single non-
terminal. So, it's more about identifying and reducing a handle
rather than "pruning" in a sense of removal without replacement.
○ Handle Definition: A handle is a substring within the sentential
form (the string of terminals and non-terminals that represents the
current state of parsing) that satisfies two conditions:
■ It matches the right-hand side of a production rule in the
grammar.
■ Its reduction to the non-terminal on the left-hand side of
that production represents one step in the reverse of a
rightmost derivation.
Shift-Reduce Parser:
○ Concept: This is the most common type of bottom-up parser. It
operates using two primary actions, managed by a stack and an
input buffer:
■ Shift: The parser moves the next input symbol from the
input buffer onto the top of the stack. This action
corresponds to extending the current "prefix" of the parse
tree that is being built.
■ Reduce: When the parser identifies a handle at the top of
the stack, it performs a reduction. It pops the symbols
forming the handle from the stack and pushes the
corresponding non-terminal (LHS of the production rule)
onto the stack. This signifies the completion of a subtree.
○ Other Actions: Besides Shift and Reduce, a shift-reduce parser
also has:
■ Accept: If the stack contains only the start symbol and the
input buffer is empty, the parse is successful.
■ Error: If no Shift or Reduce action is possible, and the
input is not yet accepted, an error is detected.
In summary, a Shift-Reduce Parser is a Bottom-Up Parser that
incrementally builds a parse tree by repeatedly shifting input symbols onto a
stack and then reducing identified handles on the stack into their
corresponding non-terminals, until the start symbol is reached.
Shift-Reduce parsing is a powerful bottom-up parsing technique, but it faces
certain challenges, primarily related to ambiguity in the grammar, which can
lead to conflicts during the parsing process.
Problems Related to Shift-Reduce Parsing: Parsing
Conflicts
The main problems in Shift-Reduce parsing arise when the parser, at a given
state and with the next input symbol (lookahead), has more than one valid
action it can take. These situations are known as parsing conflicts:
1. Shift-Reduce Conflict (S/R Conflict):
○ Problem: The parser is in a state where it has a choice between
shifting the next input symbol onto the stack OR reducing a
handle on the stack using a grammar production rule.
○ Reason: This often happens with ambiguous grammars,
particularly concerning operator precedence and the "dangling
else" problem (where an else can belong to multiple if
statements). For example, in an arithmetic expression like id +
id * id, when the parser sees id + id on the stack and * as
the next input, it faces a dilemma:
■ Should it reduce id + id to an expression (e.g., E ->
E + E)? This would imply that + has higher precedence
than *.
■ Or should it shift the * onto the stack? This would imply
that * has higher precedence than +.
○ Resolution: For many programming languages, a default rule is
to shift in case of a Shift-Reduce conflict, respecting the common
"longest match" rule or specific operator precedence/associativity
rules.
2. Reduce-Reduce Conflict (R/R Conflict):
○ Problem: The parser is in a state where it has identified a handle
on the stack that matches the right-hand side (RHS) of two or
more different production rules, and it doesn't know which rule
to use for the reduction.
○ Reason: This indicates a more serious ambiguity in the grammar
itself. For example, if a grammar has rules like A -> alpha
and B -> alpha, and alpha is found on the stack, the parser
doesn't know whether to reduce alpha to A or to B.
○ Resolution: Reduce-Reduce conflicts are generally considered
harder to resolve and usually require rewriting the grammar to
eliminate the ambiguity, as a default action might lead to
incorrect parse trees.
These conflicts mean the grammar is ambiguous for the parser, meaning a
given input string could have more than one possible parse tree. More
advanced Shift-Reduce parsers (like LR parsers, such as SLR, LALR, and
Canonical LR) use techniques like lookahead symbols and parsing tables to
resolve some of these conflicts, but not all grammars can be parsed without
conflicts.
Operator Precedence Parser
The Operator Precedence Parser is a specific type of Shift-Reduce Parser
designed to handle a particular class of ambiguous grammars: those involving
arithmetic expressions where operator precedence and associativity rules are
crucial.
● Purpose: It's explicitly built to resolve the ambiguities (often leading to
Shift-Reduce conflicts) that arise from the precedence (e.g.,
multiplication before addition) and associativity (e.g., left-to-right for
subtraction) of operators.
● How it works: It uses a precedence table that defines
precedence relations (<⋅, ≐, $ \cdot >$) between pairs
of terminal symbols (operators and delimiters).
○ If the precedence of the current terminal on the stack is less than
the next input terminal (< \cdot), the parser shifts the input.
○ If the precedence is equal (\doteq), it shifts (e.g., between
parentheses).
○ If the precedence of the current terminal on the stack is greater
than the next input terminal (\cdot >), it means a handle has
been found, and the parser performs a reduction.
● Advantages: It's relatively simple to implement by hand and effective
for parsing arithmetic and Boolean expressions.
● Limitations:
○ It is not a general-purpose parser; it works only for "operator
grammars" which have strict restrictions (e.g., no ϵ-productions,
no two adjacent non-terminals on the RHS of a production).
○
○ It cannot handle all types of grammars or complex language
constructs beyond simple expressions.
○ It can struggle with unary operators unless specifically handled.
In essence, while general Shift-Reduce parsers face ambiguity problems
leading to conflicts, the Operator Precedence Parser is a specialized Shift-
Reduce technique that leverages explicit operator precedence and
associativity rules to deterministically parse a specific, but very common,
type of language construct: expressions.
LEADING and TRAILING
These are sets of terminal symbols used specifically in the construction of
Operator Precedence Parsers. They help define the precedence relations
between operators.
● LEADING(A): For a non-terminal A, LEADING(A) is the set of all
terminal symbols a such that a can be the firstterminal symbol in any
string that can be derived from A.
○ Example: If you have a rule E -> T + E and T can derive id,
then id would be in LEADING(E). If T can also derive (E),
then ( would also be in LEADING(E).
● TRAILING(A): For a non-terminal A, TRAILING(A) is the set of all
terminal symbols a such that a can be the lastterminal symbol in any
string that can be derived from A.
○ Example: If you have a rule E -> E + T and T can derive id,
then id would be in TRAILING(E). If T can also derive (E),
then ) would also be in TRAILING(E).
These sets are crucial for determining the precedence
relations (<⋅, ≐, ⋅>) between operators, which in turn guide
the Shift-Reduce decisions of an Operator Precedence Parser.
LR Parser
An LR Parser is a powerful type of bottom-up, shift-reduce parser. The
"LR" stands for:
● L: Scans the input string from Left to right.
● R: Constructs a Rightmost derivation in reverse.
● How it works: An LR parser operates using a stack, an input buffer,
and a parsing table. The parsing table has two main parts: an ACTION
table and a GOTO table.
○ Based on the current state (on top of the stack) and the next input
symbol (lookahead), the parser consults the ACTION table to
decide whether to:
■ Shift: Push the current input symbol onto the stack and
move to a new state.
■ Reduce: Pop a certain number of symbols from the stack
(the "handle"), and then consult the GOTO table with the
new top of the stack state and the non-terminal (LHS of the
production rule) to find the next state to push.
■ Accept: If the start symbol is on the stack and the input is
exhausted.
■ Error: If no valid action is found.
○ This process continues until the input is accepted or an error is
detected.
Types of LR Parsers
There are several types of LR parsers, distinguished by how they construct
their parsing tables and how much "lookahead" (number of input symbols
they peek at) they use to make decisions. The higher the power, the larger
and more complex the parsing tables tend to be:
1. LR(0) Parser: The simplest form, using zero lookahead. It often
suffers from Shift-Reduce or Reduce-Reduce conflicts, limiting the
grammars it can parse.
2. SLR(1) Parser (Simple LR): An improvement over LR(0) that uses
one symbol of lookahead to resolve some conflicts, particularly Shift-
Reduce conflicts, by checking the FOLLOW set of the non-terminal to
be reduced.
3. LALR(1) Parser (Look-Ahead LR): This is the most commonly used
LR parser in practice for programming languages. It has the same
number of states as an SLR(1) parser (making its tables reasonably
sized) but uses more sophisticated lookahead propagation to resolve
more conflicts than SLR(1), making it more powerful.
4. CLR(1) Parser (Canonical LR): The most powerful type of LR
parser. It can parse any LR(1) grammar but generates very large
parsing tables, making it less practical for direct implementation
compared to LALR(1).
Need for LR Parsers
LR parsers are widely preferred for parsing programming languages due to
several significant advantages:
1. Power and Generality: They can parse a very large class of context-
free grammars, including almost all grammars used for modern
programming languages, making them much more powerful than LL
(top-down) parsers or simpler Shift-Reduce parsers like Operator
Precedence Parsers.
2. Determinism and Efficiency: LR parsers are deterministic, meaning
they do not require backtracking. This allows them to parse input in
linear time, making them highly efficient.
3. Early Error Detection: An LR parser can detect a syntax error as soon
as it's theoretically possible to do so during a left-to-right scan of the
input string, which is crucial for providing good error messages to
programmers.
4. Automatic Construction: Parsing tables for LR parsers can be
automatically generated by parser generators (like Yacc/Bison),
significantly simplifying the compiler construction process compared to
manual parser writing.
5. Handles Ambiguity: While ambiguous grammars still cause conflicts,
LR parsers have well-defined mechanisms (like using lookahead) to
resolve common ambiguities (e.g., operator precedence and
associativity) by specifying conflict resolution rules.
LR(0) Item
An LR(0) item is a production rule with a special dot (.) placed somewhere
in its right-hand side (RHS). The dot indicates the position of the parser in
recognizing the production.
● Format: A -> α . β
○ A is a non-terminal (LHS).
○ α and β are strings of terminals and/or non-terminals.
○ The dot represents the current parsing position: α has already
been matched on the stack, and the parser expects to see β next in
the input.
● Purpose: Each LR(0) item represents a possible "state" or progress
point in the parser's recognition of a grammar production.
○ A -> . αβ: The parser is at the beginning of recognizing
production A.
○ A -> α . β: The parser has matched α and expects to match
β.
○ A -> αβ .: The parser has completely recognized the
production A (it's a reduce item).
Closure of Item Sets
The closure operation is a crucial step in constructing the states (sets of
LR(0) items) of an LR parser's finite automaton.
● Concept: Given a set of LR(0) items I, its closure, CLOSURE(I),
includes all items in I plus any additional items that must be
recognized as a consequence of items already in I.
● Algorithm:
1. Initialize CLOSURE(I) with all items in I.
2. If A -> α . B β is in CLOSURE(I) (meaning we expect to
see a non-terminal B next) and B -> γ is a production in the
grammar, then add the item B -> . γ to CLOSURE(I), if it's
not already there.
3. Repeat step 2 until no new items can be added to CLOSURE(I).
● Purpose: This ensures that for any given state, the set of items includes
all possible productions that the parser might be recognizing or
expecting to recognize from that point.
Construction of SLR Parsing Table
Constructing an SLR (Simple LR) parsing table involves several steps,
relying heavily on LR(0) items and their closures, along with the FOLLOW
sets of the grammar's non-terminals.
1. Augment the Grammar: Add a new start production S' -> S,
where S' is a new start symbol and S is the original start symbol. This
helps in defining the accept state.
2. Construct the Canonical Collection of LR(0) Item Sets (States):
○ Start with an initial state I₀ = CLOSURE({S' -> . S}).
○ Use the GOTO function to find transitions between states:
GOTO(I, X) is the closure of all items A -> α X . β that
are in I, where X is a grammar symbol (terminal or non-
terminal).
○ Repeatedly apply GOTO to existing item sets for all possible
grammar symbols until no new item sets are generated. Each
resulting set of items forms a state in the LR(0) automaton.
3. Construct the SLR Parsing Table:
○ The parsing table has two parts: an ACTION table and a GOTO
table.
○ ACTION Table (rows = states, columns = terminals):
■ Shift entries: If state I contains A -> α . a β (where
a is a terminal) and GOTO(I, a) = J, then set
ACTION[I, a] = shift J.
■ Reduce entries: If state I contains a complete item A ->
α . (where A ≠ S'), then for every terminal b in
FOLLOW(A), set ACTION[I, b] = reduce A ->
α. This is where SLR uses the FOLLOW set to resolve
potential conflicts.
■ Accept entry: If state I contains S' -> S ., then set
ACTION[I, $] = accept (where $ is the end-of-
input marker).
■ Any undefined entries indicate a syntax error.
○ GOTO Table (rows = states, columns = non-terminals):
■ For each state I and non-terminal A, if GOTO(I, A) =
J, then set GOTO[I, A] = J.
● Handling Conflicts: If an SLR parser encounters a Shift-Reduce or
Reduce-Reduce conflict while constructing the table, the grammar is
not an SLR grammar. LALR or Canonical LR parsers are more
powerful and can handle a larger class of grammars with fewer
conflicts.
Problems Related to SLR Parsers
While SLR (Simple LR) parsers are a significant improvement over LR(0) by
incorporating one-symbol lookahead, they are still not as powerful as desired
for many programming language grammars. Their main problem stems from
their reliance on the general FOLLOW sets for making reduce decisions.
● Over-reliance on FOLLOW Sets: When an SLR parser encounters a
complete item A -> α ., it determines if a reduction can be
performed by checking if the next input symbol (lookahead) is in
FOLLOW(A). The issue is that FOLLOW(A) contains all terminals that
can possibly follow A in any sentential form, not just in the context of
the current parsing state.
● Leads to Unnecessary Conflicts: This can lead to:
1. Shift-Reduce Conflicts: A state might contain both a shift item
(e.g., A -> α . aβ) and a reduce item (e.g., B -> γ .)
where a is in FOLLOW(B). Even if in that specific state a can
only follow A and not B, SLR's general FOLLOW set doesn't
differentiate, leading to a conflict.
2. Reduce-Reduce Conflicts: A state might contain two reduce
items (A -> α . and B -> β .) where FOLLOW(A) and
FOLLOW(B) overlap. If the lookahead is in the intersection of
these FOLLOW sets, the parser faces a conflict.
● Example: Consider a grammar that differentiates between list and
array based on the next symbol (e.g., list -> id ; and array
-> id [ ...). If both id ; . and id [ . can appear in a state,
SLR might get a conflict if FOLLOW(list) and FOLLOW(array)
overlap, even if the actual context of the id could uniquely determine
which production it belonged to.
Construction of Canonical LR(1) Parser
Canonical LR(1) parsers are the most powerful type of LR parser. They
overcome SLR's limitations by incorporating the lookahead directly into the
items themselves.
● LR(1) Item: An LR(1) item is of the form [A -> α . β, a],
where:
○ A -> α . β is an LR(0) item.
○ a is a lookahead terminal symbol. This a represents a terminal
that must follow A if the production A -> αβis used in a
rightmost derivation (or if A -> α . β is reduced to A).
● CLOSURE for LR(1) Items:
○ If [A -> α . B β, a] is an item in the set and B -> γ is a
production:
○ For every terminal b in FIRST(βa) (i.e., the first terminal of β
or a if β is empty), add the item [B -> . γ, b] to the set.
○ This more precise lookahead propagation is the key difference
from LR(0) closure.
● GOTO for LR(1) Items:
○ GOTO(I, X) works similarly to LR(0), but it applies to LR(1)
items. If [A -> α . X β, a] is in state I, then after shifting
X, the new item will be [A -> α X . β, a]. The lookahead
a is carried along.
● Table Construction:
○ Shift actions are determined similarly to SLR, based on GOTO
transitions.
○ Reduce actions: If a state I contains a complete item [A -> α
. , a], then the parser performs a reduction A -> α only if
the current input lookahead is precisely a. This precise lookahead
avoids the conflicts seen in SLR.
Problems Related to Canonical LR(1) Parsers (CLR)
Despite their theoretical power (they can parse any LR(1) grammar),
Canonical LR(1) parsers have a significant practical drawback:
● Too Many States and Large Tables: The fine-grained nature of LR(1)
items, where each item carries a specific lookahead symbol, leads to a
massive number of states in the LR(1) automaton.
○ For typical programming language grammars, the number of
states can be several times larger than for SLR or LALR parsers.
○ This results in very large parsing tables, which consume a lot of
memory and can be computationally expensive to generate and
use.
● Impracticality: Due to the size of the tables, CLR parsers are rarely
implemented directly in practice. They serve more as a theoretical
benchmark and a stepping stone to the more practical LALR(1)
parsers.
LALR(1) parsers are a compromise: they combine states from the CLR(1)
automaton that are identical except for their lookahead sets. This dramatically
reduces the number of states (making table sizes comparable to SLR) while
retaining most of the power of CLR(1) parsers.
LALR Parser
The LALR (Look-Ahead LR) parser is the most widely used type of LR
parser in practice, offering a practical compromise between the power of
Canonical LR(1) and the smaller table sizes of SLR(1).
● Concept: LALR(1) parsers are constructed by first building the full
Canonical LR(1) states. Then, it merges states that have identical
"core" LR(0) items but differ only in their lookahead sets. When
merging these states, their lookahead sets are combined.
● Construction Idea:
○ Identify LR(1) states that have the same set of LR(0) items (i.e.,
their "core" is identical).
○ Merge these states into a single new state in the LALR
automaton.
○ The lookahead sets of the merged items are combined (union
operation).
● Advantages:
○ Table Size: The merging process significantly reduces the
number of states and, consequently, the size of the parsing table,
making it comparable to that of an SLR parser. This is a major
practical advantage over CLR(1).
○ Power: LALR(1) retains most of the parsing power of CLR(1). It
can parse almost all grammars used for real-world programming
languages.
○ Efficiency: Like other LR parsers, it's deterministic and operates
in linear time.
Problems Related to LALR Parsers
While LALR(1) is very powerful and practical, the state-merging process can
sometimes introduce ambiguities that were not present in the original
Canonical LR(1) grammar.
● Introduced Reduce-Reduce Conflicts: This is the primary drawback
of LALR(1) compared to CLR(1). The merging of lookahead sets can
lead to a situation where a single state now contains two different
complete items ([A -> α ., a] and [B -> β ., a]) that both
want to reduce on the same lookahead symbol a. In the separate
CLR(1) states, these might have been distinct, but merging causes a
conflict.
○ Example (Conceptual): Imagine two CLR(1) states: State X:
{[A -> id ., x]} and State Y: {[B -> id .,
y]}. If SLR were used, id might cause a Reduce-Reduce
conflict if x and y were both in FOLLOW(A)and FOLLOW(B).
CLR(1) resolves this because the lookahead tells it whether to
reduce to A or B. In LALR(1),if State X and State Y have
the same LR(0) core (id .), they would be merged into a single
LALR(1) state.This new state would then contain {[A ->
id ., {x, y}]} and {[B -> id ., {x, y}]},
creating a Reduce-Reduce conflict for both x and y as
lookaheads.
● Cannot Handle All LR(1) Grammars: Because of these potentially
introduced Reduce-Reduce conflicts,LALR(1) parsers cannot parse all
LR(1) grammars that a CLR(1) parser could. However, the set of
grammars they can handle is still vast enough for practical compiler
design.
● No New Shift-Reduce Conflicts: It's important to note that LALR(1)
merging does not introduce new Shift-Reduce conflicts that were not
already present in the underlying CLR(1) states.
YACC (Yet Another Compiler Compiler)
YACC is a classic parser generator (a tool that automates parser
construction) that typically generates LALR(1) parsers.It's widely used to
implement the syntax analysis (parsing) phase of compilers and interpreters.
● Functionality:
○ Input: YACC takes a grammar specification (usually in a BNF-
like notation) and associated semantic actions (C or C++ code
snippets) as input. The grammar defines the syntax of the
language, and the semantic actions specify what to do when a
particular grammar rule is recognized (e.g., build a parse tree
node, perform type checking, generate intermediate code).
○ Output: YACC generates C (or C++) source code for an
LALR(1) parser. This generated code includes the parsing tables
(ACTION and GOTO) and a parser driver routine (e.g.,
yyparse()).
○ Integration: The generated parser works in conjunction with a
lexical analyzer (often generated by Lex or Flex) which
provides the tokens from the input stream.
● Relation to LALR: YACC's core mechanism is based on the LALR(1)
parsing algorithm. It builds the LR(0) item sets, computes lookahead
sets, and then performs state merging to construct the compact
LALR(1) parsing tables.
● Conflict Resolution in YACC:
○ When YACC encounters a Shift-Reduce or Reduce-Reduce
conflict in the grammar during table construction,it reports them
to the user.
○ For Shift-Reduce conflicts, YACC has a default policy to prefer
shifting. This often aligns with common language design
principles (e.g., longest match, higher operator precedence).
○ For Reduce-Reduce conflicts, YACC typically defaults to using
the rule that appears earlier in the grammar specification.
○ YACC also allows the user to explicitly define operator
precedence and associativity rules in the grammar file (using
%left, %right, %nonassoc directives) to help it resolve
conflicts related to expressions.
● Importance: YACC (and its GNU counterpart, Bison) revolutionized
compiler construction by automating the complex and error-prone task
of writing a parser, allowing developers to focus on the grammar rules
and semantic actions rather than the intricacies of parser state
management.