Bottom-Up LR Parsing
17-363/17-663: Programming Language Pragmatics
Reading: PLP section 2.3
Copyright © 2016 Elsevier
Prof. Jonathan Aldrich
Top-Down vs. Bottom-Up Parsing
• Top-Down/LL Parsing Intuition
program Start trying to parse a program
stmt_list $$$ Based
Start trying
on lookahead,
to parse arefine
program
to stmt_list
then to stmt stmt_list
stmt stmt_list $$$
Stack tracks predicted future parsing
...
• Bottom-Up/LR Parsing Intuition
read A Start by shifting a few tokens
stmt Reduce tokens to a stmt, thentotoa stmt_list
stmt, then a stmt_list
stmt_list Continue
Continue to
tokens
to shift
shift and
and reduce
reduce tokens
tokens
tokens to
to recognize
recognize another
another stmt
stmt
stmt_list read B Stack shows what constructs
stmt_list stmt have been recognized so far
Example Program and SLR(1) Grammar
read A
read B
sum := A + B
write sum
write sum / 2
Modeling a Parse with LR Items
• Initial parse state captured by an item
– includes start symbol, production, and current location
• What we see next might be inside stmt_list
– So we expand stmt_list and get a set of items:
Modeling a Parse with LR Items
• We can likewise expand stmt to get the item set:
• This is an SLR parser state
– We’ll call it state 0
Modeling a Parse with LR Items
• Our starting stack has state 0 on it:
0
• Input: read A read B …
• From state 0, we shift read onto the stack and
move to state 1:
0 read 1
• State 1 represents the following item:
Modeling a Parse with LR Items
• stack / item: 0 read 1
• input: A read B …
• From state 1, we shift id onto the stack
• stack / item: 0 read 1 id 1’
• input: read B …
• Now we reduce to stmt, and put stmt into the input
• stack / item: 0
• input: stmt read B …
Modeling a Parse with LR Items
• stack / item: 0
• input: stmt read B …
• We now shift stmt
• stack / item: 0 stmt 0’
• input: read B …
• Next we reduce to stmt_list
• stack / item: 0
• input: stmt_list read B …
Modeling a Parse with LR Items
• stack / item: 0
• input: stmt_list read B …
• Now we shift stmt_list
• stack / item: 0 stmt_list 2
• input: read B …
The Characteristic Finite State
Machine (CFSM)
There are also shift-reduce actions. So our states 0’, 1’ aren’t shown
here: they are “in between” states within a shift-reduce action
The CFSM as a Table
A Detailed Explanation of the CFSM
A Detailed Explanation of the CFSM
A Detailed Explanation of the CFSM
Exercise: LR Parsing
• Assume you are in parsing state 0
and the token stream is write sum / 2
• Show how the parse stack changes as the token
stream is consumed
• We’ll do the first action together
Parsing if-then-else Statements
• A famous parsing challenge (from Algol) involves if-
then-else, where else is optional:
stmt ::= if exp then stmt
| if exp then stmt else stmt
• Consider the phrase:
if exp then if exp then stmt else stmt
• Which then does the else belong to?
Shift/Reduce Conflicts
• This is a shift-reduce conflict
if exp then if exp then stmt . else stmt
• When the else appears
• we can shift, treating it as part of the inner if statement, or
• we can reduce the inner if statement,
treating the else as part of the outer if statement
• How to solve?
– Many existing tools prioritize shift over reduce
• This corresponds to the traditional solution to the if problem
Shift/Reduce Conflicts
• This is a shift-reduce conflict
if exp then if exp then stmt . else stmt
• When the else appears
• we can shift, treating it as part of the inner if statement, or
• we can reduce the inner if statement,
treating the else as part of the outer if statement
• How to solve?
– Many existing tools prioritize shift over reduce
– You can declare productions with precedence
• E.g. giving the if-then-else production higher precedence
than the if-then production
Shift/Reduce Conflicts
• This is a shift-reduce conflict
if exp then if exp then stmt . else stmt
• When the else appears
• we can shift, treating it as part of the inner if statement, or
• we can reduce the inner if statement,
treating the else as part of the outer if statement
• How to solve?
– Many existing tools prioritize shift over reduce
– You can declare productions with precedence
– Rewrite the grammar to make it LR(1)
An LR(0) If-Then-Else Grammar
stmt → balanced_stmt | unbalanced_stmt
balanced_stmt → if cond then balanced_stmt
else balanced_stmt
| other_stuff
unbalanced_stmt → if cond then stmt
| if cond then balanced_stmt
else unbalanced_stmt
Invariant: balanced_stmts may be inside unbalanced_stmts
– but not vice versa
Unfortunately this grammar is LR(0) but not LL(0)
– Have to use precedence in LL parsers
or custom code in a recursive-descent parser
Connections to Theory
• A scanner is a Deterministic Finite Automaton (DFA)
– it can be specified with a state diagram
• An LL or LR parser is a Pushdown Automaton (PDA)
– a PDA can be specified with a state diagram and a stack
• the state diagram looks just like a DFA state diagram, except the arcs
are labeled with <input symbol, top-of-stack symbol> pairs, and in
addition to moving to a new state the PDA has the option of pushing
or popping a finite number of symbols onto/off the stack
• For LL(1) parsers the state machine has only two states:
processing and accepted
• All the action is in the input symbol and top of stack
• LR(1) parsers are richer (and more expressive)
Error Reporting
• Error reporting is relatively simple
• If you get a token for which there’s no entry in the
current parsing state / top of stack element, signal an
error
• Can tell the user what tokens would be OK here
Error Recovery
• Nice to report more than one error to the user
• Rather than stopping after the first one
• Simple idea: Panic mode
• In C-like languages, semicolons are good recovery spots
• So on an error:
• read tokens until you get to a semicolon
• discard the parser’s stack (predictions in an LL parser, states in an LR
parser) until you come to a production that has a semicolon
• assume you’ve parsed the semicolon-containing construct,
and continue parsing
• There are ways to do substantially better – see the online
supplement to the textbook
Other Parsing Tools
• Generalized LR (GLR) parser generators
• Accept any grammar – even ambiguous ones!
• This can be good if you have grammars written by nonexperts, as in
SASyLF
• But for a compiler-writer it is dangerous—you may not even know
your grammar is ambiguous, and then your poor users get ambiguity
errors when the parser runs
• Works like an LR parser, but on ambiguity considers all
possible parses in parallel
• Still O(n) if the grammar is LR (or “close”)
Other Parsing Tools
• Parsing Expression Grammar (PEG) parser generators
• Sidestep ambiguity by always favoring the first production
• Same danger as GLR parsers – you may not know your
grammar is ambiguous
• Still used some in practice (e.g. in Python)
• About as efficient as LL or LR in practice
• Like LR, PEG grammars can be cleaner than LL grammars
• Requires extreme care to get right – must think algorithmically
instead of declaratively
• Guido van Rossum, the developer of Python, saw this as an advantage