Principles of Programming Languages
Syntax Analysis
Outline
Grammar
– Context-free grammar
– Derivation and Derivation Tree
Grammar for Arithmetic Expression
– Operation precedence and associativity
Syntax Analysis
Ambiguity in Grammar
Parser Construction
The Big Picture Again
source
code
Scanner Parser Opt1 Opt2 ... Optn
machine
Instruction Register Instruction code
Selection Allocation Scheduling
COMPILER
Syntax and Grammar
Syntax (programming language sense):
– Define structure of a program
– Not reflect the meaning (semantic) of the program
Grammar:
– Rule-based formalism to specify a language
syntax
Context-Free Grammar (CFG)
A kind of grammar
Not as complex as context-sensitive and
phase-structure grammar
More powerful than regular grammar
Formal Definition of CFG
• G = (VN ,VT,S, P)
• VN: finite set of nonterminal symbols
VT: finite set of tokens (VTVN=)
SVN: start symbol
P: finite set of rules (or productions) of
BNF (Backus – Naur Form) form A
(a)* where A VN, a(VTVN)
Example 1
G = ({exp,op,exp},{+,-,*,/,id})
exp exp op exp
exp id
op +|-|*|/
Derivation
• = uXv derives = uv if X-> is a production
Notation: (directly derive)
* ( ... | = )
+
Derivations: S + where consists of tokens only.
Sentential form: S + is a sentential form
Sentence: S * is a derivation is a sentence
Language: set of all sentences possibly derived
Example 1
exp exp op exp exp op id id op id
id + id
exp exp op exp id op exp id + exp
id + id
exp exp op exp exp op exp op exp
id op exp op exp id + exp op exp id +
exp * exp id + id * exp id + id * id
Example 3
exp exp op exp id op exp id + exp
id + id
exp exp op exp exp op id exp + id
id + id
Derivations as Trees
Internally, in the parser, derivations are implemented as trees
A convenient and natural way to represent a sequence of
derivations is a syntactic tree or parse tree
Example:
Expr
Expr Expr
Op
Number Identifier
*
34 ax
Example 4
exp
exp
Example 4
exp exp op exp
exp
exp op exp
Example 4
exp exp op exp id op exp
exp
exp op exp
id
Example 4
exp exp op exp id op exp id + exp
exp
exp op exp
id +
Example 4
exp exp op exp id op exp id + exp
id + id
exp
exp op exp
id + id
Classic Expression Grammar
exp exp + term | exp – term | term
term term * factor | term /factor |factor
factor ( exp ) | ID | INT
why is this classic expression grammar better
than the previously used one?
Operation Precedence
exp
exp op exp
id + exp op exp
id * id
Operation Precedence
exp
exp op exp
exp op exp * id
id + id
Operation Precedence
exp
exp + term
term term * factor
factor factor id
id id
Operation Precedence
(id+id) *id exp
term
term * factor
factor id
( exp )
Operator Associativity
exp
exp op exp
exp op exp + id
id + id
Operator Associativity
exp
exp op exp
id + exp op exp
id + id
Operator Associativity
exp
exp + term
factor
exp + term
id
term factor
factor id
id
Precedence and Associativity
When properly written, a grammar can
enforce operator precedence and
associativity as desired
Hands-on Excersice
Rewrite the grammar to fulfill the following
requirements:
– operator “*” takes lower precedence than “+”
– operator “-” is right-associativity
Syntactic Analysis
Lexical Analysis was about ensuring that we extract
a set of valid words (i.e., tokens/lexemes) from the
source code
But nothing says that the words make a coherent
sentence (i.e., program)
Syntactic Analysis
Example:
– “for while i == == == 12 + for ( abcd)”
– Lexer will produce a stream of tokens:
<TOKEN_FOR> <TOKEN_WHILE> <TOKEN_IDENT, “i”>
<TOKEN_COMPARE> <TOKEN_COMPARE> <TOKEN_COMPARE>
<TOKEN_NUMBER,”12”> <TOKEN_OP, “+”> <TOKEN_FOR>
<TOKEN_OPAREN> <TOKEN_ID, “abcd”> <TOKEN_CPAREN>
– But clearly we do not have a valid program
– This program is lexically correct, but syntactically
incorrect
A Grammar for Expressions
Expr Expr Op Expr
Expr Number | Identifier
Identifier Letter | Letter Identifier
Letter a-z
Op “+” | “-” | “*” | “/”
Number Digit Number | Digit
Digit 0|1|2|3|4|5|6|7|8|9
What is Parsing?
What we just saw is the process of, starting with the start
symbol and, through a sequence of rule derivation obtain a
string of terminal symbols
– We could generate all correct programs (infinite set though)
Parsing: the other way around
– Give a string of non-terminals, the process of discovering a
sequence of rule derivations that produce this particular
string
What is parsing
When we say we can’t parse a string, we mean
that we can’t find any legal way in which the string
can be obtained from the start symbol through
derivations
What we want to build is a parser: a program that
takes in a string of tokens (terminal symbols) and
discovers a derivation sequence, thus validating
that the input is a syntactically correct program
Ambiguity
We call a grammar ambiguous if a string of
terminal symbols can be reached by two
different derivation sequences
In other terms, a string can have more than
one parse tree
It turns out that our expression grammar is
ambiguous!
Let’s show that string 3*5+8 has two parse
trees
Ambiguity
Expr Expr
Expr Op Expr Expr Op Expr
Expr Op Expr + 8 3 * Expr Op Expr
3 * 5 5 + 8
“left parse-tree” “right parse-tree”
Problems with Ambiguity
The problem is that the syntax impacts meaning (for the later
stages of the compiler)
For our example string, we’d like to see the left tree because
we most likely want * to have a higher precedence than +
We don’t like ambiguity because it makes the parsers difficult to
design because we don’t know which parse tree will be
discovered when there are multiple possibilities
So we often want to disambiguate grammars
Problems with Ambiguity
It turns out that it is possible to modify grammars to make them
non-ambiguous
– by adding non-terminals
– by adding/rewriting production rules
In the case of our expression grammar, we can rewrite the
grammar to remove ambiguity and to ensure that parse trees
match our notion of operator precedence
– We get two benefits for the price of one
– Would work for many operators and many precedence relations
Non-Ambiguous Grammar
Expr Term | Expr + Term | Expr - Term
Term Term * Factor
| Term / Factor Expr
| Factor
Factor Number | Identifier Expr - Term
Example: 4*5+3-8*9 Expr + Term Factor
Term Factor Term * Factor
Term * Factor Number Factor Number
Factor Number
Number 3 9
Number 8
5
4
Non-Ambiguous Grammar
Expr Term | Expr + Term | Expr - Term
Term Term * Factor
| Term / Factor Expr
| Factor
Factor Number | Identifier Expr - Term
Example: 4*5+3-8*9 Expr + Term Factor
Term Factor Term * Factor
Term * Factor Number Factor Number
Factor Number
Number 3 9
Number 8
5
4
In-class Exercise
Consider the CFG:
S (L) | a
L L,S|S
Draw parse trees for:
(a, a)
(a, ((a, a), (a, a)))
In-class Exercise
S
Consider the CFG:
S (L) |
( L )
a
L L,S|S L , S
S a
Draw parse trees for:
(a, a) a
(a, ((a, a), (a, a)))
S
( L )
In-class Exercise
L , S
Consider the CFG: S ( L )
S (L) | a S
L ,
a
L L,S|S S ( L )
( L ) L , S
Draw parse trees for: S a
L , S
(a, a)
S a a
(a, ((a, a), (a, a)))
a
In-class Exercise
Write a CFG grammar for the language of
well-formed parenthesized expressions
– (), (()), ()(), (()()), etc.: OK
– ()), )(, ((()), (((, etc.: not OK
In-class Exercise
Write a CFG grammar for the language of
well-formed parenthesized expressions
– (), (()), ()(), (()()), etc.: OK
– ()), )(, ((()), (((, etc.: not OK
P () | PP | (P)
In-class Exercise
Is the following grammar ambiguous?
A A “and” A | “not” A | “0” | “1”
In-class Exercise
Is the following grammar ambiguous?
A A A “and” A | not A | 0 | 1 A
not A A and A
A and A not A 1
0 1 0
Another Example Grammar
ForStatement for “(“ StmtCommaList “;”
ExprCommaList “;” StmtCommaList ”)” “{“
StmtSemicList “}”
StmtCommaList | Stmt | Stmt “,” StmtCommaList
ExprCommaList | Expr | Expr “,” ExprCommaList
StmtSemicList | Stmt | Stmt “;” StmtSemicList
Expr . . .
Stmt . . .
Full Language Grammar Sketch
Program VarDeclList FuncDeclList
VarDeclList | VarDecl | VarDecl VarDeclList
VarDecl Type IdentCommaList “;”
IdentCommaList Ident | Ident “,” IdentCommaList
Type int | char | float
FuncDeclList | FuncDecl | FuncDecl FuncDeclList
FuncDecl Type Ident “(“ ArgList “)” “{“ VarDeclList StmtList
“}”
StmtList | Stmt | Stmt StmtList
Stmt Ident “=“ Expr “;” | ForStatement | ...
Expr ...
Ident ...
Real-world CFGs
Some sample grammars found on the Web
– LISP: 7 rules
– PROLOG: 19 rules
– Java: 30 rules
– C: 60 rules
– Ada: 280 rules
So What Now?
We want to write a compiler for a given language
We come up with a definition of the tokens embodied
in regular expressions
We build a lexer (see previous lecture)
We come up with a definition of the syntax embodied
in a context-free grammar
– not ambiguous
– enforces relevant operator precedences and associativity
Question: How do we build a parser?
How do we build a Parser?
This question could keep us busy for 1/2 semester in
a full-fledge compiler course
So we’re just going to see a very high-level view of
parsing
– If you go to graduate school you’ll most likely have an in-
depth compiler course with all the details
How do we build a Parser?
There are two approaches for parsing:
– Top-Down: Start with the start symbol and try to expand it
using derivation rules until you get the input source code
– Bottom-Up: Start with the input source code, consume
symbols, and infer which rules could be used
Note: this does not work for all CFGs
– CFGs much have some properties to be parsable with our
beloved parsing algorithms
Writing Parsers?
Nowadays one doesn’t really write parsers from scratch, but
one uses a parser generator (Yacc is a famous one)
token stream parse tree
Parser
compile time
compiler design time
grammar
specification Parser
Generator