KEMBAR78
(Week 4) Syntax Analysis (CFG) | PDF | Parsing | Mathematical Logic
0% found this document useful (0 votes)
29 views50 pages

(Week 4) Syntax Analysis (CFG)

Uploaded by

ikechukwujo45
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views50 pages

(Week 4) Syntax Analysis (CFG)

Uploaded by

ikechukwujo45
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

CSC 318 – COMPILER CONSTRUCTION I (2-UNITS)

Week Four:
Parsing (Syntax Analysis)
The CSC318 Team [2020|2021]
femi.odusote@covenantuniversity.edu.ng
Recommended Texts for this Course
1. Watson, Des (2017). A Practical Approach to Compiler Construction, Springer. [MainText]

2. Campbell, Bill; Iyer, Swami; and Akbal-Delibas, Bahar (2013). Introduction to Compiler
Construction in a Java World, Taylor & Francis Group.

3. Seidl, Helmut; Wilhelm, Reinhard; and Hack, Sebastian (2012). Compiler Design: Analysis and
Transformation, Springer

4. Grune, Dick; Reeuwijk, Kees van; Bal, Henri E; Jacobs, Ceriel J.H; and Langendoen, Koen (2012).
Modern Compiler Design, Second Edition, Springer.

5. Reis, Anthony Dos (2011). Compiler Construction Using Java, JavaCC, and Yacc, Wiley

2
Parsing
v Once we have identified the tokens in our program,
we then want to determine its syntactic structure.

v That is, we want to put the tokens together to make the larger
syntactic entities: expressions, statements, methods, and
classdefinitions.

v This process of determining the syntactic structure of


a program is called parsing.
3
v First, we wish to make sure the program is syntactically
valid - that it conforms to the Grammar that describes
its syntax.

v As the Parser parses the program, it should identify


syntax errors and report them and the line numbers
they appear on.

v Moreover, when the Parser does find a syntax error,it


should not just stop, but it should report the error and
gracefully recover so that it may go on looking for
additional errors.
4
v Second, the parser should produce some representation of the parsed
program that is suitable for semantic analysis.

v A BNF grammar describes the syntax of programs in a programming


language.

v The Abstract Syntax Tree, (AST) captures the essential syntax of our program.

v Why are we interested in a Tree representation for our program?


ü Because it is easier to analyze and decorate (with type information) a Tree
than it is to do the same with text.

v The AST makes that syntax, which is implicit in the program text,explicit.
ü This is the purpose of Parsing.
5
Context-Free Grammars and Languages
Backus–Naur Form (BNF) and Its Extensions
v The grammars that we use to describe programming languages are inherently recursive
and so are best described by what we call context-free grammars.

• For example, the context-free rule


Ø S ::= if (E) S
§ says that, if E is an expression and S is a statement,

Ø if (E) S (1)
§ is also a statement.
§ Or, we can read the rule as, a statement S can be written as an if, followed by a
parenthesized expression E and finally a statement S.
§ That is, we can read the “::=“ as ‘can be written as a' .

6
§ There are abbreviations possible in the notation.
§ For example, we can write

S ::= if (E) S | if (E) S else S


(2)
as shorthand for
S ::= if (E) S
S ::= if (E) S else S
(3)
That is,we can read the ‘|’ as ‘or’.

§ So, a statement S can be written either as an if followed by a parenthesized expression E, and a


statement S;

§ or as an if,followed by a parenthesized expression E,a statement S,an else,and another


statement S.
7
§ In our grammar, the square brackets indicate that a phrase is optional..

§ For example, (2) and (3) could have been written as


S ::= if (E) S [else S] (4)
which says an S may be written as if (E) S, optionally followed by else S.

§ Curly braces denote the Kleene closure, indicating that the phrase may appear zero or
more times
§ For example,

• E ::= T {+ T} (5)

which says that an expression E may be written as a term T , followed by zero or more
occurrences of + followed by a term T :
T
T+T
T+T+T
• . .. 8
§ Finally, one may use the alternation sign | inside
right-hand sides, using parentheses for grouping.

• For example,

E ::= T {(+ | -) T}

meaning that the additive operator may be either + or -, allowing for:

T+T-T+T
9
Grammar and the Language It Describes
• A context-free grammar is a tuple, G = (N,T,S,P ), where
• N is a set of non-terminal symbols, sometimes called non-terminals;
• T is a set of terminal symbols, sometimes called terminals;
• S ∈ N is a designated non-terminal, called the start symbol,; and
• P is a set of production rules, sometimes called productions or rules.
• For example, a context-free grammar that describes (very) simple arithmetic
expressions is G = (N, T, S, P ).
• where N = {E, T, F} is the set of non-terminals, T = {+, *, (, ), id} is the set of terminals,
S = E is the start symbol, and the Production set, P is is as follows:

• P = {E ::= E + T , E ::= T, T ::= T * F , T ::= F, F ::= (E), F ::= id}

10
• A little less formally, we can denote the same grammar simply
as a sequence of production rules.
• For example, the grammar below denotes the same grammar
as does the immediate preceding one given.

E ::= E + T
E ::= T
T ::= T * F
T ::= F
F ::= (E)
F ::= id
11
• We may surmise that the symbols (here E, T , and F ) that are defined by at least one production
rule are non-terminals.
• Conventionally, the first defined non-terminal (that is, E here) is our start symbol.
• Those symbols that are not defined (here +,*, (, ),and id) are terminals.
• The start symbol is important to us because it is from this symbol, using the production rules,
that can generate strings in a language.
• For example, because we designate E to be the start symbol in the grammar above,
we can record a sequence of applications of the production rules, starting from E to
the sentence id + id * id as follows:
E ⇒E + T
⇒T +T
⇒F + T
⇒ id + T
⇒ id + T * F
⇒ id + F * F
⇒ id + id * F
⇒ id + id * id
12
• When a string of symbols derives another string of symbols in one step, that is, by a single
application of a production rule, we say that first string directly derives the second string.

• For example,
E directly derives
E+T
E + T directly derives
T+T
T + T directly derives
F + T. and so on . ..
• When one string can be rewritten as another string, using zero or more production
rules from the grammar, we say the first string derives the second string.
• For this derives relation, we usually use the symbol ⇒∗.
• For example,
E ⇒∗ E (in zero steps)
E ⇒∗ id + F * F
T + T ⇒∗ id + id * id 13
• We say the language L(G) that is described by a grammar G consists of all the strings
(sentences) comprised of only terminal symbols that can be derived from the start symbol.

• A little more formally, we can express the language L(G) for a grammar G with start symbol S
and terminals T as

• L(G) = {w | S ⇒∗ w and w ∈ T*}.

• For example, in our grammar above,

E ⇒∗ id + id * id
E ⇒∗ id
E ⇒∗ (id + id) * id
so, L(G) includes each of id + id * id,
id
(id + id) * id
• and infinitely more finite sentences. 14
Top-Down and Bottom-Up parsing Parsing
Approaches
• There are two major parsing approaches: top-down and bottom-up.
• In top-down parsing, you start with the start symbol, S and apply the production rules
until you arrive at the desired/specified string.
• In bottom-up parsing, you start with the specified string and reduce it to the start
symbol, S by applying the production rules backwards.
• As an example, let’s trace through the two approaches on this simple grammar that
recognizes strings consisting of any number of a’s followed by at least one (and
possibly more) b’s:
• S –> AB
• A –> aA | ε
• B –> b | bB
Here is a top-down parse of aaab. >>>

15
top-down parse
• We begin with the start symbol and at each step, expand one of the remaining
nonterminals by replacing it with the right side of one of its productions.
• We repeat until only terminals remain.
• The top-down parse produces a leftmost derivation of the sentence.
Start Symbol, S Rules:
AB S –> AB
aAB A –> aA
aaAB A –> aA
aaaAB A –> aA
aaaεB A –> ε
aaab B –> b

16
bottom-up parse
• A bottom-up parse works in reverse.
• We begin with the sentence of terminals and each step applies a production in reverse, replacing
a substring that matches the right side with the nonterminal on the left.
• We continue until we have substituted our way back to the start symbol.
The string • aaab The Rules:
(insert ε)
• aaaεb
A –> ε
• aaaAb
A –> aA
• aaAb
A –> aA
• aAb A –> aA
• Ab B –> b
• AB S –> AB
• S

17
Derivations (cont’d)
• We are interested in languages, those strings of terminals that can be derived from a grammar’s start symbol.
• There are two kinds of derivation that will be important to us when we go about parsing these languages: left-
most derivations and right-most derivations.
• A left-most derivation is a derivation in which at each step, the next string is derived by applying a production rule for
rewriting the left-most non-terminal.
• For example, we have already seen a left-most derivation of id + id * id:
E⇒E+ T
⇒T+T
⇒F+T
⇒ id + T
⇒ id + T * F
⇒ id + F * F
⇒ id + id * F
⇒ id + id * id
• Here we have underlined the left-most non-terminal in each string of symbols in the derivation to show that it is indeed
a left-most derivation.

18
right-most derivation
• A right-most derivation is a derivation in which at each step, the next string is derived
by applying a production rule for rewriting the right-most non-terminal.

• For example, the right-most derivation of id + id * id would go as follows:


E ⇒E + T
⇒E + T * F
⇒ E + T * id
⇒ E + F * id
⇒ E + id * id
⇒ T + id * id
⇒ F + id * id
⇒ id + id * id

19
• We use the term sentential form to refer to any string of (terminal and non-terminal)
symbols that can be derived from
the start symbol.
• So, for example in the previous derivation,
E
E+T
E+T*F
E + T * id
E + F * id
E + id * id
T + id * id
F + id * id
id + id * id
are all sentential forms.

ØClearly, any sentential form consisting solely of terminal symbols is a sentence in the
language. 20
Parse tree
• An alternative representation of a derivation is
the parse tree, a tree that illustrates the
derivation and the structure of an input string
(at the leaves) from a start symbol (at the root).

• For example, Figure 1 shows the parse tree for


id + id * id.
21
A parse tree for id + id * id

22
Ambiguous Grammars and Unambiguous Grammars
• Given a grammar G, if there exists a sentence s in L(G) for which
there are more than one left-most derivations in G (or equivalently,
either more than one right-most derivations, or more than one
parse tree for s in G), we say that the sentence s is ambiguous.

• Moreover, if a grammar G describes at least one ambiguous


sentence, the grammar G is ambiguous (grammar).

• If there is no such sentence, that is, if every sentence is derivable by


a unique left-most (or right-most) derivation (or has a unique parse
tree), we say the grammar is unambiguous.

23
Example
• Consider the grammar, G:

E ::= E + E | E * E | (E) | id

and, consider the sentence form, id + id * id.

• One left-most derivation for this sentence is:


E ⇒E + E
⇒ id + E
⇒ id + E * E
⇒ id + id * E
⇒ id + id * id
24
In-Class Exercise 4
• Produce another left-most derivation for the
same sentence above.

25
right-most derivations
• It is also the case that the sentence has two
right-most derivations in the grammar:
E ⇒E + E
⇒ E + E *E
⇒ E + E * id
⇒ E + id * id
⇒ id + id *id

27
In-Class Exercise 5
• Produce another right-most derivation for the
same sentence.

28
In-Class Exercise 6
• Produce the two parse trees for the sentence to
demonstrate ambiguity.

30
• Clearly, we would rather have unambiguous grammars
describe our programming language, because
ambiguity in the parsing can lead to ambiguity in
assigning semantics (meaning) to programs.

• For example, in id + id * id, which operation is applied


first: addition or multiplication?

• From our math days, we would like the multiplication


operator, * to be more binding than the addition
operator, +.
32
Conditional statements
• As another example, consider the grammar describing conditional statements

S ::= if (E) S
| if (E) S else S
|s
E ::= e

• Here, the token e represents an arbitrary expression and the s represents a (non- conditional)
statement.
• Consider the following sentence form (string of terminals): if (e) if (e) s else s

• If we look at this sentence carefully, we see that it nests one conditional statement within another.
• One might ask: To which 'if' does the 'else' belong?
• We cannot know; the sentence is ambiguous.

33
Two left-most derivations for a conditional
statement
• More formally, there exist two left- most derivations for this sentence:

Derivation 1: Derivation 2:

S ⇒ if (E) S else S S ⇒ if (E) S


⇒ if (e) S else S ⇒ if (e) S
⇒ if (e) if (E) S else S ⇒ if (e) if (E) S else S
⇒ if (e) if (e) S else S ⇒ if (e) if (e) S else S
⇒ if (e) if (e) s else S ⇒ if (e) if (e) s else S
⇒ if (e) if (e) s else s ⇒ if (e) if (e) s else s

34
Two parse trees for if (e) if (e) s else s

35
• In general, there is no algorithm that can tell us whether
or not an arbitrary grammar is ambiguous.

• That is, ambiguity is not decidable.

• But there are large classes of grammars (and so


programming languages) that both can be shown to be
decidable and for which efficient parsers can be
automatically constructed.

• These classes of grammars are large enough to


describe the many programming languages in use
today.
36
• Because of their recursive nature, parsing languages described by
context-free grammars requires the use of a pushdown stack.

• In general, we can parse any language that is described by a


context-free grammar but more often than not, the algorithms that do
so involve backtracking.

• But there are classes of grammars whose languages may be parsed


without backtracking; the parsing is said to be deterministic.

• At each step of the parsing algorithm, the next step may be


determined simply by looking at both the state of the pushdown
stack and the incoming symbol (as the input sentence is
scanned from left to right).
37
Deterministic Parsers categorized
• These deterministic parsers may be roughly divided into two categories:

Ø Top-down parsing:
§ where the parser begins at the start symbol, S and, step-by-step derives the
input sentence and producing either a parse tree or (more likely) an Abstract
Syntax Tree (AST) from the root down to its leaves.

Ø Bottom-up parsing:
§ where the parser begins at the input sentence and, scanning it from left to right,
applies the production rules, replacing righthand sides with lefthand sides, in a
step-by-step fashion for reducing the sentence to obtain the start symbol.
§ Here the Parse Tree or Abstract Syntax Tree is built from the bottom leaves up to the top
root.

38
Top-Down Deterministic Parsing
• There are two popular top-down deterministic parsing strategies:
1. Parsing by recursive descent and
2. LL(1) parsing.

• Both Parsers scan the input from left to right, looking at and scanning just one symbol at a time.

• In both cases, the parser starts with the grammar’s start symbol, S as an initial goal in the
parsing process;
§ that is, the goal is to scan the input sentence and parse the syntactic entity represented by the
start symbol.
• The start symbol is then rewritten, using a BNF rule replacing the symbol with the right-
hand side sequence of symbols.

39
Parsing
• Parsing = process of determining if a string of tokens can be
generated by a grammar

• For any CF grammar there is a parser that takes at most O(n3) time to
parse a string of n tokens

• Linear algorithms suffice for parsing prog. language source code

• Top-down parsing “constructs” a parse tree from the root to the leaves
• Bottom-up parsing “constructs” a parse tree from the leaves to the root
4400
Predictive Parsing
• Recursive descent parsing is a top-down parsing method
§ Every nonterminal has one (recursive) procedure responsible for parsing the
nonterminal’s syntactic category of input tokens
§ When a nonterminal has multiple productions, each production is
implemented in a branch of a selection statement based on input look-ahead
information

• Predictive parsing is a special form of recursive descent parsing


where we use one lookahead token to unambiguously determine the
parse operations.

41
41
Example Predictive Parser (Grammar)

type ® simple
| ^ id
| array [ simple ] of type
simple ® integer
| char
| num dotdot num

42
Example Predictive Parser (Program Code)
procedure match(t : token);
begin
if lookahead = t then procedure simple();
lookahead := nexttoken() begin
else error() if lookahead = ‘integer’ then
end; match(‘integer’)
else if lookahead = ‘char’ then
procedure type(); match(‘char’)
begin else if lookahead = ‘num’ then
if lookahead in { ‘integer’, ‘char’, ‘num’ } then match(‘num’);
simple() match(‘dotdot’);
else if lookahead = ‘^’ then match(‘num’)
match(‘^’); match(id) else error()
else if lookahead = ‘array’ then end;
match(‘array’); match(‘[‘); simple();
match(‘]’); match(‘of’); type()
else error()
end;
43
Example Predictive Parser (Execution Step 1)

Check lookahead
type()
and call match

match(‘array’)

Input: array [ num dotdot num ] of integer

lookahead
44
Example Predictive Parser (Execution Step 2)

type()

match(‘array’) match(‘[’)

Input: array [ num dotdot num ] of integer

lookahead
45
Example Predictive Parser (Execution Step 3)

type()

match(‘array’) match(‘[’) simple()

match(‘num’)

Input: array [ num dotdot num ] of integer

lookahead
46
Example Predictive Parser (Execution Step 4)

type()

match(‘array’) match(‘[’) simple()

match(‘num’) match(‘dotdot’)

Input: array [ num dotdot num ] of integer

lookahead
47
Example Predictive Parser (Execution Step 5)

type()

match(‘array’) match(‘[’) simple()

match(‘num’) match(‘dotdot’) match(‘num’)

Input: array [ num dotdot num ] of integer

lookahead
48
Example Predictive Parser (Execution Step 6)

type()

match(‘array’) match(‘[’) simple() match(‘]’)

match(‘num’) match(‘dotdot’) match(‘num’)

Input: array [ num dotdot num ] of integer

lookahead
49
Example Predictive Parser (Execution Step 7)

type()

match(‘array’) match(‘[’) simple() match(‘]’) match(‘of’)

match(‘num’) match(‘dotdot’) match(‘num’)

Input: array [ num dotdot num ] of integer

lookahead
50
Example Predictive Parser (Execution Step 8)

type()

match(‘array’) match(‘[’) simple() match(‘]’) match(‘of’) type()

match(‘num’) match(‘dotdot’) match(‘num’) simple()

match(‘integer’)

Input: array [ num dotdot num ] of integer

lookahead
LECTURE 4: FURTHER READING & ASSIGNMENT

Kindly study the Chapter4 in the Textbook, DES WATSON, on page


75-91, for further background understanding on the lecture topic.

49
Homework
1. Consider the grammar:
A →aAbA|c
Give a leftmost derivation of the string aacbcbc.
2. Prove which one of the following strings is recognized by this grammar.
S →aSbb|ε
i. aabb
ii. bb
iii. aabbbb
iv. aaaabb
3. Given the following expression grammar, produce a Syntax tree for 2+3*4.
Exp → Exp+Exp2
Exp → Exp-Exp2
Exp → Exp2
Exp2 → Exp2*Exp3
Exp2 → Exp2/Exp3
Exp2 → Exp3
Exp3 → num
Exp3 → (Exp)

52
52

You might also like