Compiler & Interpreter
By
Dr. Tariq R. Soomro
WEEK-2: LANGUAGE DESCRIPTION- SYNTACTIC STRUCTURE:
• Clear and complete descriptions of a language are needed by
programmers, implementers, and even language designers.
• The syntax of a language specifies how programs in the language
are built up.
• The semantics of the language specifies what programs mean.
• For example, dates are built up from digits represented by D and
the symbol / as follows:
DD/DD/DDDD
• According to this syntax, 01/02/2001 is a date.
• The day this date refers to is not identified by the syntax.
• In the United States, this date refer to January 2, 2001, but
elsewhere 01 is interpreted as the day and 02 as the month, so the
date refers to February 1, 2001.
• The same syntax therefore has different semantics in different
parts of the world.
Organization of Language Descriptions:
• Grammars are convenient enough for syntax that they are used
by all communities: programmers, implementers, designers.
• Semantic descriptions, however, are seldom both readable enough
to be suitable for a beginner and precise enough to specify a
language fully.
• Several distinct styles of language description have arisen to meet
the conflicting needs of readability and precision.
• Tutorials: A tutorial introduction is a guided tour of language. It
provide impressions of what the main constructs of the language
are and how they are meant to be used.
• The syntax and semantics are introduced gradually, as needed.
• Reference Manuals: A reference manual describing the syntax
and semantics of a language is traditionally organized around the
syntax of the language.
• Formal Definitions: A formal definition is a precise description of
the syntax and semantics of a language; it is aimed at specialists.
English descriptions leave room for conflicting interpretations, so
precise formal notations have been developed. The training and
effort needed to learn such notations are balanced by their
promise for clarifying particularly subtle point.
Expression Notations:
• Expression such as a+b*c have been in use for centuries and were
a starting point for design of programming languages.
• For example, a expression in Fortran can be written as:
(- b + √ b2 – 4 * a * c ) / (2 * a)
(- b + sqrt (b * b – 4.0 * a * c)) / (2.0 * a)
• Programming languages use a mix of infix, prefix, and postfix
notations. (Assignment)
• A binary operator is applied to two operands.
• In infix notation, a binary operator is written between its
operands, as in the expression a+b.
• Other alternative are prefix notation, in which the operator is
written first, as + a b, and postfix notation, in which the operator
is written last, as a b +.
• An expression can be enclosed within parentheses without
affecting its value.
• Expression E has the same value as (E), as a rule.
• Prefix and Postfix notations are sometimes called parenthesis-free
because as we shall see, the operands of each operator can be
found unambiguously, without the need for parentheses.
Prefix Notation:
• An expression in prefix notation is written as follows:
• The prefix notation for a constant or a variable is the constant or
variable itself.
• The application of an operator op to sub-expressions E1 and E2 is
written in prefix notation as op E1 E2.
• An advantage of prefix notation is that it is easy to decode during
a left-to-right scan of an expression.
• If a prefix expression begins with operator +, the next expression
after + must be the first operands of + and the expression after
that must be the second operand of +.
• For example, the sum of x and y is written in prefix notation as +
x y.
• The product of + x y and z is written as * + x y z.
• Thus + 20 30 equals to 50 and * + 20 30 60 = * 50 60 = 3000
• Or * 20 + 30 60 = * 20 90 = 1800
Postfix Notation:
• An expression in postfix notation is written as follows:
• The postfix notation for a constant or a variable is the constant or
variable itself.
• The application of an operator op to sub-expressions E1 and E2 is
written in postfix notation as E1 E2 op.
• An advantage of postfix notation is that they can be mechanically
evaluated wit the help of a stack data structure.
• For example, the sum of x and y is written in postfix notation as
x y +.
• The product of x y + and z is written as x y + z *.
• Thus 20 30 + equals to 50 and 20 30 + 60 * = 50 60* = 3000
• Or 20 30 60 + * = 20 90 * = 1800
Infix Notation:
• In infix notation, operators appear between their operands; +
appear between a and b in the sum a + b.
• An advantage of infix notation is that it is familiar and hence easy
to read.
• Infix notation comes with rules for precedence and associativity.
• How is an expression like a + b * c to be decoded?
• Is it the sum of a and b * c, or is it the product of a + b and c?
• The operator * usually takes its operands before + does.
• An operator at a higher precedence level takes its operands before
an operator at a lower precedence level.
• BODMAS rules is an example.
Mixfix Notation:
• Operations specified by a combination of symbols do not fit neatly
into the prefix, infix, postfix classification.
• For example the keywords, if, then, and else are used together in
the expression
if a > b then a else b
• The meaningful components of this expression are the condition
a>b and the expressions a and b.
• If a>b evaluates to true, then the value of the expression is a,
otherwise, it is b.
• When symbols or keywords appear interspersed with the
components of an expression, the operation will be said to be in
mixfix notation.
Abstract Syntax Trees:
• The abstract syntax of a language identifies the meaningful
components of each construct in the language.
• The prefix expression +ab, the infix expression a+b, and the
postfix expression ab+ all have the same meaningful components;
the operator + and the sub-expressions a and b.
• A corresponding tree representation is a better grammar can be
designed if the abstract syntax of a language is known before the
grammar is specified.
+
a b
• An operator and its operands are represented by a node and its
children.
• A tree consists of a node with k ≥ 0 trees as its children.
• When k = 0, a tree consists of just a node, with no children.
• A node with no children is called a leaf.
• The root of a tree is a node with no parent; that is, it is not a child
of any node.
Lexical Syntax:
• Keyword like if and symbol like <= are treated as units in a
programming language, just as words are treated as units in
English.
• The meaning of the word dote (love / admire) bears no relation to
the meaning of dot, despite the similarity of their written
representations.
• The two-characters symbol <= is treated as a unit in Pascal and C.
• It is distinct from the one-character < and =, which have different
meaning of their own.
• For example:
• <> in Pascal and != in C
• mod in Pascal and % in C etc.
• Grammars deal with units called tokens
• The syntax of a programming language is specified in terms of
units called tokens or terminals.
• A lexical syntax for language specifies the correspondence
between the written representation of the language and the tokens
or terminals in a grammar for the language.
• Alphabetic character sequences that are treated as units in a
language are called keywords.
• For example if and while are keywords in both Pascal and C.
• Keywords are reserve words and cannot be used as names.
• The actual character sequence used to write down an occurrence
of a token is called the spelling of that occurrence.
• Token will be written using bold font and spellings will be written
using a typewriter-like font; so the keyword “while” has spelling
while.
• Subscript can be used to distinguish between occurrences of a
token; the subscript might be the spelling for a token representing
a name, or the value for a token representing a number.
• Using token “name” for names and token “number” for integers,
the character sequence:
B*B–4*A*C
• is represented by the token sequence as:
nameB * nameB – number4 * nameA * nameC
• White space in the form of blank, tab, and newline character can
typically be inserted between tokens without changing the
meaning of a program.
• Similarly comments between tokens are ignored.
• Informal descriptions usually suffice for white space, comments
and the correspondence between tokens and their spellings, so
lexical syntax will not be formalized.
• Real numbers are a possible exception.
• The most complex rules in a lexical syntax are typically the ones
describing the syntax of real numbers, because parts of the syntax
are optional.
• The following some of the ways of writing the same number:
• 314.E-2 = 3.14 = 0.314E+1 = 0.313E1
• and leading 0 can sometimes be dropped as: .314E1
Context-Free Grammars:
• The concrete syntax of a language describes its written
representation, including lexical details such as the placement of
keywords and punctuation marks.
• Context-free grammars, or simply grammars, are a notation for
specifying concrete syntax.
• BNF, from Backus-Nour Form, is a one way of writing grammars.
• A grammar for a language imposes a hierarchical structure,
called a parse tree on programs in the language.
• The following is a parse tree for the string 3.14 in a language of
real numbers:
real number
integer-part fraction
digit digit fraction
digit
3 . 1 4
• The leaves at the bottom of a parse tree are labeled with terminals
or tokens like 3; tokens represent themselves.
• By contrast, the other nodes of a parse tree are labeled with non-
terminals like real-number and digit; non-terminal represent
language constructs.
• Each node in the parse tree is based on a production, a rule that
defines a non-terminal in terms of a sequence of terminals and
non-terminals.
• The root of the parse tree for 3.14 is based on the following
informally stated production:
• “A real number consists of an integer part, a point, and a fraction
part”.
• Together the tokens, the non-terminals, the productions, and a
distinguished non-terminal, called the starting non-terminal,
constitute a grammar for a language.
• The starting non-terminal may represent a portion of a complete
program when fragments of a programming language are studies.
• Both tokens and non-terminals are referred to as grammar
symbols, or simply symbols.
Definition of Context-Free Grammars:
• Given a set of symbols, a starting over the set is a finite sequence
of zero or more symbols from the set.
• The number of symbol in the sequence is said to be the length of
the string.
• The length of the string “teddy” is 5.
• An empty string is a string of length zero.
• A context-free grammar, or simply grammar, has four parts:
1) A set of tokens or terminal; these are the atomic symbols in
the language
2) A set of non-terminals; these are the variable representing
constructs in the language.
3) A set of rules called productions for identifying the
components of a construct. Each production has a non-
terminals as its left side, the symbol ::=, and a string over
the sets of terminals and non-terminals as its right side
4) A non-terminal chosen as the starting non-terminal; it
represents the main construct of the language.
• Unless otherwise stated, the production for the starting non-
terminal appear first.
BNF: Backus-Naur Form:
• The concept of a context-free grammar, consisting of terminals,
non-terminals, productions, and a string non-terminal, is
independent of the notation used to write grammars.
• BNF is one such notation, made popular by its use to organize the
report on the Algol-60 programming language.
Grammars for Expressions:
• A well-designed grammar can make it easy to pick out the
meaningful components of a construct.
• In other words, with a well-designed grammar, parse trees are
similar enough to abstract syntax trees that the grammar can be
used to organize a language description or a program that exploits
the syntax.
• An example of a program that exploits syntax is an “expression
evaluator” that analyzes and evaluates expressions.
• After expressions, the remaining syntax is often easy.
Variants of Grammars:
• The other ways of grammars are Extended BNF and Syntax
Charts.
• EBNF is an extension of BNF that allows lists and optional
elements to be specified. Lists or sequences of elements appear
frequently in the syntax of programming language. The appeal of
EBNF is convenience, not additional capability, since anything
that can be specified with EBNF can also be specified using BNF.
• Syntax charts are a graphical notation for grammars. They have
visual appeal; again, anything that can be specified using syntax
charts can also be specified using BNF.