Abstract Syntax Trees (AST)
Modern Compiler Design
        by David Galles
   University of San Francisco
        Abstract Syntax Trees (ASTs)
• Parse trees not only prove that a given string can
  be derived from a context-free grammar, they
  also give some indication of the meaning of the
  string.
• The parse tree for the expression id+id*id also
  provides the information about the execution of
  the operators
• Whether the execution is id+(id*id) or (id+id)*id
Parse tree
                   Parse tree
• What about the non-terminals E, T and F – are
  they required to convey the meaning of the
  string id+id*id?... NO
• The non-terminals are necessary to describe the
  grammar but not required when the syntax
  analysis (parsing) has been done successfully.
• We need a simplified version of the parse tree
  that conveys all of the structure and – meaning.
• This kind of simplified parse tree is known as an
  abstract syntax tree or AST.
         Abstract Syntax Tree (AST)
• An abstract syntax tree is a simplified parse tree
  which contains the structural information about
  the shape of the parse tree, without the non-
  terminals of the grammars.
• Non-terminals are required for parsing but they
  give no information about the meaning of the
  string (expression)
• Non-terminals are removed to produce an
  abstract syntax tree.
               ASTs - Examples
• ASTs for 3+4*5
• 3*(4+5)
                 Designing ASTs
• When designing an AST for a programming
  language, we need to decide what information
  from the parse tree is essential and what can be
  discarded.
• We need to structure the tree so that it is easy to
  work with the next stages of compilation.
• In general, an AST consists of a root and several
  children, as follows:
                 Designing ASTs
• The root of the tree contains information that
  describes the type of the tree – whether the tree
  represents an operator expression, an integer
  literal, an if statement, and so on.
• The children of the root are subtrees that describe
  the various components of the tree itself. For
  instance, the children of an operator expression
  describe the operands and the children of
  assignment statement describe the variable being
  modified and the value being assigned.
              ASTs for Expressions
• ASTs for expressions are fairly straight forward.
• Simple expressions (integers constants, identifiers
  etc.) can be represented by a single node.
• Binary operations are represented as root, which
  stores the operation, and a left and a right subtree,
  which corresponds to the left and right operands of
  the expression.
• What about parenthesis? Parenthesis give us
  grouping information – but that information is already
  contained within the structure of the tree.
• Thus we need not include parenthesis in an AST.
                ASTs for Expressions
• ASTs for: (3+4)*(5+6)
• AST for (4)
     • Integer_Literal(4)
                ASTs for Variables
• The method that we will use to describe
  structured variables as trees is as follows:
• Base variable: Base variable are simple variable,
  such as x and y, They are represented by a node
  that has a single child – identifier for the variable.
• Array variables: Array variables are used for
  array accesses, such as A[3] or A[4+5]. They are
  represented by a tree with a root of ‘ArrayVar’, a
  variable tree for the ‘base’ of the array and an
  expression tree for the index
               ASTs for Variables
• Class instance variables: Class instance variables
  are used for accessing variables inside of classes,
  such as x.y. They are represented by a tree with a
  root of ‘ClassVar’, a variable tree for the ‘base’ of
  the class instance variable and an identifier for
  the variable name of the class instance variable
  being accessed.
• Same strategy can be used in the ASTs for records
  or structs
               ASTs for Variables
• AST for variable x
• AST for class variable x.y
               ASTs for Variables
• AST for array variable x[4]
• AST for class variable x.w[4]
               ASTs for Variables
• AST for class variable y[3].z
                   ASTs for Variables
• Consider the following program:
  Class simpleClass{
  int a;
  int b;       }
  Class complexClass{
  int u;
  simpleClass v; }
  void main(){
  complexClass x;
  x = new complexClass();
  x.v = new simpleClass();
  x.v.a = 3; }
  • ASTs for x.v.a AND w.x.y.z
               ASTs for Variables
• AST for v.w[x.y].z
              ASTs for Statements
• AST for Assignment Statements
  – To represent an assignment statement, we need to
    keep track the l-value of the assignment statement
    and the r-value of the assignment statement.
  – Thus the AST for an assignment statement will have
    two children , a left subtree for the l-value of the
    assignment statement and a right subtree for the r-
    value side.
            ASTs for Statements
• AST for Assignment Statements
             AST for if Statements
• To represent an if statement, we need three
  pieces of information.
• The test for the id statement, the ‘then’ clause of
  the if statement and the ‘else’ clause of an if
  statement.
• The else subtree may be empty for an if
  statement that does not have an else.
          AST for while Statement
• While statements are similar to if statement.
  They require an expression tree for the loop test
  and a statement tree for the body of the loop as
  follows:
          AST for BlockStatement
• The AST for a block statement has a variable
  number of children – one for each statement in
  the block.
        AST for Variable Declaration
• We can represent any variable declaration as an
  AST with four children.
• The root contains a flag that signifies that the
  tree represents a variable declaration. The four
  children are as follows:
  – An identifier for the type of the variable.
  – An identifier for the name of the variable.
  – An identifier for the dimensionality of the variable.
    Non-array variable will have a dimensionality of 0.
  – An initial value for the variable (if any).
          AST for Variable Declaration
• If a declaration has no initial value, then the
  initial value child of null.
• For example: int x = 3 ; AST will be;
• int y[ ][ ];
        AST for Variable Declaration
• AST for x = 4 + y
          AST for Class Declaration
• class MyClass {
   int x;
   boolean B[ ];
}
                 ASTs Examples
• AST for x.y = u[v.w].z
                 ASTs Examples
• AST for: if (x>0) x=x-1; else x=x+1;
                ASTs Examples
• AST for: while(x>0){ x=x/2; y=y+1; }
       AST for Function Declaration
• To represent a function declaration we need to
  keep track of several pieces of information:
  – Return type of the function (Identifier).
  – Name of the function (Identifier).
  – List of formal parameters (each of which needs the
    same information as a variable declaration).
  – Function body (Block statement).
                 AST Summary
• An AST is a simplified parse tree which
  represents the structural information without
  non-terminals
• ASTs help to perform the task of next phases of
  the compiler
• All language structures can be represented by
  abstract syntax trees
• AST can be referred as Intermediate
  representation of the source program
              Three-Address Code
• Three-address code is a linearized representation
  of a syntax tree.
• In three-address code, there is at most one operator
  on the right side of an instruction; that is, no built-
  up arithmetic expressions are permitted.
• Thus a expression like x+y*z might be translated
  into the sequence of three-address instructions:
  t1 = y * z
  t2 = x + t 1
  t1 and t2 are compiler-generated temporary names
             Three-Address Code
• Three-address code is desirable for code
  generation.
• Three-address representation of the given AST is
  as follows.
   t1 = a + b
   t2 = c + d
   t3 = t 1 * t 2
         Addresses and Instructions
• Three-address code is built from two concepts:
  addresses and instructions.
• An address can be one of the following:
• A name. For convenience, we allow source-program
  names to appear as addresses in three-address code.
  In an implementation, a source name is replaced by a
  pointer to its symbol-table entry, where all
  information about the name is kept.
• A constant. In practice, a compiler must deal with
  many different types of constants and variables. Type
  conversions within expressions are considered.
        Addresses and Instructions
• A compiler-generated temporary. It is useful,
  especially in optimizing compilers, to create a
  distinct name each time a temporary is needed.
  These temporaries can be combined, if possible,
  when registers are allocated to variables.
        Addresses and Instructions
• Consider the statement: do i = i+1; while(a[i]<v);
• In both translations, the last instruction is a
  conditional jump to the first instruction.
• The multiplication i * 8 is appropriate for an
  array of elements that each take 8 units of space.
                   Quadruples
• In a compiler, these instructions can be
  implemented as objects or as records with fields for
  the operator and the operands. Three
  such representations are called quadruples, triples,
  and indirect triples.
• A quadruple (or just "quad') has four fields, which
  we call op, arg1, arg2, and result.
• Instructions with unary operators like x = minus y or
  x = y do not use arg2.
• Conditional and unconditional jumps put the target
  label in result.
                  Quadruples
• Three-address code for the assignment a = b * - c
  +b * - c ;
• Three-address code and quadruples are
  represented as:
                      Triples
• A triple has only three fields, which we call op,
  arg1, and arg2.
• Using triples, we refer to the result of an
  operation x op y by its position, rather than by
  an explicit temporary name.
                 Indirect triples
• Indirect triples consist of a listing of pointers to
  triples, rather than a listing of triples themselves.
• For example, let us use an array instruction to list
  pointers to triples in the desired order. Then, the
  triples in Fig. 6.11(b) can be represented as:
                    Summary
• An intermediate representation is typically some
  combination of a graphical notation and three-
  address code.
• As in syntax trees, a node in a graphical notation
  represents a construct; the children of a node
  represent its sub-constructs.
• Three address code takes its name from
  instructions of the form x = y op z, with at most
  one operator per instruction.
• There are additional instructions for control flow.