L02 Syntax
L02 Syntax
1
Programming Languages Syntax
     Computer languages must be precise:
       Both their form (syntax) and meaning (semantics) must be specified
        without ambiguity, so that both programmers and computers can tell
        what a program is supposed to do.
       Example: the syntax of Arabic numerals:
          A digit “is”: 0 |(or) 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
          A non_zero_digit “is” 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
          A natural_number (>0)“is”a non_zero_digit followed
           by other digits (a number that doesn’t start with 0) = the
           regular expression “non_zero_digit digit*”
     Specifying the syntax for programming languages has 2 parts:
      Regular Expressions (RE) and Context-Free Grammars
2
                          (c) Paul Fodor (CS Stony Brook) and Elsevier
Regular Expressions
     A regular expression is one of the following:
       a character
       the empty string, denoted by ε
       two regular expressions concatenated
         E.g., letter letter
       two regular expressions separated by | (i.e., or),
         E.g., letter ( letter | digit )
       a regular expression followed by the Kleene star
       (concatenation of zero or more previous item)
         E.g., letter ( letter | digit )*
3
                      (c) Paul Fodor (CS Stony Brook) and Elsevier
Regular Expressions
 RE example: the syntax of numeric constants can be
 defined with regular expressions:
A digit “is”     0|1|2|3|4|5|6|7|8|9
A number “is”    integer | real
An integer “is” digit digit*
A real “is”      integer exponent
                 | decimal ( exponent | ε )
A decimal “is”   digit* (.digit|digit.) digit*
An exponent “is” ( e | E ) ( + | - | ε ) integer
4
                  (c) Paul Fodor (CS Stony Brook) and Elsevier
Regular Expressions
     Regular expressions work well for defining tokens
       They are unable to specify nested constructs
        For example, a context free grammar in BNF
         form to define arithmetical expressions is:
    expr → id | number | - expr | ( expr ) | expr op expr
    op → + | - | * | /
          Same number of open and closed
           parenthesis cannot be represented by RE
5
                     (c) Paul Fodor (CS Stony Brook) and Elsevier
 Chomsky Hierarchy
 Context Free Languages are strictly more powerful than
  Regular Expressions, BUT, Regular Expressions are way faster
  to recognize, so
   Regular Expressions are used to create tokens, the leafs of the
    syntax tree, while Context Free grammars build the syntax tree
 Chomsky Hierarchy:
   Type-3: Regular Languages (Regex) – implemented by Finite Automata
     (called Lexer, Scanner, Tokenizer)
    Type-2: Context-Free Languages - Pushdown Automata (called Parsers)
    Type-1: Context-Sensitive Language
    Type-0: Unrestricted Language - Turing Machine
 Types 0 and 1 are not for practical use in defining programming languages
 Type 2, for very restricted practical use (O(N3) in the worst case)
 Type 3 are fast (linear time to recognize tokens), but not expressive enough
6
  for most languages        (c) Paul Fodor (CS Stony Brook) and Elsevier
Context-Free Grammars (CFG)
     Backus–Naur Form (BNF) notation for CFG:
    expr → id | number | - expr | ( expr ) | expr op expr
    op → + | - | * | /
       Each of the rules in a CFG is known as a production
       The symbols on the left-hand sides of the productions are
       nonterminals (or variables)
     A CFG consists of:
       a set of terminals/tokens T (that cannot appear on the
        left-hand side of any production)
       a set of non-terminals N
       a non-terminal start symbol S, and
7      a set of productions
                      (c) Paul Fodor (CS Stony Brook) and Elsevier
Context-Free Grammars (CFG)
     John Backus was the inventor of Fortran (won the
      ACM Turing Award in 1977)
     John Backus and Peter Naur used the BNF form for
      Algol
       Peter Naur also won the ACM Turing Award in 2005 for
       Report on the Algorithmic Language ALGOL 60
     BNF was named by Donald Knuth
8
                       (c) Paul Fodor (CS Stony Brook) and Elsevier
Context-Free Grammars (CFG)
     The Kleene star * and meta-level parentheses of regular
      expressions do not change the expressive power of the
      notation
           id_list → id ( , id )*
      is shorthand for
           id_list → id id_list_tail
           id_list_tail → , id id_list_tail
           id_list_tail → ε
      or the left-recursive version
           id_list → id
           id_list → id_list , id
9
                       (c) Paul Fodor (CS Stony Brook) and Elsevier
 Context-Free Grammars (CFG)
      From RE to BNF notation:
        Consider the RE: a*( b a* b )*
          Start with a*:
      As −> a As
          | ε
      Same with ( b a* b )*. It is:
      S −> b As b S
          | ε
      Now you concatenate them into a single non-terminal:
      G −> As S
10
                      (c) Paul Fodor (CS Stony Brook) and Elsevier
 Context-Free Grammars (CFG)
 Derivations and Parse Trees: A context-free grammar shows us how to
     generate a syntactically valid string of terminals
     1. Begin with the start symbol
     2. Choose a production with the start symbol on the left-hand side;
         replace the start symbol with the right-hand side of that production
     3. Now choose a nonterminal A in the resulting string, choose a
         production P with A on its left-hand side, and replace A with the
         right-hand side of P
        Repeat this process until no non-terminals remain
           The replacement strategy named right-most derivation chooses
            at each step to replace the right-most nonterminal with the right-
            hand side of some production
            o There are many other possible derivations, including left-most
               and options in between.
11
                           (c) Paul Fodor (CS Stony Brook) and Elsevier
 Context-Free Grammars (CFG)
      Example: we can use our grammar for expressions to generate the
       string “slope * x + intercept”:
          expr ⇒ expr op expr
                                                        Grammar:
          ⇒ expr op id                                expr → id | number
          ⇒ expr + id                                    | - expr | ( expr )
          ⇒ expr op expr + id
                                                         | expr op expr
          ⇒ expr op id + id
          ⇒ expr * id + id                            op → + | - | * | /
          ⇒ id * id + id
          ⇒ id(slope)* id(x)+ id(intercept)
       Notes: The ⇒ metasymbol is often pronounced “derives”
        A series of replacement operations that shows how to derive a string of terminals
         from the start symbol is called a derivation
        Each string of symbols along the way is called a sentential form
        The final sentential form, consisting of only terminals, is called the yield of the
12       derivation             (c) Paul Fodor (CS Stony Brook) and Elsevier
Derivations and Parse Trees
  We can represent a derivation graphically as a parse tree
      The root of the parse tree is the start symbol of the
       grammar
      The leaves are its yield
      Each node with its
     children represent a
     production
  E.g., The parse tree for the
     expression grammar for
     3 + 4 * 5 is:
13
                        (c) Paul Fodor (CS Stony Brook) and Elsevier
Derivations and Parse Trees
  The example grammar is ambiguous (it can generate
     multiple parse trees for 3+4*5): one corresponds to
     3+(4*5) and one corresponds to (3+4)*5
                               Grammar:
                             expr → id | number
                                   | - expr | ( expr )
                                   | expr op expr
                             op → + | - | * | /
14
                     (c) Paul Fodor (CS Stony Brook) and Elsevier
Context free grammars
  A better version of our expression grammar should include
     precedence and associativity:
           expr → term | expr add_op term
           term → factor | term mult_op factor
           factor → id | number | - factor | ( expr )
           add_op → + | -
           mult_op → * | /
15
                      (c) Paul Fodor (CS Stony Brook) and Elsevier
Context free grammars
      Parse tree for expression grammar for 10 - 4 - 3
19
                     (c) Paul Fodor (CS Stony Brook) and Elsevier
 Scanning
  Construction of an NFA equivalent to a given regular
     expression: cases
20
                     (c) Paul Fodor (CS Stony Brook) and Elsevier
 Scanning
  Construction of an NFA equivalent to the regular
     expression d* ( .d | d. ) d*
21
                   (c) Paul Fodor (CS Stony Brook) and Elsevier
 Scanning
  Construction of an NFA equivalent to the regular
     expression d* ( .d | d. ) d*
22
                   (c) Paul Fodor (CS Stony Brook) and Elsevier
 Scanning
  From an NFA to a DFA:
      Reason: With no way to “guess” the right transition to take
      from any given state, any practical implementation of an
      NFA would need to explore all possible transitions
      concurrently or via backtracking
      We can instead build a DFA from that NFA:
        The state of the DFA after reading any input will be the set
         of states that the NFA might have reached on the same input
          Our example: Initially, before it consumes any input, the NFA
           may be in State 1, or it may make epsilon transitions to
           States 2, 4, 5, or 8
            o We thus create an initial State A for our DFA to represent
23
              this set: 1,2,4,5,8
                        (c) Paul Fodor (CS Stony Brook) and Elsevier
 Scanning
      On an input of d, our NFA may move from State 2 to State 3,
       or from State 8 to State 9.
        It has no other transitions on this input from any of the states
         in A.
        From State 3, however, the NFA may make epsilon transitions
         to any of States 2, 4, 5, or 8.
        We therefore create DFA State B: 2, 3, 4, 5, 8, 9
      On a .,our NFA may move from State 5 to State 6
        There are no other transitions on this input from any of the
         states in A, and there are no epsilon transitions out of State 6.
        We therefore create the singleton DFA State C: 6
      We continue the process until we find all the states and
       transitions in the DFA (it is a finite process –Why?)
24
                       (c) Paul Fodor (CS Stony Brook) and Elsevier
 Scanning
  The DFA equivalent to our previous NFA:
25
                  (c) Paul Fodor (CS Stony Brook) and Elsevier
Scanner code (usually realized as nested if/case statements)
      Suppose we are building an ad-hoc (hand-written) scanner for a
       Calculator:
       assign → :=
       plus → +
       minus → -
       times → *
       div → /
       lparen → (
       rparen → )
       id → letter ( letter | digit )*
       number → digit digit *
           | digit * ( . digit | digit . ) digit *
       comment → /* ( non-* | * non-/ )* */
                | // ( non-newline )* newline
26
                         (c) Paul Fodor (CS Stony Brook) and Elsevier
 Scanning
      We read the characters one at a time with look-ahead
       skip any initial white space (spaces, tabs, and newlines)
     if cur_char ∈ {‘(’, ‘)’, ‘+’, ‘-’, ‘*’}
            return the corresponding single-character token
     if cur_char = ‘:’
            read the next character
            if it is ‘=’ then return assign else announce an error
     if cur_char = ‘/’
            peek at the next character
            if it is ‘*’ or ‘/’
              read additional characters until “*/” or newline
                    is seen, respectively
            jump back to top of code
     else return div
27
                           (c) Paul Fodor (CS Stony Brook) and Elsevier
 Scanning
     if cur_char = .
         read the next character
         if it is a digit
                read any additional digits
                return number
         else announce an error
     if cur_char is a digit
         read any additional digits and at most one decimal point
         return number
     if cur_char is a letter
         read any additional letters and digits
         check to see whether the resulting string is read or
     write
         if so then return the corresponding token
         else return id
     else announce an error
28
                      (c) Paul Fodor (CS Stony Brook) and Elsevier
 Scanning
  Pictorial
     representation of
     a scanner for
     calculator
     tokens, in the
     form of a finite
     automaton
29
                   (c) Paul Fodor (CS Stony Brook) and Elsevier
 Scanning
  We run the machine over and over to get one
     token after another
      Nearly universal rule:
       always take the longest possible token from the input
        thus foobar is foobar and never f or foo or
        foob
       more to the point, 3.14159 is a real constant and
        never 3, ., and 14159
30
                    (c) Paul Fodor (CS Stony Brook) and Elsevier
 Scanning
 The rule about longest-possible tokens means you return only
   when the next character can't be used to continue the current
   token
    the next character will generally need to be saved for the
     next token
  In some cases, you may need to peek at more than one
   character of look-ahead in order to know whether to
   proceed
    In Pascal, for example, when you have a 3 and you a see a dot
       do you proceed (in hopes of getting 3.14)? or
       do you stop (in fear of getting 3..5)? (declaration of
        arrays in Pascal, e.g., “array [1..6] of
31      Integer”)     (c) Paul Fodor (CS Stony Brook) and Elsevier
 Scanning
  Writing a pure DFA as a set of nested case
   statements is a surprisingly useful
   programming technique
   use perl, awk, sed
  Table-driven DFA is what lex and
   scangen produce
   lex (flex) in the form of C code
   scangen in the form of numeric tables
32   and a separate driver
                (c) Paul Fodor (CS Stony Brook) and Elsevier
 Perl-style Regexp
      Learning by examples:
        abcd - concatenation
        a(b|c)d - grouping
        a(b|c)*d - can apply a number of repeats to char or group
           ? = 0-1
           * = 0-inf
           + = 1-inf
        [bc] - character class
        [a-zA-Z0-9_] - ranges
        . - matches any character.
        \a - alpha
        \d - numeric
        \w - word (alpha, num, _)
        \s - whitespace
33
                       (c) Paul Fodor (CS Stony Brook) and Elsevier
 Perl-style Regexp
     Learning by examples:
       How do we write a regexp that matches
       floats?
       digit*(.digit|digit.)digit*
\d*(\.\d|\d \.)\d*
34
               (c) Paul Fodor (CS Stony Brook) and Elsevier
 Parsing
      The parser calls the scanner to get the tokens,
       assembles the tokens together into a syntax
       tree, and passes the tree (perhaps one
       subroutine at a time) to the later phases of the
       compiler (this process is called syntax-directed
       translation).
      Most use a context-free grammar (CFG)
35
                   (c) Paul Fodor (CS Stony Brook) and Elsevier
 Parsing
      It turns out that for any CFG we can create
      a parser that runs in O(n3) time (e.g.,
      Earley’s algorithm and the Cocke-Younger-
      Kasami (CYK) algorithm)
      O(n3) time is clearly unacceptable for a
        parser in a compiler - too slow even for a
        program of 100 tokens (~1,000,000
        cycles)
36
                  (c) Paul Fodor (CS Stony Brook) and Elsevier
 Parsing
      Fortunately, there are large classes of grammars
      for which we can build parsers that run in linear
      time
       The two most important classes are called
        LL and LR
         LL stands for Left-to-right, Leftmost derivation
           Leftmost derivation - work on the left side of the parse tree
         LR stands for Left-to-right, Rightmost derivation
           Rightmost derivation - work on the right side of the tree
       LL parsers are also called 'top-down', or 'predictive' parsers
       LR parsers are also called 'bottom-up', or 'shift-reduce' parsers
37
                         (c) Paul Fodor (CS Stony Brook) and Elsevier
Top-down parsing (LL)
 Consider a grammar for a comma
 separated list of identifiers,
 terminated by a semicolon:
 id_list → id id_list_tail
 id_list_tail → , id id_list_tail
 id_list_tail → ;
43
                       (c) Paul Fodor (CS Stony Brook) and Elsevier
 LL Parsing
      Example (the average program):
                  read A
                  read B
                  sum := A + B
                  write sum
                  write sum / 2 $$
      We keep a stack of non-terminals with the start
       symbol inserted
      We start at the top and predict needed productions on
       the basis of the current "left-most" non-terminal in
44
       the tree and the current input token
                     (c) Paul Fodor (CS Stony Brook) and Elsevier
 LL Parsing
      Table-driven LL parsing: you have a big loop in
       which you repeatedly look up an action in a
       two-dimensional table based on current leftmost
       non-terminal and current input token
      The actions are:
       (1) match a terminal
       (2) predict a production
          OR
       (3) announce a syntax error
45
                   (c) Paul Fodor (CS Stony Brook) and Elsevier
 LL Parsing
      First, unfold the
      production rules to
      collect for each
      production the
      possible tokens that
      could start it
46
                           (c) Paul Fodor (CS Stony Brook) and Elsevier
 LL Parsing
 Construct the prediction table:
     for each possible input token and
     the left-most nonterminal, what is
     the possible production rule that
     will be used?
      The non-terminal will be "used",
       while the RHS of the production is
       added to the stack.
47
                          (c) Paul Fodor (CS Stony Brook) and Elsevier
 LL Parsing
 LL(1) parse table for parsing for
     calculator language
          read A
          read B
          sum := A + B
          write sum
          write sum / 2 $$
48
                           (c) Paul Fodor (CS Stony Brook) and Elsevier
49
     (c) Paul Fodor (CS Stony Brook) and Elsevier
50
     (c) Paul Fodor (CS Stony Brook) and Elsevier
Parse tree for the average program
51
               (c) Paul Fodor (CS Stony Brook) and Elsevier
 LL Parsing
      Problems trying to make a grammar LL(1)
        left recursion
          example:
       id_list      → id_list , id
       id_list      → id
        we can get rid of all left recursion
         mechanically in any grammar
       id_list                      → id id_list_tail
       id_list_tail                 → , id id_list_tail
       id_list_tail                 → ε
52
                   (c) Paul Fodor (CS Stony Brook) and Elsevier
 LL Parsing
      Problems trying to make a grammar LL(1)
        common prefixes
          example:
         stmt → id := expr
                  | id ( arg_list )
          we can eliminate left-factor mechanically =
           "left-factoring”
         stmt → id id_stmt_tail
         id_stmt_tail → := expr
                            | ( arg_list)
53
                    (c) Paul Fodor (CS Stony Brook) and Elsevier
 LL Parsing
      Eliminating left recursion and common prefixes still does NOT
       make a grammar LL
        there are infinitely many non-LL LANGUAGES, and the
         mechanical transformations work on them just fine
      Problems trying to make a grammar LL(1)
        the"dangling else" problem prevents grammars from being
         LL(1) (or in fact LL(k) for any k)
          the following natural (Pascal) grammar fragment is ambiguous:
         stmt → if cond then_clause else_clause
                   | other_stuff
         then_clause → then stmt
         else_clause → else stmt | ε
         Example String: “if C1 then if C2 then S1 else S2”
         Ambiguity: the else can be paired with either if then!!!
54
                          (c) Paul Fodor (CS Stony Brook) and Elsevier
 LL Parsing
      Desired effect: pair the else with the nearest then.
      The less natural grammar fragment:
       stmt → balanced_stmt | unbalanced_stmt
       balanced_stmt → if cond then balanced_stmt
                                   else balanced_stmt
                           | other_stuff
       unbalanced_stmt → if cond then stmt
                           | if cond then balanced_stmt
                                   else unbalanced_stmt
      A balanced_stmt is one with the same number of
       thens and elses.
      An unbalanced_stmt has more thens.
55
                     (c) Paul Fodor (CS Stony Brook) and Elsevier
 LL Parsing
      The usual approach, whether top-down OR
      bottom-up, is to use the ambiguous grammar
      together with a disambiguating rule that says:
       else goes with the closest then or
       more generally, the first of two possible
        productions is the one to predict (or reduce)
        stmt → if cond then_clause else_clause
              | other_stuff
        then_clause → then stmt
        else_clause → else stmt | ε
56
                   (c) Paul Fodor (CS Stony Brook) and Elsevier
 LL Parsing
      Better yet, languages (since Pascal) generally employ
       explicit end-markers, which eliminate this
       problem.
      In Modula-2, for example, one says:
           if A = B then
                if C = D then E := F end
           else
                G := H
           end
      Ada says 'end if'; other languages say 'fi'
57
                      (c) Paul Fodor (CS Stony Brook) and Elsevier
 LL Parsing
      One problem with end markers is that they tend to bunch up.
       In Pascal you say
           if A = B then …
           else if A = C then …
           else if A = D then …
           else if A = E then …
           else ...;
        With end markers this becomes
           if A = B then …
           else if A = C then …
           else if A = D then …
           else if A = E then …
           else ...;
           end; end; end; end; end; end; …
58
                           (c) Paul Fodor (CS Stony Brook) and Elsevier
 LR Parsing
      LR parsers are almost always table-driven:
       like a table-driven LL parser, an LR parser uses a
        big loop in which it repeatedly inspects a two-
        dimensional table to find out what action to take
       unlike the LL parser, however, the LR driver has
        non-trivial state (like a DFA), and the table is
        indexed by current input token and current
        state
         also the stack contains a record of what has been
59
         seen SO FAR (NOT what is expected)
                      (c) Paul Fodor (CS Stony Brook) and Elsevier
 LR Parsing
      LR keeps the roots of its partially
      completed subtrees on a stack
       When it accepts a new token from the scanner, it
        shifts the token into the stack
       When it recognizes that the top few symbols on
        the stack constitute a right-hand side, it reduces
        those symbols to their left-hand side by popping
        them off the stack and pushing the left-hand side
        in their place
60
                     (c) Paul Fodor (CS Stony Brook) and Elsevier
 LR Parsing                                          id_list → id id_list_tail
                                                     id_list_tail → , id id_list_tail
                                                     id_list_tail → ;
      Rightmost (canonical) derivation for the
      identifiers grammar:
61
                   (c) Paul Fodor (CS Stony Brook) and Elsevier
 LR Parsing
      LR(1) grammar for the calculator language:
62
                    (c) Paul Fodor (CS Stony Brook) and Elsevier
 LR Parsing
      Example (the average program):
               read A
               read B
               sum := A + B
               write sum
               write sum / 2                                       $$
63
                    (c) Paul Fodor (CS Stony Brook) and Elsevier
 LR Parsing
      When we begin execution, the parse stack is
      empty and we are at the beginning of the
      production for program:
        program → . stmt_list $$
       When augmented with a ., a production is called an LR item
       This original item (program → . stmt_list $$)
        is called the basis of the list.
64
                          (c) Paul Fodor (CS Stony Brook) and Elsevier
 LR Parsing
      Since the . in this item is immediately in front of
      a nonterminal—namely stmt_list —we
      may be about to see the yield of that
      nonterminal coming up on the input.
      program → . stmt_list $$
      stmt_list → . stmt_list stmt
      stmt_list → . stmt
65
                      (c) Paul Fodor (CS Stony Brook) and Elsevier
 LR Parsing
      Since stmt is a nonterminal, we may also be at
      the beginning of any production whose left-hand
      side is stmt:
      program → . stmt_list $$
      stmt_list → . stmt_list stmt
      stmt_list → . stmt
      stmt → . id := expr
      stmt → . read id
      stmt → . write expr
       The additional items to the basis are its closure.
66
                         (c) Paul Fodor (CS Stony Brook) and Elsevier
 LR Parsing
      Our upcoming token is a read
        Once we shift it onto the stack, we know we are
        in the following state:
       stmt → read . id
        This state has a single basis item and an empty closure—the .
        precedes a terminal.
      After shifting the A, we have:
       stmt → read id .
67
                         (c) Paul Fodor (CS Stony Brook) and Elsevier
 LR Parsing
      We now know that read id is the handle,
      and we must reduce.
       The reduction pops two symbols off the parse
       stack and pushes a stmt in their place
       Since one of the items in State 0 was
      stmt_list → . stmt
      we now have
      stmt_list → stmt .
      Again we must reduce: remove the stmt from
68
      the stack and push a stmt_list in its place.
                      (c) Paul Fodor (CS Stony Brook) and Elsevier
 LR Parsing
      Our new state:
      program → stmt_list . $$
      stmt_list → stmt_list . stmt
      stmt → . id := expr
      stmt → . read id
      stmt → . write expr
69
                    (c) Paul Fodor (CS Stony Brook) and Elsevier
70
     (c) Paul Fodor (CS Stony Brook) and Elsevier
71
     (c) Paul Fodor (CS Stony Brook) and Elsevier
72
     (c) Paul Fodor (CS Stony Brook) and Elsevier
73
     (c) Paul Fodor (CS Stony Brook) and Elsevier
     Table entries indicate whether to shift (s), reduce (r), or shift and
     then reduce (b). The accompanying number is the new state
     when shifting, or the production that has been recognized when
     (shifting and) reducing
74
                          (c) Paul Fodor (CS Stony Brook) and Elsevier
     Driver for a table-driven LR(1) parser
75
                 (c) Paul Fodor (CS Stony Brook) and Elsevier
76
     (c) Paul Fodor (CS Stony Brook) and Elsevier
 Parsing summary
      A scanner is a DFA
       it can be specified with a state diagram
      An LL or LR parser is a PDA (push down automata)
       a PDA can be specified with a state diagram and a
        stack
         the state diagram looks just like a DFA state diagram,
          except the arcs are labeled with <input symbol,
          top-of-stack symbol> pairs, and in addition to
          moving to a new state the PDA has the option of pushing
          or popping a finite number of symbols onto/off the stack
       Early's algorithm does NOT use PDAs, but dynamic
77      programming    (c) Paul Fodor (CS Stony Brook) and Elsevier
 Actions
      We can run actions when a rule triggers:
      Often used to construct an AST for a
       compiler.
      For simple languages, can interpret code
       directly
      We can use actions to fix the Top-Down
       Parsing problems
78
                  (c) Paul Fodor (CS Stony Brook) and Elsevier
 Programming
      A compiler-compiler (or parser generator, compiler generator) is a
       programming tool that creates a parser, interpreter, or
       compiler from some form of formal description of a language
       and machine
        the input is a grammar (usually in BNF) of a programming
         language
        the generated output is the source code of a parser
      Examples of parser generators:
        classical parsing tools: lex, Yacc, bison, flex, ANTLR
        PLY: python implementation of lex and yacc
        Python TPG parser
        ANTLR for python
79
                         (c) Paul Fodor (CS Stony Brook) and Elsevier
 Classic Parsing Tools
     lex - original UNIX Lexical analysis (tokenizing) generator
        create a C function that will parse input according to a set of regular
         expressions
     yacc -Yet Another Compiler Compiler (parsing)
        generate a C program for a parser from BNF rules
                                                                                          Parsed
     Input       yylex()                                                  yyparse()       Input
                                                token
      char        C code                       stream                        C code
     stream
              Lexical Analyzer                                               Parser
              (Tokenizer)
81
                           (c) Paul Fodor (CS Stony Brook) and Elsevier
 Lex and Yacc the big picture
82
          (c) Paul Fodor (CS Stony Brook) and Elsevier
 Lex Example
     /* lexer.l */
     %{
     #include “header.h”
     int lineno = 1;
     %}
     %%
     [ \t]* ; /* Ignore whitespace */
     \n { lineno++; }
     [0-9]+ { yylval.val = atoi(yytext);
                  return NUMBER; }
     [a-zA-Z_][a-zA-Z0-9_]* { yylval.name = strdup(yytext);
                                return ID; }
     \+ { return PLUS; }
     - { return MINUS; }
     \* { return TIMES; }
     \/ { return DIVIDE; }
     = { return EQUALS; }
     %%
83
                     (c) Paul Fodor (CS Stony Brook) and Elsevier
 Yacc Example
     /* parser.y */
     %{
     #include “header.h”
     %}
     %union {
             char *name;
             int val;
     }
     %token PLUS MINUS TIMES DIVIDE EQUALS
     %token<name> ID;
     %token<val> NUMBER;
     %%
     start : ID EQUALS expr;
     expr : expr PLUS term
             | expr MINUS term
             | term
             ;
     ...
84
                      (c) Paul Fodor (CS Stony Brook) and Elsevier
 Bison Overview
                                                The programmer puts BNF rules and
     myparser.y
                                                token rules for the parser he wants in a
     BNF rules and actions for                  bison source file myparser.y
     your grammar.
                                                run bison to create a C program (*.tab.c)
                                                containing a parser function.
     > bison myparser.y                         The programmer must also supply a
                                                tokenizer named yylex( )
     myparser.tab.c                       yylex.c
     parser source code                   tokenizer function in C
                     myprog
                     executable program
85
                          (c) Paul Fodor (CS Stony Brook) and Elsevier
 PLY
      PLY: Python Lex-Yacc = an implementation of
       lex and yacc parsing tools for Python by David
       Beazley: http://www.dabeaz.com/ply/
      A bit of history:
       Yacc : ~1973. Stephen Johnson (AT&T)
       Lex : ~1974. Eric Schmidt and Mike Lesk (AT&T)
       PLY: 2001
86
                    (c) Paul Fodor (CS Stony Brook) and Elsevier
 PLY
      PLY is not a code generator
      PLY consists of two Python modules
       ply.lex = A module for writing lexers
            Tokens specified using regular expressions
            Provides functions for reading input text
       ply.yacc = A module for writing grammars
      You simply import the modules to use them
          The grammar must be in a file
87
                     (c) Paul Fodor (CS Stony Brook) and Elsevier
 PLY
      ply.lex example:
       import ply.lex as lex
       tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
        ’DIVIDE’, EQUALS’ ]
       t_ignore = ‘ \t’
       t_PLUS = r’\+’
       t_MINUS = r’-’
       t_TIMES = r’\*’
       t_DIVIDE = r’/’
       t_EQUALS = r’=’
       t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’
       def t_NUMBER(t):
           r’\d+’
           t.value = int(t.value)
           return t
88
                          (c) Paul Fodor (CS Stony Brook) and Elsevier
 PLY
      Two functions: input() and token()
       lex.lex() # Build the lexer
       ...
       lex.input("x = 3 * 4 + 5 * 6")
       while True:
            tok = lex.token()
            if not tok: break
# Use token
89
                       (c) Paul Fodor (CS Stony Brook) and Elsevier
 PLY
     import ply.yacc as yacc
     import mylexer                                      # Import lexer information
     tokens = mylexer.tokens                             # Need token list
     def p_assign(p):
          '''assign : NAME EQUALS expr'''
     def p_expr(p):
          '''expr : expr PLUS term
                 | expr MINUS term
                 | term'''
     def p_term(p):
          '''term : term TIMES factor
                 | term DIVIDE factor
                 | factor'''
     def p_factor(p):
          '''factor : NUMBER'''
91
                         (c) Paul Fodor (CS Stony Brook) and Elsevier
 PLY Calculator Example
92
         (c) Paul Fodor (CS Stony Brook) and Elsevier
Build a parse tree using tuples
93
          (c) Paul Fodor (CS Stony Brook) and Elsevier
94
     (c) Paul Fodor (CS Stony Brook) and Elsevier
 PLY Precedence Specifiers
      Precedence Specifiers (most precedence at bottom):
          precedence = (
               ('left','PLUS','MINUS'),
               ('left','TIMES','DIVIDE'),
               ('nonassoc','UMINUS'),
          )
          def p_expr_uminus(p):
               'expr : MINUS expr %prec UMINUS'
               p[0] = -p[1]
          ...
95
                      (c) Paul Fodor (CS Stony Brook) and Elsevier
 PLY Best Documentation
      Google Mailing list/group:
          http://groups.google.com/group/ply-hack
96
                   (c) Paul Fodor (CS Stony Brook) and Elsevier
 TPG
      TGP is a lexical and syntactic parser generator
      for Python
       YACC is too complex to use in simple cases
        (calculators, configuration files, small
        programming languages, …)
       You can also add Python code directly into
        grammar rules and build abstract syntax trees
        while parsing
97
                    (c) Paul Fodor (CS Stony Brook) and Elsevier
 Python TPG Lexer
      Toy Parser Generator (TPG): http://cdsoft.fr/tpg
        Syntax:
           token <name> <regex> <function> ;
           separator <name> <regex>;
        Example:
           token integer '\d+' int;
           token float '\d+\.\d*|\.\d+' float;
           token rbrace '{';
           separator space '\s+';
98
                     (c) Paul Fodor (CS Stony Brook) and Elsevier
 Python TPG Lexer
      Embed TPG in Python:
     import tpg
     class Calc:
      r"""
      separator spaces: '\s+' ;
      token number: '\d+' ;
      token add: '[+-]' ;
      token mul: '[*/]' ;
      """
     Try it in Python: download TGP from
99   http://cdsoft.fr/tpg
                    (c) Paul Fodor (CS Stony Brook) and Elsevier
 TPG example
       Defining the grammar:
         Non-terminal productions:
          START -> Expr ;
          Expr -> Term ( add Term )* ;
          Term -> Fact ( mul Fact )* ;
          Fact -> number | '\(' Expr '\)' ;
100
                 (c) Paul Fodor (CS Stony Brook) and Elsevier
 TPG example
      import tpg
      class Calc:
       r"""
       separator spaces: '\s+' ;
       token number: '\d+' ;
       token add: '[+-]' ;
       token mul: '[*/]' ;
       START -> Expr ;
       Expr -> Term ( add Term )* ;
       Term -> Fact ( mul Fact )* ;
       Fact -> number | '\(' Expr '\)' ;
101    """       (c) Paul Fodor (CS Stony Brook) and Elsevier
 TPG example
       Reading the input and returning values:
       separator spaces: '\s+' ;
       token number: '\d+' int ;
       token add: '[+-]' make_op;
       token mul: '[*/]' make_op;
         Transform tokens into defined operations:
            def make_op(s):
              return {
              '+': lambda x,y: x+y,
              '-': lambda x,y: x-y,
              '*': lambda x,y: x*y,
              '/': lambda x,y: x/y,
              }[s]
102
                        (c) Paul Fodor (CS Stony Brook) and Elsevier
  TPG example
 After a terminal symbol is recognized we will store it
  in a Python variable: for example to save a number in
  a variable n: number/n.
 Include Python code example:
  Expr/t -> Term/t ( add/op Term/f $t=op(t,f)$ )* ;
  Term/f -> Fact/f ( mul/op Fact/a $f=op(f,a)$ )* ;
  Fact/a -> number/a | '\(' Expr/a '\)' ;
 103
                   (c) Paul Fodor (CS Stony Brook) and Elsevier
import math                                     # Simple calculator calc.py
import operator
import string
import tpg
def make_op(s):
      return {
                '+': lambda x,y: x+y,
                '-': lambda x,y: x-y,
                '*': lambda x,y: x*y,
                '/': lambda x,y: x/y,
      }[s]
class Calc(tpg.Parser):
      r"""
      separator spaces: '\s+' ;
      token number: '\d+' int ;
      token add: '[+-]' make_op ;
      token mul: '[*/]' make_op ;
104   START/e -> Term/e ;
                       (c) Paul Fodor (CS Stony Brook) and Elsevier
        Term/t -> Fact/t ( add/op Fact/f $ t = op(t,f) $ )* ;
        Fact/f -> Atom/f ( mul/op Atom/a $ f = op(f,a) $ )* ;
        Atom/a -> number/a | '\(' Term/a '\)' ;
        """
calc = Calc()
if tpg.__python__ == 3:
      operator.div = operator.truediv
      raw_input = input
105
                          (c) Paul Fodor (CS Stony Brook) and Elsevier
#!/usr/bin/env python
# Larger example: scientific_calc.py
import math
import operator
import string
import tpg
if tpg.__python__ == 3:
      operator.div = operator.truediv
      raw_input = input
def make_op(op):
      return {
          '+'    : operator.add,
          '-'    : operator.sub,
          '*'    : operator.mul,
          '/'    : operator.div,
          '%'    : operator.mod,
          '^'    : lambda x,y:x**y,
          '**'   : lambda x,y:x**y,
          'cos' : math.cos,
          'sin' : math.sin,
          'tan' : math.tan,
106       'acos': math.acos,
                               (c) Paul Fodor (CS Stony Brook) and Elsevier
         'asin': math.asin,
         'atan': math.atan,
         'sqr' : lambda x:x*x,
         'sqrt': math.sqrt,
         'abs' : abs,
         'norm': lambda x,y:math.sqrt(x*x+y*y),
      }[op]
class Calc(tpg.Parser, dict):
      r"""
         separator space '\s+' ;
         token pow_op    '\^|\*\*' $ make_op
         token add_op    '[+-]'           $ make_op
         token mul_op    '[*/%]'          $ make_op
         token funct1    '(cos|sin|tan|acos|asin|atan|sqr|sqrt|abs)\b' $ make_op
         token funct2    '(norm)\b' $ make_op
         token real      '(\d+\.\d*|\d*\.\d+)([eE][-+]?\d+)?|\d+[eE][-+]?\d+'
                  $ float
         token integer   '\d+' $ int
         token VarId     '[a-zA-Z_]\w*'
          ;
107
                              (c) Paul Fodor (CS Stony Brook) and Elsevier
      START/e ->
              'vars'                               $ e=self.mem()
          |   VarId/v '=' Expr/e                   $ self[v]=e
          |   Expr/e
      ;
      Var/$self.get(v,0)$ -> VarId/v ;
      Expr/e -> Term/e ( add_op/op Term/t                         $ e=op(e,t)
                       )*
      ;
      Term/t -> Fact/t ( mul_op/op Fact/f                         $ t=op(t,f)
                       )*
      ;
      Fact/f ->
              add_op/op Fact/f                                    $ f=op(0,f)
          |   Pow/f
      ;
      Pow/f -> Atom/f ( pow_op/op Fact/e                          $ f=op(f,e)
                       )?
      ;
108
                            (c) Paul Fodor (CS Stony Brook) and Elsevier
            Atom/a ->
                    real/a
                |   integer/a
                |   Function/a
                |   Var/a
                |   '\(' Expr/a '\)'
            ;
            Function/y ->
                    funct1/f '\(' Expr/x '\)'                                   $ y = f(x)
                |   funct2/f '\(' Expr/x1 ',' Expr/x2 '\)'                      $ y = f(x1,x2)
            ;
      """
      def mem(self):
            vars = sorted(self.items())
            memory = [ "%s = %s"%(var, val) for (var, val) in vars ]
            return "\n\t" + "\n\t".join(memory)
109
                                 (c) Paul Fodor (CS Stony Brook) and Elsevier
print("Calc (TPG example)")
calc = Calc()
while 1:
      l = raw_input("\n:")
      if l:
         try:
              print(calc(l))
         except Exception:
              print(tpg.exc())
      else:
         break
110
                               (c) Paul Fodor (CS Stony Brook) and Elsevier
 AntLR
      ANother Tool for Language Recognition is an LL(k)
       parser and translator generator tool
       which can create
         lexers
         parsers
         abstract syntax trees (AST’s)
      in which you describe the language grammatically
      and in return receive a program that can recognize and
        translate that language
111
                       (c) Paul Fodor (CS Stony Brook) and Elsevier
      Tasks Divided
        Lexical Analysis (scanning)
        Semantic Analysis (parsing)
        Tree Generation
         Abstract Syntax Tree (AST) is a structure
           which keeps information in an easily
           traversable form (such as operator at a
           node, operands at children of the node)
          ignores form-dependent superficial details
        Code Generation
112
                     (c) Paul Fodor (CS Stony Brook) and Elsevier
  The Java Code
       The code to invoke the parser:
      import java.io.*;
      class Main {
          public static void main(String[] args) {
            try {
              // use DataInputStream to grab bytes
              MyLexer lexer = new MyLexer(
                    new DataInputStream(System.in));
              MyParser parser = new MyParser(lexer);
              int x = parser.expr();
              System.out.println(x);
            } catch(Exception e) {
              System.err.println("exception: "+e);
            }
          }
      }
113
                      (c) Paul Fodor (CS Stony Brook) and Elsevier
      Abstract Syntax Trees
       Abstract Syntax Tree: Like a parse tree, without
        unnecessary information
       Two-dimensional trees that can encode the structure of
        the input as well as the input symbols
       An AST for (3+4) might be represented as
114
                       (c) Paul Fodor (CS Stony Brook) and Elsevier