Adv.
Natural Language
                            Processing
                                                     Lecture 2 & 3
                                       Instructor: Dr. Muhammad Asfand-e-yar
MS(CS), Bahria University, Islamabad                                      Instructor: Dr. Muhammad Asfand-e-yar
      Previous Lecture
      • What is NLP?
      • Examples of NLP
      • What is Special about Languages?
      • Why it is hard for Computer to understand Human Language?
      • Where NLP lies?
      • Applications of NLP
MS(CS), Bahria University, Islamabad               Instructor: Dr. Muhammad Asfand-e-yar
      Today’s Lecture
      • Regular Expressions
      • Text Normalization
      • Edit Distance
MS(CS), Bahria University, Islamabad   Instructor: Dr. Muhammad Asfand-e-yar
      Basic Text Processing
      Regular Expressions
MS(CS), Bahria University, Islamabad   Instructor: Dr. Muhammad Asfand-e-yar
      ELIZA Application
              User:        You are like my father in some ways.
              ELIZA:       WHAT RESEMBLANCE DO YOU SEE
              User:        You are not very aggressive but I think you don’t want me to notice that.
              ELIZA:       WHAT MAKES YOU THINK I AM NOT VERY AGGRESSIVE
              User:        You don’t argue with me.
              ELIZA:       WHY DO YOU THINK I DON’T ARGUE WITH YOU
              User:        You are afraid of me.
              ELIZA:       DOES IT PLEASE YOU TO BELIEVE I AM AFRAID OF YOU
      The dialogue above is from ELIZA, an early natural language processing
      system that could carry on a limited conversation with a user by imitating the
      responses of a Rogerian psychotherapist
MS(CS), Bahria University, Islamabad                                             Instructor: Dr. Muhammad Asfand-e-yar
      ELIZA Application
      ELIZA is a surprisingly simple program that uses pattern matching to
      recognize phrases like
             “You are X”
      and translate them into suitable outputs like
             “What makes you think I am X?”.
      This simple technique succeeds in this domain because ELIZA doesn’t
      actually need to know anything to mimic a Rogerian psychotherapist.
      Eliza’s mimicry of human conversation was remarkably successful: many
      people who interacted with ELIZA came to believe that it really understood
      them and their problems, many continued to believe in ELIZA’s abilities even
      after the program’s operation was explained. Even today such chatbots are
      a fun diversion.
MS(CS), Bahria University, Islamabad                          Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression
      Important tool for describing text patterns: the regular expression (RE)
      RE is used to specify strings that might be extracted from a document,
        For example; “You are X” in Eliza above, to defining strings like $199 or
        $24.99 for extracting tables of prices from a document.
      RE is used to turn to a set of tasks collectively called text normalization, in
      which plays an important part.
      Normalizing text means converting text to a more convenient, standard form.
         For example, most of what we are going to do with language relies on
         first separating out or tokenizing words from running text, the task of
         tokenization.
MS(CS), Bahria University, Islamabad                             Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression
      Regular Expression (RE) is a standardized text search strings in
      computer science language for specifying text.
      RE is used in every computer language, word processor, and text
      processing tools like the Unix tools grep or Emacs.
      RE is an algebraic notation for characterizing a set of strings.
      RE is particularly useful for searching in texts, when we have a
      pattern to search for and/or a corpus of texts to search through.
MS(CS), Bahria University, Islamabad                   Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression
      The simplest kind of RE is a sequence of simple characters.
      To search for woodchuck, we type /woodchuck/.
         OR the expression /Buttercup/ matches any string containing
         the substring Buttercup; grep with that expression would
         return the line “I’m called little Buttercup”.
      The search string can consist of a single character (like /!/) or a
      sequence of characters (like /urgl/).
MS(CS), Bahria University, Islamabad                     Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression
      A formal language for specifying text strings
      How can we search for any of these?
            •   woodchuck
            •   woodchucks
            •   Woodchuck
            •   Woodchucks
MS(CS), Bahria University, Islamabad                  Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression
      Regular Expressions are Case Sensitive;
      RE                                Example Patterns Matched
      /woodchucks/                      “interesting links to woodchucks and lemurs”
      /a/                               “Mary Ann stopped by Mona’s”
      /!/                               “You’ve left the burglar behind again!” said Nori
                              Some Simple regular expression search
MS(CS), Bahria University, Islamabad                                    Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression
      Regular Expressions are Case Sensitive;
      RE                               Match                    Example Patterns
      /[wW]oodchuck/                   Woodchuck or woodchuck   “Woodchuck”
      /[abc]/                          ‘a’, ‘b’, or ‘c’         “Hast du gut gelernt?”
      /[1234567890]/                   any digit                “plenty cubes having 1”
                 The use of the brackets [ ] to specify a disjunction of characters.
      Therefore, it can also
      /[ABCDEFGHIJKLMNOPQRSTUVWXYZ]/
      Means “any Capital Letter”
MS(CS), Bahria University, Islamabad                                  Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression
      Regular Expressions are Case Sensitive;
      RE                           Match                  Example Patterns
      /[A-Z]/                      an upper case letter   “we should call it ‘Gut Gemacht”
      /[a-z]/                      a lower case letter    “the Bean was impatient”
      /[0-9]/                      a single digit         “Kapital 1: Text Schrieben”
                 The use of the brackets [ ] to specify a disjunction of characters.
MS(CS), Bahria University, Islamabad                                   Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression
      The square braces can also be used to specify what a single
      character should not be used in a string by use of the caret ^.
      If the caret ^ is the first symbol after the open square brace [, the
      resulting pattern is negated.
            For example, the pattern /[^a]/ matches any single character (including
            special characters) except a.
      This is only true when the caret is the first symbol after the open
      square brace.
            If it occurs anywhere else, it usually stands for a caret symbol
MS(CS), Bahria University, Islamabad                               Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression
      Regular Expressions with condition “Not a substring”;
      RE                               Match                          Example Patterns
      /[^A-Z]/                         not an upper case letter       “Am Himmel”
      /[^Ss]/                          neither ‘S’ nor ‘s’            “Hast du gut gelernt?”
      /[^\.]/                          not a period or dot            “our resident”
      /[e^]/                           either ‘e’ or ‘^’              “look up ^ now”
      /a^b/                            the patter ‘a^b’               “look up a^b now”
                                              Uses of caret/cap “^”
MS(CS), Bahria University, Islamabad                                      Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression
      How can we talk about optional elements, like an optional “s” in
      “woodchuck” and “woodchucks”?
      We can’t use the square brackets, because while the “[ ]” allow us to
      say “s or S”, they don’t allow us to say “s or nothing”.
      Therefore we use the question mark /?/, which means “the preceding
      character or nothing”.
      The question mark means that “zero or one instances of the previous
      character”.
MS(CS), Bahria University, Islamabad                       Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression
      Regular Expressions with Optional Conditions”;
      RE                               Match                         Example Patterns
      /woodchucks?/                    Woodchuck or woodchucks       “woodchuck”
      /colou?r/                        color or colour               “colour”
                                           Uses of question mark “?”
      RE                           Match                            Example Patterns
      /beg.n/                      any character between beg “begin” , “beg’n” ,
                                   and n                            “begun”
                          Uses of the period or dot “.” to specify any character
MS(CS), Bahria University, Islamabad                                      Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression
      How can I distinguish between cat and dog?
      Since we can’t use the square brackets to search for “cat or dog”
         Why can’t we say /[catdog]/?
      We need a new operator, the disjunction operator, also called the pipe
      symbol |.
      The pattern /cat|dog/ matches
            • either the string cat
            • or the string dog
MS(CS), Bahria University, Islamabad                      Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression
      Disjunction
      Woodchucks is another name for groundhog!
      • The pipe | for disjunction
        RE                              Example Pattern
        /groundhog|woodchuck/           “groundhog” or
                                        “woodchuck”
        /yours|mine/                    “yours" or “mine”
        /a|b|c/                         = [abc]
        /[gG]roundhog | [Ww]oodchuck/
MS(CS), Bahria University, Islamabad                        Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression
      Sometimes we need to use the disjunction operator in the midst
      of a larger sequence.
      For example, suppose I want to search for information about pet
      fish for my cousin David. How can I specify both guppy and
      guppies?
      Is it possible to express the above as /guppy|ies/?
      No, because that would match only the strings guppy and ies.
MS(CS), Bahria University, Islamabad                  Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression
      Precedence:
      This is because sequences like guppy take precedence over the
      disjunction operator |.
      To make the disjunction operator apply only to a specific pattern,
      then use the parenthesis operators ( and ).
      Therefore, the pattern /gupp(y|ies)/ would specify that we meant
      the disjunction only to apply to the suffixes y and ies.
MS(CS), Bahria University, Islamabad                   Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression
      Operator Precedence
      This idea that one operator may take precedence over another, use parentheses to
      specify what we mean, is formalized by the operator precedence hierarchy for
      regular expressions.
      The following table gives the order of RE operator precedence, from highest
      precedence to lowest precedence.
                                 1     Parenthesis             ()
                                 2     Counters                * + ? {}
                                 3     Sequences and anchors   the ^my end$
                                 4     Disjunction             |
      Thus, because Counters have a higher precedence than Sequences
MS(CS), Bahria University, Islamabad                                          Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression
      For example; match repeated instances of a string.
         Column 1 Column 2 Column 3
      /Column [0-9]+ */
            Will not match any number of columns; instead, it will match a single column
            followed by any number of spaces.
            The star here applies only to the space that precedes it, not to the whole
            sequence.
      With the parentheses, the expression is;
      /(Column [0-9]+ *)*/
            to match the word Column, followed by a number and optional spaces, the
            whole pattern repeated any number of times.
MS(CS), Bahria University, Islamabad                                  Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression
      Construct a Finite State Machine for a language of certain
      sheep?
      The string that look like the following:
      baa
      baaa
      baaaa
      baaaaa
      …
MS(CS), Bahria University, Islamabad                  Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression
      The string that look like the following:
      baa
      baaa
      baaaa
      baaaaa
      …
                                                         Stephen C Kleene
      On the above string we will apply Kleene Closure property;
      baa* or ba+
MS(CS), Bahria University, Islamabad                  Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression
      Kleene Closure property.
      baa* or ba+
      Therefore, it can be expressed as /baa*/ or /[baa]*/
      1. /baa*/ => means one b, one a followed by zero or more as
      2. /ba+/ => means one b followed by one a or more as
      3. /[baa]*/ => means zero or more bs or as (incorrect according to the
         given language)
      Therefore, according to language the correct expression will be /baa*/
MS(CS), Bahria University, Islamabad                      Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression
      Kleene Closure property.
      baa* or ba+
      What will be the FSA machine?                 a
                                       b        a                ε
                      q0                   q1       q2                            q3
MS(CS), Bahria University, Islamabad                     Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression
      Kleene Closure property.
      baa* or ba+
                                                             a
                                       b        a        a                ε
                      q0                   q1       q2       q3                            q4
MS(CS), Bahria University, Islamabad                              Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression
      Kleene Closure property.
      baa* or ba+   L(M) = {baa, baaa, baaaa ,baaaaa ,baaaaaa, . . .}
                                                                  a
                                       b        a         a                    ε
                      q0                   q1        q2           q3                            q4
                                                b   b         b
                                   a                                           a, b
                                                    q5
MS(CS), Bahria University, Islamabad                                   Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expressions
      An FSA for the words for English numbers 1–99.
MS(CS), Bahria University, Islamabad                   Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expressions
      An FSA for the words for English numbers 1–99 in Dollars and Cents.
MS(CS), Bahria University, Islamabad                        Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expressions:
      RE=> ? , * , +
      Kleene *, Kleene +
    RE                      Matches                     Example Pattern
    /colou?r/               Optional previous char      color , colour
    /oo*h!/                 0 or more of previous char oh! , ooh! , oooh! , ooooh!
    /o+h!/                  1 or more of previous char oh! , ooh! , oooh! , ooooh!
    /baa*/                  b and 1 or more a           baa , baaa , baaaa , baaaaa
    /ba+/                   b and 1 or more a           baa , baaa , baaaa , baaaaa
MS(CS), Bahria University, Islamabad                                     Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expressions
            RE - Anchors => ^ , $
        RE                       Matches                           Example Pattern
        /^[A-Z]/                 in the beginning capital letter   “Palo Alto”
        /^[^A-Za-z]/             no alphabet in the beginning      “1” or “ ‘Hello’ ”
        /\.$/                    period “.” in the end             “The end.”
        /.$/                     any thing in the end              “The end?” or “The end!”
MS(CS), Bahria University, Islamabad                                     Instructor: Dr. Muhammad Asfand-e-yar
      Example “the”
      Find me all instances of the word “the” in a text.
            /the/
                Misses capitalized examples
            /[tT]he/
                won’t treat underscores and numbers as word (the or the25)
            /\b[tT]he\b/
                Incorrectly returns other or theology
            /[^a-zA-Z][tT]he[^a-zA-Z]/
      It is not being mentioned that “the” word should be in beginning of a line.
MS(CS), Bahria University, Islamabad                                  Instructor: Dr. Muhammad Asfand-e-yar
      Example “the”
      There are also two other anchors:        •   \b matches a word boundary
                                               •   \B matches a non-boundary
      For example, /\b99\b/ will match the string 99 in
         “There are 99 bottles of beer on the wall”
            (because 99 follows a space)
      But not 99 in
         “There are 299 bottles of beer on the wall”
            (since 99 follows a number).
      But it will match 99 in $99
         (since 99 follows a dollar sign ($), which is not a digit, underscore, or letter).
MS(CS), Bahria University, Islamabad                               Instructor: Dr. Muhammad Asfand-e-yar
      Example “the”
      /[^a-zA-Z][tT]he[^a-zA-Z]/
      “the” word should be in beginning of a line or in end of a line
      /(^|[^a-zA-Z])[tT]he([^a-zA-Z]|$)/
MS(CS), Bahria University, Islamabad                     Instructor: Dr. Muhammad Asfand-e-yar
      Example “the”
      The process we just went through was based on fixing two kinds of errors
            Matching strings that we should not have matched (there, then, other)
              False positives (Type I)
            Not matching things that we should have matched (The, the)
               False negatives (Type II)
MS(CS), Bahria University, Islamabad                             Instructor: Dr. Muhammad Asfand-e-yar
      Example “the”
      In NLP we are always dealing with these kinds of errors.
      Reducing the error rate for an application often involves two
      antagonistic efforts:
            • Increasing accuracy or precision (minimizing false positives)
            • Increasing coverage or recall (minimizing false negatives).
MS(CS), Bahria University, Islamabad                              Instructor: Dr. Muhammad Asfand-e-yar
      Complex Example
      Let’s try out a more significant example of the RE.
      For example; build an application to help a user buy a computer on the Web.
         • The user might want
         “any machine with more than 6 GHz and 500 GB of disk space for less
         than $1000”.
         • To do this kind of retrieval, initially analyze expressions like 6 GHz or
           500 GB or Mac or $999.99.
      In the rest of the section some simple regular expressions will be analyzed
      for this task.
MS(CS), Bahria University, Islamabad                           Instructor: Dr. Muhammad Asfand-e-yar
      Complex Example
      First, let’s complete RE for prices.
      RE for a dollar sign followed by a string of digits:
      /$[0-9]+/
            $ is for the end
      Decimal point and two digits afterwards
      /$[0-9]+\.[0-9][0-9]/
      This pattern only allows $199.99 but not $199
MS(CS), Bahria University, Islamabad                         Instructor: Dr. Muhammad Asfand-e-yar
      Complex Example
      Make the cents optional and to make sure a word boundary:
          /\b$[0-9]+(\.[0-9][0-9])?\b/
      How about specifications for processor speed?
      Here’s a pattern for that:
           /\b[0-9]+˽*(GHz|[Gg]igahertz)\b/
      Allowing optional fractions again “5.5 GB”;
            /\b[0-9]+(\.[0-9]+)? *(GB|[Gg]igabytes?)\b/
                                                             ˽ means White Spaces
MS(CS), Bahria University, Islamabad                      Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression
      RE /{3}/ means
      “exactly 3 occurrences of the previous character or expression”.
      Therefore,
      /a\.{24}z/
            will match a followed by 24 dots followed by z
      /a\.{24, 30}z/
            will match a followed by 24 dots OR upto 30 dots followed by z
      /a\.{24, }z/
            will match a followed by at least 24 dots followed by z
MS(CS), Bahria University, Islamabad                                         Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression
      Substitutions
      To do substitutions put parentheses ( and ) around the first pattern and use
      the number operator \1 in the second pattern to refer back.
      Here’s how it looks:
         s/([0-9]+)/<\1>/
      The \1 will be replaced by whatever string matched the first item in
      parentheses.
      Therefore, this will match
         The bigger they were, the bigger they will be
      but not
         The bigger they were, the faster they will be
MS(CS), Bahria University, Islamabad                            Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression
      Capture group
      The use of parentheses to store a pattern in memory is called a capture group.
      Every time a capture group is used (i.e., parentheses surround a pattern), the
      resulting match is stored in a numbered register.
      For example; match two different sets of parentheses, \2 means whatever matched
      the second capture group.
      Thus
              /the (.*)er they (.*), the \1er we \2/
                       The faster they ran, the faster we ran
      but not
                       The faster they ran, the faster we ate.
      Similarly, the third capture group is stored in \3, the fourth is \4, and so on.
MS(CS), Bahria University, Islamabad                             Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression
      Occasionally parentheses are used for grouping, but don’t want to capture the
      resulting pattern in a register.
      In that case we use a non-capturing group, which is specified by putting the
      commands ?: after the open parentheses, in the form (?: pattern ).
      For example;
              /(?:some|a few) (people|cats) like some \1/
      will match
              some cats like some people
      but not
              some people like some a few
      Hence the above example means that (?: some|a few) are not selected.
MS(CS), Bahria University, Islamabad                               Instructor: Dr. Muhammad Asfand-e-yar
      Exercise
                 The faster they ran, the faster we ran
                 The faster they walk, the faster we walk
                 The slower they ran, the slower we ran
                 The slower they walk, the slower we walk
      How can you make a regular expression of above example?
                 s/The (.*)er they (.*), the \1 we \2/
MS(CS), Bahria University, Islamabad                        Instructor: Dr. Muhammad Asfand-e-yar
      Exercise
                 The faster they ran, the slower we ran
                 The slower they ran, the faster we ran
      How can you make a regular expression of above example?
                 s/The (faster|slower) they (.*), the \1 we (.*)/
      OR
                 s/The (faster|slower) they ran, the \1 we ran/
MS(CS), Bahria University, Islamabad                            Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression (ELIZA)
      Substitutions and capture groups are very useful in implementing
      simple chatbots like ELIZA
                   User 1:             Men are all alike.
                   ELIZA1:             IN WHAT WAY
                   User 2:             They’re always bugging us about something or other.
                   ELIZA2:             CAN YOU THINK OF A SPECIFIC EXAMPLE
                   User 3:             Well, my friend made me come here.
                   ELIZA3:             YOUR FRIEND MADE YOU COME HERE
                   User 4:             He says I’m depressed much of the time.
                   ELIZA4:             I AM SORRY TO HEAR YOU ARE DEPRESSED
MS(CS), Bahria University, Islamabad                                               Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression (ELIZA)
                   User 1:             Men are all alike.
                   ELIZA1:             IN WHAT WAY
                   User 2:             They’re always bugging us about something or other.
                   ELIZA2:             CAN YOU THINK OF A SPECIFIC EXAMPLE
                   User 3:             Well, my friend made me come here.
                   ELIZA3:             YOUR FRIEND MADE YOU COME HERE
                   User 4:             He says I’m depressed much of the time.
                   ELIZA4:             I AM SORRY TO HEAR YOU ARE DEPRESSED
      ELIZA works by having a series or cascade of regular expression substitutions
      each of which matches and changes some part of the input lines.
      The first substitutions change all instances of my to YOUR, and I’m to YOU ARE,
      and so on.
MS(CS), Bahria University, Islamabad                                               Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression (ELIZA)
                   User 1:             Men are all alike.
                   ELIZA1:             IN WHAT WAY
                   User 2:             They’re always bugging us about something or other.
                   ELIZA2:             CAN YOU THINK OF A SPECIFIC EXAMPLE
                   User 3:             Well, my friend made me come here.
                   ELIZA3:             YOUR FRIEND MADE YOU COME HERE
                   User 4:             He says I’m depressed much of the time.
                   ELIZA4:             I AM SORRY TO HEAR YOU ARE DEPRESSED
                s/.* all .*/IN WHAT WAY/
                s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/
                s/.* I’[mM] (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/
                s/.* I AM (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/
MS(CS), Bahria University, Islamabad                                               Instructor: Dr. Muhammad Asfand-e-yar
     Regular Expression (ELIZA)
               User 1:       Men are all alike.
               ELIZA1:       IN WHAT WAY
               User 2:       They’re always bugging us about something or other.
               ELIZA2:       CAN YOU THINK OF A SPECIFIC EXAMPLE
               User 3:       Well, my friend made me come here.
               ELIZA3:       YOUR FRIEND MADE YOU COME HERE
               User 4:       He says I’m depressed much of the time.
               ELIZA4:       I AM SORRY TO HEAR YOU ARE DEPRESSED
                What will be the RE of given sentence in selected dialog?
                1) s/(my|your) friend made (me|you) come here/\1 friend made \2 come here/
                2) s/(my|your) (.*) (me|you) (.*)/\1 \2 \3 \4/
                                     Now more examples:
                                     User 6: Ahhh, my wrist watch made me crazy.
MS(CS), Bahria University, Islamabad                                       Instructor: Dr. Muhammad Asfand-e-yar
                                     User 7: Hmm, my car drive me cool.
      Regular Expression
      Lookahead
      Finally, there will be times to predict the future: lookahead in the text
      to see
            • if some pattern matches,
            • but not advance the match cursor,
            • Therefore, the pattern should occur
      These lookahead assertions make use of the “?” syntax that check in
      the previous section for non-capture groups.
      (?= pattern) is true if pattern occurs, but is zero-width, i.e. the match
      pointer doesn’t advance the cursor.
      (?! pattern) only returns true if a pattern does not match, but again is
      zero-width and doesn’t advance the cursor.
MS(CS), Bahria University, Islamabad                         Instructor: Dr. Muhammad Asfand-e-yar
      Regular Expression
      Negative Look ahead
      Negative lookahead is commonly used when parsing some complex
      pattern but want to rule out a special case.
      For example;
      to match, at the beginning of a line, any single word that doesn’t
      start with “Volcano”.
                                           RE: In the beginning
      Negative lookahead to do this:                                   Negative Look ahead
                                           RE: Look ahead
                 /(^?!Volcano)[A-Za-z]+/
MS(CS), Bahria University, Islamabad                              Instructor: Dr. Muhammad Asfand-e-yar
      Exercise
      Look ahead:
      How to write a RE, if to match a word “But”, at the beginning of a
      line?
                            Solution:   /(^?=But)[A-Za-z]+/
MS(CS), Bahria University, Islamabad                          Instructor: Dr. Muhammad Asfand-e-yar
      Summary
      Regular expressions play a surprisingly large role
            • Sophisticated sequences of regular expressions are often the first
              model for any text processing text
      For many hard tasks, machine learning classifiers are used
            • But regular expressions are used as features in the classifiers
            • Can be very useful in capturing generalizations
MS(CS), Bahria University, Islamabad                              Instructor: Dr. Muhammad Asfand-e-yar
    Exercise
    Write regular expressions for the following languages.
    1. the set of all alphabetic strings;
        /[A-Za-z]+/
     2. the set of all lower case alphabetic strings ending in a “b”;
             /[a-z]+b/
     3. the set of all strings from the alphabet “a, b” such that each a is
           immediately preceded by bs and immediately followed by a b;
             /b+ab/
             OR
             /bb*ab/
MS(CS), Bahria University, Islamabad                     Instructor: Dr. Muhammad Asfand-e-yar
      Exercise
      Write the regular expression for the following FSA
         Solution:
                /aba(ab)* | ab(ab)*/
MS(CS), Bahria University, Islamabad                  Instructor: Dr. Muhammad Asfand-e-yar
      Exercise
      Write the set of Strings for the following FSA
                                                    one
                   the                      one            very          old               sheep
        q0                         q1                q2             q3            q4                      q5
                             the
                                        Solution:
                                               the one very old sheep
                                               the very old sheep
MS(CS), Bahria University, Islamabad           the one old sheep          Instructor: Dr. Muhammad Asfand-e-yar
      Basic Text Processing
      Word tokenization
MS(CS), Bahria University, Islamabad   Instructor: Dr. Muhammad Asfand-e-yar
    Words and Corpora
     Corpus (plural = Corpora)
                A collection of written or spoken material stored on a computer and
                used to find out how language is used.
     Example:
     He stepped out into the hall, was delighted to encounter a water
     brother.
     This sentence has
     • 13 words if we don’t count punctuation marks as words,
     • 15 if we count punctuation.
           Whether we treat period (“.”), comma (“,”), and so on as words
           depends on the task.
     Punctuation is critical for finding boundaries of things (commas, periods,
     colons) and for identifying some aspects of meaning (question marks,
     exclamation
MS(CS),                        marks, quotation marks).
        Bahria University, Islamabad                            Instructor: Dr. Muhammad Asfand-e-yar
      Words and Corpora
      Spoken Sentences
      For example:
      An utterance is the spoken correlate of a sentence:
            I do uh main- mainly business data processing
      The given is the spoken sentence (i.e. utterance) and has two
      kinds of disfluencies.
            1. The broken-off word main- is called a fragment.
            2. Words like uh and um are called fillers or filled pauses.
      Should we consider these to be words?
      Again, it depends on the application.
            If we are building a speech transcription system, we might want to
            eventually strip out the disfluencies.
MS(CS), Bahria University, Islamabad                             Instructor: Dr. Muhammad Asfand-e-yar
      Words and Corpora
      What is the reason to use fillers and fragments in spoken
      speech?
      Disfluencies like uh or um are actually helpful in speech
      recognition in predicting the upcoming word,
            because they may signal that the speaker is restarting the clause or idea,
      Therefore for speech recognition they are treated as regular
      words.
MS(CS), Bahria University, Islamabad                              Instructor: Dr. Muhammad Asfand-e-yar
      Words and Corpora
      How about inflected forms like cats versus cat?
      These two words have the same lemma cat but are different
      wordforms.
      Lemma
      A lemma is a set of lexical forms having the same stem, the same
      major part-of-speech, and the same word sense.
      Wordform
      The wordform is the full inflected or derived form of the word.
      For morphologically complex languages like Arabic, we often need
      to deal with lemmatization.
MS(CS), Bahria University, Islamabad                    Instructor: Dr. Muhammad Asfand-e-yar
      Words and Corpora
MS(CS), Bahria University, Islamabad   Instructor: Dr. Muhammad Asfand-e-yar
      Words and Corpora
      For many tasks in English, however, wordforms are sufficient
      How many words are there in English?
      To answer this question we need to word type distinguish two
      ways of talking about words.
      Types are the number of distinct words in a corpus; if the set of
      words in the vocabulary is V, the number of types is the word
      token vocabulary size |V|.
      Tokens are the total number N of running words.
MS(CS), Bahria University, Islamabad                    Instructor: Dr. Muhammad Asfand-e-yar
      How many words?
      I do uh main- mainly business data processing
         Fragments (i.e. main-), filled pauses (i.e. uh)
      Seuss’s cat in the hat is different from other cats!
        Lemma: same stem, part of speech, rough word sense
           cat and cats = same lemma
        Wordform: the full inflected surface form
           cat and cats = different wordforms
MS(CS), Bahria University, Islamabad                       Instructor: Dr. Muhammad Asfand-e-yar
       How many words?
    they lay back on the San Francisco grass and looked at the stars and their
   • Type: an element of the vocabulary.
   • Token: an instance of that type in running text.
   How many Tokens or types in given example?
       • 15 tokens (or 14)
14 => they lay back on the San Francisco grass and looked at the stars and their
       • 13 types (or 12) (or 11?)
13 => they lay back on the San Francisco grass and looked at the stars and their
12 => they lay back on the San Francisco grass and looked at the stars and their
11 => they lay back on the San Francisco grass and looked at the stars and their
 MS(CS), Bahria University, Islamabad                    Instructor: Dr. Muhammad Asfand-e-yar
      Text Normalization
      Every NLP task needs to do text normalization:
              1.     Segmenting/tokenizing words in running text
              2.     Normalizing word formats
              3.     Segmenting sentences in running text
MS(CS), Bahria University, Islamabad                           Instructor: Dr. Muhammad Asfand-e-yar
                                                Heap’s Law: VR(n) = knβ
      How many words?                           Heaps' law (also called Herdan's law) describes the
                                                number of distinct words in a document (or set of
                                                documents) as a function of the document length (so
      N = number of tokens                      called type-token relation).
      V = vocabulary = set of types             The formula is given:
                                                1) V is the number of distinct words in an instance
            |V| is the size of the vocabulary        text of size n.
                                                2) k and β are free parameters determined various
                                                     experiments.
                                                With English text corpora, typically k is between 10
                                                and 100, and β is between 0.4 and 0.6.
                                                     Tokens = N         Types = |V|
        Switchboard phone conversations              2.4 million        20 thousand
        Shakespeare                                  884,000            31 thousand
        Google N-grams
MS(CS), Bahria University, Islamabad                 1 trillion         13    million
                                                                        Instructor: Dr. Muhammad Asfand-e-yar
      The Tokenization in UNIX
      (Inspired by Ken Church’s UNIX for Poets.)
      Given a text file, output the word tokens and their frequencies
      tr -sc ’A-Za-z’ ’\n’ < shakes.txt     Change all non-alpha to newlines
           | sort              Sort in alphabetical order
           | uniq –c           Merge and count each type
MS(CS), Bahria University, Islamabad                              Instructor: Dr. Muhammad Asfand-e-yar
      The Tokenization in UNIX
      tr -sc ’A-Za-z’ ’\n’ < shakes.txt | head
                            The output of above command is:
                 THE
                 SONNETS
                 by
                 William
                 Shakespeare
                 From
                 fairest
                 creatures
                 We
                 ...
MS(CS), Bahria University, Islamabad                          Instructor: Dr. Muhammad Asfand-e-yar
     The Tokenization in UNIX
 Now that there is one word per line, it will sort in the lines, and pass
 them to uniq -c which will collapse and count them:
      tr -sc 'A-Za-z' '\n' < shakes.txt | sort | uniq -c
                                   1945 A
                                   72 AARON
                                   19 ABBESS
                                   25 Aaron
                                   6 Abate
                                   1 Abates
                                   5 Abbess
                                   6 Abbey
                                   3 Abbot
MS(CS), Bahria University, Islamabad                    Instructor: Dr. Muhammad Asfand-e-yar
                                   .... …
      The Tokenization in UNIX
      Merging upper and lower case
      tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c
      Sorting the counts
      tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c | sort –n –r
                                       23243 the
                                       22225 i
                                       18618 and
                                       16339 to     The -n option to sort means to sort numerically
                                       15687 of     rather than alphabetically, and the -r option means
                                       12780 a      to sort in reverse order (highest-to-lowest)
                                       12163 you
                                       10839 my
                                       10005 in
MS(CS), Bahria University, Islamabad
                                       8954 d                                       Instructor: Dr. Muhammad Asfand-e-yar
      The Tokenization in UNIX
      The Unix command sequence removes all the numbers and
      punctuation, for most NLP applications it is required to keep
      these while tokenization.
      Punctuations are break off as a separate token;
            • commas are a useful piece of information for parsers.
            • periods help indicate sentence boundaries.
MS(CS), Bahria University, Islamabad                             Instructor: Dr. Muhammad Asfand-e-yar
      Issues in Tokenization
            Finland’s capital             Finland Finlands Finland’s ?
            what’re, I’m, isn’t           What are, I am, is not
            Hewlett-Packard               Hewlett Packard ?
            state-of-the-art              state of the art ?
            Lowercase                     lower-case lowercase lower case ?
            San Francisco                 one token or two?
            m.p.h., PhD.                  ??
            555,550.50                    ??
            rock ‘n’ roll                 rock and roll (separate tokens), one token?
MS(CS), Bahria University, Islamabad                               Instructor: Dr. Muhammad Asfand-e-yar
     Issues in Tokenization
     A tokenizer can also be used to expand clitic contractions that are marked by
     apostrophes, for example,
         • what're to the two tokens what are,
         • we're to the two tokens we are.
     A clitic is a part of a word that can’t stand on its own, and can only occur when it is
     attached to another word.
     Some such contractions occur in other alphabetic languages, including articles and
     pronouns in French (j'ai, l'homme).
     Depending on the application, tokenization algorithms may also tokenize multiword
     expressions for example:
     New York or rock 'n' roll as a single token, which requires a multiword expression
     dictionary of some sort.
     Tokenization is thus intimately tied up with named entity detection, the task of detecting
     names,
MS(CS),           dates,
        Bahria University,     and organizations
                           Islamabad                                Instructor: Dr. Muhammad Asfand-e-yar
      Issues in Tokenization
      Penn Treebank tokenization
      Penn Treebank tokenization standard, used for the parsed corpora
      (treebanks) released by the Linguistic Data Consortium (LDC), the source of
      many useful datasets.
      This standard separates out
            • clitics (i.e. doesn’t becomes does plus n’t)
            • keeps hyphenated words together
            • and separates out all punctuation
   Input     “ The     San Francisco-based   restaurant ,   “ they   said   , “   does    n’t charge       $ 10 ” .
   Output
     .    “ The         San Francisco-based restaurant , “ they      said   , “ does     n’t   charge      $ 10 ” .
MS(CS), Bahria University, Islamabad                                              Instructor: Dr. Muhammad Asfand-e-yar
      Tokenization: Language Issues
      French
         L'ensemble  one token or two?      L'ensemble => means “the entire”
            • L ? L’ ? Le ?
            • Want l’ensemble to match with un ensemble
      German noun compounds are not segmented
        Lebensversicherungsgesellschaftsangestellter
        Lebens versicherungs gesellschafts angestellter
          means ‘life insurance company employee’
        German information retrieval needs compound splitter
MS(CS), Bahria University, Islamabad                         Instructor: Dr. Muhammad Asfand-e-yar
      Tokenization: Language Issues
      Chinese and Japanese no spaces between words:
            莎拉波娃现在居住在美国东南部的佛罗里达。
             莎拉波娃      现在              居住 在 美国         东南部       的 佛罗里达
             Sharapova now             lives in US     southeastern Florida
      Further complicated in Japanese, with multiple alphabets
      intermingled
            Dates/amounts in multiple formats
        フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
                       Katakana        Hiragana      Kanji   Romaji
       End-user can express query entirely in hiragana!
MS(CS), Bahria University, Islamabad                           Instructor: Dr. Muhammad Asfand-e-yar
      Word Tokenization in Chinese
      The example provided in previous slides are also called Word
      Segmentation
      Chinese words are composed of characters
            Characters are generally 1 syllable and 1 morpheme.
            Average word is 2.4 characters long.
      Standard baseline segmentation algorithm:
            Maximum Matching (also called Greedy or Max-Match)
MS(CS), Bahria University, Islamabad                              Instructor: Dr. Muhammad Asfand-e-yar
      Maximum Matching
      Word Segmentation Algorithm
      Given a wordlist of any string as dictionary, and a string.
      1)       Start a pointer at the beginning of the string
      2)     Find the longest word in dictionary that matches the string
             starting at pointer
            1) Move the pointer over the word in string
            2) Go to 2
MS(CS), Bahria University, Islamabad                            Instructor: Dr. Muhammad Asfand-e-yar
      Maximum Matching
      Word Segmentation Algorithm
      function MAXMATCH(sentence, dictionary D) returns word sequence W
             if sentence is empty
                       return empty list
             for i length(sentence) downto 1
                       firstword = first i chars of sentence
                       remainder = rest of sentence
                       if InDictionary(firstword, D)
                               return list(firstword, MaxMatch(remainder,dictionary) )
             # no word was found, so make a one-character word
             firstword = first char of sentence
             remainder = rest of sentence
      return list (firstword, MaxMatch(remainder,dictionary D) )
MS(CS), Bahria University, Islamabad                                  Instructor: Dr. Muhammad Asfand-e-yar
      Max-Match Segmentation
      Thecatinthehat                         The cat in the hat
                                             The table down there
      Thetabledownthere
                                             Theta bled own there
      Doesn’t generally work in English!
      But works astonishingly well in Chinese Language
            莎拉波娃现在居住在美国东南部的佛罗里达。
            莎拉波娃      现在               居住 在 美国         东南部       的 佛罗里达
            Sharapova now              lives in US     southeastern Florida
      Modern probabilistic segmentation algorithms even better
MS(CS), Bahria University, Islamabad                              Instructor: Dr. Muhammad Asfand-e-yar
      Max-Match Segmentation
      Also works well in German Language
            Lebensversicherungsgesellschaftsangestellter
            Lebens            versicherungs   gesellschafts   angestellter
            life              insurance       company         employee
MS(CS), Bahria University, Islamabad                                   Instructor: Dr. Muhammad Asfand-e-yar
      Basic Text Processing
      Word Normalization and Stemming
MS(CS), Bahria University, Islamabad   Instructor: Dr. Muhammad Asfand-e-yar
      Normalization
      Tokens can also be Normalized, and choose the Normalized form to
      match multiple forms
            Information Retrieval: indexed text & query terms must have same form.
                 For example              to match U.S.A. to USA and US
                                       or to match uh-huh to uhhuh
      We implicitly define equivalence classes of terms
            e.g., deleting periods in a term
      Alternative: asymmetric expansion:
            • Enter: window                   Search: window, windows
            • Enter: windows                  Search: Windows, windows, window
            • Enter: Windows                  Search: Windows
      Potentially more powerful, but less efficient
MS(CS), Bahria University, Islamabad                                      Instructor: Dr. Muhammad Asfand-e-yar
      Case folding
      Case folding is another form of Normalization.
        For example: every thing is matched to lower case.
      Applications like Information               Case folding is used for
      Retrieval: reduce all letters to               • Sentiment Analysis,
      lower case                                     • Machine Translation,
            • Normally users tend to use lower       • Information extraction
              case
            • Possible exception: upper case in
              mid-sentence?                       Case folding is helpful (US versus
                 • e.g., General Motors           us is important)
                 • Fed vs. fed
                 • SAIL vs. sail
MS(CS), Bahria University, Islamabad                                Instructor: Dr. Muhammad Asfand-e-yar
      Lemmatization
      Lemmatization: determines that two words have the same root,
      despite their surface differences
      Lemmatization is used to reduce variant forms to base form
            • am, are, is  be
            • car, cars, car's, cars'  car
      the boy's cars are different colors  the boy car be different color
      Lemmatization: have to find correct dictionary headword form
      Machine translation
            Spanish quiero (‘I want’), quieres (‘you want’) same lemma as querer ‘want’
MS(CS), Bahria University, Islamabad                             Instructor: Dr. Muhammad Asfand-e-yar
      Morphology
      How Lemmatization is performed?
      Through Morphological Parsing
      Morphology is the way to build Morphemes:
            Morphemes means “The small meaningful units that make up words”
      Two types of Morphemes are:
            • Stems: The core meaning-bearing units
            • Affixes: Adding additional meaning of various kinds OR Bits and pieces that
              adhere to stems
                 • Often with grammatical functions
MS(CS), Bahria University, Islamabad                                  Instructor: Dr. Muhammad Asfand-e-yar
      Stemming
      Reduce terms to their stems in information retrieval
      Stemming is crude chopping of affixes
            • language dependent
            • e.g., automate(s), automatic, automation all reduced to automat.
      for example compressed                            for exampl compress 
      and compression are both                          and compress ar both 
      accepted as equivalent to                         accept as equival to 
      compress.                                         compress
      For stemming widely used algorithm is Porter’s Algorithm
MS(CS), Bahria University, Islamabad                             Instructor: Dr. Muhammad Asfand-e-yar
      Porter’s Algorithm
      The most common English stemmer.
      This was not the map we found in Billy Bones's chest, but an accurate
      copy, complete in all things-names and heights and soundings-with
      the single exception of the red crosses and the written notes.
      The Porter stemmer algorithm produces the following stemmer output.
      Thi wa not the map we found in Billi Bone s chest but an accur copi
      complet in all thing name and height and sound with the singl except
      of the red cross and the written note
MS(CS), Bahria University, Islamabad                      Instructor: Dr. Muhammad Asfand-e-yar
      Porter’s Algorithm
      The most common English stemmer.
         Step 1a                         Step 2 (for long stems)
              sses  ss                      ational ate
                 caresses  caress              relational  relate
              ies  i                        izer ize
                 ponies  poni                  digitizer  digitize
              ss  ss                        ator ate
                 caress  caress                operator  operate
              s ø                           …
                 cats    cat             Step 3 (for longer stems)
        Step 1b                              al  ø
              (*v*)ing  ø                      revival  reviv
                 walking  walk              able  ø
                 sing     sing                 adjustable  adjust
              (*v*)ed  ø                    ate  ø
                 plastered  plaster            activate  activ
              …
MS(CS), Bahria University, Islamabad
                                             …
                                                             Instructor: Dr. Muhammad Asfand-e-yar
      Morphology in a Corpus
      Why only strip –ing if there is a vowel?
      (*v*)ing  ø
              walking  walk
              sing      sing
      tr -sc 'A-Za-z' '\n' < shakes.txt | grep 'ing$' | sort | uniq -c | sort –nr
      tr -sc 'A-Za-z' '\n' < shakes.txt | grep '[aeiou].*ing$' | sort | uniq -c | sort –nr
                                       1312 King       548 being
                                       548 being       541 nothing
                                       541 nothing     152 something
                                       388 king        145 coming
                                       375 bring       130 morning
                                       358 thing       122 having
                                       307 ring        120 living
                                       152 something   117 loving
                                       145 coming      116 Being
MS(CS), Bahria University, Islamabad                                   Instructor: Dr. Muhammad Asfand-e-yar
                                       130 morning     102 going
      Dealing with complex morphology is
      sometimes necessary
      Some languages requires complex morpheme segmentation
            Turkish
            Uygarlastiramadiklarimizdanmissinizcasina
            `(behaving) as if you are among those whom we could not civilize’
      Uygar `civilized’ + las `become’ + tir `cause’ + ama `not able’ + dik `past’ +
      lar ‘plural’ + imiz ‘p1pl’ + dan ‘abl’ + mis ‘past’ + siniz ‘2pl’ + casina ‘as if’
MS(CS), Bahria University, Islamabad                                 Instructor: Dr. Muhammad Asfand-e-yar
      Basic Text Processing
      Sentence Segmentation and Decision
      Trees
MS(CS), Bahria University, Islamabad   Instructor: Dr. Muhammad Asfand-e-yar
      Sentence Segmentation
      Sentence segmentation is another important step in text
      processing.
      The most useful cues for segmenting a text into sentences are
      punctuation, like periods, question marks, exclamation points.
      Question marks, exclamation points and Periods are relatively
      unambiguous markers of sentence boundaries.
           !, ? are relatively unambiguous
MS(CS), Bahria University, Islamabad                  Instructor: Dr. Muhammad Asfand-e-yar
      Sentence Segmentation
      !, ? are relatively unambiguous
      Period “.” is quite ambiguous
            • Sentence boundary
            • Abbreviations like Inc. or Dr.
            • Numbers like .02% or 4.3
      Sentence Segmentation/Tokenization works by:
      Build a binary classifier
            • Looks at a “.”
            • Decides EndOfSentence/NotEndOfSentence
            • Classifiers: hand-written rules, regular expressions, or machine-learning
MS(CS), Bahria University, Islamabad                                   Instructor: Dr. Muhammad Asfand-e-yar
      Determining if a word is end-of-sentence:
      a Decision Tree
MS(CS), Bahria University, Islamabad   Instructor: Dr. Muhammad Asfand-e-yar
      More sophisticated decision tree features
      Case of word with “.”: Upper, Lower, Cap, Number
      Case of word after “.”: Upper, Lower, Cap, Number
            For example: Dr. , Ph.D. , 50.5 , …
      Numeric features
            • Length of word with “.”
            • Probability (word with “.” occurs at end-of-s)
            • Probability (word after “.” occurs at beginning-of-s)
MS(CS), Bahria University, Islamabad                                  Instructor: Dr. Muhammad Asfand-e-yar
      Implementing Decision Trees
      A decision tree is just an if-then-else statement
      The interesting research is choosing the features
      Setting up the structure is often too hard to do by hand
            • Hand-building only possible for very simple features, domains
                 • For numeric features, it’s too hard to pick each threshold
            • Instead, structure usually learned by machine learning from a training
              corpus
MS(CS), Bahria University, Islamabad                                            Instructor: Dr. Muhammad Asfand-e-yar
      Exercise
      Implement the Max-Match algorithm in any Programming Language.
      Algorithm is provided in the slide 82 & 83.
      Septs to follow:
      1. Construct a dictionary as *.txt, list of words per line
      2. Pass a string, as given below, to the Max-Match function and
         identify the string.
            For example:
               Input:                  wecanonlyseeashortdistanceahead
               Output:                 we can only see a short distance ahead.
MS(CS), Bahria University, Islamabad                                         Instructor: Dr. Muhammad Asfand-e-yar
      Q&A
                                       That’s all for today’s Lecture
MS(CS), Bahria University, Islamabad            Instructor: Dr. Muhammad Asfand-e-yar