NLP Module 2 - 1
NLP Module 2 - 1
PROCESSING (BCS601)
Module -2 chapter 3
Module 2- Word level and syntactic analysis
• Word Level Analysis:
Regular Expressions
Finite-State Automata
Morphological Parsing
Spelling Error Detection and correction
Words and Word classes
Part-of Speech Tagging.
• Syntactic Analysis:
Context-free Grammar
Constituency
Parsing
Probabilistic Parsing
REGULAR EXPRESSIONS
• Regular expressions are a pattern - matching standard for string parsing and
replacement .
• They are a powerful way to find and replace strings that take a defined format.
• Regular expressions can be used to parse dates , urls and email addresses , log
files , configuration files , command line switches , or programming scripts.
• They are useful tools for the design of language compilers and have been used in
NLP for tokenization, describing lexicons , morphological analysis , etc.
• The use of regular expressions in computer science was made popular by a Unix -
based editor , ' ed ‘ .
Perl was the first language that provided integrated support for regular expressions
• Regular expressions were originally studied as a part of theory of computation .
• They were first introduced by Kleene ( 1956 ) .
REGULAR EXPRESSIONS contd..
• A regular expression is an algebraic formula whose value
is a pattern consisting of a set of strings, called the
language of the expression.
The simplest kind of regular expression contains a single
symbol /a/
• A regular expression may specify a sequence of
characters.
Eg: /Madhwa/
REGULAR EXPRESSIONS contd..
• Character classes
Characters are grouped by putting them between square
brackets
• Eg: the pattern /(abed)/ will match (any of) a, b, c, and d.
• / [0123456789]/ specifies any single digit
• /[5-9]/ specifies any one of the characters 5, 6, 7, 8, or 9.
• /[m-p]/ specifies any one of the letter m, n, o, or p.
• the pattern /[^x]/ matches any single character except x
REGULAR EXPRESSIONS contd..
[abc] Match any of a, b, and c 'Refresher course will start tomorrow'
[A-Z] Match any character between A and Z(ASCII order) 'the course will end on Jan. 10, 2006'
[^A-Z] Match any character other than an uppercase letter TREC Conference'
[^abc] Match anything other than a, b and c TREC Conference
[+*?.] Match any of+, *, ?. or the dot '3 ±2 = 5'
[a^] Match a or ^ ^ has three different uses
• /?/ - A question mark makes the preceding character optional
Eg: /supernovas?/ specifies both 'supernova' and 'supernovas’
• * operator, called the Kleene * specifies zero or more occurrences of a preceding character or regular
expression.
Eg: /b*/ will match any string containing zero or more occurrences of 'b', i.e., it will match 'b', 'bb', or 'bbbb’
• It will also match 'aaa', since that string contains zero occurrences of 'b’
/bb*/ matches a 'b' followed by zero or more ‘b’s
• This will match with any of the strings 'b', 'bb', 'bbb', 'bbbb’
/[ab)*/ specifies 'zero or more "a"s or "b"s.
• This will match strings like 'aa', 'bb', or 'abab'.
REGULAR EXPRESSIONS contd..
• The Kleene + provides a shorter notation to specify one or more
occurrences of a character
/a+/ means one or more occurrences of 'a’
• The caret (^) is used as an anchor to specify a match at the
beginning of a line.
• The dollar sign, $, is an anchor that is used to specify a match at
the end of the line.
/^The nature\$/ will search the lines having phrase ‘The nature’
• dot (.), is called wildcard character, which matches any single
character.
The wildcard expression /./ matches any single character.
/.at/ matches with any of the string cat, bat, rat, gat, kat, mat, etc.
REGULAR EXPRESSIONS contd..
. The dot matches any single character.
\n Matches a new line character (or CR+LF combination).
\t Matches a tab (ASCII 9).
\d Matches a digit [0-9]
\D Matches a non-digit.
\w Matches an alphanumeric character.
\W Matches a non-alphanumeric character.
\s Matches a whitespace character.
\S Matches a non-whitespace character.
\ Use \ to escape special characters. Eg: \. matches a dot,\* matches a * and\\ matches a backslash.
• wildcard symbol is used for counting characters.
Eg: /.....berry/ matches ten letter strings that end in berry.
• Finds patterns like strawberry, sugarberry, and blackberry but fails to match with blueberry or hackberry
• The pattern blackberry| blackberries matches either 'blackberry' or 'blackberries’
• A regular expression may contain symbol pairs.
For example, the regular expression /a:b/ represents a pair of string.
The a:b relation is the cross product of the languages denoted by the expressions /a/ and /b/.
REGULAR EXPRESSIONS contd..
• Suppose we need to check if a string is an email address or not
• An email address consist of a non-empty sequence of characters
followed by the 'at' symbol, @, followed by another non-empty
sequence of characters ending with pattern like .xx, .xxx, .xxxx, etc.
• The regular expression for email address is:
^[A-Za-z0-9_\.-]++@[A-Za-z0-9_\.-]++[A-Za-z0-9_][A-Za-z0-9_]$
FINITE-STATE AUTOMATA
• Consider a very simple machine with an input device, a processor, some
memory, and an output device.
• The machine starts in the initial state.
• It checks the input and goes to next state, which is completely determined
by the prior state and the input.
• If all goes well, the machine reaches final state and terminates.
• If the machine gets an input for which the next state is not specified, and it
gets stuck in non final state and the machine has failed or rejected the
input.
• A general model of this type of machine is called a finite automation;
'finite' because the number of states and the alphabet of input symbols is
finite·
'automaton' because the machine moves automatically.
• This type of machine is more commonly called deterministic
FINITE-STATE AUTOMATA contd..
• A finite automaton has the following properties:
A finite set of states, one of which is designated the initial or
start state, and one or more of which are designated as the
final states.
A finite alphabet set Σ consisting of input symbols.
A finite set of transitions that specify for each state and each
symbol of the input alphabet, the state to which it next goes.
• A finite automaton can be deterministic or non-deterministic.
• In a non-deterministic automaton, more than one transition
out of a state is possible for the same input symbol
Example:
• Suppose Σ= {a, b,c}, the set of states= {q0, q1, q2, q3, q4} with q0
being the start state and q4 the final state,
• Transition Rules:
From state q0 and with input a, go to state q1.
From state q1 and with input b, go to state q2 ·
From state q1 and with input c, go to state q3
From state q2 and with input b, go to state q4
From state q3 and with input b, go to state q4
Step 2: Maps
Surface Step 1: Split word into Intermediate morphemes to stem Lexical
Form possible morphemes form and morphological Form
features
35
morphology using FST..
• FST-based morphological parser for singular and plural nouns
in English.
plural form of regular nouns usually end with -s or -es.
• a word ending in 's' need not necessarily be the plural form of a
word..
One of the required translations is the deletion of the 'e'
when introducing a morpheme boundary.
• required for words ending in xes, ses, zes (e.g., suffixes and
boxes).
mapping English
nouns to the
intermediate form
morphology using FST..
• possible sequences of states that the transducer undergoes,
given the surface forms birds and boxes as input.
Dotted Pair Notation
1) FSA recogniser for "fox"
f o x
l e s s +Adj ε +Comp
5.11.2002 l e s s ε e r
morphology using FST..
• Develop a transducer that maps from the intermediate level to
the lexical level
The input to transducer has one of the following forms:
• Regular noun stem, e.g., bird, cat
transducer maps all symbols of the stem themselves and then
output N and sg
• Regular noun stem + s, e.g., bird + s
transducer maps all symbols of the stem to themselves, but
then output N and replaces PL with s
• Singular irregular noun stem, e.g., goose
transducer maps all symbols of the stem themselves and
then output N and sg
• Plural irregular noun stem, e.g., geese
transducer maps the irregular plural noun stem to the
corresponding singular stem (e.g., geese to goose) and then
it should add N and PL
• The mapping from State 1 to State 2, 3, or 4 is carried out with the
help of a transducer encoding a lexicon.
• The transducer implementing the lexicon maps the individual
regular and irregular noun stems to their correct noun stem,
replacing labels like regular noun form
lexicon maps the surface form geese, which is an irregular noun, to
its correct stem goose
• g:g e:o e:o s:s e:e
Using a single transducer which combines input and output taper ,
the surface word form birds will be mapped to bird + N + pl as
follows:
• b:b i:i r:r d:d + e:N + s:pl
Each letter maps to itself, while e maps to morphological feature +N,
and s maps to morphological feature pl
A transducer mapping nouns to their stem and morphological features
• The power of the transducer lies in the fact that the same
transducer can be used for analysis and generation.
That is, we can run it in the downward direction (input:
surface form and output: lexical form) or in the upward
direction.
SPELLING ERROR DETECTION AND CORRECTION
• In computer-based information systems, errors of typing and spelling
constitute a very common source of variation between strings.
• These errors have been widely investigated.
• All investigations agree that single character omission, insertion,
substitution, and reversal are the most common typing mistakes.
• As in Damearu (1964), over 80% of the errors were single-error
misspellings
Substitution of a single letter,
Omission of a single letter,
Insertion of a single letter, and
transposition of two adjacent letters.
• As in Shafer and Hardwick (1968), the most common type of single
character error was substitution, followed by omission of a single letter
and then insertion of a single letter .
SPELLING ERROR DETECTION AND CORRECTION contd..
• Optical character recognition (OCR) and other automatic reading devices
introduce errors of substitution, deletion, and insertion.
• OCR errors are usually iouped into five classes:
substitution, multi-substitution (or framing), space deletion or insertion, and failures.
OCR substitution and multi-substitution errors are caused due to visual similarity
such as c→e, 1 → l, r → n and m → rn.
Failure occurs when the OCR algorithm fails to select a letter with sufficient accuracy.
The frequency and type of errors are characteristics of the particular device.
These errors can be corrected using 'context' or by using linguistic structures
• Many approaches to speech recognition deal with strings of phonemes (or
symbols representing sounds), and attempt to match a spoken utterance
with a dictionary of known utterances.
SPELLING ERROR DETECTION AND CORRECTION contd..
• Spelling errors are mainly phonetic, where the misspell word is
pronounced in the same way as the correct word.
• Phonetic errors distort the word by more than single insertion, deletion, or
substitution.
• Spelling errors categorized into non-word errors and real word errors.
When an error results in a word that does not appear in a given lexicon
or is not a valid orthographic word form, it is non-word error.
• N gram analysis and dictionary look up are used to detect non word error
A real-word error results in actual words of the language.
• It occurs due to typographical mistakes or spelling errors.
• They may cause local syntactic errors, global syntactic errors, semantic errors, or
errors at discourse or pragmatic levels
SPELLING ERROR DETECTION AND CORRECTION contd..
• Spelling correction consists of detecting and correcting errors.
Error detection is the process of finding misspelled words and error correction is the
process of suggesting correct words to a misspelled one.
These sub problems are addressed in two ways:
• Isolated-error detection and correction
• Context-dependent error detection and correction
In isolated-word error detection and correction, each word is checked separately,
independent of its context
The strategy of detecting whether or not a word is correct has a number of problems
associated with it.
• The strategy requires the existence of a lexicon containing all correct words. Such a lexicon
would take a long time to compile and occupy a lot of space.
• Some languages are highly productive. It is 'impossible to list all the correct words of such
languages.
• This strategy fails when spelling error produces a word that belongs to the lexicon, e.g., when
'theses' is written in place of 'these'. Such an error is called a real-word error.
• The larger the lexicon, the more likely it is that an error goes undetected, because the chance
of a word being found is greater in a large lexicon.
Spelling correction algorithms..
• Context dependent error detection and correction methods, utilize the
context of a word to detect and correct errors.
This requires grammatical analysis and is thus more complex and language
dependent.
In these methods, the list of candidate words must first be obtained using
an isolated-word method before making a selection depending on the
context
• The spelling correction algorithm has been broadly categorized by Kukich
(1992) as follows.
Minimum edit distance: It is the minimum number of operations
(insertions, deletions, or substitutions) required to transform one string
into another.
Similarity key technique: it changes the given string into a key such that
similar strings will change into same key.
Spelling correction algorithms contd..
n--gram based techniques: These can be used for both non word and real-word
error detection
• In English, certain bi grams and tri-grams of letter never occur or rarely occurs.
This information can be used to handle non-word error.
• Strings that contain these unusual n-grams can be identified as possible spelling
errors.
• N- gram techniques usually require a large corpus or dictionary as training data
Neural nets: These have the ability to do associative recall based on incomplete and
noisy data.
• They can be trained to adapt to specific spelling error patterns.
• The drawback of neural nets is that they arc computationally expensive.
Rule-based techniques: A set of rules (heuristics) derived from knowledge of a
common spelling error pattern is used to transform misspelled words into valid
words.
• For example, a rule can be written for errors that occur from the letters ue being
typed as eu.
Minimum Edit Distance
• Minimum edit distance is the number of insertions, deletions, and substitutions
required to change one string into another
E.g. minimum edit distance between tumor and tutor is 2
• substitute 'm' for 't’ and insert ‘u’ before ‘r’
• conversion.
• Edit distance between two strings can be represented as a binary function.
Edit distance can be viewed as a alignment problem.
• By aligning two strings degree of match can be found
• Minimum edit distance is the best alignment method
• The following alignment between tutor and tumour has a edit distance of 2.
t u t o - r
t u m o u r
• Another possible alignment with edit distance 3.
t u t - o - r
t u - m o u r
• The best possible alignment corresponds to minimum edit distance.
• Dynamic programming algorithms can be quite useful for finding minimum edit
distance between two sequences.
• implemented by creating an edit distance matrix.
• This matrix has one row for each symbol in the source string and one column for
each matrix in the target string
The (i,j)th cell in this matrix represents the distance between first i character of the
source and first j character of the target string.
The value in each cell is computed in terms of three possible paths following which
we can reach there:
•
dist[i 1, j] insert _ cos t,
dist[i, j] dist[i 1, j 1] subst _ cos t[soure i, t arg etj]
dist[i, j 1] delete _ cos t
• The substitution will be 0 if the ith character in the source matches with jth
character in the target.
Minimum Edit Distance algorithm
Input: Two strings, X and Y
Output: The minimum edit distance between X and Y
m length(X)
n length(Y)
for i = 0 to m do
dist[i,0] i
for j = 0 to n do
dist[0,j] j
for i = 1 to m do
for j = 1 to n do
dist[i,j] = min { dist[i-1,j] + insert_cost, dist[i-1,j-1] +
subst_cost(Xi,Yj),
dist[i,j-1] + delet_cost }
end
Computing Minimum Edit Distance
# t u m o u r
# 0 1 2 3 4 5 6
t 1 0 1 2 3 4 5
u 2 1 0 1 2 3 4
t 3 2 1 1 2 3 4
o 4 3 2 2 1 2 3
r 5 4 3 3 2 2 2
Computing Minimum Edit Distance
Words and Word Classes
• Words are classified into categories called part-of-speech
Also called as word classes or lexical categories
Common lexical categories are noun and verbs
Other lexical categories include adjectives, adverbs,
prepositions and conjunctions.
Parts of speech (POS) tagging
• Parts of speech tagging is the process of assigning a part - of - speech ( such as a noun ,
verb , pronoun , preposition , adverb , and adjective ) , to each word in a sentence .
The input to a tagging algorithm is the sequence of words of a natural language
sentence and specified tag sets ( a finite list of POS tags ) .
The output is a single best POS tag for each word .
Many words may belong to more than one lexical category.
• But only one of the possible meanings is used at a time.
No tagger is efficient enough to identify the correct lexical category of each word in
a sentence in every case.
• E.g. Soma is reading the book and rama booked the ticket
The collection of tags used by a particular tagger is called a tag set .
• Most POS tag sets make use of the same basic categories , i.e. , noun , verb , adjective , and
prepositions .
• tag sets differ in how they define categories and how finely they divide words into categories
• Most tag sets capture morpho - syntactic information such as singular / plural , number ,
gender , tense , etc.
The number of tags used by different taggers varies
POS methods
• Parts of speech tagging methods has 3 categories.
Rule based(linguistic)
Stochastic (data driven)
Hybrid.
• Rule Based Taggers
Use hand coded rules to assign tags to words. These rules use a lexicon
to obtain a list of candidate tags and then use rules to discard incorrect
tags.
• Stochastic (data driven)
Have data driven approaches in which frequency based information is
automatically derived from corpus and used to tag words.
They disambiguate words based on the probability that a word occurs
with a particular tag.
• Hybrid taggers combine both of above approaches.
Rule Based Taggers
• Rule Based Taggers: Have a 2 stage architecture:
First stage is a dictionary look up procedure, which returns a set of potential tags and
appropriate syntactic features for each word.
Second stage uses a hand coded rules to discard contextually illegitimate tags to get
a single part of speech for each word.
• Eg: The noun-verb ambiguity : Sentence: The show must go on.
• The potential tags for the word show in this sentence is {VB,NN}.
The ambiguity is resolved using the rule
If preceding word is determiner THEN eliminate VB tags.
• This rule disallows verb after a determine using this rule the word show is noun
only.
• In addition to contextual information, many taggers use morphological
information to help in the disambiguation process.
E.g. If word ends in _ing and preceding word is a verb THEN label it a verb(VB).
• Capitalization information can be utilized in the tagging of unknown nouns.
Rule Based Taggers
• Rule based tagger usually require supervised training, Instead rules
can be induced automatically
To induce rules, untagged text is run through a tagger
The output is then manually corrected
The corrected text is then submitted to the tagger, which learns correction
rules by comparing the 2 sets of data.
This process may be repeated several times.
• TAGGIT (Greene and Rubin 1971) :- Initial tagging of Brown Corpus(Francis and
Kucera 1982).
Rule based system- it uses 3,300 disambiguation rules and able to tag 77% of the words in
the Brown Corpus with their correct part of speech.
• Another rule based tagger is ENGTWOL (voutilainen 1995)
Rule Based Taggers
• Advantages:
Speed, deterministic than stochastic
Arguments against them is the skill and effort required in writing
disambiguation rules
Stochastic taggers require manual work if good performance is to be
achieved
Rule based, time is spent in writing a rule set
Stochastic, time is spent developing restriction on transitions and
emissions to improve tagger performance
• Disadvantages:
It’s usable for only one language. Using it for another one requires a
rewrite of most of the programs.
Stochastic tagger
• Standard stochastic tagger is HMM tagger algorithm
• Markov model applies the simplifying assumption that the probability of a chain of
symbols can be approximated in terms of its parts of n-grams
• Simplest n-gram model is unigram model, which assigns the most likely tag to each token
• Unigram model needs to be trained using a tagged training corpus before it can be used
to tag data
• The most likely statistics are gathered over the corpus and used for tagging
• The context used by the unigram tagger is the text of the word itself
Ex: It will assign the tag JJ for each occurrence of fast is used as an adjective than
used as noun, verb or adverb.
She had a fast[noun]
Muslims fast[verb] during Ramadan
Those who were injured in the accident need to be helped fast[adverb]
• Bigram tagger uses the current word and the tag of previous word in tagging process
• As the tag sequence “DT NN” is more likely than the tag sequence “DD JJ”
Stochastic tagger
• In general, n gram model considers the current word and the tag of the previous
n-1 words in assigning a tag to a word
• The context considered by a tri-gram model.
The shaded area represents context. Tokens Wn-2 Wn-1 Wn Wn+1
Tags tn-2 tn-1 tn tn+1
Can be computed as
=P(DT] x P(NNP|DT)*P(MD|NNP) X P(VB|MD) x P(the/DT)X P(bird|NNP) X P(can|MD)
X P(fly|VB)
Hybrid Taggers
• Hybrid approaches to tagging combine the features of both the rule based and stochastic
approaches.
• They use rules to assign tags to words.
• Like the stochastic taggers ,this is a machine learning techniques and rules are
automatically induced from the data.
• Transformation-based learning (TBL) of tags ,also known as Brill tagging ,is a example of
hybrid approach .
• Transformation-based error-driven learning
Used in a number of natural learning problems, including parts-of-speech tagging, speech
generation ,and syntactic parsing .
Supervised learning algorithm
The input to Brill's TBL tagging algorithm is a tagged corpus and a lexicon .
The initial state annotator uses the lexicon to assign the most likely tag to each word as the start
state.
An ordered set of transformational rules are applied sequentially.
The rule that that result in the most improved tagging is selected.
A manually tagged corpus is used as reference for truth.
Hybrid Taggers contd..
The process is iterated until some
stopping criteria is reached,such as when
no significant information is achieved over
the previous iteration.
At each iteration ,the transformation that
results in the highest score is selected.
The output of the algorithm is a ranked
list of learned transformation that
transform the initial tagging close to
correct tagging
New text can then be annotated by first
assigning the most frequent tag and then TBL learner
applying the ranked list of learned
transformation in order.
TBL tagging algorithm
• Input: Tagged corpus and lexicon(with most frequent information)
• step1: Label every word with most likely tag(from dictionary)
• step2: Check every possible transformation and select one which most improves tagging
• step3: Re-tag corpus applying the rules
• Repeat 2-3 Until some stopping criterion is reached
• RESULT: Ranked sequence of transformation rules
• Each transformation is a pair of re-write rule of the form t1->t2 and a contextual
condition.
• In order to limit the set of transformations ,a small set of templates is constructed.
• Any allowable transformation is an instantiation of these templates.
TBL tagging algorithm contd..
• Examples of transformation templates and rules learned by the tagger
change tag a to tag b when :
1. The preceding (following) word is tagged z.
2. The preceding (following) word is w.
3. The word two before (after) is w.
4. One of the preceding two words is w.
5. One of the two preceding (following ) words is tagged z.
6. The current word is w and the preceding (following) word is x.
7. One of the previous three words is tagged z.