KEMBAR78
NLP Module 2 - 1 | PDF | Part Of Speech | Morphology (Linguistics)
100% found this document useful (1 vote)
167 views86 pages

NLP Module 2 - 1

This document covers key concepts in Natural Language Processing, focusing on word-level and syntactic analysis, including regular expressions, finite-state automata, and morphological parsing. It explains how regular expressions are used for string matching and parsing, while finite-state automata serve as models for processing input strings. Additionally, it discusses morphological parsing, which analyzes word structures and formations from morphemes, essential for various NLP applications.

Uploaded by

Spoorthi Harkuni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
167 views86 pages

NLP Module 2 - 1

This document covers key concepts in Natural Language Processing, focusing on word-level and syntactic analysis, including regular expressions, finite-state automata, and morphological parsing. It explains how regular expressions are used for string matching and parsing, while finite-state automata serve as models for processing input strings. Additionally, it discusses morphological parsing, which analyzes word structures and formations from morphemes, essential for various NLP applications.

Uploaded by

Spoorthi Harkuni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

NATURAL LANGUAGE

PROCESSING (BCS601)
Module -2 chapter 3
Module 2- Word level and syntactic analysis
• Word Level Analysis:
 Regular Expressions
 Finite-State Automata
 Morphological Parsing
 Spelling Error Detection and correction
 Words and Word classes
 Part-of Speech Tagging.
• Syntactic Analysis:
 Context-free Grammar
 Constituency
 Parsing
 Probabilistic Parsing
REGULAR EXPRESSIONS
• Regular expressions are a pattern - matching standard for string parsing and
replacement .
• They are a powerful way to find and replace strings that take a defined format.
• Regular expressions can be used to parse dates , urls and email addresses , log
files , configuration files , command line switches , or programming scripts.
• They are useful tools for the design of language compilers and have been used in
NLP for tokenization, describing lexicons , morphological analysis , etc.
• The use of regular expressions in computer science was made popular by a Unix -
based editor , ' ed ‘ .
 Perl was the first language that provided integrated support for regular expressions
• Regular expressions were originally studied as a part of theory of computation .
• They were first introduced by Kleene ( 1956 ) .
REGULAR EXPRESSIONS contd..
• A regular expression is an algebraic formula whose value
is a pattern consisting of a set of strings, called the
language of the expression.
 The simplest kind of regular expression contains a single
symbol /a/
• A regular expression may specify a sequence of
characters.
 Eg: /Madhwa/
REGULAR EXPRESSIONS contd..
• Character classes
 Characters are grouped by putting them between square
brackets
• Eg: the pattern /(abed)/ will match (any of) a, b, c, and d.
• / [0123456789]/ specifies any single digit
• /[5-9]/ specifies any one of the characters 5, 6, 7, 8, or 9.
• /[m-p]/ specifies any one of the letter m, n, o, or p.
• the pattern /[^x]/ matches any single character except x
REGULAR EXPRESSIONS contd..
[abc] Match any of a, b, and c 'Refresher course will start tomorrow'
[A-Z] Match any character between A and Z(ASCII order) 'the course will end on Jan. 10, 2006'
[^A-Z] Match any character other than an uppercase letter TREC Conference'
[^abc] Match anything other than a, b and c TREC Conference
[+*?.] Match any of+, *, ?. or the dot '3 ±2 = 5'
[a^] Match a or ^ ^ has three different uses
• /?/ - A question mark makes the preceding character optional
 Eg: /supernovas?/ specifies both 'supernova' and 'supernovas’
• * operator, called the Kleene * specifies zero or more occurrences of a preceding character or regular
expression.
 Eg: /b*/ will match any string containing zero or more occurrences of 'b', i.e., it will match 'b', 'bb', or 'bbbb’
• It will also match 'aaa', since that string contains zero occurrences of 'b’
 /bb*/ matches a 'b' followed by zero or more ‘b’s
• This will match with any of the strings 'b', 'bb', 'bbb', 'bbbb’
 /[ab)*/ specifies 'zero or more "a"s or "b"s.
• This will match strings like 'aa', 'bb', or 'abab'.
REGULAR EXPRESSIONS contd..
• The Kleene + provides a shorter notation to specify one or more
occurrences of a character
 /a+/ means one or more occurrences of 'a’
• The caret (^) is used as an anchor to specify a match at the
beginning of a line.
• The dollar sign, $, is an anchor that is used to specify a match at
the end of the line.
 /^The nature\$/ will search the lines having phrase ‘The nature’
• dot (.), is called wildcard character, which matches any single
character.
 The wildcard expression /./ matches any single character.
 /.at/ matches with any of the string cat, bat, rat, gat, kat, mat, etc.
REGULAR EXPRESSIONS contd..
. The dot matches any single character.
\n Matches a new line character (or CR+LF combination).
\t Matches a tab (ASCII 9).
\d Matches a digit [0-9]
\D Matches a non-digit.
\w Matches an alphanumeric character.
\W Matches a non-alphanumeric character.
\s Matches a whitespace character.
\S Matches a non-whitespace character.
\ Use \ to escape special characters. Eg: \. matches a dot,\* matches a * and\\ matches a backslash.
• wildcard symbol is used for counting characters.
 Eg: /.....berry/ matches ten letter strings that end in berry.
• Finds patterns like strawberry, sugarberry, and blackberry but fails to match with blueberry or hackberry
• The pattern blackberry| blackberries matches either 'blackberry' or 'blackberries’
• A regular expression may contain symbol pairs.
 For example, the regular expression /a:b/ represents a pair of string.
 The a:b relation is the cross product of the languages denoted by the expressions /a/ and /b/.
REGULAR EXPRESSIONS contd..
• Suppose we need to check if a string is an email address or not
• An email address consist of a non-empty sequence of characters
followed by the 'at' symbol, @, followed by another non-empty
sequence of characters ending with pattern like .xx, .xxx, .xxxx, etc.
• The regular expression for email address is:
 ^[A-Za-z0-9_\.-]++@[A-Za-z0-9_\.-]++[A-Za-z0-9_][A-Za-z0-9_]$
FINITE-STATE AUTOMATA
• Consider a very simple machine with an input device, a processor, some
memory, and an output device.
• The machine starts in the initial state.
• It checks the input and goes to next state, which is completely determined
by the prior state and the input.
• If all goes well, the machine reaches final state and terminates.
• If the machine gets an input for which the next state is not specified, and it
gets stuck in non final state and the machine has failed or rejected the
input.
• A general model of this type of machine is called a finite automation;
 'finite' because the number of states and the alphabet of input symbols is
finite·
 'automaton' because the machine moves automatically.
• This type of machine is more commonly called deterministic
FINITE-STATE AUTOMATA contd..
• A finite automaton has the following properties:
 A finite set of states, one of which is designated the initial or
start state, and one or more of which are designated as the
final states.
 A finite alphabet set Σ consisting of input symbols.
 A finite set of transitions that specify for each state and each
symbol of the input alphabet, the state to which it next goes.
• A finite automaton can be deterministic or non-deterministic.
• In a non-deterministic automaton, more than one transition
out of a state is possible for the same input symbol
Example:
• Suppose Σ= {a, b,c}, the set of states= {q0, q1, q2, q3, q4} with q0
being the start state and q4 the final state,
• Transition Rules:
 From state q0 and with input a, go to state q1.
 From state q1 and with input b, go to state q2 ·
 From state q1 and with input c, go to state q3
 From state q2 and with input b, go to state q4
 From state q3 and with input b, go to state q4

A deterministic finite-state automaton (DFA)


FINITE-STATE AUTOMATA contd..
• Finite State automata have been used in a wide variety of areas, including
linguistics, electrical engineering, computer science, mathematics, and
logic.
• These are an important tool in computational linguistics and have been
used as a mathematical device to implement regular expressions.
• Any regular expression can be represented by a finite automaton and the
language of any finite automaton can be described by a regular expression.
• A Deterministic Finite Automaton (DFA) is defined as a 5-tuple
(Q, Σ, δ, s, F) consisting of
 A finite set Q (the set of states)
 A finite set of symbols Σ (the input alphabet)
 A transition function δ: Q × Σ → Q mapping the current state q ∈ Q and
input symbol a ∈ Σ to a new state δ(q, a)∈Q
 An initial state s ∈ Q (the start state)
 A set of accepting states F (the final states)
FINITE-STATE AUTOMATA contd..
• The transition function of a non-deterministic finite-state
automaton (NFA) maps Q x (Σ U{ε}) to a subset of the power set of
Q,
 That is, for each state, there can be more than one transition on a
given symbol, each leading to a different state
 Eg: There are two possible
transitions. from state q0
on input symbol a.

• A path is a sequence of transitions beginning with the start state.


 A path leading to one of the final states is a successful path
• The FSAs encode regular languages.
 The language that an FSA encodes is the set of strings that can
be formed by concatenating the symbols along each
successful path.
• input, ac
 q0 ->a -> q2 ->c -> q3
 not reached the final state, the string ac is not recognized by
the automaton.
• input, acb
 q0 ->a -> q2->c -> q3->b -> q4 (Final State)
 the string acb is defined by the automaton.
• State Transition Table:
 It is a table showing what state (or states in the case of a
nondeterministic finite automaton) a finite-state machine
will move to, based on the current state and other inputs.
• Rows represent states and column represents input
Input
State A B C
Start: q0 q1 ϕ ϕ
q1 ϕ q2 q3
q2 ϕ q4 ϕ
q3 ϕ q4 ϕ
End: q4 ϕ ϕ ϕ
• Consider a language consisting all strings congaing only
as and bs and ending with baa.
 Regular expression /(a|b)*baa$/
• Two automata that define the same language are said to
be equivalent.
• An NFA can be converted to an equivalent DFA and vice
versa
MORPHOLOGICAL PARSING
• Morphology is a sub-discipline of linguistics.
• It studies word structure and the formation of words from
smaller units (morphemes)
• The goal of morphological parsing is to discover the morphemes
tat build a given word.
• morphemes are the smallest meaning-bearing units in a language
 Eg: the word 'bread' consists of a single morpheme
 'eggs’ consist of two: the morpheme egg and the morpheme -s.
• A morphological parser should be able to tell us that the word
'eggs' is the plural form of the noun stem 'egg’
• Two broad classes : stems and affixes
MORPHOLOGICAL PARSING contd..
• Stem is the main morpheme, i.e., morpheme that contains the
central meaning.
• Affixes modify the meaning given by the stem.
 Affixes are divided into prefix, suffix, infix, and circumfix
• Prefixes are morphemes which appear before a stem
• suffixes are morphemes app lied to the end of the stem.
• circumfixes are morphemes that may be applied to either end of
the stem
• infixes are morphemes that appear inside a stem.
• Prefixes and suffixes are quite common in Urdu, Hindi, and English.
 English word unhappy is composed of the stem, happy, and the
prefix, un-.
 birds, is composed of the stem, bird, and the suffix-, s
MORPHOLOGICAL PARSING contd..
• Types of word formation
 Inflection, derivation, and compounding
 In inflection, a root word is combined with a grammatical morpheme to yield a word of the
same class as the original stem.
• Inflection is viewed as the process of adding very general meanings to existing words, not as
the creation of new words.
• Eg: in 'cat-s', 'talk-ed’
 Derivation combines a word stem with a a grammatical morpheme to yield a word belonging
to a different class
• formation of the noun 'computation' from the verb 'compute’
• child – childhood (both are nouns, but with different meanings), do – undo (both are
verbs, but with opposite meanings), deny (verb) – denial (noun)-different class
• The formation of a noun from a verb or adjective is called nominalization
 compounding is the process of merging two or more words to form a new word
• Personal computer, desktop, overlook, girlfriend
• Morphological analysis and generation deal with inflection, derivation and
compounding process in word formation.
Some points about Morphological analysis
• Morphology knowledge is required to understand the syntactic and
semantic properties of new words.
• Morphological analysis and generation are essential to many NLP
applications ranging from spelling corrections to machine
translations.
 In parsing, it helps to know the agreement features of words.
 In information retrieval, morphological analysis helps identify the
presence of a query word in a document.
• Morphological analysis and generation rely on two sources of
information:
 a dictionary of the valid lemmas of the language
• Lemma for studies, studying is study
 a set of inflection paradigms.
morphological parser
• Information sources used by a morphological parser
 Lexicon : A lexicon lists stems and affixes together with basic
information about them.
 Morphotactics: Deals with certain ordering among the
morphemes that constitute a word.
• For example, restlessness is a valid word in English but not
restnessless.
 Orthographic rules : These are spelling rules that specify the
changes that occur when two given morphemes combine.
• For example the y->ier spelling rule changes easy to easier not
to 'easyer'.
Morphological analysis
• Morphological analysis can be avoided if an exhaustive lexicon
is available that lists features for all the word-forms of all the
roots.
 More memory is needed
 fails to show the relationship between different roots having
similar word forms.
 For morphologically complex languages, the number of
possible word-forms may be theoretically infinite. SO it is
difficult to list these word-forms.
• So, morphological parsing is necessary to reduce the word-
forms
Morphological analysis - stemmers
• stemmers are the simplest morphological systems that
collapse morphological variations of a given word (word-
forms) to one lemma or stem.
 Used in information retrieval.
 They do not require a lexicon
• They make use of rewrite rules of the form
• Eg: ier-> y (e.g., earlier early)
• ing -> ε (e.g., playing play)
Stemming algorithms
• Stemming algorithms work in two steps:
 (i)Suffix removal: This step removes predefined endings from words.
 (ii)Recoding: This step adds predefined endings to the output of the
first step
• These two steps can be performed sequentially in Lovins's stemmer and
simultaneously in Porter's stemmer
 Porter's algorithm reduces only suffixes; prefixes and compounds are
not reduced.
• two-level morphological model by Koskenniemi :
 Used for highly inflected languages
 word is represented as a correspondence between its lexical level form
to its surface level form.
• The surface level represents the actual spelling of the word
• the lexical level represents the concatenation of its constituent morphemes.
Morphological parsing
• Morphological parsing is viewed as a mapping from the surface
level into morpheme and feature sequences on the lexical level.
 Eg: the surface form 'playing' is represented in the lexical form as
play + V + PP
 The lexical form consists of the stem 'play' followed by the
morphological information +V +PP, which indicates that 'playing' is
the present participle form of the verb
 the surface form 'books' is represented in the lexical form as 'book +
N + PL', where the first component is the stem and the second
component (N + PL) is the morphological information, which tells us
that the surface level form is a plural noun.
finite-state transducer (FST)
• Morphology is implemented with a finite-state automata, called finite-state transducer
(FST).
 An FST can be thought of as a two-state automaton, which recognizes or generates a pair of
strings.
 FST passes over the input string by consuming the input symbols on the tape it traverses and
consists it to the output string in the form of symbols.
• A finite-state transducer is a 6 tuple (Σ1 , Σ2,Q,δ, S, F),
 Q is a finite set of states,
 Σ1 and Σ2 are input and output alphabets,
 δ is a function mapping Q × (Σ1 ∪ {ε}) × (Σ2 ∪ {ε}) to a subset of the power set of Q,
 S⊆ Q is the set of initial states,
 F ⊆ Q is the set of final states.
• Transducers can be seen as automata with transitions labelled with symbols from Σ1 X Σ2
where Σ1 , Σ2 are the alphabets of input and output respectively
• An FST will read a set of strings on the input tape and generates a set of relations on the
output tape. An FST can be thought of as a translator or relater between strings in a set.
FST contd..
• A transducer that accepts two input strings, hot and cat, and maps them
onto cot and bat respectively

• FSTs encode regular relations.


 Regular relation is the relation between regular languages.
 The regular language encoded on the upper side of an FST is called
upper language, and the one on the lower side is termed lower
language.
 If T is a transducer, and s is a string, then we use T(s) to represent the
set of strings encoded by T such that the pair (s, t) is in the relation.
 The FSTs are closed under union concatenation, composition, and
Kleene closure.
 They are not closed under intersection and complementation.
Implementation of two level morphology using FST

Step 2: Maps
Surface Step 1: Split word into Intermediate morphemes to stem Lexical
Form possible morphemes form and morphological Form
features

bird bird bird+ N + sg


birds bird+ s bird+ N+ PL
goose goose goose+ N + sg
geese geese geese+ N +PL
boxes box+ s box+N + PL

Two-step morphological parser


Morphological Parsing: Examples

Input word Output morphemes


cats cat +N +PL
cat cat + N + SG
cities city + N + PL
walks walk + V + 3SG
cook cook +N +SG or
cook +V

29.10.2002 CSA3050: NLP Algorithms 31


morphology using FST..
• Steps 1: Split the words up into its possible components by considering
spelling rules.
• E.g. bird + s out of birds, where + indicates morpheme boundaries
• Boxes into boxe + s and box + s (e has been introduced due to the
spelling rule)
 The output of this step is a concatenation of morphemes, i.e., stems
and affixes.
• Eg: transducer mapping lesser
 comparative form of the adjective less is lesser
 ε is empty string
morphology using FST..
• Step 2: Use a lexicon to look up categories of the stems and
meaning of the affixes
 bird + s will be mapped to bird+ N+ PL
 box + s to box+ N + PL
 splitting boxes into boxe + s is an incorrect way of splitting
boxes, so it is discarded
• Two transducers are need to implement both steps of
morphology:
 one that maps the surface form to the intermediate form and
 another that maps the intermediate form to the lexical form
Example lexicon
f:f o:o x:x [fox +N +SG]
c:c a:a t:t [cat +N +SG]
g:g o:o o:o s:s e:e [goose +N +SG] or [goose +V]
g:g o:e o:e s:s e:e [goose +N +PL]
s:s h:h e:e e:e p:p [sheep +N +SG] or [sheep +N +PL]
m:m o:o u:u s:s e:e [mouse +N +SG]
m:m o:i u: s:c e:e [mouse +N +PL]
Some English Spelling Rules

Consonant Beg / Begging / Begged


doubling
E deletion Make / Making
E insertion Watch / Watches
Y replacement Try / Tries
K Insertion Panic / Panicked

35
morphology using FST..
• FST-based morphological parser for singular and plural nouns
in English.
 plural form of regular nouns usually end with -s or -es.
• a word ending in 's' need not necessarily be the plural form of a
word..
 One of the required translations is the deletion of the 'e'
when introducing a morpheme boundary.
• required for words ending in xes, ses, zes (e.g., suffixes and
boxes).

mapping English
nouns to the
intermediate form
morphology using FST..
• possible sequences of states that the transducer undergoes,
given the surface forms birds and boxes as input.
Dotted Pair Notation
1) FSA recogniser for "fox"
f o x

2) FST transducers for fox/fox; goose/geese, less/lesser


f:f o:o x:x

g:g e:o e:o s:s e:e

l e s s +Adj ε +Comp
5.11.2002 l e s s ε e r
morphology using FST..
• Develop a transducer that maps from the intermediate level to
the lexical level
 The input to transducer has one of the following forms:
• Regular noun stem, e.g., bird, cat
 transducer maps all symbols of the stem themselves and then
output N and sg
• Regular noun stem + s, e.g., bird + s
 transducer maps all symbols of the stem to themselves, but
then output N and replaces PL with s
• Singular irregular noun stem, e.g., goose
 transducer maps all symbols of the stem themselves and
then output N and sg
• Plural irregular noun stem, e.g., geese
 transducer maps the irregular plural noun stem to the
corresponding singular stem (e.g., geese to goose) and then
it should add N and PL
• The mapping from State 1 to State 2, 3, or 4 is carried out with the
help of a transducer encoding a lexicon.
• The transducer implementing the lexicon maps the individual
regular and irregular noun stems to their correct noun stem,
replacing labels like regular noun form
 lexicon maps the surface form geese, which is an irregular noun, to
its correct stem goose
• g:g e:o e:o s:s e:e
 Using a single transducer which combines input and output taper ,
the surface word form birds will be mapped to bird + N + pl as
follows:
• b:b i:i r:r d:d + e:N + s:pl
 Each letter maps to itself, while e maps to morphological feature +N,
and s maps to morphological feature pl
A transducer mapping nouns to their stem and morphological features

• The power of the transducer lies in the fact that the same
transducer can be used for analysis and generation.
 That is, we can run it in the downward direction (input:
surface form and output: lexical form) or in the upward
direction.
SPELLING ERROR DETECTION AND CORRECTION
• In computer-based information systems, errors of typing and spelling
constitute a very common source of variation between strings.
• These errors have been widely investigated.
• All investigations agree that single character omission, insertion,
substitution, and reversal are the most common typing mistakes.
• As in Damearu (1964), over 80% of the errors were single-error
misspellings
 Substitution of a single letter,
 Omission of a single letter,
 Insertion of a single letter, and
 transposition of two adjacent letters.
• As in Shafer and Hardwick (1968), the most common type of single
character error was substitution, followed by omission of a single letter
and then insertion of a single letter .
SPELLING ERROR DETECTION AND CORRECTION contd..
• Optical character recognition (OCR) and other automatic reading devices
introduce errors of substitution, deletion, and insertion.
• OCR errors are usually iouped into five classes:
 substitution, multi-substitution (or framing), space deletion or insertion, and failures.
 OCR substitution and multi-substitution errors are caused due to visual similarity
such as c→e, 1 → l, r → n and m → rn.
 Failure occurs when the OCR algorithm fails to select a letter with sufficient accuracy.
 The frequency and type of errors are characteristics of the particular device.
 These errors can be corrected using 'context' or by using linguistic structures
• Many approaches to speech recognition deal with strings of phonemes (or
symbols representing sounds), and attempt to match a spoken utterance
with a dictionary of known utterances.
SPELLING ERROR DETECTION AND CORRECTION contd..
• Spelling errors are mainly phonetic, where the misspell word is
pronounced in the same way as the correct word.
• Phonetic errors distort the word by more than single insertion, deletion, or
substitution.
• Spelling errors categorized into non-word errors and real word errors.
 When an error results in a word that does not appear in a given lexicon
or is not a valid orthographic word form, it is non-word error.
• N gram analysis and dictionary look up are used to detect non word error
 A real-word error results in actual words of the language.
• It occurs due to typographical mistakes or spelling errors.
• They may cause local syntactic errors, global syntactic errors, semantic errors, or
errors at discourse or pragmatic levels
SPELLING ERROR DETECTION AND CORRECTION contd..
• Spelling correction consists of detecting and correcting errors.
 Error detection is the process of finding misspelled words and error correction is the
process of suggesting correct words to a misspelled one.
 These sub problems are addressed in two ways:
• Isolated-error detection and correction
• Context-dependent error detection and correction
 In isolated-word error detection and correction, each word is checked separately,
independent of its context
 The strategy of detecting whether or not a word is correct has a number of problems
associated with it.
• The strategy requires the existence of a lexicon containing all correct words. Such a lexicon
would take a long time to compile and occupy a lot of space.
• Some languages are highly productive. It is 'impossible to list all the correct words of such
languages.
• This strategy fails when spelling error produces a word that belongs to the lexicon, e.g., when
'theses' is written in place of 'these'. Such an error is called a real-word error.
• The larger the lexicon, the more likely it is that an error goes undetected, because the chance
of a word being found is greater in a large lexicon.
Spelling correction algorithms..
• Context dependent error detection and correction methods, utilize the
context of a word to detect and correct errors.
 This requires grammatical analysis and is thus more complex and language
dependent.
 In these methods, the list of candidate words must first be obtained using
an isolated-word method before making a selection depending on the
context
• The spelling correction algorithm has been broadly categorized by Kukich
(1992) as follows.
 Minimum edit distance: It is the minimum number of operations
(insertions, deletions, or substitutions) required to transform one string
into another.
 Similarity key technique: it changes the given string into a key such that
similar strings will change into same key.
Spelling correction algorithms contd..
 n--gram based techniques: These can be used for both non word and real-word
error detection
• In English, certain bi grams and tri-grams of letter never occur or rarely occurs.
This information can be used to handle non-word error.
• Strings that contain these unusual n-grams can be identified as possible spelling
errors.
• N- gram techniques usually require a large corpus or dictionary as training data
 Neural nets: These have the ability to do associative recall based on incomplete and
noisy data.
• They can be trained to adapt to specific spelling error patterns.
• The drawback of neural nets is that they arc computationally expensive.
 Rule-based techniques: A set of rules (heuristics) derived from knowledge of a
common spelling error pattern is used to transform misspelled words into valid
words.
• For example, a rule can be written for errors that occur from the letters ue being
typed as eu.
Minimum Edit Distance
• Minimum edit distance is the number of insertions, deletions, and substitutions
required to change one string into another
 E.g. minimum edit distance between tumor and tutor is 2
• substitute 'm' for 't’ and insert ‘u’ before ‘r’
• conversion.
• Edit distance between two strings can be represented as a binary function.
 Edit distance can be viewed as a alignment problem.
• By aligning two strings degree of match can be found
• Minimum edit distance is the best alignment method
• The following alignment between tutor and tumour has a edit distance of 2.
t u t o - r
t u m o u r
• Another possible alignment with edit distance 3.
t u t - o - r
t u - m o u r
• The best possible alignment corresponds to minimum edit distance.
• Dynamic programming algorithms can be quite useful for finding minimum edit
distance between two sequences.
• implemented by creating an edit distance matrix.
• This matrix has one row for each symbol in the source string and one column for
each matrix in the target string
 The (i,j)th cell in this matrix represents the distance between first i character of the
source and first j character of the target string.
 The value in each cell is computed in terms of three possible paths following which
we can reach there:

dist[i 1, j]  insert _ cos t,

dist[i, j]  dist[i 1, j 1]  subst _ cos t[soure i, t arg etj]
dist[i, j 1]  delete _ cos t

• The substitution will be 0 if the ith character in the source matches with jth
character in the target.
Minimum Edit Distance algorithm
Input: Two strings, X and Y
Output: The minimum edit distance between X and Y
m  length(X)
n  length(Y)
for i = 0 to m do
dist[i,0]  i
for j = 0 to n do
dist[0,j]  j
for i = 1 to m do
for j = 1 to n do
dist[i,j] = min { dist[i-1,j] + insert_cost, dist[i-1,j-1] +
subst_cost(Xi,Yj),
dist[i,j-1] + delet_cost }
end
Computing Minimum Edit Distance

# t u m o u r
# 0 1 2 3 4 5 6
t 1 0 1 2 3 4 5
u 2 1 0 1 2 3 4
t 3 2 1 1 2 3 4
o 4 3 2 2 1 2 3
r 5 4 3 3 2 2 2
Computing Minimum Edit Distance
Words and Word Classes
• Words are classified into categories called part-of-speech
 Also called as word classes or lexical categories
 Common lexical categories are noun and verbs
 Other lexical categories include adjectives, adverbs,
prepositions and conjunctions.
Parts of speech (POS) tagging
• Parts of speech tagging is the process of assigning a part - of - speech ( such as a noun ,
verb , pronoun , preposition , adverb , and adjective ) , to each word in a sentence .
 The input to a tagging algorithm is the sequence of words of a natural language
sentence and specified tag sets ( a finite list of POS tags ) .
 The output is a single best POS tag for each word .
 Many words may belong to more than one lexical category.
• But only one of the possible meanings is used at a time.
 No tagger is efficient enough to identify the correct lexical category of each word in
a sentence in every case.
• E.g. Soma is reading the book and rama booked the ticket
 The collection of tags used by a particular tagger is called a tag set .
• Most POS tag sets make use of the same basic categories , i.e. , noun , verb , adjective , and
prepositions .
• tag sets differ in how they define categories and how finely they divide words into categories
• Most tag sets capture morpho - syntactic information such as singular / plural , number ,
gender , tense , etc.
 The number of tags used by different taggers varies
POS methods
• Parts of speech tagging methods has 3 categories.
 Rule based(linguistic)
 Stochastic (data driven)
 Hybrid.
• Rule Based Taggers
 Use hand coded rules to assign tags to words. These rules use a lexicon
to obtain a list of candidate tags and then use rules to discard incorrect
tags.
• Stochastic (data driven)
 Have data driven approaches in which frequency based information is
automatically derived from corpus and used to tag words.
 They disambiguate words based on the probability that a word occurs
with a particular tag.
• Hybrid taggers combine both of above approaches.
Rule Based Taggers
• Rule Based Taggers: Have a 2 stage architecture:
 First stage is a dictionary look up procedure, which returns a set of potential tags and
appropriate syntactic features for each word.
 Second stage uses a hand coded rules to discard contextually illegitimate tags to get
a single part of speech for each word.
• Eg: The noun-verb ambiguity : Sentence: The show must go on.
• The potential tags for the word show in this sentence is {VB,NN}.
 The ambiguity is resolved using the rule
If preceding word is determiner THEN eliminate VB tags.
• This rule disallows verb after a determine using this rule the word show is noun
only.
• In addition to contextual information, many taggers use morphological
information to help in the disambiguation process.
 E.g. If word ends in _ing and preceding word is a verb THEN label it a verb(VB).
• Capitalization information can be utilized in the tagging of unknown nouns.
Rule Based Taggers
• Rule based tagger usually require supervised training, Instead rules
can be induced automatically
 To induce rules, untagged text is run through a tagger
 The output is then manually corrected
 The corrected text is then submitted to the tagger, which learns correction
rules by comparing the 2 sets of data.
 This process may be repeated several times.
• TAGGIT (Greene and Rubin 1971) :- Initial tagging of Brown Corpus(Francis and
Kucera 1982).
 Rule based system- it uses 3,300 disambiguation rules and able to tag 77% of the words in
the Brown Corpus with their correct part of speech.
• Another rule based tagger is ENGTWOL (voutilainen 1995)
Rule Based Taggers
• Advantages:
 Speed, deterministic than stochastic
 Arguments against them is the skill and effort required in writing
disambiguation rules
 Stochastic taggers require manual work if good performance is to be
achieved
 Rule based, time is spent in writing a rule set
 Stochastic, time is spent developing restriction on transitions and
emissions to improve tagger performance
• Disadvantages:
 It’s usable for only one language. Using it for another one requires a
rewrite of most of the programs.
Stochastic tagger
• Standard stochastic tagger is HMM tagger algorithm
• Markov model applies the simplifying assumption that the probability of a chain of
symbols can be approximated in terms of its parts of n-grams
• Simplest n-gram model is unigram model, which assigns the most likely tag to each token
• Unigram model needs to be trained using a tagged training corpus before it can be used
to tag data
• The most likely statistics are gathered over the corpus and used for tagging
• The context used by the unigram tagger is the text of the word itself
 Ex: It will assign the tag JJ for each occurrence of fast is used as an adjective than
used as noun, verb or adverb.
She had a fast[noun]
Muslims fast[verb] during Ramadan
Those who were injured in the accident need to be helped fast[adverb]
• Bigram tagger uses the current word and the tag of previous word in tagging process
• As the tag sequence “DT NN” is more likely than the tag sequence “DD JJ”
Stochastic tagger
• In general, n gram model considers the current word and the tag of the previous
n-1 words in assigning a tag to a word
• The context considered by a tri-gram model.
 The shaded area represents context. Tokens Wn-2 Wn-1 Wn Wn+1
Tags tn-2 tn-1 tn tn+1

• a tag is assigned to a word given the previous tag.


• The objective of a tagger is to assign a tag sequence to a given sentence.
• HMM uses 2 layers of states:
1) Visual layer corresponds to the input words
2) Hidden layer learnt by the system corresponding to the tags
• While tagging the input data, observe only the words but the tags(states) are
hidden
 States of the model are visible in training, not during the tagging task
Stochastic tagger
• HMM makes use of lexical and bigram probabilities estimated over a tagged
corpus in order to compute the most likely tag sequence for each sentence
• One way to store the statistical information is to build a probability matrix.
 This contains both the probability that an individual word belongs to a word
class as well as the n-gram analysis
• eg: For a bigram model, the probability that a word of class X follows a word of class Y.
 This matrix is then used to drive the HMM tagger while tagging a unknown
text
• Given a sequence of words, the objective is to find the most probable tag
sequence for the sentence
• Let w be the sequence of words.
 W=w1,w2,w3,,,,,,wn
• The task is to find the tag sequence
 T=t1,t2,t3,,,,,,tn
• Which maximizes P(T|W) i.e,,
 T’= argmaxT P(T|W)
Stochastic tagger contd..
• Applying Bayes Rule, P(T/W) can be the estimated using the
expression:
P(T|W)= P(W|T) *P(T)/P(W)
• The probability of applying for the word sequence, P(W) is dropped as it
remains the same for each tag sequence. So, the expression:
T’= argmaxT P(W|T) * P(T)
• Using the Markov assumptions, the probability of a tag sequence can be
estimated as the product of the probability of its constituent n-grams, i.e,,
P(T)=P(t1)*P(t2/t1)*P(t3/t2t1)…..*P(tn/t1…tn-1)
• P(W|T) is the probability of seeing a word sequence, given a tag sequence.
E.g.: the probability of seeing “ The egg is rotten” given ‘DT NNP VB JJ’
• the following two assumptions can be made:
 The words are independent of each other
 The probability of a word is dependent only on it’s tag.
Stochastic tagger contd..
• Using the assumptions, we obtain
P(W/T)=P(w1/t1)*P(w2/t2)…….P(wi/ti ) *……P(wn/t n)
i.e., P(W/T) = ς 𝑛𝑖=1 P(wi/ti )
P(W/T)*P(T)= ς 𝑛𝑖=1 P(wi/ti ) x P(t1) x P(t 2/t 1) x P(t 3/t 2t 1) x …..x P(t n/t 1…t n-1)
• Approximating the tag history using only the two previous tags, the transition
probability, P(t) becomes
P(T)= P(t1) x P(t2/t1) x P(t3/t2t1) x … x P(tn /tn-2 tn-1)
• Hence, P(T/W) can be estimated as
P(W/T)*P(T)= ς 𝑛𝑖=1 P(wi/ti ) x P(t1) x P(t 2/t 1)*P(t 3/t 2t 1)x…..x P(t n /tn-2 tn-1)
= ς 𝑛𝑖=1 P(wi/ti ) x P(t1) x P(t 2/t 1) x ς 𝑛𝑖=3 P(ti /ti −2 ti−1)
• We estimate these probabilities from relative frequencies via maximum
likelihood estimation
P(ti /ti-2 ti-1) = C(ti-2 , ti-1 , ti)/C(ti-2 , ti-1)
P(wi/ti ) =C(wi , ti )/C(ti)
• Where C(ti-2, ti-1, ti) is the number of occurrences of ti , followed by ti-2 , ti-1.
• Advantages:
 Accurate and language independent
• Disadvantage:
 Require a manually tagged corpus.
• Example
 Consider the sentence:
• The bird can fly
• DT NNP MD VB
• Using bi-gram approximation, the probability
DT NNP MD VB
P= | | | |
The bird can fly

Can be computed as
=P(DT] x P(NNP|DT)*P(MD|NNP) X P(VB|MD) x P(the/DT)X P(bird|NNP) X P(can|MD)
X P(fly|VB)
Hybrid Taggers
• Hybrid approaches to tagging combine the features of both the rule based and stochastic
approaches.
• They use rules to assign tags to words.
• Like the stochastic taggers ,this is a machine learning techniques and rules are
automatically induced from the data.
• Transformation-based learning (TBL) of tags ,also known as Brill tagging ,is a example of
hybrid approach .
• Transformation-based error-driven learning
 Used in a number of natural learning problems, including parts-of-speech tagging, speech
generation ,and syntactic parsing .
 Supervised learning algorithm
 The input to Brill's TBL tagging algorithm is a tagged corpus and a lexicon .
 The initial state annotator uses the lexicon to assign the most likely tag to each word as the start
state.
 An ordered set of transformational rules are applied sequentially.
 The rule that that result in the most improved tagging is selected.
 A manually tagged corpus is used as reference for truth.
Hybrid Taggers contd..
 The process is iterated until some
stopping criteria is reached,such as when
no significant information is achieved over
the previous iteration.
 At each iteration ,the transformation that
results in the highest score is selected.
 The output of the algorithm is a ranked
list of learned transformation that
transform the initial tagging close to
correct tagging
 New text can then be annotated by first
assigning the most frequent tag and then TBL learner
applying the ranked list of learned
transformation in order.
TBL tagging algorithm
• Input: Tagged corpus and lexicon(with most frequent information)
• step1: Label every word with most likely tag(from dictionary)
• step2: Check every possible transformation and select one which most improves tagging
• step3: Re-tag corpus applying the rules
• Repeat 2-3 Until some stopping criterion is reached
• RESULT: Ranked sequence of transformation rules

• Each transformation is a pair of re-write rule of the form t1->t2 and a contextual
condition.
• In order to limit the set of transformations ,a small set of templates is constructed.
• Any allowable transformation is an instantiation of these templates.
TBL tagging algorithm contd..
• Examples of transformation templates and rules learned by the tagger
 change tag a to tag b when :
1. The preceding (following) word is tagged z.
2. The preceding (following) word is w.
3. The word two before (after) is w.
4. One of the preceding two words is w.
5. One of the two preceding (following ) words is tagged z.
6. The current word is w and the preceding (following) word is x.
7. One of the previous three words is tagged z.

# change tags contextual condition Example


From To
1. NN VB The previous tag is TO To/TO fish/NN
2. JJ RB The previous tag is VBZ runs/VBZ fast/JJ
Example for TBL tagging algorithm:
• Assume that in a corpus, fish is the most likely to be a noun.
 P(NN/fish) = 0.91
 P(VB/fish) = 0.09
• now, consider the following two sentences and their initial tags :
 I/PRP like/VB to/TO eat/VB fish/NNP
 I/PRP like/VB to/TO fish/NNP
• The most likely tag for fish in NNP, the tagger assigns this tag to the word in both
sentence.
• In the second case, it is a mistake .
• After initial tagging when the transformation rules are applied ,the tagger learns
a rule that applies exactly to this mistagging of fish.
• Change NNP to VB if the previous tag in TO.
• As the contextual condition is satisfied ,this rule with change fish/NN to fish/VB:
like/VB to/TO fish/NN →like/VB to/TO fish/VB
TBL tagging algorithm
• The algorithm can be made efficient by indexing the words in a
training corpus using potential transformation .
• Most of the work in part-of-speech is done for English and
some European languages
• In other languages, part-of-speech tagging and
• NLP research in general ,is constrained by the lack of
annotated corpus.
• A few parts of speech tagging systems reported in recent years
use morphological analyzers along with a tagged corpus.
Unknown words
• Unknown words do not appear in the dictionary or a
training corpus
 Create problems during tagging
• Solutions:
 Assign the most frequent tag to the unknown word
 Assume any POS and initialize them by assigning them as
open class tags.
 Use morphological information such as suffixes to get the
possible tag for unknown word

You might also like