lOMoARcPSD|44355733
NLP-Unit-III - Notes
Master of information technology (University of Mumbai)
Scan to open on Studocu
Studocu is not sponsored or endorsed by any college or university
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)
lOMoARcPSD|44355733
NLP – UNIT -III
Word Classes and Part-of-Speech tagging(POS), Survey of POS tagsets,
Rule based approaches (ENGTOWL), Stochastic approaches(Probabilistic,
N-gram and HMM), TBL morphology, unknown word handling, Evaluation
metrics: Precision/Recall/F-measure, error analysis.
❖ Word Classes:
In Parts of Speech tagging for English words, we are given a text of English
words we need to identify the parts of speech of each word.
Example Sentence : Learn NLP from Scaler
Learn -> ADJECTIVE NLP -> NOUN from -> PREPOSITION Scaler -> NOUN
Although it seems easy, Identifying the part of speech tags is much more
complicated than simply mapping words to their part of speech tags.
Words often have more than one POS tag. Let’s understand this by taking an easy
example.
In the below sentences focus on the word “back”:
The relationship of “back” with adjacent and related words in a phrase,
sentence, or paragraph is changing its POS tag.
It is quite possible for a single word to have a different part of speech tag
in different sentences based on different contexts. That is why it is very
difficult to have a generic mapping for POS tags.
Word Classes
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)
lOMoARcPSD|44355733
In grammar, a part of speech or part-of-speech (POS) is known as word class or
grammatical category, which is a category of words that have similar
grammatical properties.
The English language has four major word classes: Nouns, Verbs, Adjectives, and
Adverbs.
Commonly listed English parts of speech are nouns, verbs, adjectives, adverbs,
pronouns, prepositions, conjunction, interjection, numeral, article, and
determiners.
These can be further categorized into open and closed classes.
Closed Class
Closed classes are those with a relatively fixed/number of words, and we rarely
add new words to these POS, such as prepositions. Closed class words are
generally functional words like of, it, and, or you, which tend to be very
short, occur frequently, and often have structuring uses in grammar.
Example of closed class-
Determiners: a, an, the Pronouns: she, he, I, others Prepositions: on, under,
over, near, by, at, from, to, with
Open Class
Open Classes are mostly content-bearing, i.e., they refer to objects, actions,
and features; it's called open classes since new words are added all the time.
By contrast, nouns and verbs, adjectives, and adverbs belong to open classes;
new nouns and verbs like iPhone or to fax are continually being created or
borrowed.
Example of open class-
Nouns: computer, board, peace, school Verbs: say, walk, run,
belong Adjectives: clean, quick, rapid, enormous Adverbs: quickly, softly,
enormously, cheerfully
Tagset
The problem is (as discussed above) many words belong to more than one word
class.
And to do POS tagging, a standard set needs to be chosen. We Could pick very
simple/coarse tagsets such as Noun (NN), Verb (VB), Adjective (JJ), Adverb
(RB), etc.
But to make tags more dis-ambiguous, the commonly used set is finer-grained,
University of Pennsylvania’s “UPenn TreeBank tagset”, having a total of 45
tags.
Ta
Tag Description Example Description Example
g
and, but, SY
CC Coordin. Conjunction Symbol +%, &
or M
one, two,
CD Cardinal number TO "to" to
three
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)
lOMoARcPSD|44355733
DT Determiner a, the UH Interjection ah, oops
EX Existential 'there' there VB Verb, base form eat
VB
FW Foreign word mea culpa Verb, past tense ate
D
VB
IN Preposition/sub-conj of, in, by Verb, gerund eating
G
VB Verb, past
JJ Adjective yellow eaten
N participle
VB Verb, non-3 sg
JJR Adj., comparative bigger eat
P pres
VB
JJS Adj., superlative wildest Verb, 3 sg pres eats
Z
WD which,
LS List item marker 1, 2, One Wh-determiner
T that
MD Modal can, should WP Wh-pronoun what, who
WP
NN Noun, sing. or mass llama Possessive wh- whose
$
WR
NNS Noun, plural llamas Wh-adverb how, where
B
Proper noun,
NNP IBM $ Dollar sign $
singular
NNP
Proper noun, plural Carolinas # Pound sign #
S
PDT Predeterminer all, both "، Left quote ( or )
POS Possessive ending 's " Right quote ( or )
PRP Personal pronoun I, you, he ( Left parenthesis ([,(,{,<)
PRP Right
Possessive pronoun your, one's ) (],),},>)
$ parenthesis
quickly,
RB Adverb , Comma
never
Sentence-final
RBR Adverb, comparative faster . (. ! ?)
punc
Mid-sentence
RBS Adverb, superlative fastest : (: ; ...)
punc
RP Particle up, off
❖ Part-of-Speech tagging(POS):
Part-of-speech tagging is the process of assigning a part of speech to each
word in a text. The input is a sequence 𝑥1,𝑥2,...,𝑥𝑛 of (tokenized) words, and
the output is a sequence 𝑦1,𝑦2,...,𝑦𝑛 of POS tags, each output 𝑦𝑖 corresponding
exactly to one input 𝑥𝑖.
Tagging is a disambiguation task; words are ambiguous i.e. have more than one a
possible part of speech, and the goal is to find the correct tag for the
situation.
For example, a book can be a verb (book that flight) or a noun (hand me
that book).
The goal of POS tagging is to resolve these ambiguities, choosing the proper
tag for the context.
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)
lOMoARcPSD|44355733
POS tagging Algorithms Accuracy:
The accuracy of existing State of the Art algorithms of part-of-speech tagging
is extremely high. The accuracy can be as high as ~ 97%, which is also about
the human performance on this task, at least for English.
We’ll discuss algorithms/techniques for this task in the upcoming sections, but
first, let’s explore the task. Exactly how hard is it?
Let's consider one of the popular electronic collections of text samples, Brown
Corpus. It is a general language corpus containing 500 samples of English,
totaling roughly one million words.
In Brown Corpus :
85-86% words are unambiguous - have only 1 POS tag
but, 14-15% words are ambiguous - have 2 or more POS tags
Particularly ambiguous common words include that, back, down, put, and set.
The word back itself can have 6 different parts of speech (JJ, NN, VBP, VB, RP,
RB) depending on the context.
Nonetheless, many words are easy to disambiguate because their different tags
aren’t equally likely. For example, "a" can be a determiner or the letter "a",
but the determiner sense is much more likely.
This idea suggests a useful baseline, i.e., given an ambiguous word, choose the
tag which is most frequent in the corpus.
This is a key concept in the Frequent Class tagging approach.
Let’s explore some common baseline and more sophisticated POS tagging
techniques.
❖ Survey of POS tagsets:
❖ Rule based approaches (ENGTOWL):
Rule-based tagging is the oldest tagging approach where we use contextual
information to assign tags to unknown or ambiguous words.
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)
lOMoARcPSD|44355733
The rule-based approach uses a dictionary to get possible tags for tagging each
word. If the word has more than one possible tag, then rule-based taggers use
hand-written rules to identify the correct tag.
Since rules are usually built manually, therefore they are also called
Knowledge-driven taggers. We have a limited number of rules, approximately
around 1000 for the English language.
One of example of a rule is as follows:
Sample Rule: If an ambiguous word “X” is preceded by a determiner and followed
by a noun, tag it as an adjective;
A nice car: nice is an ADJECTIVE here.
Limitations/Disadvantages of Rule-Based Approach:
● High development cost and high time complexity when applying to a large
corpus of text
● Defining a set of rules manually is an extremely cumbersome process and
is not scalable at all
❖ Stochastic approaches - Probabilistic:
Stochastic POS Tagger uses probabilistic and statistical information from the
corpus of labelled text (where we know the actual tags of words in the corpus)
to assign a POS tag to each word in a sentence.
This tagger can use techniques like Word frequency measurements and Tag
Sequence Probabilities. It can either use one of these approaches or a
combination of both. Let’s discuss these techniques in detail.
Word Frequency Measurements
The tag encountered most frequently in the corpus is the one assigned to the
ambiguous words(words having 2 or more possible POS tags).
Let’s understand this approach using some example sentences :
Ambiguous Word = “play”
Sentence 1 : I play cricket every day. POS tag of play = VERB
Sentence 2 : I want to perform a play. POS tag of play = NOUN
The word frequency method will now check the most frequently used POS tag for
“play”. Let’s say this frequent POS tag happens to be VERB; then we assign the
POS tag of "play” = VERB
The main drawback of this approach is that it can yield invalid sequences of
tags.
Tag Sequence Probabilities
In this method, the best tag for a given word is determined by the probability
that it occurs with “n” previous tags.
Simply put, assume we have a new sequence of 4 words, 𝑤1 𝑤2 𝑤3 𝑤4w1 w2 w3 w4
And we need to identify the POS tag of 𝑤4w4.
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)
lOMoARcPSD|44355733
If n = 3, we will consider the POS tags of 3 words prior to w4 in the labelled
corpus of text
Let’s say the POS tags for
𝑤1w1 = NOUN, 𝑤2w2 = VERB, 𝑤3w3 = DETERMINER
In short, N, V, D: NVD
Then in the labelled corpus of text, we will search for this NVD sequence.
Let’s say we found 100 such NVD sequences. Out of these -
10 sequences have the POS of the next word is NOUN 90 sequences have the POS of
the next word is VERB
Then the POS of the word 𝑤4w4 = VERB
The main drawback of this technique is that sometimes the predicted sequence is
not Grammatically correct.
Now let’s discuss some properties and limitations of the Stochastic tagging
approach:
1. This POS tagging is based on the probability of the tag occurring (either
solo or in sequence)
2. It requires labelled corpus, also called training data in the Machine
Learning lingo
3. There would be no probability for the words that don’t exist in the
training data
4. It uses a different testing corpus (unseen text) other than the training
corpus
5. It is the simplest POS tagging because it chooses the most frequent tags
associated with a word in the training corpus
Transformation-Based Learning Tagger: TBL
Transformation-based tagging is the combination of Rule-based & stochastic
tagging methodologies.
In Layman's terms;
The algorithm keeps on searching for the new best set of rules given input as
labelled corpus until its accuracy saturates the labelled corpus.
Algorithm takes following Input:
● a tagged corpus
● a dictionary of words with the most frequent tags
Output: Sequence of transformation rules
Example of sample rule learned by this algorithm:
Rule: Change Noun(NN) to Verb(VB) when previous tag is To(TO)
E.g.: race has the following probabilities in the Brown corpus -
● Probability of tag is NOUN given word is race P(NN | race) = 98%
● Probability of tag is VERB given word is race P(VB | race) = 0.02
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)
lOMoARcPSD|44355733
Given sequence: is expected to race tomorrow
● First tag race with NOUN (since its probability of being NOUN is 98%)
● Then apply the above rule and retag the POS of race with VERB (since just
the previous tag before the “race” word is TO)
The Working of the TBL Algorithm
Step 1: Label every word with the most likely tag via lookup from the input
dictionary.
Step 2: Check every possible transformation & select one which most improves
tagging accuracy.
Similar to the above sample rule, other possible (maybe worst transformations)
rules could be -
● Change Noun(NN) to Determiner(DT) when previous tag is To(TO)
● Change Noun(NN) to Adverb(RB) when previous tag is To(TO)
● Change Noun(NN) to Adjective(JJ) when previous tag is To(TO)
● etc…..
Step 3: Re-tag corpus by applying all possible transformation rules
Repeat Step 1,2,3 as many times as needed until accuracy saturates or you reach
some predefined accuracy cutoff.
Advantages and Drawbacks of the TBL Algorithm
Advantages
● We can learn a small set of simple rules, and these rules are decent
enough for basic POS tagging
● Development, as well as debugging, is very easy in TBL because the
learned rules are easy to understand
● Complexity in tagging is reduced because, in TBL, there is a cross-
connection between machine-learned and human-generated rules
Drawbacks
Despite being a simple and somewhat effective approach to POS tagging, TBL has
major disadvantages.
● TBL algorithm training/learning time complexity is very high, and time
increases multi-fold when corpus size increases
● TBL does not provide tag probabilities
❖ Stochastic approaches – N-gram:
N-gram can be defined as the contiguous sequence of n items from a given sample
of text or speech. The items can be letters, words, or base pairs according to
the application. The N-grams typically are collected from a text or speech
corpus (A long text dataset).
N-gram Language Model:
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)
lOMoARcPSD|44355733
An N-gram language model predicts the probability of a given N-gram within any
sequence of words in the language. A good N-gram model can predict the next
word in the sentence i.e the value of p(w|h)
Example of N-gram such as unigram (“This”, “article”, “is”, “on”, “NLP”) or
bi-gram (‘This article’, ‘article is’, ‘is on’,’on NLP’).
Now, we will establish a relation on how to find the next word in the sentence
using ‘.’ We need to calculate p(w|h), where is the candidate for the next
word. For example in the above example, lets’ consider, we want to calculate
what is the probability of the last word being “NLP” given the previous words:
𝑝(𝑁𝐿𝑃|𝑡ℎ𝑖𝑠 𝑎𝑟𝑡𝑖𝑐𝑙𝑒 𝑖𝑠 𝑜𝑛)
After generalizing the above equation can be calculated as:
𝑝(𝑤5 |𝑤1 , 𝑤2 , 𝑤3 , 𝑤4 )𝑜𝑟 𝑃(𝑊)
= 𝑝(𝑤𝑛 |𝑤1 , 𝑤2 , … … , 𝑤𝑛 )
But how do we calculate it? The answer lies in the chain rule of probability:
𝑃(𝐴, 𝐵)
𝑃(𝐵) =
𝑃(𝐵)
∴ 𝑃(𝐴, 𝐵) = 𝑃(𝐵). 𝑃(𝐵)
Generalizing the above equation:
𝑃(𝑋1, 𝑋2 , … … , 𝑋𝑛 ) = 𝑃(𝑋1 )𝑃(𝑋1 )𝑃(𝑋3 |𝑋1 , 𝑋2 ) … 𝑃(𝑋𝑛 |𝑋1 , 𝑋2 , … … , 𝑋𝑛 )
𝑃(𝑤1 , 𝑤2 , 𝑤3 … … , 𝑤𝑛 ) = 𝜋𝑖 𝑃(𝑤𝑖 |𝑤1 , 𝑤2 , … … , 𝑤𝑛 )
Simplyfying the above formula using Markov assumptions:
𝑃(𝑤𝑖 |𝑤1 , 𝑤2 , … … , 𝑤𝑖−1 ) ≈ 𝑃(𝑤𝑖 |𝑤𝑖−𝑘 , … … , 𝑤𝑖−1 )
For Unigram:
𝑃(𝑤1, 𝑤2 , 𝑤3 … … , 𝑤𝑛 ) ≈ 𝜋𝑖 𝑃(𝑤𝑖 )
For Bigram:
𝑃(𝑤𝑖 |𝑤1 , 𝑤2 , … … , 𝑤𝑖−1 ) ≈ 𝑃(𝑤𝑖 |𝑤𝑖−1 )
❖ Stochastic approaches – HMM:
A Markov chain is a mathematical model that represents a process where the system
transitions from one state to another. The transition assumes that the probability
of moving to the next state is solely dependent on the current state. Please
refer to the figure below for an illustration:
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)
lOMoARcPSD|44355733
In the above figure, ‘a’, ‘p’, ‘i’, ‘t’, ‘e’, and ‘h’ represent the states, while
the numbers on the edges indicate the probability of transitioning from one state
to another. For example, the probability of transitioning from state ‘t’ to
states ‘i’, ‘a’, and ‘h’ are 0.3, 0.3, and 0.4, respectively.
The start state is a special state that represents the initial state of the
process, such as the start of a sentence.
Markov processes are commonly used to model sequential data, like text and speech.
For instance, if you want to build an application that predicts the next word in
a sentence, you can represent each word in a sentence as a state. The transition
probabilities can be learned from a corpus and represent the probability of
moving from the current word to the next word.
For example, the transition probability from the state ‘San’ to ‘Francisco’ will
be higher than the probability of transitioning to the state ‘Delhi’.
The Hidden Markov Model (HMM) is an extension of the Markov process used to model
phenomena where the states are hidden or latent, but they emit observations. For
instance, in a speech recognition system like a speech-to-text converter, the
states represent the actual text words to predict, but they are not directly
observable (i.e., the states are hidden). Rather, you only observe the speech
(audio) signals corresponding to each word and need to deduce the states using
the observations.
Similarly, in POS tagging, you observe the words in a sentence, but the POS tags
themselves are hidden. Thus, the POS tagging task can be modeled as a Hidden
Markov Model with the hidden states representing POS tags that emit observations,
i.e., words.
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)
lOMoARcPSD|44355733
The hidden states emit observations with a certain probability. Therefore, Hidden
Markov Model has emission probabilities, which represent the probability that a
particular state emits a given observation. Along with the transition and initial
state probabilities, these emission probabilities are used to model HMMs.
The figure below illustrates the emission and transition probabilities for a
hidden Markov process with three hidden states and four observations.
HMM can be trained using a variety of algorithms, including the Baum-Welch
algorithm and the Viterbi algorithm.
The Baum-Welch algorithm is an unsupervised learning algorithm that iteratively
adjusts the probabilities of events occurring in each state to fit the data
better.
The Viterbi algorithm is a dynamic programming algorithm that finds the most
likely sequence of hidden states given a sequence of observable events.
Viterbi Algorithm
The Viterbi algorithm is a dynamic programming algorithm used to determine the
most probable sequence of hidden states in a Hidden Markov Model (HMM) based on
a sequence of observations. It is a widely used algorithm in speech recognition,
natural language processing, and other areas that involve sequential data.
The algorithm works by recursively computing the probability of the most likely
sequence of hidden states that ends in each state for each observation.
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)
lOMoARcPSD|44355733
At each time step, the algorithm computes the probability of being in each state
and emits the current observation based on the probabilities of being in the
previous states and making a transition to the current state.
Assuming we have an HMM with N hidden states and T observations, the Viterbi
algorithm can be summarized as follows:
1. Initialization: At time t=1, we set the probability of the most
likely path ending in state i for each state i to the product of the
initial state probability pi and the emission probability of the
first observation given state i. This is denoted by: delta[1,i] = pi
* b[i,1].
2. Recursion: For each time step t from 2 to T, and for each state i,
we compute the probability of the most likely path ending in state i
at time t by considering all possible paths that could have led to
state i. This probability is given by:
delta[t,i] = max_j(delta[t-1,j] * a[j,i] * b[i,t])
Here, a[j,i] is the probability of transitioning from state j to
state i, and b[i,t] is the probability of observing the t-th
observation given state I.
We also keep track of the most likely previous state that led to
the current state i, which is given by:
psi[t,i] = argmax_j(delta[t-1,j] * a[j,i])
3. Termination: The probability of the most likely path overall is
given by the maximum of the probabilities of the most likely paths
ending in each state at time T. That is, P* = max_i(delta[T,i]).
4. Backtracking: Starting from the state i* that gave the maximum
probability at time T, we recursively follow the psi values back to
time t=1 to obtain the most likely path of hidden states.
The Viterbi algorithm is an efficient and powerful tool that can handle long
sequences of observations using dynamic programming.
Advantages of the Hidden Markov Model
One of the advantages of HMM is its ability to learn from data.
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)
lOMoARcPSD|44355733
HMM can be trained on large datasets to learn the probabilities of certain events
occurring in certain states.
For example, HMM can be trained on a corpus of sentences to learn the probability
of a verb following a noun or an adjective.
Applications of the Hidden Markov Model
● Part-of-Speech (POS) Tagging
● Named Entity Recognition (NER)
● Speech Recognition
● Machine Translation
Limitations of the Hidden Markov Model
HMM assumes that the probability of an event occurring in a certain state is
fixed, which may not always be the case in real-world data. Additionally, HMM is
not well-suited for modeling long-term dependencies in language, as it only
considers the immediate past.
There are alternative models to HMM in NLP, including recurrent neural networks
(RNNs) and transformer models like BERT and GPT. These models have shown promising
results in a variety of NLP tasks, but they also have their own limitations and
challenges.
❖ TBL morphology:
Natural Language Processing (NLP) is one of the most rapidly growing areas of
research. The findings of Morphological Analysis and Morphological Generation
might be considered highly relevant in most Natural Language Processing
applications. Because morphological analysis is a technique for recognising a
word, the result can be employed at a later stage. With this in mind, this
study explains how morphological analysis and generation may be demonstrated as
critical components of several Natural Language Processing domains such as
spell checkers and machine translation.
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)
lOMoARcPSD|44355733
Morphological analysis is a field of linguistics that studies the structure of
words. It identifies how a word is produced through the use of morphemes. A
morpheme is a basic unit of the English language. The morpheme is the smallest
element of a word that has grammatical function and meaning. Free morpheme and
bound morpheme are the two types of morphemes. A single free morpheme can
become a complete word.
For instance, a bus, a bicycle, and so forth. A bound morpheme, on the other
hand, cannot stand alone and must be joined to a free morpheme to produce a
word. ing, un, and other bound morphemes are examples.
Inflectional Morphology and Derivational Morphology are the two types of
morphology. Both of these types have their own significance in various areas
related to the Natural Language Processing.
• Morphological Analyzer
In inflected languages, words are formed through morphological processes such
as affixation. For example, by adding the suffix ‘-s’ to the verb ‘to dance’,
we form the third person singular ‘dances’.
A morphological analyser assigns the attributes of a given word by evaluating
what morphological processes the form has undergone. If you give it the word
‘bailaré’ in Spanish, it will tell you it is the first person, singular, simple
future, indicative form of the verb ‘bailar’.
Morphological Parsing.
It is the process of determining the morphemes from which a given word is
constructed. Morphemes are the smallest meaningful words which cannot be
divided further. Morphemes can be stem or affix. Stem are the root word whereas
affix can be prefix, suffix or infix. For example-
Unsuccessful → un success ful
(prefix) (stem) (suffix)
Order of words also decide the morphological parser. To design a morphological
parser we require three things- lexicon, morphotactic and orthographic rules.
Types of Morphology:
• Inflectional Morphology
Inflectional morphology, on the other hand, involves the modification of words
to indicate grammatical information such as tense, number, gender, case, and
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)
lOMoARcPSD|44355733
comparison. Unlike derivational morphology, inflectional affixes do not change
the part of speech or the core meaning of a word. Instead, they provide
grammatical details that help us understand the relationships between words in
a sentence.
Inflectional morphology typically adds suffixes to words, although there are a
few cases where prefixes are used. These suffixes are often predictable and
follow specific patterns based on the grammatical category of the word. For
example, the plural marker "-s" is added to nouns to indicate more than one
(e.g., "cat" becomes "cats"), while the past tense marker "-ed" is added to
verbs to indicate a completed action in the past (e.g., "walk" becomes
"walked").
One important characteristic of inflectional morphology is that it is
obligatory in certain grammatical contexts. For instance, in English, verbs
must be inflected for tense to agree with the subject of a sentence. Similarly,
nouns must be inflected for number to match the quantity they represent. These
inflectional markers help us understand the syntactic structure of a sentence
and ensure grammatical accuracy.
Unlike derivational morphology, inflectional morphology does not typically
alter the stress or pronunciation of a word. Instead, it focuses on providing
grammatical information without changing the core meaning or part of speech.
Inflectional affixes are often considered to be more regular and predictable
compared to derivational affixes.
modification of a word to express different grammatical categories.Inflectional
morphology is the study of processes, including affixation and vowel change,
that distinguish word forms in certain grammatical categories. Inflectional
morphology consists of at least five categories, provided in the following
excerpt from Language Typology and Syntactic Description: Grammatical
Categories and the Lexicon. As the text will explain, derivational morphology
cannot be so easily categorized because derivation isn’t as predictable as
inflection. Examples- cats, men etc.
• Derivational Morphology:
Derivational morphology refers to the process of creating new words by adding
affixes to a base or root word. These affixes can be prefixes (added at the
beginning of a word), suffixes (added at the end of a word), or infixes (added
within a word). The primary function of derivational morphology is to change
the meaning or part of speech of a word. For example, adding the prefix "un-"
to the word "happy" creates the word "unhappy," which conveys the opposite
meaning.
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)
lOMoARcPSD|44355733
Derivational morphology allows for the expansion of vocabulary and the creation
of new words. It plays a crucial role in language development and allows
speakers to express nuanced meanings. Additionally, derivational affixes can
change the part of speech of a word. For instance, adding the suffix "-er" to
the verb "teach" transforms it into the noun "teacher."
Furthermore, derivational morphology often involves changes in the phonological
structure of a word. This means that the addition of affixes can alter the
pronunciation and stress patterns of the base word. For example, when the
suffix "-ity" is added to the adjective "able," the stress shifts from the
first syllable to the second, resulting in the word "ability."
Derivational morphology is a productive process, meaning that it can be applied
to a wide range of words and can create numerous new words. It allows for the
formation of complex words and contributes to the richness and flexibility of a
language.
Is defined as morphology that creates new lexemes, either by changing the
syntactic category (part of speech) of a base or by adding substantial,
nongrammatical meaning or both. On the one hand, derivation may be
distinguished from inflectional morphology, which typically does not change
category but rather modifies lexemes to fit into various syntactic contexts;
inflection typically expresses distinctions like number, case, tense, aspect,
person, among others. On the other hand, derivation may be distinguished from
compounding, which also creates new lexemes, but by combining two or more bases
rather than by affixation, reduplication, subtraction, or internal modification
of various sorts. Although the distinctions are generally useful, in practice
applying them is not always easy.
Differences between inflection and derivation morphology
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)
lOMoARcPSD|44355733
Derivational and inflectional morphology are two fundamental processes in
language that contribute to the formation and structure of words. While
derivational morphology focuses on creating new words and changing their
meaning or part of speech, inflectional morphology provides grammatical
information without altering the core meaning or part of speech. Both processes
play crucial roles in language development, allowing for the expansion of
vocabulary, expression of nuanced meanings, and maintenance of grammatical
accuracy. Understanding the attributes of derivational and inflectional
morphology enhances our comprehension of how words are formed and how they
function within a language.
APPROACHES TO MORPHOLOGY:
• Morpheme Based Morphology :
In these words are analyzed as arrangements of morphemes.Word-based morphology
is (usually) a word-and-paradigm approach. The theory takes paradigms as a
central notion. Instead of stating rules to combine morphemes into word forms
or to generate word forms from stems, word-based morphology states
generalizations that hold between the forms of inflectional paradigms.
• Lexeme Based Morphology:
Lexeme-based morphology usually takes what it is called an “item-andprocess”
approach. Instead of analyzing a word form as a set of morphemes arranged in
sequence , aword form is said to be the result of applying rules that alter a
word-form or steam in order to produce a new one.
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)
lOMoARcPSD|44355733
• Word based Morphology :
Word-based morphology is usually a word-and -paradigm approach.instead of
stating rules to combine morphemes into word forms.
Morphological Analysis using Paradigms:
Most NLP systems use simple linguistic theories for morphological analysis.
Words are related to each other by analogical rules. Words can be categorized
based on the pattern they fit into. Applicable both to existing words and to
new ones. Application of a pattern different from the one that has been used-
give rise to a new word. Examples: older replacing elder.
Procedure And Algorithm:
A language expert provides different tables of words in the entire language.the
roots follow the pattern implicit in the table for generating their word.
Algorithm:Forming paradigm Table
Purpose:To form paradigm table from word forms table for a root
Input:Root r, Words forms Table Wft(with labels for rows and columns )
Output:Paradigm Table Pt Algorithm:
1. Create an empty table PT of the same dimensionality, size and labels as the
word forms table WFT.
2. for every entry w in WTF , do if w=r then store (0,0)in the corresponding
position in PT else begin let i be the position of the first characters w and r
which are different store (size(r)-i+1, suffix(i,w)) at the corresponding
position in PT
3.return PT
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)
lOMoARcPSD|44355733
APPLICATIONS OF MORPHOLOGICAL ANALYSIS :-
Text to Speech Synthesis
These days people are mainly dependent on technology medium such as computers
,mobiles for their daily need. But when we talk about people who are not aware
technology and disabled people they face a lot of difficulty in interfacing of
these mediums. So this requires the need of Text to Speech Synthesis.
Morphological analysis can be used to reduce the size of lexicon and also plays
an important role in determining the pronunciation of a homograph. It also
helps in Schwa deletion and the consideration of it. In case of compound words
,this analyser can be used to segregate a compound word into basic form and
later this basic root can be combined using Morphological generator to generate
the result.
Machine Translation.
Machine translation mainly helps the people who are belonging to the different
communities and want to interact with the data present in the different
languages, for this machine translation is one of the prominent solution. For
few languages Machine translation have been developed but for the few other
languages the work is going on.
In lack of Morphological analysis ,we need to store all the word forms ,this
will increase the size of database and will take more time to search. One more
benefit of this analyser is it provides the information of the word such as
number, gender. This information can be used in target language to generate the
correct form of the word.
benefit of this analyser is it provides the information of the word such as
number, gender. This information can be used in target language to generate the
correct form of the word.
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)
lOMoARcPSD|44355733
. Spell Checker
● A Spell checker is an application that is used to identify whether a
word has been spelled correctly or not. Spell checker functionality
can be divided into two parts: Spell check error detection and Spell
check error correction. Spell check error detection phase only detects
the error while Spell check error correction will provide some
suggestions also to correct the error detected by Spell check error
detection phase. One more advantage of using morphology based spell
checker is that it can handle the name entity problem. If any word is
not included in the lexicon, can be added easily.
. Search Engine
Search engine is an application program used to search a particular document
over the internet using World Wide Web. We have to provide the input to the
engine, after that it will provide the result based on the input given.
Morphological Analysis and Generation improves the result of the search engine.
Suppose if a word is provided as a input but this word is not present in the
lexicon, it will affect the output. In that case Morphological analysis of that
word is done.
Morphology and Syntax
Syntactic expressions of the different semantic elements are expressed as
separate and independent words. morphological, phonological defined
autonomously during the 1970s and rest of work on syntax showed syntactic
systems could handy the morphology. the researchers using syntax, in which
syntactic and morphological
structures can be derived from semantic representations, must show the word
formation without using two types of morphemes namely lexical morphemes and
grammatical morphemes. English words are generally composed of a stem and an
optional set of affixes.
Syntactic processes.
machine translation in between two languages, database of words plays a
significant rule. … Another important benefit can be that, Morphological
analysis provides the information of the word such as number
Morphological facts.
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)
lOMoARcPSD|44355733
we use it in our day-to-day life like using plural words i.es adding s to its
prefix and its all dependent on token processing ex I eat one melon a day.
Indonesian: I eat two fruit melon every day
This language won’t use morphological plurals in there wordings
Dependency trees .
Single sentence from John F. Kennedy’s Inaugural speech
Morphological Analysis.
Two methods to morphology, analytic and Synthetic are the 2two .
Analytic principles
Point 1
Forms with the same meaning and the same sound shape in all their occurrences
are instances of the same morpheme.
Point 2
Forms with the same meaning but different sound shapes may be instances of the
same morpheme if their distributions do not overlap.
Point 3
Not all morphemes are segmental.
Ex:
run ran in this the verb and past tense are not being
speak spoke segmented rather the main point is look
eat ate at both the present and past tense forms of these verbs
because it is the contrast between them that is important.
How to differentiate or breaking of words Aztec, spoken in Mexico,
Example
ikalwewe ‘his big house’
ikalsosol ‘his old house’ So here ikal- means ‘his house’ ikalmeh ‘his houses’
-wewe means ‘big’ ikalci·n ‘his little house
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)
lOMoARcPSD|44355733
❖ Unknown word handling:
Dealing with unknown words in natural language processing (NLP) can be a
challenging task, as it requires a combination of linguistic knowledge and
computational techniques. There are several approaches that can be used to
handle unknown words in NLP, such as:
1. Morphological analysis: This approach involves breaking down a word into its
smaller units, such as prefixes, suffixes, and roots, to identify its meaning.
This technique is particularly useful for languages with rich morphology, such
as German or Turkish.
2. Statistical methods: These methods use statistical models to predict the
meaning of unknown words based on their context. This approach is commonly used
in machine learning-based NLP systems.
3. Domain-specific dictionaries: In some cases, unknown words may be specific
to a particular domain, such as medical terminology or legal jargon. In such
cases, using a domain-specific dictionary can help in identifying the meaning
of the unknown word.
4. Word embeddings: Word embeddings are numerical representations of words that
capture their semantic and syntactic properties. These embeddings can be used
to find similarities between known and unknown words, and thus infer their
meaning.
5. Contextual clues: Sometimes, the context in which an unknown word appears
can provide clues about its meaning. For example, if the unknown word appears
in a sentence about cooking, it is likely related to food or ingredients.
In conclusion, dealing with unknown words in NLP requires a combination of
techniques and approaches. Each approach has its own strengths and limitations,
and the choice of method will depend on the specific task and language being
analyzed. To learn more about handling unknown words in NLP, check out the link
in the bio.
Don't let unknown words stop you from delving deeper into the fascinating world
of natural language processing. Click the link in the bio to discover more
about handling unknown words and other challenges in NLP.
❖ Evaluation metrics:
You can’t train a good model if you don’t have the right evaluation metric, and
you can’t explain your model if you don’t understand the metric you’re using.
So, here’s a list of common metrics which are used for ML and NLP models, along
with their definitions and common applications. I’ve always had a difficult
time remembering these from charts and confusion matrices, so I thought a
verbal explanation might work better.
Accuracy
Denotes the fraction of times the model makes a correct prediction as compared
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)
lOMoARcPSD|44355733
to the total predictions it makes. Best used when the output variable is
categorical or discrete. For example, how often a sentiment classification
algorithm is correct.
Precision
Evaluates the percent of true positives identified given all positive cases.
Particularly helpful when identifying positives are more important than overall
accuracy. For example, if identifying a cancer that is prevalent 1% of the
time, a model that always spits out “negative” will be 99% accurate, but 0%
precise.
Recall
The percent of true positives versus combined true and false positives. In the
example with a rare cancer that is prevalent 1% of the time, if a model creates
totally random predictions (50/50), it will have 50% accuracy (50/100), 50%
precision (0.5/1), and 1% recall (0.5/50)
F1 Score
Combines precision and recall to give a single metric — both completeness and
exactness. (2 * Precision * Recall) / (Precision + Recall). Used together with
accuracy, and useful in sequence-labeling tasks, such as entity extraction, and
retrieval-based question answering.
AUC
Area Under Curve; Combines true positives vs false positives as threshold for
prediction is varied. Used to measure the quality of a model independent of
prediction threshold, and to find the optimal prediction threshold for a
classification task.
MRR
Mean Reciprocal Rank. Evaluate the responses retrieved given their probability
of being correct. The mean of the reciprocal of the ranks of the retrieved
results. Used heavily in all information-retrieval tasks, including article
search and e-commerce search.
MAP
Mean average precision, calculated across each retrieved result. Used in
information-retrieval tasks.
RMSE
Root mean squared error — very common way to capture a model’s performance in a
real-value prediction task. Good way to ask “How far off from the answer am I?”
Calculates the square root of the mean of the squared errors for each data
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)
lOMoARcPSD|44355733
point. Used in numerical prediction — temperature, stock market price, position
in euclidean space…
MAPE
Mean absolute percentage error. Used when the output variable is a continuous
variable, and is the average of absolute percentage error for each data point.
Often used in conjunction with RMSE and to test the performance of regression
models.
BLEU
The cheese that tastes like it sounds. Also, bilingual evaluation understudy.
Captures the amount of n-gram overlap between the output sentence and the
reference ground truth sentence. Has many variants, and mainly used in machine
translation tasks. Has also been adapted to text to text tasks such as
paraphrase generation and summarization.
METEOR
Precision-based metric to measure quality of generated text. Sort of a more
robust BLEU. Allows synonyms and stemmed words to be matched with the reference
word. Mainly used in machine translation.
ROUGE
Like BLEU and METEOR, compares quality of generated to reference text. Measures
recall. Mainly used for summarization tasks where it’s important to evaluate
how many words a model can recall (recall = % of true positives versus both
true and false positives).
Perplexity
Measures how confused an NLP model is, derived from cross-entropy in a next
word prediction task. Used to evaluate language models, and in language-
generation tasks, such as dialog generation.
Metrics specific to tasks:
● Sentiment analysis: Accuracy, precision, recall, F1-score for
positive/negative sentiment.
● Named entity recognition: F1-score for different entity types.
● Question answering: Accuracy, F1-score for answer selection and answer
generation.
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)
lOMoARcPSD|44355733
❖ Evaluation metrics - Precision:
Definition: Ratio of correctly predicted positive instances to the total number
of predicted positive instances.
Pros: Useful for measuring the model’s ability to identify true positives.
Cons: May be sensitive to class imbalance, favouring models that predict the
majority class.
Precision can be seen as a measure of quality. Higher precision means that an
algorithm returns more relevant results than irrelevant ones.
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)
lOMoARcPSD|44355733
Precision is defined as;
❖ Evaluation metrics - Recall:
Definition: Ratio of correctly predicted positive instances to the total number
of actual positive instances.
Pros: Useful for measuring the model’s ability to capture all relevant positive
instances.
Cons: May be sensitive to class imbalance, favouring models that predict all
instances as positive.
Recall as a measure of quantity. High recall means that an algorithm returns
most of the relevant results.
Recall is defined as;
Recall in this context is also referred to as the true positive rate
or sensitivity, and precision is also referred to as positive predictive
value (PPV).
Other related measures used in classification include true negative rate
and accuracy. True negative rate is also called specificity.
Both precision and recall may be useful in cases where there is imbalanced
data. However, it may be valuable to prioritize one over the other in cases
where the outcome of a false positive or false negative is costly. For example,
in medical diagnosis, a false positive test can lead to unnecessary treatment
and expenses. In this situation, it is useful to value precision over recall.
In other cases, the cost of a false negative is high. For instance, the cost of
a false negative in fraud detection is high, as failing to detect a fraudulent
transaction can result in significant financial loss.
❖ Evaluation metrics - F-measure:
F1 Score: The F1 score is a commonly used evaluation metric that combines
precision and recall into a single value. It provides a balanced measure of
performance by taking into account both false positives and false negatives.
Evaluating an NLP system's F1 score helps determine its overall effectiveness.
A measure that combines precision and recall is the harmonic mean of precision
and recall, the traditional F-measure or balanced F-score:
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)
lOMoARcPSD|44355733
This measure is approximately the average of the two when they are close, and
is more generally the harmonic mean, which, for the case of two numbers,
coincides with the square of the geometric mean divided by the arithmetic mean.
There are several reasons that the F-score can be criticized, in particular
circumstances, due to its bias as an evaluation metric. This is also known as
the 𝐹1 measure, because recall and precision are evenly weighted.
It is a special case of the general 𝐹𝛽 measure (for non-negative real values
of 𝛽):
Two other commonly used 𝐹 measures are the 𝐹2 measure, which weights recall
higher than precision, and the 𝐹0.5 measure, which puts more emphasis on
precision than recall.
The F-measure was derived by van Rijsbergen (1979) so that 𝐹𝛽 "measures the
effectiveness of retrieval with respect to a user who attaches 𝛽 times as much
importance to recall as precision". It is based on van Rijsbergen's
effectiveness measure,
1
𝐸𝛼 = 1 −
𝛼 1−𝛼
+
𝑃 𝑅
the second term being the weighted harmonic mean of precision and recall with
1
weights (𝛼, 1 − 𝛼). Their relationship is 𝐹𝛽 = 1 − 𝐸𝛼 where 𝛼 = .
1+𝛽 2
❖ Error Analysis:
Error analysis in NLP involves examining the mistakes made by a natural language
processing (NLP) model to gain insights into its performance and identify areas
for improvement.
1. Error Types: Start by categorizing the types of errors your NLP model is
making. These can include syntactic errors (grammar-related), semantic errors
(meaning-related), and pragmatic errors (context-related). Understanding the
nature of errors helps in devising targeted solutions.
Imagine you have a sentiment analysis model that classifies movie reviews as
positive or negative. Here are examples of different error types:
Syntactic Error: "I didn't loved the movie" (incorrect verb form)
Semantic Error: "The movie was incredibly boring" (negative sentiment
misclassified as positive)
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)
lOMoARcPSD|44355733
Pragmatic Error: "I expected more from this film, but it fell short" (contextual
nuances affecting sentiment classification)
2. Error Identification: Use evaluation metrics such as precision, recall, and
F1 score to quantify the model's performance. Identify instances where the model's
predictions differ from the ground truth or expected outputs.
Calculate precision, recall, and F1 score to quantify model performance. For
instance:
True Positive (TP): Correctly classified positive reviews
False Positive (FP): Negative reviews classified as positive
True Negative (TN): Correctly classified negative reviews
False Negative (FN): Positive reviews classified as negative
3. Error Patterns: Look for recurring patterns in the errors. For example, does
the model struggle with certain types of sentences or specific linguistic
constructs? Identifying these patterns can point towards underlying issues in
the model's architecture or training data.
Construct a confusion matrix to visualize error patterns. Below is an example
confusion matrix for sentiment analysis:
4. Error Analysis Techniques:
• Confusion Matrix: Construct a confusion matrix to visualize the model's
performance across different classes or categories. This helps in
understanding which classes are frequently misclassified.
o Visual representation helps identify which classes are frequently
misclassified.
• Error Visualization: Use tools like heatmaps or colored annotations to
highlight errors in a dataset. This visual representation makes it easier
to spot patterns.
o Use heatmaps or annotations to highlight errors in the dataset.
• Error Annotation: Manually annotate a subset of errors to gain qualitative
insights. This could involve analyzing the context of misclassifications
or incorrect predictions.
o Manually analyze misclassified instances to understand context and
reasons for errors.
• Error Decomposition: Break down errors into components such as token-level
errors, entity-level errors, or sentence-level errors. This granularity
helps in pinpointing where the model is faltering.
o Break down errors into token-level, entity-level, or sentence-level
to pinpoint issues.
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)
lOMoARcPSD|44355733
5. Root Cause Analysis: Once you've identified error patterns, delve deeper into
their root causes. Common causes include insufficient training data, biased
datasets, model complexity, or mismatched input-output representations.
* Analyze the causes of errors such as biased training data, model complexity,
or linguistic nuances.
6. Mitigation Strategies: Based on the error analysis, devise strategies to
mitigate the identified issues. This could involve augmenting the training data,
fine-tuning the model architecture, adjusting hyperparameters, or implementing
post-processing techniques.
* Implement strategies like data augmentation, model fine-tuning, or
hyperparameter adjustments to address identified issues.
7. Iterative Improvement: Error analysis is an iterative process. After
implementing mitigations, re-evaluate the model's performance and conduct another
round of error analysis to validate improvements and uncover new challenges.
* After mitigation, re-evaluate model performance and conduct error analysis
iteratively for continuous improvement.
Example: Sentiment Analysis
Let's say you have a sentiment analysis model trained to classify movie reviews
as positive or negative. Let’s see how to perform error analysis:
1. Error Types:
- False Positives: Classifying a negative review as positive.
- False Negatives: Classifying a positive review as negative.
- Misclassified Neutral Sentences: In cases where the model struggles to
differentiate between positive and negative sentiment.
2. Error Identification:
Suppose your model misclassifies the review "The movie was great, but the
ending was disappointing" as negative. This would be a false negative error.
3. Error Patterns:
- The model might struggle with negation ("not good" being misclassified as
positive).
- It could also misclassify sarcasm or nuanced expressions ("I loved every
boring minute of it" being misclassified as negative).
4. Error Analysis Techniques:
- Confusion Matrix:

This matrix shows the distribution of correct and incorrect predictions
across positive and negative classes.
- Error Visualization:
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)
lOMoARcPSD|44355733

Using colored annotations or highlighting, you can visualize
misclassifications in a dataset.
- Error Decomposition:
Break down errors into token-level errors (misclassified words), entity-
level errors (misclassified entities like names or locations), and sentence-
level errors (overall sentiment misclassification).
5. Root Cause Analysis:
- The model might not have been trained on enough diverse examples of negation
or sarcasm.
- The training data might be biased towards certain types of reviews, leading
to skewed predictions.
6. Mitigation Strategies:
- Augment the training data with more examples of negation, sarcasm, and
nuanced expressions.
- Fine-tune the model to handle such linguistic nuances better.
- Implement post-processing techniques to handle edge cases like negation.
7. Iterative Improvement:
After implementing mitigations, re-evaluate the model's performance using
metrics like accuracy, precision, recall, and F1 score. Conduct another round of
error analysis to validate improvements and uncover any new challenges.
In this example, error analysis helps in understanding why the sentiment analysis
model is making mistakes and guides improvements to enhance its accuracy and
robustness.
Downloaded by Bhavesh Fanade (bhaveshfanade0@gmail.com)