DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING (AIML)
ACADEMIC YEAR 2025-2026 (ODD SEMESTER)
BATCH: 2023-2027YEAR/SEMESTER: III/V
AL3501 - NATURAL LANGUAGE PROCESSING
UNIT I INTRODUCTION
Origins and challenges of NLP – Language Modeling: Grammar-based LM,
Statistical LM - Regular Expressions, Finite-State Automata – English Morphology,
Transducers for lexicon and rules, Tokenization, Detecting and Correcting Spelling
Errors, Minimum Edit Distance
Introduction – Natural Language Processing
Humans communicate through some formal language either by text or speech.
To make interactions between computers and humans, computers need to
understand natural languages used by humans.
Natural language processing is all about making computers learn,
understand, analyze, manipulate and interpret natural (human) languages.
NLP stands for Natural Language Processing, which is a part of Computer
Science, Human languages or Linguistics, and Artificial Intelligence.
Processing of Natural Language is required when you want an intelligent
system like robot to perform as per your instructions, when you want to
hear decision from a dialogue based clinical expert system, etc.
The ability of machines to interpret human language is now at the core of
many applications that we use every day – chat bots, Email classification
and spam filters, search engines, grammar checkers, voice assistants, and
social language translators.
The input and outpu to fan NLP system can be Speech or Written Text.
1. Origins and Challenges of NLP
Origins: - Natural Language Processing (NLP) is an interdisciplinary field combining
linguistics, computer science, and artificial intelligence. It aims to enable machines to
understand, interpret, and generate human language.
Historical Development
1950s: Early machine translation (e.g., Georgetown-IBM Experiment).
1960s-70s: Rule-based AI systems like ELIZA and SHRDLU.
1970s–80s: Development of syntax-based parsers and context-free grammars
for sentence structure analysis.
1980s-90s: Rise of statistical NLP and probabilistic models (e.g., Brown Corpus,
HMMs).
2000s-present: Machine learning, deep learning, and pre-trained language models
(BERT, GPT).
2010s: Emergence of deep learning, word embeddings (Word2Vec, GloVe), and
neural sequence models.
2020s: Pretrained transformers (e.g., BERT, GPT) revolutionize NLP through
fine-tuning on massive datasets.
2. Key Milestones
- 1950: Turing’s “Computing Machinery and Intelligence”
- 1966: ALPAC Report criticized slow progress in MT
- 1996: First use of statistical MT at IBM
- 2018+: BERT, GPT, and transformer-based NLP dominates
3. Challenges of NLP
a) Ambiguity
Lexical Ambiguity: A single word has multiple meanings. E.g., “bank”
(financial institution vs. river bank).
Syntactic Ambiguity: Sentence can be parsed in multiple valid ways. E.g., “I saw the
man with the telescope.”
Semantic Ambiguity: Multiple interpretations of meaning.
Pragmatic Ambiguity: Context-dependent understanding (sarcasm, politeness, etc.)
b) Resource Scarcity
Lack of annotated corpora for low-resource languages
Inconsistent or unavailable linguistic resources for many regional languages
c) Morphological Complexity
Especially in agglutinative and inflectional languages (e.g., Finnish, Tamil)
Words encode rich grammatical information requiring complex morphological
analysis
d) Contextual Understanding
Understanding real-world context, discourse, intent, and sentiment
Requires world knowledge and commonsense reasoning
e) Domain Adaptation
NLP models trained on general data often fail on specialized domains like legal,
medical, or scientific text
f) Code-Switching and Multi-linguality
Mixing of languages in the same utterance (common in informal texts)
Handling language identification, translation, and tagging becomes difficult
g) Computational Complexity
Parsing and deep learning-based models can be computationally expensive
Requires optimization and efficient model design
h) Evaluation and Benchmarks
Lack of universally accepted benchmarks for many NLP tasks
Difficulty in comparing systems across different languages and domains
4. Modern NLP Trends Addressing Challenges
Pretrained models like BERT, GPT, T5 address data scarcity via transfer learning
Multilingual models (e.g., mBERT, XLM-R) address low-resource challenges
Few-shot and zero-shot learning approaches
Use of massive corpora and unsupervised learning
2. Language Modeling
A. Grammar-Based Language Models
Based on formal grammar rules (e.g., Context-Free Grammars - CFG).
Define possible sentence structures using rules like:
o S -> NP VP
o VP -> V NP
Used in parsing to verify grammatical correctness.
B. Statistical Language Models
Assign probabilities to sequences of words.
N-gram models:
o Unigram: P(w1), Bigram: P(w2|w1), Trigram: P(w3|w1, w2)
Probabilities estimated using corpus frequency counts.
Applications: Speech recognition, text prediction, machine translation.
it plays a crucial role in digital transformation, allowing businesses to be
more agile, cost-effective, and scalable in today's rapidly changing
technological landscape.
Introduction to Language Models
Language models (LMs) assign probabilities to sequences of words.
The simplest LMs are n-gram models.
An n-gram is a sequence of 'n' words (e.g., bigram: 'please turn', trigram: 'please turn
your').
These models help estimate the probability of a word based on previous words.
N-grams are simpler compared to neural LMs like RNNs and transformers.
Understanding N-Grams
P(w|h): Probability of word 'w' given history 'h'.
Estimate using frequency counts from a large corpus.
Challenges: Language is creative, new sentences constantly created.
Zero counts common even in large corpora like the web.
Need for Smarter Estimation
Exact sentence probabilities are hard to estimate due to data sparsity.
Use the chain rule: P(w1, w2, ..., wn) = P(w1)P(w2|w1)...P(wn|w1...wn-1).
Use n-gram approximation: P(wn|wn-1, ..., wn-N+1).
Markov assumption: Probability depends only on a limited history.
Bigram: P(wn|wn-1), Trigram: P(wn|wn-2, wn-1).
General N-Gram Equation
Approximate: P(wn|w1:n-1) ≈ P(wn|wn-1) for bigram, or P(wn|wn-2, wn-1) for trigram.
Compute sequence probability by multiplying conditional probabilities of words.
Maximum Likelihood Estimation (MLE)
Estimate probabilities from observed counts in a corpus.
MLE = Count(w_n-1, w_n) / Count(w_n-1).
Example: If 'Chinese' occurs 400 times in 1,000,000 words, P(Chinese) = 0.0004.
MLE gives the parameter values that maximize the likelihood of the training data.
Example: Bigram Calculation
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
Augment with <s> and </s> for sentence boundaries.
Compute bigram probabilities like P(am|I) = Count(I am)/Count(I).
Real Data: Berkeley Restaurant Project
Corpus of restaurant queries from Berkeley dialogue system.
Examples: 'Can you tell me about any good Cantonese restaurants close by'.
Bigram grammar is sparse; many zero entries.
Normalize bigram counts using corresponding unigram totals.
Computing Sentence Probability
Example: P(I want Chinese food) = P(I|<s>) * P(want|I) * P(Chinese|want) * P(food|
Chinese).
Multiply relevant bigram probabilities together.
Practical Issues in Language Modeling
In practice, trigrams or 4-grams are more common than bigrams.
Use pseudo-words (<s>, <s>) for sentence start and </s> for end.
Probabilities computed in log form to avoid underflow from multiplying small numbers.
Log probabilities help maintain numeric stability in computations.
3. Regular Expressions
Used to match patterns in text.
Syntax examples:
o .: Matches any character
o *: Zero or more repetitions
o [abc]: Matches a, b, or c
Applications:
o Tokenization
o Pattern matching (e.g., email validation)
o Lexical analysis
4. Finite-State Automata (FSA)
Definition: - A computational model with states and transitions used to recognize patterns.
Components: - States (Q), Alphabet (Σ), Transition function (δ), Start state (q0),
Accept states (F)
Types: - Deterministic FSA (DFA) - Non-deterministic FSA (NFA)
Use Cases: - Recognizing regular languages - Implementing lexical analyzers in
compilers - Token recognition in NLP
5. English Morphology
Morphology: Study of the structure of words. - Morpheme: Smallest meaningful unit of a
language
Types of Morphemes: - Free morphemes (can stand alone): book, run - Bound
morphemes (must attach): -ing, -ed
Processes: - Inflection: run → runs (plural, tense, etc.) - Derivation: teach → teacher (new
word with a new meaning)
Applications: - Lemmatization - Stemming - Part-of-speech tagging
Introduction to Morphology
Morphology is the domain of linguistics that analyses the internal structure of words. It
involves morphological analysis—exploring how words are structured and composed of
smaller meaningful units called morphemes.
Morphemes and Word Examples
- Words are built up of minimal meaningful elements:
• played = play + ed
• cats = cat + s
• unfriendly = un + friend + ly
Types of Morphemes
- Stems: play, cat, friend
- Affixes: ed, s, un, ly
- Prefixes precede the stem (e.g., un-)
- Suffixes follow the stem (e.g., -ed, -s, -ly)
Stemming
- Stemming identifies the root/stem of a word by removing affixes:
• play = play
• replayed = re + play + ed
• computerized = comput + er + ize + d
Problems in Morphological Processing
Inflectional Morphology: constructs inflected forms from base forms using
inflectional affixes. It relates different forms of the same word.
• Lemma Singular Plural
• cat → cat → cats
• mouse → mouse → mice
Derivational Morphology: constructs new words from roots and derivational
affixes:
• inter + national = international
• international + ize = internationalize
• internationalize + ation = internationalization
Morphological Processes
- Simple processes concatenate morphs one by one:
• disagreement = dis + agree + ment + s
- More complex schemes include morphophonemic changes.
- Allomorphs: Alternative forms of a morpheme (e.g., plural morpheme in 'cats', 'dogs',
'dishes', 'oxen').
Morphological Typology
Typology classifies languages based on their morphological characteristics.
Isolating/Analytic Languages: Few morphemes per word (e.g., Chinese,
Vietnamese).
Synthetic Languages: Multiple morphemes per word:
• Agglutinative: Each morpheme represents one function (e.g., Tamil,
Japanese).
• Fusional: Morphemes may represent multiple functions (e.g., Latin,
German).
Concatenative Languages: Morphs are linked sequentially.
Nonlinear Languages: Morphemes may change tones or templates (e.g.,
Semitic languages).
Analytic vs Synthetic Languages
Analytic Languages: Minimal inflection, rely on word order and auxiliary
words
Synthetic Languages: Use affixes extensively.
Agglutinative Languages: Use distinct prefixes, suffixes, and infixes.
Fusional Languages: Inflectional categories are fused, making root
extraction difficult.
6. Transducers for Lexicon and Rules
Finite-State Transducers (FST): - FSTs map between input and output strings. - Useful in
morphological analysis and generation.
Example: - Input: “run+PAST” - Output: “ran”
Structure: - Combines lexical lookup with morphological rules
7. Tokenization
Definition: - Process of splitting text into tokens (words, punctuation, etc.)
Challenges: - Handling punctuation, contractions, and special symbols - Language-specific
rules
Tools: - NLTK, spaCy, regex-based tokenizers
Applications: - Text preprocessing - Information retrieval - Machine translation
Tokenizing
By tokenizing, you can conveniently split up text by word or by sentence.
This will allow you to work with smaller pieces of text that are still relatively coherent
and meaningful even outside of the context of the rest of the text.
It’s your first step in turning unstructured data into structured data, which is easier to
analyze.
When you’re analyzing text, you’ll be tokenizing by word and tokenizing by sentence.
Tokenizing by Word
Words are like the atoms of natural language. They’re the smallest unit of meaning that
still makes sense on its own.
Tokenizing your text by word allows you to identify words that come up particularly
often.
For example, if you were analyzing a group of job ads, then you might find that the word
“Python” comes up often.
That could suggest high demand for Python knowledge, but you’d need to look
deeper to know more.
Tokenizing by Sentence
When you tokenize by sentence, you can analyze how those words relate to one
another and see more context.
Are there a lot of negative words around the word “Python” because the hiring manager
doesn’t like Python?
Are there more terms from the domain of herpetology than the domain of software
development, suggesting that you may be dealing with an entirely different kind of
python than you were expecting?
8. Detecting and Correcting Spelling Errors
Types of Errors: - Non-word errors: e.g., “speling” - Real-word errors: e.g., “their” instead
of “there”
Detection Techniques: - Dictionary lookup - Context-based analysis (n-gram models) - Edit
distance filtering
Correction Methods: - Candidate generation + ranking - Noisy channel model
1. Introduction to Spelling Error Detection and Correction
Spelling correction is a core task in Natural Language Processing (NLP).
It involves identifying misspelled words and replacing them with likely correct
alternatives.
Used in search engines, word processors, OCR, and speech recognition post-processing.
2. Types of Spelling Errors
Non-word Errors: Misspelled words not in the vocabulary (e.g., 'recieve' for 'receive').
Real-word Errors: Correct words used incorrectly in context (e.g., 'there' instead of
'their').
3. Detection of Errors
Basic method: Check each word against a dictionary or vocabulary list.
Real-word error detection is more complex and context-dependent.
4. Correction Techniques
Edit Distance (Levenshtein Distance): Measures the number of insertions, deletions, and
substitutions required to transform one word into another.
Candidates within a maximum edit distance are considered possible corrections.
Damerau-Levenshtein Distance: Also considers transpositions (e.g., 'adn' → 'and').
5. Candidate Generation
Generate all strings that are one or two edits away from the error word.
Common edits include: deletion, insertion, substitution, and transposition.
Filter candidates that exist in the vocabulary.
6. Candidate Ranking
Use language models to score and rank candidates.
Noisy Channel Model: Estimates the probability of the intended word given the observed
error.
Bayesian formulation: P(correct|error) ∝ P(error|correct) * P(correct).
7. Noisy Channel Model
P(correct word | observed word) ∝ P(observed word | correct word) × P(correct word).
P(observed word | correct word): Modeled using error statistics (e.g., keyboard typos).
P(correct word): Modeled using unigram or bigram language models.
8. Contextual Spelling Correction
Uses surrounding words to resolve real-word errors.
Example: 'He lost his patients.' vs 'He lost his patience.'
N-gram models or neural language models can determine the most probable contextually
appropriate correction.
9. Applications
Word processors (e.g., MS Word spell check).
Search engines (e.g., Google’s 'Did you mean?').
OCR text post-processing.
Autocorrect features in mobile devices.
9. Minimum Edit Distance
Definition: - Minimum number of operations (insertions, deletions, substitutions) required
to transform one string into another.
Operations: - Insert a character - Delete a character - Substitute one character for another
Algorithm: - Levenshtein distance using dynamic programming - Time Complexity: O(m ×
n), where m and n are string lengths
Example: - “kitten” → “sitting” - Steps: - kitten → sitten (substitute ‘k’ with ‘s’) -
sitten → sittin (substitute ‘e’ with ‘i’) - sittin → sitting (insert ‘g’) - Distance = 3
Applications: - Spell checkers - Search engines - Machine translation
Introduction to Minimum Edit Distance
Minimum Edit Distance is a measure of the similarity between two strings.
It quantifies how many edit operations are required to transform one string into another.
Used in spelling correction, DNA sequence comparison, speech recognition, and
plagiarism detection.
Types of Edit Operations
Insertion: Adding a character.
Deletion: Removing a character.
Substitution: Replacing one character with another.
Transposition (in some variants): Swapping adjacent characters (used in Damerau-
Levenshtein Distance).
Levenshtein Distance
Most common form of Minimum Edit Distance.
Allows insertions, deletions, and substitutions.
Each operation typically has a cost of 1.
The distance is the minimal total cost to transform one string into another.
Wagner-Fischer Algorithm
Dynamic programming algorithm to compute Levenshtein Distance.
Uses a matrix to store distances between prefixes of the source and target words.
Builds up solution from base cases (empty strings) to full strings.
Example Calculation
Source: kitten, Target: sitting
Operations: kitten → sitten (substitution), sitten → sittin (substitution), sittin → sitting
(insertion)
Minimum Edit Distance = 3
Weighted Edit Distance
Allows different costs for different operations (e.g., substitutions are costlier than
insertions).
Useful when modeling real-world errors like keyboard typos.
Helps prioritize more likely errors based on empirical data.
Applications in NLP
Spelling correction: Find dictionary word with smallest edit distance to misspelled word.
Machine Translation: Evaluate similarity between hypothesis and reference translations.
Speech Recognition: Compare predicted and actual transcriptions.
Information Retrieval: Fuzzy matching of user queries.
Extensions and Variants
Damerau-Levenshtein Distance: Includes transpositions.
Hamming Distance: Only substitutions allowed (equal-length strings).
Jaro-Winkler Distance: Used in record linkage and name matching.
Limitations
Does not consider phonetic similarity or semantics.
Computationally intensive for large-scale comparisons.
Real-word error correction often requires contextual analysis beyond simple edit
distance.
UNIT II: WORD LEVEL ANALYSIS
Unsmoothed N-grams, Evaluating N-grams, Smoothing, Interpolation and Backoff –
Word Classes, Part-of-Speech Tagging, Rule-based, Stochastic and Transformation-
based tagging, Issues in PoS tagging – Hidden Markov and Maximum Entropy
models.
1. Unsmoothed N-grams
• Unsmoothed N-gram models estimate the probability of a word given the preceding
N−1 words.
• They assume a fixed window context and are trained by counting occurrences in a
corpus.
• These models assign zero probability to unseen sequences.
• A bigram model estimates P(wi | wi−1) using: Count(wi−1, wi)/Count(wi−1).
Introduction to N-grams
An N-gram is a contiguous sequence of 'n' items from a given sample of text or speech.
These items can be phonemes, syllables, letters, words, or base pairs according to the
application.
In NLP, N-grams are usually sequences of words.
For example, for the sentence 'I love NLP':
- Unigrams: ['I', 'love', 'NLP']
- Bigrams: ['I love', 'love NLP']
- Trigrams: ['I love NLP']
Definition of Unsmoothed N-grams
Unsmoothed N-gram models estimate the probability of a word given its history
based solely on the frequency counts in a training corpus.
These models do not adjust for unseen word sequences, assigning zero probability
to any sequence not encountered during training.
How Unsmoothed N-gram Models Work
The probability of a word sequence is computed as the product of the conditional
probabilities of each word given its preceding words.
Bigram Model: P(w1, w2, w3) ≈ P(w1) * P(w2|w1) * P(w3|w2)
General N-gram approximation: P(w1, ..., wn) ≈ ∏(P(wi | wi−(n−1), ..., wi−1))
Maximum Likelihood Estimation (MLE)
MLE is used to calculate N-gram probabilities in unsmoothed models.
For bigrams: P(wi | wi−1) = Count(wi−1, wi) / Count(wi−1)
For trigrams: P(wi | wi−2, wi−1) = Count(wi−2, wi−1, wi) / Count(wi−2, wi−1)
Example:
- Sentence: 'the cat sat'
- P(cat | the) = Count(the cat) / Count(the)
Limitations of Unsmoothed N-gram Models
Zero probabilities: If a word sequence never appears in the training data, it receives
a probability of zero.
Data sparsity: Most word combinations do not appear in a corpus, even large ones.
Lack of generalization: Cannot infer probabilities for unseen combinations.
Poor performance on rare or novel constructions.
Importance of Smoothing
Smoothing techniques address the limitations of unsmoothed models.
They redistribute probability mass from seen to unseen N-grams.
Common smoothing methods: Laplace, Add-k, Good-Turing, Kneser-Ney.
Applications of N-gram Models
Predictive text input and autocomplete.
Speech recognition and synthesis.
Machine translation.
Spell correction and information retrieval.
2. Evaluating N-grams
Evaluating N-gram models is essential to understanding how well they perform in predicting
natural language. Two standard metrics are used: Perplexity and Cross-Entropy.
Perplexity measures how well a model predicts a sample. Lower perplexity indicates
better prediction capability. It is the exponentiation of the average negative log
probability per word.
Cross-Entropy quantifies the average number of bits needed to encode a word in the
test corpus using the language model. It is directly related to perplexity and is
minimized when the model's distribution matches the true distribution.
Evaluating N-grams typically involves training on one corpus and testing on another.
A key challenge is that even well-trained models might struggle with data that is not
represented in the training corpus. This is where smoothing and backoff techniques
are often introduced. Evaluating models helps in tasks like machine translation,
speech recognition, and predictive text input.
• Evaluation uses held-out data to measure model performance.
• Key metrics:
- Perplexity: Measures model surprise; lower is better.
- Cross-Entropy: Average bits required to encode data.
• A good model assigns high probability to likely sequences.
3. Smoothing Techniques
Smoothing techniques address the zero-probability problem in N-gram models by
adjusting the estimated probabilities for unseen events.
Common methods include:
Add-One (Laplace) Smoothing: Adds 1 to every count; simple but can distort
probability distributions.
Add-k Smoothing: Generalizes Laplace by adding a small value k, such as 0.1, to
reduce overcorrection.
Good-Turing Discounting: Adjusts estimates based on the frequency of frequencies
(how many items occur once, twice, etc.).
Kneser-Ney Smoothing: A more refined method that considers the diversity of
contexts a word appears in, often used in real applications.
Interpolation and Backoff: Used to balance probabilities from higher-order and
lower-order N-grams.
These techniques ensure better model performance in real-world scenarios where language is
highly variable.
Smoothing techniques are essential in natural language processing, especially in N-gram
language models, to address the problem of data sparsity. In real-world language data, it's
common to encounter sequences of words (n-grams) that never appear in the training corpus.
Without smoothing, these unseen sequences would be assigned a probability of zero, which
is highly problematic in tasks like speech recognition, machine translation, and text
prediction.
Smoothing reallocates some probability mass from observed n-grams to unseen ones,
ensuring that even novel sequences receive a small but non-zero probability.
Why Smoothing is Important
Handles unseen n-grams: Prevents zero probabilities for valid but unseen word
combinations.
Improves generalization: Helps models perform better on real-world data not present
in the training set.
Enables realistic language generation: Avoids breaking a model’s output due to
impossible word transitions.
Common Smoothing Techniques
Add-One (Laplace) Smoothing
Adds one to every n-gram count.
Simple and intuitive.
Useful for small vocabularies but overestimates rare events in large corpora.
Add-k (Additive) Smoothing
Generalizes Laplace by adding a smaller constant (e.g., 0.1 or 0.01) instead
of 1.
Reduces overestimation but still suffers in sparse datasets.
Good-Turing Smoothing
Adjusts probabilities based on the frequency of frequencies (e.g., how often
n-grams with frequency 1 occur).
Transfers probability mass from seen to unseen events.
Especially useful when many low-frequency events exist.
Kneser-Ney Smoothing
Among the most advanced and effective.
Focuses not just on word frequency but also how diverse the contexts are
in which a word appears.
Used in state-of-the-art language models.
Backoff Models
If a higher-order n-gram is unseen, the model “backs off” to a lower-order
n-gram (e.g., from trigram to bigram).
Helps in estimating probabilities more reliably when data is sparse.
Interpolation Models
Combine multiple n-gram models (e.g., unigram, bigram, trigram) using
weighted averaging.
Each level contributes to the final probability.
Offers smoother transitions across varying contexts.
3. Interpolation and Backoff
These are probability estimation techniques used when certain N-gram sequences are
missing in the training data.
Backoff: When an N-gram is not found, the model “backs off” to a lower-order model
(e.g., from trigram to bigram).
Interpolation: A weighted average of different N-gram probabilities is used
regardless of whether higher-order data is present.
Example for interpolation:
P(wi | wi-2, wi-1) = λ1 * P(wi | wi-2, wi-1) + λ2 * P(wi | wi-1) + λ3 * P(wi)
Backoff methods are often combined with discounting (like Katz backoff) to ensure the
total probability sums to 1. Interpolation techniques, especially linear interpolation, are
widely used due to their simplicity and robustness in handling sparse data.
Interpolation and Backoff
• Advanced smoothing approaches.
• Interpolation: Combines different order probabilities.
• Backoff: Uses lower-order model if higher-order fails.
4. Word Classes
Words in a language are grouped into classes based on their syntactic and grammatical
roles. These are often referred to as parts of speech (PoS).
Open classes: Include Nouns, Verbs, Adjectives, Adverbs – capable of expanding.
Closed classes: Include Pronouns, Prepositions, Conjunctions – relatively fixed
and do not admit new entries often.
Identifying word classes helps in tagging, parsing, machine translation, and text-
to-speech systems. The classification can be context-dependent. For instance, the
word “run” can be a noun or a verb depending on its usage in the sentence.
Words grouped by syntactic role (e.g., NN, VB).
Open (nouns) vs. closed classes (prepositions).
Essential for parsing and tagging.
5. Part-of-Speech Tagging
PoS tagging is the process of labeling each word in a sentence with its correct part of
speech. It is crucial for parsing and semantic analysis. PoS tagging helps determine the
grammatical structure of sentences and the syntactic relationships between words.
It relies on lexicons (word-to-tag mappings) and contextual rules or statistical
models.
The Penn Treebank tag set is a standard resource with 36–45 tags covering
nouns, verbs, adjectives, and more.
Challenges in PoS tagging include dealing with ambiguous words, unknown
words, and contextual dependencies.
Part-of-Speech Tagging
• Assigns syntactic category to tokens.
• Uses predefined tagsets like Penn Treebank.
• Challenges include ambiguity and unknown words.
6. Rule-Based PoS Tagging
Rule-based PoS tagging uses hand-crafted rules and dictionaries to assign tags. These
rules rely on linguistic knowledge and word context.
Example rules:
If the word ends in “-ing” and is preceded by “is,” it’s likely a verb.
If a word follows “a” or “the,” it’s likely a noun.
Advantages:
High accuracy when rules are well-defined.
Easy to interpret.
Disadvantages:
Hard to scale and maintain for large corpora.
Requires extensive manual effort.
One well-known rule-based tagger is the Brill tagger, which uses transformation rules to
correct initial tags.
Rule-based PoS Tagging
• Uses hand-crafted rules.
• Example: '-ed' ending suggests VBN.
• Brill Tagger uses transformation rules.
7. Stochastic PoS Tagging
Stochastic or statistical PoS tagging uses probability distributions derived from annotated
corpora.
Approaches:
Unigram tagger: Assigns most frequent tag for each word.
Bigram/trigram tagger: Considers tag(s) of previous word(s).
Hidden Markov Models (HMMs): Use transition and emission probabilities to
determine the most probable sequence of tags.
Advantages:
Robust against ambiguity.
Learns patterns from large data.
Drawbacks:
Requires large annotated datasets.
May fail with rare or unseen words.
Stochastic PoS Tagging
• Learns from annotated corpora.
• Unigram/Bigram/Trigram taggers use frequency.
• Struggles with unknown words and limited context.
8. Transformation-Based Tagging
This hybrid method, also known as Brill Tagging, begins with an initial tagging (usually
from a unigram tagger) and learns correction rules from training data.
Example transformation:
Change tag from “NN” to “VB” if the word follows “to”.
Advantages:
Combines the interpretability of rule-based systems with the adaptability of statistical
methods.
Fewer rules can achieve high accuracy.
Transformation-based tagging is especially useful in resource-limited scenarios where
building large statistical models is impractical.
Transformation-based Tagging
• Hybrid of rule and probabilistic methods.
• Starts with baseline, applies learned rules.
• Good accuracy and interpretability.
9. Issues in PoS Tagging
PoS tagging faces several real-world challenges:
Ambiguity: A word can belong to multiple classes (e.g., “can” as verb or noun).
Unknown words: Neologisms, typos, or foreign terms can be unrecognized.
Domain variation: Words may behave differently across domains (e.g., legal,
medical).
Multiword expressions: Phrases like “in spite of” or “as well as” may be hard to tag
correctly.
Overcoming these issues requires robust models, rich training data, and possibly contextual
embeddings like BERT.
Issues in PoS Tagging
• Ambiguity, unknown words, context dependencies.
• Multiword expressions complicate tagging.
10. Hidden Markov Models (HMMs)
HMMs are probabilistic models used for sequential data like text. In PoS tagging:
States represent tags.
Observations represent words.
Key components:
Transition probabilities: P(tag_i | tag_i-1)
Emission probabilities: P(word | tag)
The Viterbi algorithm is used to find the most likely tag sequence. HMMs assume:
Markov property (future depends only on present state).
Observations depend only on the current state.
While effective, HMMs are being replaced by neural sequence models in modern NLP.
• States = Tags, Observations = Words.
• Uses transition and emission probabilities.
• Viterbi algorithm finds most likely tag sequence.
11. Maximum Entropy Models
MaxEnt models are discriminative classifiers that predict the best tag based on rich features:
Surrounding words
Previous tags
Prefixes/suffixes
Capitalization
They work by selecting the probability distribution with the maximum entropy (i.e., least
assumptions) that still fits the training data.
MaxEnt models can use diverse features and are often preferred over HMMs in PoS tagging
due to their flexibility. They form the basis for logistic regression and are used in
conditional random fields (CRFs) and deep learning-based taggers.
Maximum Entropy Models
• Contextual probabilistic models.
• Feature-based, flexible, no independence assumptions.
• Used in OpenNLP, Stanford NLP.
UNIT III SYNTACTIC ANALYSIS
Context-Free Grammars, Grammar rules for English, Treebanks, Normal Forms for
grammar – Dependency Grammar – Syntactic Parsing, Ambiguity, Dynamic
Programming parsing – Shallow parsing – Probabilistic CFG, Probabilistic CYK,
Probabilistic Lexicalized CFGs - Feature structures, Unification of feature structures.
1. Context-Free Grammars (CFGs)
A CFG is defined by 4-tuple G = (N, Σ, P, S):
N – set of non-terminal symbols
Σ – set of terminal symbols (words)
P – set of production rules (e.g., S → NP VP)
S – start symbol (usually S)
➤ Example:
S → NP VP
NP → Det N
VP → V NP
Det → "the" | "a"
N → "dog" | "cat"
V → "chased" | "saw"
➤ Generated Sentence:
S → NP VP
→ Det N VP
→ "the" "dog" VP
→ "the" "dog" V NP
→ "the" "dog" "chased" Det N
→ "the dog chased a cat"
2. Grammar Rules for English
English syntax is flexible but structured. Rules include:
Recursive rules: Allow nested constructs (e.g., relative clauses).
Coordination: "The boy and the girl".
Complex VP rules: "He might have been seen".
➤ Sample Rule:
S → NP VP
VP → V | V NP | V PP
PP → P NP
NP → Det N | NP PP
3. Treebanks
A treebank is a corpus of syntactically annotated sentences.
➤ Example
(Penn Treebank format):
(S
(NP (DT The) (NN cat))
(VP (VBD sat)
(PP (IN on)
(NP (DT the) (NN mat)))))
➤ Parse Tree (Text-based):
S
/ \
NP VP
/ \ / \
DT NN VBD PP
"The" "cat" "sat" ...
4. Normal Forms for Grammar
➤ Chomsky Normal Form (CNF)
All rules in the form: A → BC or A → a
Useful for CYK parsing.
➤ Greibach Normal Form (GNF)
Rules of the form: A → a α (a is a terminal, α is non-terminal string).
5. Dependency Grammar
Dependency parsing focuses on binary grammatical relations between words.
➤ Example:
Sentence: The cat sat on the mat.
Dependencies:
sat → root
cat → subject of sat
on → modifier of sat
mat → object of on
➤ Diagram:
sat
/| \
cat on the
|
mat
6. Syntactic Parsing
Parsing = mapping a sentence into a syntactic structure.
➤ Parsing Approaches:
Top-down parsing: Begins from the start symbol and applies productions to match
the input.
Bottom-up parsing: Starts from input and builds up to the start symbol.
Earley parser: Combines top-down and bottom-up; works on any CFG.
CYK parser: Uses CNF; dynamic programming-based.
7. Ambiguity in Parsing
Multiple parse trees possible for a sentence.
➤ Example:
Sentence: I saw the man with the telescope.
Interpretation 1: I used a telescope to see.
Interpretation 2: The man has the telescope.
Each yields a different parse tree:
PP attached to VP vs. NP.
8. Dynamic Programming Parsing
Used to avoid redundant computations in parsing.
➤ CYK Algorithm:
Bottom-up parser for CNF.
Builds parse table for substrings of increasing length.
Steps:
1. Fill table for single words.
2. Combine substrings using production rules.
3. Final cell (top-right) contains parse if sentence is in grammar.
9. Shallow Parsing (Chunking)
Shallow parsing identifies chunks like NP, VP, etc., but doesn’t build full trees.
➤ Example:
Sentence: He saw a large dog in the park.
Chunks:
[NP He]
[VP saw]
[NP a large dog]
[PP in the park]
Used in:
Named Entity Recognition
Relation Extraction
10. Probabilistic CFG (PCFG)
Extends CFG with rule probabilities.
➤ Example:
S → NP VP [1.0]
VP → V NP [0.6]
VP → V [0.4]
Probability of derivation = product of rule probabilities.
Used to resolve ambiguity by selecting the most likely parse tree.
11. Probabilistic CYK Parsing
Combines PCFG with CYK algorithm.
Parses in CNF.
Each table entry stores best probability + backpointer.
Final parse = tree with highest probability.
Used in syntax-based machine translation and speech systems.
12. Probabilistic Lexicalized CFGs
Lexicalized grammars include headwords with rules.
➤ Example:
VP(head=gave) → V(head=gave) NP(head=book) PP(head=to)
Helps model:
Subcategorization (e.g., verbs that take PP or not).
Lexical preferences.
Used in:
Collins Parser
Parser training from Treebanks
13. Feature Structures
A way to represent additional information like gender, number, tense, etc.
➤ Example:
[Category: Noun
Number: Plural
Gender: Neuter]
Helps enforce constraints (e.g., subject-verb agreement).
14. Unification of Feature Structures
Unification = merging two compatible feature structures.
➤ Example:
FS1: [Number: Singular, Person: 3]
FS2: [Number: Singular]
→ Unified: [Number: Singular, Person: 3]
Conflict: FS1: [Number: Singular], FS2: [Number: Plural] → Unification fails.
Used in formalisms:
HPSG (Head-Driven Phrase Structure Grammar)
LFG (Lexical Functional Grammar)
UNIT IV SEMANTICS AND PRAGMATICS
Requirements for representation, First-Order Logic, Description Logics – Syntax-Driven
Semantic analysis, Semantic attachments – Word Senses, Relations between Senses,
Thematic Roles, selectional restrictions – Word Sense Disambiguation, WSD using
Supervised, Dictionary & Thesaurus, Bootstrapping methods – Word Similarity using
Thesaurus and Distributional methods.
1. Requirements for Semantic Representation
Effective semantic representations should be:
Unambiguous – Each expression has a single meaning.
Compositional – Meaning derived systematically from parts.
Expressive – Able to represent quantification, negation, modality, and belief states.
Canonical – Equivalent expressions yield the same logical form.
These representations serve as bridges between syntax and deeper meaning in NLP
applications like QA and inference.
2. First-Order Logic (FOL)
FOL enables precise semantic representation using variables, predicates, and quantifiers.
➤ Example:
“Every student loves some language”
∀x (Student(x) → ∃y (Language(y) ∧ Loves(x,y)))
FOL supports inferences and querying, essential for semantic parsing and knowledge-based
systems.
3. Description Logics
A family of logics optimized for ontology and knowledge-based representation.
Allows class hierarchy modeling (e.g., Language ⊑ HumanCommunication).
Used extensively with standards like OWL in the Semantic Web.
Ideal for controlled domains such as healthcare or legal knowledge representation.
4. Syntax-Driven Semantic Analysis
Also known as Syntax-Directed Translation, this approach attaches semantic rules to CFG
productions.
➤ Example:
S → NP VP
S.sem = VP.sem(NP.sem)
Where semantic functions compose meaning bottom-up along parse trees. Connects syntax
and semantics in a structured way.
5. Semantic Attachments
Attachments augment grammar rules with semantic construction logic.
➤ Example:
VP → V NP
VP.sem = λx. V.sem (x, NP.sem)
This allows dynamic construction of logical forms during parsing.
6. Word Senses
Language is inherently ambiguous: words can have multiple meanings.
Polysemy: Related senses (e.g., “bank” as financial/hydrological).
Homonymy: Unrelated senses (e.g., “bat” as a mammal/used in sports).
Resources like WordNet group senses into synsets and encode relationships like synonymy
and hyponymy.
7. Relations Between Word Senses
Common lexical relationships include:
Synonymy (e.g., “big” ≈ “large”)
Antonymy (e.g., “happy” vs. “sad”)
Hyponymy/Hypernymy (e.g., “dog” is a kind of “animal”)
Meronymy (part-whole) (e.g., “wheel” is part of “car”)
These relations are used in search, analogy, and inference tasks.
8. Thematic Roles & Selectional Restrictions
Roles describe how entities participate in events:
Agent: Doer (e.g., “John” in “John ate an apple”)
Theme: Receiver (e.g., “apple”)
Instrument, Location, Experiencer, etc.
Selectional restrictions place semantic constraints on roles:
“eat ⊢ object [+edible]”
Violations indicate semantic anomalies (e.g., “He ate a car”).
8. Thematic Roles & Selectional Restrictions
Roles describe how entities participate in events:
Agent: Doer (e.g., “John” in “John ate an apple”)
Theme: Receiver (e.g., “apple”)
Instrument, Location, Experiencer, etc.
Selectional restrictions place semantic constraints on roles:
“eat ⊢ object [+edible]”
Violations indicate semantic anomalies (e.g., “He ate a car”).
9. Word Sense Disambiguation (WSD)
WSD determines correct sense of a word in context:
➤ Methods:
1. Supervised Learning
o Training classifiers (Naïve Bayes, SVM) on manually annotated corpora.
o Features: Surrounding words, POS tags, collocations.
2. Lesk & Dictionary-Based Methods
o Compare overlap between sense definitions (from WordNet glosses) and
context.
o Extensions use expanded lexical resources or gloss vectors.
3. Bootstrapping & Semi-Supervised Methods
o Start with seed examples, use Yarowsky’s one sense per
collocation/discourse property.
o Expand labeled data via confident predictions.
WSD underpins tasks like translation, information extraction, and semantic search.
10. Word Similarity: Lexicon-Based & Distributional
Methods for quantifying how semantically similar two words are:
Thesaurus-based (e.g., WordNet): use path length or ontology depth.
o Measures: Path similarity, Wu-Palmer, Resnik.
Distributional Methods:
o Build word vectors from large corpora (e.g., Word2Vec, GloVe).
o Similarity measured via cosine distance.
o Captures context-based meaning and conceptual similarity.
Topic Key Concepts
Semantic Representation Compositional, logical, expressive meaning
FOL & Description Logics Quantification, inference, ontology modeling
Syntax-Driven Semantics & Attachments Mapping syntax to semantic forms
Word Senses & Lexical Relations Polysemy, WordNet synsets and semantic relations
Thematic Roles & Restrictions Conceptual roles and semantic constraints
WSD Methods Supervised, dictionary, bootstrapping methods
Word Similarity Lexicon-based vs. distributional similarity
UNIT V DISCOURSE ANALYSIS AND LEXICAL RESOURCES
Discourse segmentation, Coherence – Reference Phenomena, Anaphora Resolution
using Hobbs and Centering Algorithm – Co-reference Resolution – Resources: Porter
Stemmer, Lemmatizer, Penn Treebank, Brill's Tagger, WordNet, PropBank, FrameNet,
Brown Corpus, British National Corpus (BNC).
1. Discourse Analysis Overview
Discourse is structured communication through written or spoken language, analyzing how
language is used beyond individual sentences.
Discourse Segmentation: Dividing a text into meaningful units (e.g., sentences,
paragraphs, topics).
Applications: Summarization, dialogue systems, sentiment analysis.
Examples of Segmentation:
"John went to the store. He bought apples. Later, he baked a pie."
Segments:
1. John went to the store.
2. He bought apples.
3. Later, he baked a
pie.
2. Coherence and Reference Phenomena
Coherence:
Ensures logical flow and connection between parts of discourse.
Local Coherence: Between adjacent segments.
Global Coherence: Across the full discourse.
Devices for Achieving Coherence:
Referential Devices: Pronouns (e.g., He → John).
Lexical Cohesion: Repetition/synonyms (e.g., challenge ↔ difficulties).
Conjunctions/Discourse Markers: However, therefore, etc.
Ellipsis: Omitting repeated elements for brevity.
3. Reference Phenomena
Anaphora: Refers back.
o "John came. He was tired." → He → John.
Cataphora: Refers forward.
o "Before he arrived, John was expected."
Exophora: Refers to something external.
o "Look at that!"
Endophora: Within the discourse (includes anaphora and cataphora).
4. Anaphora Resolution
Identifying what a pronoun or reference refers to.
Hobbs’ Algorithm:
Syntactic approach.
Parses trees to find the most likely antecedent using traversal.
Centering Algorithm:
Focuses on discourse centers.
Forward-looking centers (Cf): New referents.
Backward-looking center (Cb): Previous topic.
Maintains topic continuity for coherence.
5. Coreference Resolution
Identifying all expressions that refer to the same entity.
Types:
Anaphora: "John was late. He apologized."
Cataphora: "When he arrived, John was shocked."
Split Antecedents: "John met Mary. They left."
Exophora: Refers to outside the text.
Resolution Steps:
1. Identify mentions (noun phrases).
2. Extract features (gender, number, etc.).
3. Form candidate chains.
4. Select best match using rules or ML models.
5. NLP Resources
Resource Description
Porter Stemmer Rule-based suffix stripping algorithm. e.g., "playing" → "play"
Lemmatizer Reduces to dictionary form, context-aware. e.g., "better" → "good"
Penn Treebank Annotated corpus with POS tags and syntactic parse trees.
Brill's Tagger Rule-based POS tagger with transformation rules.
WordNet Lexical database with synsets, hypernyms, hyponyms.
PropBank Annotated corpus with predicate-argument structures.
FrameNet Database of semantic frames capturing contextual meaning.
Brown Corpus Annotated 1M-word English corpus across genres.
British Nat. Corpus 100M-word modern British English corpus.
7. Install NLP Tools with NLTK (Python)
import nltk
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('brown')
Install with pip:
pip install nltk
Discourse Segmentation
Discourse segmentation is the process of dividing a text or discourse into meaningful units
such as sentences, paragraphs, or subtopics. These segments help in understanding the
structure and flow of the discourse. It is crucial in applications like summarization,
sentiment analysis, and dialogue systems.
Coherence and Reference Phenomena
Coherence refers to the logical and semantic connections that make a discourse meaningful
And interpretable.
Reference Phenomena involve the use of linguistic elements (like pronouns) that refer to
other parts of the discourse for maintaining coherence. These include:
Anaphora: Refers to expressions that depend on preceding text for their meaning (e.g.,
"John is here. He is happy").
Cataphora: Refers to expressions that rely on succeeding text (e.g., "Before he
arrived, John was expected").
Anaphora Resolution
Anaphora resolution is the task of determining the antecedent (the word or phrase to which
an anaphor refers).
Two notable algorithms used for this are:
1. Hobbs' Algorithm:
A syntactic approach for pronoun resolution.
Traverses a parse tree of the sentence and searches for antecedents in a systematic
manner.
Steps involve walking up and down the parse tree to find the most likely antecedent.
2. Centering Algorithm:
Focuses on local coherence in a discourse.
Uses centers (entities that a discourse is about) to resolve references:
Forward-looking centers: Potential referents in the next utterance.
Backward-looking centers: Referents carried over from the previous utterance.
Attempts to maintain continuity by preferring references to entities that were
prominent in the previous utterance.
Coreference Resolution
Coreference resolution is the broader task of identifying when multiple expressions in a text
refer to the same entity. It includes:
Resolving anaphors (e.g., "he" in "John said he would come").
Linking nominal references (e.g., "the professor" and "Dr. Smith").
Resources for NLP
1. Porter Stemmer: A widely used algorithm for stemming (reducing words to their root
form).
2. Lemmatizer: Normalizes words to their dictionary form (e.g., "running" → "run").
3. Penn Treebank: A linguistic corpus annotated with part-of-speech tags and syntactic
trees.
4. Brill's Tagger: A rule-based part-of-speech tagger.
5. WordNet: A lexical database of English words grouped into synsets with semantic
relationships.
6. PropBank: A corpus annotated with predicate-argument structures for semantic role
labeling.
7. FrameNet: A database of semantic frames for modeling word meaning and context.
8. Brown Corpus: A standard corpus for linguistic analysis, containing diverse text
genres.
9. British National Corpus (BNC): A 100-million-word corpus of British English,
covering spoken and written text.
DISCOURSE MEANING:
Discourse refers to communication through spoken or written language. It is a formal term
that encompasses the ways in which language is used to convey ideas, engage in dialogue,
and construct meaning within a particular context.
DISCOURSE SEGMENTATION
Discourse segmentation is the process of dividing a discourse into smaller units, such as
sentences, clauses, paragraphs, or topics, to understand its structure and meaning more
effectively. These units, called discourse segments, help in identifying boundaries and
organizing the information for easier interpretation.
Types of Discourse Segmentation
1. Sentence-Level Segmentation: Splits a discourse into individual sentences.
2. Topic-Based Segmentation: Divides the discourse into sections based on the theme
or subject.
3. Dialogue Segmentation: Segments dialogues into conversational turns or speaker
utterances.
Example of Discourse Segmentation
InputText:
"John went to the store. He bought some apples. Later, he decided to bake a pie."
Segmented Discourse:
1. Sentence 1: "John went to the store."
2. Sentence 2: "He bought some apples."
3. Sentence 3: "Later, he decided to bake a pie."
Each segment conveys a meaningful unit of information, contributing to the overall
coherence of the discourse.
Coherence
Coherence refers to the logical flow and connection between segments in a discourse. A
coherent discourse enables the reader or listener to understand the relationships between
ideas and follow the progression of thought.
Types of Coherence
1. Local Coherence: Connections between adjacent sentences or segments.
2. Global Coherence: Logical consistency across the entire discourse.
Devices for Achieving Coherence
1. Referential Devices
Definition: Use of pronouns or phrases to refer back to earlier elements in a discourse.
Example:
Text: John bought a new car. He is very happy with it.
Explanation:
He refers back to John.
It refer back to a new car.
These pronouns maintain coherence by avoiding repetition.
2. Lexical Cohesion
Definition: Repetition of key terms or use of synonyms to link ideas.
Example:
Text: The project was challenging. Despite the difficulties, the team embraced the
challenge and completed it successfully.
Explanation:
Challenging and challenge are repetitions of a key term.
Difficulties is a near-synonym, reinforcing the theme of overcoming obstacles.
3. Conjunctions and Discourse Markers
Definition: Words or phrases that indicate relationships between sentences or ideas,
such as addition, contrast, cause, or sequence.
Example:
Text: The team worked hard. However, the project was delayed due to unforeseen
circumstances.
Explanation:
However signals a contrast between the team's effort and the project's
delay, maintaining coherence.
4. Ellipsis
Definition: Omitting repeated words or phrases to avoid redundancy while ensuring
meaning remains clear.
Example:
Text: James likes pizza, and Sarah does too.
Explanation:
The phrase likes pizza is omitted after Sarah does, as it is implied.
This omission avoids redundancy while maintaining clarity and coheren
Example of Coherence
Coherent Text:
"Anna loves gardening. She spends her weekends planting flowers and vegetables. Her
garden is admired by her neighbors."
Coherence Explanation:
o The subject (Anna and her gardening activities) is maintained throughout.
o Pronoun "she" refers back to "Anna", ensuring cohesion.
The sentences logically follow each other, expanding on Anna’s gardening.
Incoherent Text:
"Anna loves gardening. The weather was cold last week. Many people travel during
holidays."
Lack of Coherence:
o There is no logical connection between the sentences.
o The topic shifts abruptly from gardening to weather to holidays.
Example of Topic-Based Segmentation and Coherence
InputText:
"The weather was sunny and warm, perfect for a picnic. Sarah decided to invite her friends
for a day at the park. They brought sandwiches, drinks, and games. Later, they enjoyed
playing Frisbee and relaxing under the trees."
Segmented and Coherent Text:
1. Segment 1: Setting the Scene
"The weather was sunny and warm, perfect for a picnic."
2. Segment 2: Planning the Activity
"Sarah decided to invite her friends for a day at the park."
3. Segment 3: Enjoying the Picnic
"They brought sandwiches, drinks, and games. Later, they enjoyed playing
Frisbee and relaxing under the trees."
Coherence Explanation:
Each segment builds on the previous one, maintaining a logical flow from the weather
to Sarah’s decision and finally to the picnic activities.
REFERENCE PHENOMENA
Reference phenomena are linguistic mechanisms used to link words or expressions within
a discourse, enabling coherence and meaning. References connect elements in the text
(referred to as referents) to their mentions (referred to as anaphors or cataphors).
Types of Reference Phenomena
1. Anaphora:
Refers back to an earlier expression in the discourse.
Example: "John arrived late. He apologized."
He refers back to John.
2. Cataphora:
Refers forward to an expression introduced later in the discourse.
Example: "Before he spoke, James took a deep breath."
He refers to James, which appears later in the sentence.
3. Exophora:
Refers to something outside the discourse, often in the surrounding physical context.
Example: "Look at that!"
That refers to something visible in the environment.
4. Endophora:
Refers to elements within the discourse. Includes both anaphora and cataphora.
Example
When she saw him, Mary smiled at John.
"She" is anaphoric, referring back to "Mary."
"Him" is cataphoric, referring forward to "John."
Anaphora Resolution
Anaphora resolution involves identifying the antecedent (the referent) for a given anaphor.
1. Hobbs’ Algorithm
Hobbs' algorithm is a syntactic approach for resolving pronominal anaphora. It is
computationally efficient and widely used in natural language processing (NLP).
Steps of Hobbs’ Algorithm:
1. Parse the sentences to generate syntactic trees.
2. Start at the anaphor and traverse the syntactic tree in a breadth-first manner.
3. Check each potential antecedent for compatibility (e.g., gender, number,
syntactic role).
4. Select the most suitable antecedent.
Example:
Input Text: "John left early because he was tired."
o Anaphor: he.
o Antecedent: John.
o Hobbs’ algorithm identifies John as the referent by traversing the tree and
matching syntactic and semantic features.
Centering Algorithm
The Centering Algorithm is a technique used to ensure coherence in a discourse by
analyzing how entities (like people, objects, or ideas) are referenced across sentences. It
helps identify the most important subject (called the center) of a sentence and how it
connects to the next sentence.
Steps in the Centering Algorithm
1. Identify Forward-Looking Centers (Cf):
List all entities in the current sentence.
Example: "John went to the park. He played soccer there."
In Sentence 1: Forward-looking centers = {John, park}.
2. Find the Backward-Looking Center (Cb):
Determine the most important entity from the previous sentence that connects to
the current sentence.
Example:
In Sentence 2: He refers to John, making John the backward-looking center.
3. Rank the Forward-Looking Centers (Cf):
Rank the entities based on their importance (subjects > objects > others).
Forward-looking centers are ranked based on their importance in the sentence.
Typically, subjects (e.g., "John") are ranked higher than objects (e.g., "the
park").
4. Check Transitions for Coherence:
Coherence in discourse depends on how the backward-looking center (Cb)
changes across sentences. The goal is to minimize unnecessary changes.
Example of Centering Algorithm in Action
Text: "Mary went to the store. She bought apples. The apples were fresh."
Step-by-Step Process:
1. Sentence 1: "Mary went to the store."
Forward-looking centers (Cf) = {Mary, store}.
No backward-looking center (Cb) since this is the first sentence.
2. Sentence 2: "She bought apples."
Forward-looking centers (Cf) = {She (Mary), apples}.
Backward-looking center (Cb) = Mary (connected via She).
3. Sentence 3: "The apples were fresh."
Forward-looking centers (Cf) = {apples}.
Backward-looking center (Cb) = apples (connected to the previous mention of
apples).
Coherence:
The backward-looking center shifts smoothly from Mary to apples as the discourse
progresses, maintaining logical flow and coherence.
COREFERENCE RESOLUTION
Coreference Resolution is the task of identifying when different words or phrases in a
text refer to the same entity. It is a key aspect of natural language understanding and is
widely used in applications like chatbots, machine translation, and summarization.
Types of Coreference
1. Anaphora: Refers to an earlier entity in the text.
o Example: "John arrived late. He apologized."
He refers to John.
2. Cataphora: Refers to an entity mentioned later in the text.
o Example: "When he arrived, John was surprised to see everyone waiting."
He refers to John, mentioned later.
3. Split Antecedent Coreference: Refers to a combination of two or more entities.
o Example: "John met Sarah. They had lunch together."
They refers to John and Sarah.
4. Exophora: Refers to something outside the text, often in the physical context.
o Example: "Look at that!"
That refers to something observable in the environment.
Steps in Coreference Resolution
1. Identify Mentions:
o Locate all noun phrases or pronouns that could refer to an entity.
o Example: "John loves his dog."
Mentions: John, his, dog.
2. Extract Features:
o Analyze features like gender, number, semantics, and syntactic position.
o Example:
John (male, singular) matches he (male, singular).
3. Create Candidate Chains:
o Group potential references into chains based on matching features.
o Example:
Chain: {John, he}.
4. Resolve Coreference:
o Use algorithms or rules to determine the correct antecedent for each pronoun
or noun phrase.
Algorithms for Coreference Resolution
1. Rule-Based Methods:
o Use hand-crafted rules based on linguistic knowledge.
o Example:
A pronoun like he typically refers to the nearest preceding male noun
phrase.
2. Machine Learning Approaches:
o Use supervised learning to train models on annotated datasets.
o Features include syntactic roles, word embeddings, and distance between
mentions.
3. Neural Network Models:
o Leverage deep learning to capture complex relationships.
o Example: Transformers like BERT can model contextual relationships in
coreference tasks.
Examples of Coreference
Example 1: Simple Coreference
Text: "Alice visited the park. She enjoyed the walk."
Coreference: Alice ↔ She.
Example 2: Complex Coreference
Text: "The team worked hard. Their efforts paid off in the end."
Coreference: The team ↔ Their.
Example 3: Split Antecedent
Text: "John greeted Mary. They went to the café."
Coreference: John + Mary ↔ They.
Applications of Co-reference Resolution
1. Chatbots and Virtual Assistants:
o Enable contextual understanding to maintain coherent conversations.
o Example: "What is Elon Musk's company? Tell me more about it."
Resolves it to Elon Musk's company.
2. Machine Translation:
o Ensures proper pronoun reference in target languages.
3. Text Summarization:
o Helps group mentions of the same entity to create concise summaries.
4. Question Answering Systems:
o Resolves references in follow-up questions.
o Example: "Who won the race? Was he happy?"
RESOURCES
In the context of Natural Language Processing (NLP), resources refer to tools, datasets,
and frameworks that help in various tasks like text analysis, machine learning, and
language understanding. These resources provide foundational data and methods for
processing, analyzing, and understanding language.
1. Porter Stemmer
What It Is:
A stemming algorithm that reduces words to their base or root form by removing
common suffixes.
Why It’s Used:
Helps in text normalization by treating words like "running" and "runner" as the
same root word, "run."
Example:
o Input: "playing, played, playful"
o Output: "play"
2. Lemmatizer
What It Is:
A tool that converts words to their dictionary base form (lemma), considering the
word’s meaning and context.
Why It’s Used:
Unlike stemming, it ensures that the base form is a real word and more linguistically
accurate.
Example:
o Input: "better"
o Output: "good"
3. Penn Treebank
What It Is:
A dataset containing annotated syntactic structures and part-of-speech tags for
English text.
Why It’s Used:
Provides a standardized corpus for training and testing NLP models.
Example Annotation:
o Sentence: "The dog barked loudly."
o POS Tags: [The/DT dog/NN barked/VBD loudly/RB]
4. Brill's Tagger
What It Is:
A rule-based part-of-speech (POS) tagger developed by Eric Brill.
Why It’s Used:
Assigns grammatical tags to words based on contextual rules.
Example:
o Sentence: "She runs fast."
o POS Tags: She/PRP runs/VBZ fast/RB
5. WordNet
What It Is:
A lexical database for the English language that groups words into synsets (synonym
sets) and provides relationships like hypernyms (broader terms) and hyponyms
(specific terms).
Why It’s Used:
Supports tasks like word sense disambiguation and semantic analysis.
Example:
o Word: "car"
o Synsets: {automobile, motorcar}
o Hypernym: "vehicle"
6. PropBank
What It Is:
A corpus annotated with information about verb arguments and their roles in
sentences (semantic role labeling).
Why It’s Used:
Enables understanding of sentence semantics by identifying "who did what to
whom."
Example:
o Sentence: "John gave Mary a book."
o Roles: John (giver), Mary (recipient), book (object)
7. FrameNet
What It Is:
A database that groups words into semantic frames—concepts that capture the
relationships between words in context.
Why It’s Used:
Helps in understanding the broader context of sentences.
Example:
o Frame: Commerce_buy
o Sentence: "She bought a car from the dealer."
o Roles: Buyer (She), Goods (car), Seller (dealer)
8. Brown Corpus
What It Is:
One of the first large, annotated corpora of English text, covering diverse genres like
fiction, news, and academic writing.
Why It’s Used:
Provides a standard dataset for linguistic analysis and training language models.
Example:
o Contains over 1 million words categorized into genres like news and editorials.
9. British National Corpus (BNC)
What It Is:
A 100-million-word text corpus that represents modern British English from various
contexts like books, conversations, and broadcasts.
Why It’s Used:
Useful for understanding language trends, dialects, and usage patterns in British
English.
Example:
o Provides frequency counts of word usage, e.g., "colour" is more common than
"color."
Install NLTK (Natural Language Toolkit)
NLTK is one of the most popular libraries in Python for NLP. It includes a variety of
useful modules for tasks like tokenization, stemming, lemmatization, POS tagging, and
more. To install NLTK:
Steps:
1. Open a terminal or command prompt.
2. Run the following command to install NLTK via pip:
3. After installing NLTK, you can use the following code to download the datasets you
need (like WordNet, Brown Corpus, etc.):
Alternatively, you can download specific datasets
Note: The nltk.download('all') command will download all datasets, which could take some
time and space. If you only need specific resources, you can download them individually as
shown above.