0% found this document useful (0 votes)

69 views50 pages

CB3591 - Engineering Ssecure Software Systems - Notes

The document provides an overview of Natural Language Processing (NLP), its origins, challenges, and key concepts such as language modeling, tokenization, and morphological analysis. It discusses the historical development of NLP, from early machine translation to modern deep learning techniques, and outlines various challenges like ambiguity, resource scarcity, and computational complexity. Additionally, it covers practical applications of NLP, including spelling error detection and correction, and the use of regular expressions and finite-state automata.

Uploaded by

Mohana Priya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views50 pages

CB3591 - Engineering Ssecure Software Systems - Notes

Uploaded by

Mohana Priya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 50

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING (AIML)

ACADEMIC YEAR 2025-2026 (ODD SEMESTER)

BATCH: 2023-2027YEAR/SEMESTER: III/V
AL3501 - NATURAL LANGUAGE PROCESSING

UNIT I INTRODUCTION
Origins and challenges of NLP – Language Modeling: Grammar-based LM,
Statistical LM - Regular Expressions, Finite-State Automata – English Morphology,
Transducers for lexicon and rules, Tokenization, Detecting and Correcting Spelling
Errors, Minimum Edit Distance

Introduction – Natural Language Processing

 Humans communicate through some formal language either by text or speech.

 To make interactions between computers and humans, computers need to
understand natural languages used by humans.
 Natural language processing is all about making computers learn,
understand, analyze, manipulate and interpret natural (human) languages.
 NLP stands for Natural Language Processing, which is a part of Computer
Science, Human languages or Linguistics, and Artificial Intelligence.
 Processing of Natural Language is required when you want an intelligent
system like robot to perform as per your instructions, when you want to
hear decision from a dialogue based clinical expert system, etc.
 The ability of machines to interpret human language is now at the core of
many applications that we use every day – chat bots, Email classification
and spam filters, search engines, grammar checkers, voice assistants, and
social language translators.
 The input and outpu to fan NLP system can be Speech or Written Text.
1. Origins and Challenges of NLP
Origins: - Natural Language Processing (NLP) is an interdisciplinary field combining
linguistics, computer science, and artificial intelligence. It aims to enable machines to
understand, interpret, and generate human language.

Historical Development
 1950s: Early machine translation (e.g., Georgetown-IBM Experiment).
 1960s-70s: Rule-based AI systems like ELIZA and SHRDLU.
 1970s–80s: Development of syntax-based parsers and context-free grammars
for sentence structure analysis.
 1980s-90s: Rise of statistical NLP and probabilistic models (e.g., Brown Corpus,
HMMs).
 2000s-present: Machine learning, deep learning, and pre-trained language models
(BERT, GPT).
 2010s: Emergence of deep learning, word embeddings (Word2Vec, GloVe), and
neural sequence models.
 2020s: Pretrained transformers (e.g., BERT, GPT) revolutionize NLP through
fine-tuning on massive datasets.
2. Key Milestones

- 1950: Turing’s “Computing Machinery and Intelligence”

- 1966: ALPAC Report criticized slow progress in MT

- 1996: First use of statistical MT at IBM

- 2018+: BERT, GPT, and transformer-based NLP dominates

3. Challenges of NLP

a) Ambiguity

 Lexical Ambiguity: A single word has multiple meanings. E.g., “bank”

(financial institution vs. river bank).
 Syntactic Ambiguity: Sentence can be parsed in multiple valid ways. E.g., “I saw the
man with the telescope.”
 Semantic Ambiguity: Multiple interpretations of meaning.
 Pragmatic Ambiguity: Context-dependent understanding (sarcasm, politeness, etc.)

b) Resource Scarcity

 Lack of annotated corpora for low-resource languages

 Inconsistent or unavailable linguistic resources for many regional languages

c) Morphological Complexity

 Especially in agglutinative and inflectional languages (e.g., Finnish, Tamil)

 Words encode rich grammatical information requiring complex morphological
analysis

d) Contextual Understanding

 Understanding real-world context, discourse, intent, and sentiment

 Requires world knowledge and commonsense reasoning

e) Domain Adaptation

 NLP models trained on general data often fail on specialized domains like legal,
medical, or scientific text

f) Code-Switching and Multi-linguality

 Mixing of languages in the same utterance (common in informal texts)

 Handling language identification, translation, and tagging becomes difficult

g) Computational Complexity

 Parsing and deep learning-based models can be computationally expensive

 Requires optimization and efficient model design

h) Evaluation and Benchmarks

 Lack of universally accepted benchmarks for many NLP tasks

 Difficulty in comparing systems across different languages and domains

4. Modern NLP Trends Addressing Challenges

 Pretrained models like BERT, GPT, T5 address data scarcity via transfer learning
 Multilingual models (e.g., mBERT, XLM-R) address low-resource challenges
 Few-shot and zero-shot learning approaches
 Use of massive corpora and unsupervised learning

2. Language Modeling
A. Grammar-Based Language Models
 Based on formal grammar rules (e.g., Context-Free Grammars - CFG).
 Define possible sentence structures using rules like:
o S -> NP VP
o VP -> V NP
 Used in parsing to verify grammatical correctness.

B. Statistical Language Models

 Assign probabilities to sequences of words.
 N-gram models:
o Unigram: P(w1), Bigram: P(w2|w1), Trigram: P(w3|w1, w2)
 Probabilities estimated using corpus frequency counts.
 Applications: Speech recognition, text prediction, machine translation.
it plays a crucial role in digital transformation, allowing businesses to be
more agile, cost-effective, and scalable in today's rapidly changing
technological landscape.

Introduction to Language Models

 Language models (LMs) assign probabilities to sequences of words.
 The simplest LMs are n-gram models.
 An n-gram is a sequence of 'n' words (e.g., bigram: 'please turn', trigram: 'please turn
your').
 These models help estimate the probability of a word based on previous words.
 N-grams are simpler compared to neural LMs like RNNs and transformers.

Understanding N-Grams
 P(w|h): Probability of word 'w' given history 'h'.
 Estimate using frequency counts from a large corpus.
 Challenges: Language is creative, new sentences constantly created.
 Zero counts common even in large corpora like the web.

Need for Smarter Estimation

 Exact sentence probabilities are hard to estimate due to data sparsity.
 Use the chain rule: P(w1, w2, ..., wn) = P(w1)P(w2|w1)...P(wn|w1...wn-1).
 Use n-gram approximation: P(wn|wn-1, ..., wn-N+1).
 Markov assumption: Probability depends only on a limited history.
 Bigram: P(wn|wn-1), Trigram: P(wn|wn-2, wn-1).

General N-Gram Equation

 Approximate: P(wn|w1:n-1) ≈ P(wn|wn-1) for bigram, or P(wn|wn-2, wn-1) for trigram.
 Compute sequence probability by multiplying conditional probabilities of words.

Maximum Likelihood Estimation (MLE)

 Estimate probabilities from observed counts in a corpus.
 MLE = Count(w_n-1, w_n) / Count(w_n-1).
 Example: If 'Chinese' occurs 400 times in 1,000,000 words, P(Chinese) = 0.0004.
 MLE gives the parameter values that maximize the likelihood of the training data.

Example: Bigram Calculation

 <s> I am Sam </s>
 <s> Sam I am </s>
 <s> I do not like green eggs and ham </s>
 Augment with <s> and </s> for sentence boundaries.
 Compute bigram probabilities like P(am|I) = Count(I am)/Count(I).

Real Data: Berkeley Restaurant Project

 Corpus of restaurant queries from Berkeley dialogue system.
 Examples: 'Can you tell me about any good Cantonese restaurants close by'.
 Bigram grammar is sparse; many zero entries.
 Normalize bigram counts using corresponding unigram totals.
Computing Sentence Probability
 Example: P(I want Chinese food) = P(I|<s>) * P(want|I) * P(Chinese|want) * P(food|
Chinese).
 Multiply relevant bigram probabilities together.

Practical Issues in Language Modeling

 In practice, trigrams or 4-grams are more common than bigrams.
 Use pseudo-words (<s>, <s>) for sentence start and </s> for end.
 Probabilities computed in log form to avoid underflow from multiplying small numbers.
 Log probabilities help maintain numeric stability in computations.

3. Regular Expressions
 Used to match patterns in text.
 Syntax examples:
o .: Matches any character
o *: Zero or more repetitions
o [abc]: Matches a, b, or c
 Applications:
o Tokenization
o Pattern matching (e.g., email validation)
o Lexical analysis

4. Finite-State Automata (FSA)

Definition: - A computational model with states and transitions used to recognize patterns.
Components: - States (Q), Alphabet (Σ), Transition function (δ), Start state (q0),
Accept states (F)

Types: - Deterministic FSA (DFA) - Non-deterministic FSA (NFA)

Use Cases: - Recognizing regular languages - Implementing lexical analyzers in

compilers - Token recognition in NLP

5. English Morphology
Morphology: Study of the structure of words. - Morpheme: Smallest meaningful unit of a
language
Types of Morphemes: - Free morphemes (can stand alone): book, run - Bound
morphemes (must attach): -ing, -ed

Processes: - Inflection: run → runs (plural, tense, etc.) - Derivation: teach → teacher (new
word with a new meaning)

Applications: - Lemmatization - Stemming - Part-of-speech tagging

Introduction to Morphology
Morphology is the domain of linguistics that analyses the internal structure of words. It
involves morphological analysis—exploring how words are structured and composed of
smaller meaningful units called morphemes.

Morphemes and Word Examples

- Words are built up of minimal meaningful elements:
• played = play + ed
• cats = cat + s
• unfriendly = un + friend + ly

Types of Morphemes
- Stems: play, cat, friend
- Affixes: ed, s, un, ly
- Prefixes precede the stem (e.g., un-)
- Suffixes follow the stem (e.g., -ed, -s, -ly)

Stemming
- Stemming identifies the root/stem of a word by removing affixes:
• play = play
• replayed = re + play + ed
• computerized = comput + er + ize + d

Problems in Morphological Processing

 Inflectional Morphology: constructs inflected forms from base forms using

inflectional affixes. It relates different forms of the same word.
• Lemma Singular Plural
• cat → cat → cats
• mouse → mouse → mice

 Derivational Morphology: constructs new words from roots and derivational

affixes:
• inter + national = international
• international + ize = internationalize
• internationalize + ation = internationalization

Morphological Processes
- Simple processes concatenate morphs one by one:
• disagreement = dis + agree + ment + s
- More complex schemes include morphophonemic changes.
- Allomorphs: Alternative forms of a morpheme (e.g., plural morpheme in 'cats', 'dogs',
'dishes', 'oxen').

Morphological Typology

 Typology classifies languages based on their morphological characteristics.

 Isolating/Analytic Languages: Few morphemes per word (e.g., Chinese,

Vietnamese).

 Synthetic Languages: Multiple morphemes per word:

• Agglutinative: Each morpheme represents one function (e.g., Tamil,
Japanese).
• Fusional: Morphemes may represent multiple functions (e.g., Latin,
German).

 Concatenative Languages: Morphs are linked sequentially.

 Nonlinear Languages: Morphemes may change tones or templates (e.g.,

Semitic languages).
Analytic vs Synthetic Languages

 Analytic Languages: Minimal inflection, rely on word order and auxiliary

words

 Synthetic Languages: Use affixes extensively.

 Agglutinative Languages: Use distinct prefixes, suffixes, and infixes.

 Fusional Languages: Inflectional categories are fused, making root

extraction difficult.
6. Transducers for Lexicon and Rules

Finite-State Transducers (FST): - FSTs map between input and output strings. - Useful in
morphological analysis and generation.

Example: - Input: “run+PAST” - Output: “ran”

Structure: - Combines lexical lookup with morphological rules

7. Tokenization

Definition: - Process of splitting text into tokens (words, punctuation, etc.)

Challenges: - Handling punctuation, contractions, and special symbols - Language-specific

rules

Tools: - NLTK, spaCy, regex-based tokenizers

Applications: - Text preprocessing - Information retrieval - Machine translation

Tokenizing
 By tokenizing, you can conveniently split up text by word or by sentence.
 This will allow you to work with smaller pieces of text that are still relatively coherent
and meaningful even outside of the context of the rest of the text.
 It’s your first step in turning unstructured data into structured data, which is easier to
analyze.
 When you’re analyzing text, you’ll be tokenizing by word and tokenizing by sentence.
Tokenizing by Word
 Words are like the atoms of natural language. They’re the smallest unit of meaning that
still makes sense on its own.
 Tokenizing your text by word allows you to identify words that come up particularly
often.
 For example, if you were analyzing a group of job ads, then you might find that the word
“Python” comes up often.
 That could suggest high demand for Python knowledge, but you’d need to look
deeper to know more.

Tokenizing by Sentence
 When you tokenize by sentence, you can analyze how those words relate to one
another and see more context.
 Are there a lot of negative words around the word “Python” because the hiring manager
doesn’t like Python?
 Are there more terms from the domain of herpetology than the domain of software
development, suggesting that you may be dealing with an entirely different kind of
python than you were expecting?

8. Detecting and Correcting Spelling Errors

Types of Errors: - Non-word errors: e.g., “speling” - Real-word errors: e.g., “their” instead
of “there”

Detection Techniques: - Dictionary lookup - Context-based analysis (n-gram models) - Edit

distance filtering

Correction Methods: - Candidate generation + ranking - Noisy channel model

1. Introduction to Spelling Error Detection and Correction
 Spelling correction is a core task in Natural Language Processing (NLP).
 It involves identifying misspelled words and replacing them with likely correct
alternatives.
 Used in search engines, word processors, OCR, and speech recognition post-processing.

2. Types of Spelling Errors

 Non-word Errors: Misspelled words not in the vocabulary (e.g., 'recieve' for 'receive').
 Real-word Errors: Correct words used incorrectly in context (e.g., 'there' instead of
'their').
3. Detection of Errors
 Basic method: Check each word against a dictionary or vocabulary list.
 Real-word error detection is more complex and context-dependent.

4. Correction Techniques
 Edit Distance (Levenshtein Distance): Measures the number of insertions, deletions, and
substitutions required to transform one word into another.
 Candidates within a maximum edit distance are considered possible corrections.
 Damerau-Levenshtein Distance: Also considers transpositions (e.g., 'adn' → 'and').

5. Candidate Generation
 Generate all strings that are one or two edits away from the error word.
 Common edits include: deletion, insertion, substitution, and transposition.
 Filter candidates that exist in the vocabulary.

6. Candidate Ranking
 Use language models to score and rank candidates.
 Noisy Channel Model: Estimates the probability of the intended word given the observed
error.
 Bayesian formulation: P(correct|error) ∝ P(error|correct) * P(correct).

7. Noisy Channel Model

 P(correct word | observed word) ∝ P(observed word | correct word) × P(correct word).
 P(observed word | correct word): Modeled using error statistics (e.g., keyboard typos).
 P(correct word): Modeled using unigram or bigram language models.

8. Contextual Spelling Correction

 Uses surrounding words to resolve real-word errors.
 Example: 'He lost his patients.' vs 'He lost his patience.'
 N-gram models or neural language models can determine the most probable contextually
appropriate correction.

9. Applications
 Word processors (e.g., MS Word spell check).
 Search engines (e.g., Google’s 'Did you mean?').
 OCR text post-processing.
 Autocorrect features in mobile devices.

9. Minimum Edit Distance

Definition: - Minimum number of operations (insertions, deletions, substitutions) required
to transform one string into another.

Operations: - Insert a character - Delete a character - Substitute one character for another

Algorithm: - Levenshtein distance using dynamic programming - Time Complexity: O(m ×

n), where m and n are string lengths

Example: - “kitten” → “sitting” - Steps: - kitten → sitten (substitute ‘k’ with ‘s’) -
sitten → sittin (substitute ‘e’ with ‘i’) - sittin → sitting (insert ‘g’) - Distance = 3

Applications: - Spell checkers - Search engines - Machine translation

Introduction to Minimum Edit Distance
 Minimum Edit Distance is a measure of the similarity between two strings.
 It quantifies how many edit operations are required to transform one string into another.
 Used in spelling correction, DNA sequence comparison, speech recognition, and
plagiarism detection.

Types of Edit Operations

 Insertion: Adding a character.
 Deletion: Removing a character.
 Substitution: Replacing one character with another.
 Transposition (in some variants): Swapping adjacent characters (used in Damerau-
Levenshtein Distance).

Levenshtein Distance
 Most common form of Minimum Edit Distance.
 Allows insertions, deletions, and substitutions.
 Each operation typically has a cost of 1.
 The distance is the minimal total cost to transform one string into another.

Wagner-Fischer Algorithm
 Dynamic programming algorithm to compute Levenshtein Distance.
 Uses a matrix to store distances between prefixes of the source and target words.
 Builds up solution from base cases (empty strings) to full strings.

Example Calculation
 Source: kitten, Target: sitting
 Operations: kitten → sitten (substitution), sitten → sittin (substitution), sittin → sitting
(insertion)
 Minimum Edit Distance = 3

Weighted Edit Distance

 Allows different costs for different operations (e.g., substitutions are costlier than
insertions).
 Useful when modeling real-world errors like keyboard typos.
 Helps prioritize more likely errors based on empirical data.

Applications in NLP
 Spelling correction: Find dictionary word with smallest edit distance to misspelled word.
 Machine Translation: Evaluate similarity between hypothesis and reference translations.
 Speech Recognition: Compare predicted and actual transcriptions.
 Information Retrieval: Fuzzy matching of user queries.

Extensions and Variants

 Damerau-Levenshtein Distance: Includes transpositions.
 Hamming Distance: Only substitutions allowed (equal-length strings).
 Jaro-Winkler Distance: Used in record linkage and name matching.

Limitations
 Does not consider phonetic similarity or semantics.
 Computationally intensive for large-scale comparisons.
 Real-word error correction often requires contextual analysis beyond simple edit
distance.
UNIT II: WORD LEVEL ANALYSIS
Unsmoothed N-grams, Evaluating N-grams, Smoothing, Interpolation and Backoff –
Word Classes, Part-of-Speech Tagging, Rule-based, Stochastic and Transformation-
based tagging, Issues in PoS tagging – Hidden Markov and Maximum Entropy
models.

1. Unsmoothed N-grams
• Unsmoothed N-gram models estimate the probability of a word given the preceding
N−1 words.
• They assume a fixed window context and are trained by counting occurrences in a
corpus.
• These models assign zero probability to unseen sequences.
• A bigram model estimates P(wi | wi−1) using: Count(wi−1, wi)/Count(wi−1).
Introduction to N-grams
An N-gram is a contiguous sequence of 'n' items from a given sample of text or speech.
These items can be phonemes, syllables, letters, words, or base pairs according to the
application.

In NLP, N-grams are usually sequences of words.

For example, for the sentence 'I love NLP':
- Unigrams: ['I', 'love', 'NLP']
- Bigrams: ['I love', 'love NLP']
- Trigrams: ['I love NLP']

Definition of Unsmoothed N-grams

 Unsmoothed N-gram models estimate the probability of a word given its history
based solely on the frequency counts in a training corpus.
 These models do not adjust for unseen word sequences, assigning zero probability
to any sequence not encountered during training.

How Unsmoothed N-gram Models Work

 The probability of a word sequence is computed as the product of the conditional
probabilities of each word given its preceding words.
 Bigram Model: P(w1, w2, w3) ≈ P(w1) * P(w2|w1) * P(w3|w2)
 General N-gram approximation: P(w1, ..., wn) ≈ ∏(P(wi | wi−(n−1), ..., wi−1))
Maximum Likelihood Estimation (MLE)
 MLE is used to calculate N-gram probabilities in unsmoothed models.
 For bigrams: P(wi | wi−1) = Count(wi−1, wi) / Count(wi−1)
 For trigrams: P(wi | wi−2, wi−1) = Count(wi−2, wi−1, wi) / Count(wi−2, wi−1)
Example:
- Sentence: 'the cat sat'
- P(cat | the) = Count(the cat) / Count(the)

Limitations of Unsmoothed N-gram Models

 Zero probabilities: If a word sequence never appears in the training data, it receives
a probability of zero.
 Data sparsity: Most word combinations do not appear in a corpus, even large ones.
 Lack of generalization: Cannot infer probabilities for unseen combinations.
 Poor performance on rare or novel constructions.

Importance of Smoothing
 Smoothing techniques address the limitations of unsmoothed models.
 They redistribute probability mass from seen to unseen N-grams.
 Common smoothing methods: Laplace, Add-k, Good-Turing, Kneser-Ney.

Applications of N-gram Models

 Predictive text input and autocomplete.
 Speech recognition and synthesis.
 Machine translation.
 Spell correction and information retrieval.

2. Evaluating N-grams

Evaluating N-gram models is essential to understanding how well they perform in predicting
natural language. Two standard metrics are used: Perplexity and Cross-Entropy.

 Perplexity measures how well a model predicts a sample. Lower perplexity indicates
better prediction capability. It is the exponentiation of the average negative log
probability per word.
 Cross-Entropy quantifies the average number of bits needed to encode a word in the
test corpus using the language model. It is directly related to perplexity and is
minimized when the model's distribution matches the true distribution.
Evaluating N-grams typically involves training on one corpus and testing on another.
A key challenge is that even well-trained models might struggle with data that is not
represented in the training corpus. This is where smoothing and backoff techniques
are often introduced. Evaluating models helps in tasks like machine translation,
speech recognition, and predictive text input.

• Evaluation uses held-out data to measure model performance.

• Key metrics:
- Perplexity: Measures model surprise; lower is better.
- Cross-Entropy: Average bits required to encode data.
• A good model assigns high probability to likely sequences.

3. Smoothing Techniques
Smoothing techniques address the zero-probability problem in N-gram models by
adjusting the estimated probabilities for unseen events.
Common methods include:

 Add-One (Laplace) Smoothing: Adds 1 to every count; simple but can distort
probability distributions.
 Add-k Smoothing: Generalizes Laplace by adding a small value k, such as 0.1, to
reduce overcorrection.
 Good-Turing Discounting: Adjusts estimates based on the frequency of frequencies
(how many items occur once, twice, etc.).
 Kneser-Ney Smoothing: A more refined method that considers the diversity of
contexts a word appears in, often used in real applications.
 Interpolation and Backoff: Used to balance probabilities from higher-order and
lower-order N-grams.

These techniques ensure better model performance in real-world scenarios where language is
highly variable.

Smoothing techniques are essential in natural language processing, especially in N-gram

language models, to address the problem of data sparsity. In real-world language data, it's
common to encounter sequences of words (n-grams) that never appear in the training corpus.
Without smoothing, these unseen sequences would be assigned a probability of zero, which
is highly problematic in tasks like speech recognition, machine translation, and text
prediction.

Smoothing reallocates some probability mass from observed n-grams to unseen ones,
ensuring that even novel sequences receive a small but non-zero probability.

Why Smoothing is Important

 Handles unseen n-grams: Prevents zero probabilities for valid but unseen word
combinations.
 Improves generalization: Helps models perform better on real-world data not present
in the training set.
 Enables realistic language generation: Avoids breaking a model’s output due to
impossible word transitions.

Common Smoothing Techniques

Add-One (Laplace) Smoothing

 Adds one to every n-gram count.

 Simple and intuitive.
 Useful for small vocabularies but overestimates rare events in large corpora.

Add-k (Additive) Smoothing

 Generalizes Laplace by adding a smaller constant (e.g., 0.1 or 0.01) instead

of 1.
 Reduces overestimation but still suffers in sparse datasets.

Good-Turing Smoothing

 Adjusts probabilities based on the frequency of frequencies (e.g., how often

n-grams with frequency 1 occur).
 Transfers probability mass from seen to unseen events.
 Especially useful when many low-frequency events exist.

Kneser-Ney Smoothing

 Among the most advanced and effective.

 Focuses not just on word frequency but also how diverse the contexts are
in which a word appears.
 Used in state-of-the-art language models.

Backoff Models

 If a higher-order n-gram is unseen, the model “backs off” to a lower-order

n-gram (e.g., from trigram to bigram).
 Helps in estimating probabilities more reliably when data is sparse.

Interpolation Models

 Combine multiple n-gram models (e.g., unigram, bigram, trigram) using

weighted averaging.
 Each level contributes to the final probability.
 Offers smoother transitions across varying contexts.

3. Interpolation and Backoff

These are probability estimation techniques used when certain N-gram sequences are
missing in the training data.

 Backoff: When an N-gram is not found, the model “backs off” to a lower-order model
(e.g., from trigram to bigram).
 Interpolation: A weighted average of different N-gram probabilities is used
regardless of whether higher-order data is present.

Example for interpolation:

P(wi | wi-2, wi-1) = λ1 * P(wi | wi-2, wi-1) + λ2 * P(wi | wi-1) + λ3 * P(wi)

Backoff methods are often combined with discounting (like Katz backoff) to ensure the
total probability sums to 1. Interpolation techniques, especially linear interpolation, are
widely used due to their simplicity and robustness in handling sparse data.

Interpolation and Backoff

• Advanced smoothing approaches.
• Interpolation: Combines different order probabilities.
• Backoff: Uses lower-order model if higher-order fails.

4. Word Classes

Words in a language are grouped into classes based on their syntactic and grammatical
roles. These are often referred to as parts of speech (PoS).

 Open classes: Include Nouns, Verbs, Adjectives, Adverbs – capable of expanding.

 Closed classes: Include Pronouns, Prepositions, Conjunctions – relatively fixed
and do not admit new entries often.
Identifying word classes helps in tagging, parsing, machine translation, and text-
to-speech systems. The classification can be context-dependent. For instance, the
word “run” can be a noun or a verb depending on its usage in the sentence.
 Words grouped by syntactic role (e.g., NN, VB).
 Open (nouns) vs. closed classes (prepositions).
 Essential for parsing and tagging.
5. Part-of-Speech Tagging

PoS tagging is the process of labeling each word in a sentence with its correct part of
speech. It is crucial for parsing and semantic analysis. PoS tagging helps determine the
grammatical structure of sentences and the syntactic relationships between words.

 It relies on lexicons (word-to-tag mappings) and contextual rules or statistical

models.
 The Penn Treebank tag set is a standard resource with 36–45 tags covering
nouns, verbs, adjectives, and more.
Challenges in PoS tagging include dealing with ambiguous words, unknown
words, and contextual dependencies.
 Part-of-Speech Tagging
• Assigns syntactic category to tokens.
• Uses predefined tagsets like Penn Treebank.
• Challenges include ambiguity and unknown words.

6. Rule-Based PoS Tagging

Rule-based PoS tagging uses hand-crafted rules and dictionaries to assign tags. These
rules rely on linguistic knowledge and word context.
Example rules:

 If the word ends in “-ing” and is preceded by “is,” it’s likely a verb.
 If a word follows “a” or “the,” it’s likely a noun.

Advantages:

 High accuracy when rules are well-defined.

 Easy to interpret.

Disadvantages:

 Hard to scale and maintain for large corpora.

 Requires extensive manual effort.

One well-known rule-based tagger is the Brill tagger, which uses transformation rules to
correct initial tags.

Rule-based PoS Tagging

• Uses hand-crafted rules.

• Example: '-ed' ending suggests VBN.
• Brill Tagger uses transformation rules.
7. Stochastic PoS Tagging

Stochastic or statistical PoS tagging uses probability distributions derived from annotated
corpora.
Approaches:

 Unigram tagger: Assigns most frequent tag for each word.

 Bigram/trigram tagger: Considers tag(s) of previous word(s).
 Hidden Markov Models (HMMs): Use transition and emission probabilities to
determine the most probable sequence of tags.

Advantages:

 Robust against ambiguity.

 Learns patterns from large data.

Drawbacks:

 Requires large annotated datasets.

 May fail with rare or unseen words.

Stochastic PoS Tagging

• Learns from annotated corpora.

• Unigram/Bigram/Trigram taggers use frequency.
• Struggles with unknown words and limited context.

8. Transformation-Based Tagging

This hybrid method, also known as Brill Tagging, begins with an initial tagging (usually
from a unigram tagger) and learns correction rules from training data.
Example transformation:

 Change tag from “NN” to “VB” if the word follows “to”.

Advantages:

 Combines the interpretability of rule-based systems with the adaptability of statistical

methods.
 Fewer rules can achieve high accuracy.

Transformation-based tagging is especially useful in resource-limited scenarios where

building large statistical models is impractical.
Transformation-based Tagging

• Hybrid of rule and probabilistic methods.

• Starts with baseline, applies learned rules.
• Good accuracy and interpretability.

9. Issues in PoS Tagging

PoS tagging faces several real-world challenges:

 Ambiguity: A word can belong to multiple classes (e.g., “can” as verb or noun).
 Unknown words: Neologisms, typos, or foreign terms can be unrecognized.
 Domain variation: Words may behave differently across domains (e.g., legal,
medical).
 Multiword expressions: Phrases like “in spite of” or “as well as” may be hard to tag
correctly.

Overcoming these issues requires robust models, rich training data, and possibly contextual
embeddings like BERT.

Issues in PoS Tagging

• Ambiguity, unknown words, context dependencies.

• Multiword expressions complicate tagging.

10. Hidden Markov Models (HMMs)

HMMs are probabilistic models used for sequential data like text. In PoS tagging:

 States represent tags.

 Observations represent words.

Key components:

 Transition probabilities: P(tag_i | tag_i-1)

 Emission probabilities: P(word | tag)

The Viterbi algorithm is used to find the most likely tag sequence. HMMs assume:

 Markov property (future depends only on present state).

 Observations depend only on the current state.

While effective, HMMs are being replaced by neural sequence models in modern NLP.
• States = Tags, Observations = Words.
• Uses transition and emission probabilities.
• Viterbi algorithm finds most likely tag sequence.

11. Maximum Entropy Models

MaxEnt models are discriminative classifiers that predict the best tag based on rich features:

 Surrounding words
 Previous tags
 Prefixes/suffixes
 Capitalization

They work by selecting the probability distribution with the maximum entropy (i.e., least
assumptions) that still fits the training data.
MaxEnt models can use diverse features and are often preferred over HMMs in PoS tagging
due to their flexibility. They form the basis for logistic regression and are used in
conditional random fields (CRFs) and deep learning-based taggers.

Maximum Entropy Models

• Contextual probabilistic models.

• Feature-based, flexible, no independence assumptions.
• Used in OpenNLP, Stanford NLP.
UNIT III SYNTACTIC ANALYSIS
Context-Free Grammars, Grammar rules for English, Treebanks, Normal Forms for
grammar – Dependency Grammar – Syntactic Parsing, Ambiguity, Dynamic
Programming parsing – Shallow parsing – Probabilistic CFG, Probabilistic CYK,
Probabilistic Lexicalized CFGs - Feature structures, Unification of feature structures.

1. Context-Free Grammars (CFGs)

A CFG is defined by 4-tuple G = (N, Σ, P, S):

 N – set of non-terminal symbols

 Σ – set of terminal symbols (words)
 P – set of production rules (e.g., S → NP VP)
 S – start symbol (usually S)

➤ Example:
S → NP VP
NP → Det N
VP → V NP
Det → "the" | "a"
N → "dog" | "cat"
V → "chased" | "saw"

➤ Generated Sentence:
S → NP VP
→ Det N VP
→ "the" "dog" VP
→ "the" "dog" V NP
→ "the" "dog" "chased" Det N
→ "the dog chased a cat"
2. Grammar Rules for English
English syntax is flexible but structured. Rules include:
Recursive rules: Allow nested constructs (e.g., relative clauses).
Coordination: "The boy and the girl".
Complex VP rules: "He might have been seen".
➤ Sample Rule:
S → NP VP
VP → V | V NP | V PP
PP → P NP
NP → Det N | NP PP

3. Treebanks
A treebank is a corpus of syntactically annotated sentences.
➤ Example
(Penn Treebank format):
(S
(NP (DT The) (NN cat))
(VP (VBD sat)
(PP (IN on)
(NP (DT the) (NN mat)))))

➤ Parse Tree (Text-based):

S
/ \
NP VP
/ \ / \
DT NN VBD PP
"The" "cat" "sat" ...
4. Normal Forms for Grammar
➤ Chomsky Normal Form (CNF)

 All rules in the form: A → BC or A → a

 Useful for CYK parsing.

➤ Greibach Normal Form (GNF)

 Rules of the form: A → a α (a is a terminal, α is non-terminal string).

5. Dependency Grammar

Dependency parsing focuses on binary grammatical relations between words.

➤ Example:

Sentence: The cat sat on the mat.

Dependencies:

 sat → root
 cat → subject of sat
 on → modifier of sat
 mat → object of on

➤ Diagram:
sat
/| \
cat on the
|
mat
6. Syntactic Parsing

Parsing = mapping a sentence into a syntactic structure.

➤ Parsing Approaches:

 Top-down parsing: Begins from the start symbol and applies productions to match
the input.
 Bottom-up parsing: Starts from input and builds up to the start symbol.
 Earley parser: Combines top-down and bottom-up; works on any CFG.
 CYK parser: Uses CNF; dynamic programming-based.

7. Ambiguity in Parsing

Multiple parse trees possible for a sentence.

➤ Example:

Sentence: I saw the man with the telescope.

 Interpretation 1: I used a telescope to see.

 Interpretation 2: The man has the telescope.

Each yields a different parse tree:

 PP attached to VP vs. NP.

8. Dynamic Programming Parsing

Used to avoid redundant computations in parsing.

➤ CYK Algorithm:

 Bottom-up parser for CNF.

 Builds parse table for substrings of increasing length.

Steps:

1. Fill table for single words.

2. Combine substrings using production rules.
3. Final cell (top-right) contains parse if sentence is in grammar.

9. Shallow Parsing (Chunking)

Shallow parsing identifies chunks like NP, VP, etc., but doesn’t build full trees.

➤ Example:

Sentence: He saw a large dog in the park.

Chunks:

 [NP He]
 [VP saw]
 [NP a large dog]
 [PP in the park]

Used in:

 Named Entity Recognition

 Relation Extraction

10. Probabilistic CFG (PCFG)

Extends CFG with rule probabilities.

➤ Example:
S → NP VP [1.0]
VP → V NP [0.6]
VP → V [0.4]

Probability of derivation = product of rule probabilities.

Used to resolve ambiguity by selecting the most likely parse tree.

11. Probabilistic CYK Parsing

Combines PCFG with CYK algorithm.

 Parses in CNF.
 Each table entry stores best probability + backpointer.
 Final parse = tree with highest probability.

Used in syntax-based machine translation and speech systems.

12. Probabilistic Lexicalized CFGs

Lexicalized grammars include headwords with rules.

➤ Example:
VP(head=gave) → V(head=gave) NP(head=book) PP(head=to)

Helps model:

 Subcategorization (e.g., verbs that take PP or not).

 Lexical preferences.

Used in:

 Collins Parser
 Parser training from Treebanks

13. Feature Structures

A way to represent additional information like gender, number, tense, etc.

➤ Example:
[Category: Noun
Number: Plural
Gender: Neuter]
Helps enforce constraints (e.g., subject-verb agreement).

14. Unification of Feature Structures

Unification = merging two compatible feature structures.

➤ Example:
FS1: [Number: Singular, Person: 3]
FS2: [Number: Singular]
→ Unified: [Number: Singular, Person: 3]

Conflict: FS1: [Number: Singular], FS2: [Number: Plural] → Unification fails.

Used in formalisms:

 HPSG (Head-Driven Phrase Structure Grammar)

 LFG (Lexical Functional Grammar)
UNIT IV SEMANTICS AND PRAGMATICS
Requirements for representation, First-Order Logic, Description Logics – Syntax-Driven
Semantic analysis, Semantic attachments – Word Senses, Relations between Senses,
Thematic Roles, selectional restrictions – Word Sense Disambiguation, WSD using
Supervised, Dictionary & Thesaurus, Bootstrapping methods – Word Similarity using
Thesaurus and Distributional methods.
1. Requirements for Semantic Representation

Effective semantic representations should be:

 Unambiguous – Each expression has a single meaning.

 Compositional – Meaning derived systematically from parts.
 Expressive – Able to represent quantification, negation, modality, and belief states.
 Canonical – Equivalent expressions yield the same logical form.

These representations serve as bridges between syntax and deeper meaning in NLP
applications like QA and inference.

2. First-Order Logic (FOL)

FOL enables precise semantic representation using variables, predicates, and quantifiers.

➤ Example:

“Every student loves some language”

∀x (Student(x) → ∃y (Language(y) ∧ Loves(x,y)))

FOL supports inferences and querying, essential for semantic parsing and knowledge-based
systems.
3. Description Logics

A family of logics optimized for ontology and knowledge-based representation.

 Allows class hierarchy modeling (e.g., Language ⊑ HumanCommunication).

 Used extensively with standards like OWL in the Semantic Web.
 Ideal for controlled domains such as healthcare or legal knowledge representation.
4. Syntax-Driven Semantic Analysis

Also known as Syntax-Directed Translation, this approach attaches semantic rules to CFG
productions.

➤ Example:
S → NP VP
S.sem = VP.sem(NP.sem)

Where semantic functions compose meaning bottom-up along parse trees. Connects syntax
and semantics in a structured way.

5. Semantic Attachments

Attachments augment grammar rules with semantic construction logic.

➤ Example:
VP → V NP
VP.sem = λx. V.sem (x, NP.sem)
This allows dynamic construction of logical forms during parsing.
6. Word Senses

Language is inherently ambiguous: words can have multiple meanings.

 Polysemy: Related senses (e.g., “bank” as financial/hydrological).

 Homonymy: Unrelated senses (e.g., “bat” as a mammal/used in sports).

Resources like WordNet group senses into synsets and encode relationships like synonymy
and hyponymy.

7. Relations Between Word Senses

Common lexical relationships include:

 Synonymy (e.g., “big” ≈ “large”)

 Antonymy (e.g., “happy” vs. “sad”)
 Hyponymy/Hypernymy (e.g., “dog” is a kind of “animal”)
 Meronymy (part-whole) (e.g., “wheel” is part of “car”)

These relations are used in search, analogy, and inference tasks.

8. Thematic Roles & Selectional Restrictions

Roles describe how entities participate in events:

 Agent: Doer (e.g., “John” in “John ate an apple”)

 Theme: Receiver (e.g., “apple”)
 Instrument, Location, Experiencer, etc.

Selectional restrictions place semantic constraints on roles:

 “eat ⊢ object [+edible]”

 Violations indicate semantic anomalies (e.g., “He ate a car”).

8. Thematic Roles & Selectional Restrictions

Roles describe how entities participate in events:

 Agent: Doer (e.g., “John” in “John ate an apple”)

 Theme: Receiver (e.g., “apple”)
 Instrument, Location, Experiencer, etc.

Selectional restrictions place semantic constraints on roles:

 “eat ⊢ object [+edible]”

 Violations indicate semantic anomalies (e.g., “He ate a car”).

9. Word Sense Disambiguation (WSD)

WSD determines correct sense of a word in context:

➤ Methods:

1. Supervised Learning
o Training classifiers (Naïve Bayes, SVM) on manually annotated corpora.
o Features: Surrounding words, POS tags, collocations.
2. Lesk & Dictionary-Based Methods
o Compare overlap between sense definitions (from WordNet glosses) and
context.
o Extensions use expanded lexical resources or gloss vectors.
3. Bootstrapping & Semi-Supervised Methods
o Start with seed examples, use Yarowsky’s one sense per
collocation/discourse property.
o Expand labeled data via confident predictions.

WSD underpins tasks like translation, information extraction, and semantic search.

10. Word Similarity: Lexicon-Based & Distributional

Methods for quantifying how semantically similar two words are:

 Thesaurus-based (e.g., WordNet): use path length or ontology depth.

o Measures: Path similarity, Wu-Palmer, Resnik.

 Distributional Methods:
o Build word vectors from large corpora (e.g., Word2Vec, GloVe).
o Similarity measured via cosine distance.
o Captures context-based meaning and conceptual similarity.

Topic Key Concepts

Semantic Representation Compositional, logical, expressive meaning
FOL & Description Logics Quantification, inference, ontology modeling
Syntax-Driven Semantics & Attachments Mapping syntax to semantic forms
Word Senses & Lexical Relations Polysemy, WordNet synsets and semantic relations
Thematic Roles & Restrictions Conceptual roles and semantic constraints
WSD Methods Supervised, dictionary, bootstrapping methods
Word Similarity Lexicon-based vs. distributional similarity
UNIT V DISCOURSE ANALYSIS AND LEXICAL RESOURCES
Discourse segmentation, Coherence – Reference Phenomena, Anaphora Resolution
using Hobbs and Centering Algorithm – Co-reference Resolution – Resources: Porter
Stemmer, Lemmatizer, Penn Treebank, Brill's Tagger, WordNet, PropBank, FrameNet,
Brown Corpus, British National Corpus (BNC).

1. Discourse Analysis Overview

Discourse is structured communication through written or spoken language, analyzing how

language is used beyond individual sentences.

 Discourse Segmentation: Dividing a text into meaningful units (e.g., sentences,

paragraphs, topics).
 Applications: Summarization, dialogue systems, sentiment analysis.

Examples of Segmentation:

"John went to the store. He bought apples. Later, he baked a pie."

Segments:
1. John went to the store.
2. He bought apples.
3. Later, he baked a
pie.

2. Coherence and Reference Phenomena

Coherence:

 Ensures logical flow and connection between parts of discourse.

 Local Coherence: Between adjacent segments.
 Global Coherence: Across the full discourse.

Devices for Achieving Coherence:

 Referential Devices: Pronouns (e.g., He → John).

 Lexical Cohesion: Repetition/synonyms (e.g., challenge ↔ difficulties).
 Conjunctions/Discourse Markers: However, therefore, etc.
 Ellipsis: Omitting repeated elements for brevity.
3. Reference Phenomena

 Anaphora: Refers back.

o "John came. He was tired." → He → John.
 Cataphora: Refers forward.
o "Before he arrived, John was expected."
 Exophora: Refers to something external.
o "Look at that!"
 Endophora: Within the discourse (includes anaphora and cataphora).

4. Anaphora Resolution

Identifying what a pronoun or reference refers to.

Hobbs’ Algorithm:

 Syntactic approach.
 Parses trees to find the most likely antecedent using traversal.

Centering Algorithm:

 Focuses on discourse centers.

 Forward-looking centers (Cf): New referents.
 Backward-looking center (Cb): Previous topic.
 Maintains topic continuity for coherence.

5. Coreference Resolution

Identifying all expressions that refer to the same entity.

Types:

 Anaphora: "John was late. He apologized."

 Cataphora: "When he arrived, John was shocked."
 Split Antecedents: "John met Mary. They left."
 Exophora: Refers to outside the text.

Resolution Steps:

1. Identify mentions (noun phrases).

2. Extract features (gender, number, etc.).
3. Form candidate chains.
4. Select best match using rules or ML models.

5. NLP Resources

Resource Description
Porter Stemmer Rule-based suffix stripping algorithm. e.g., "playing" → "play"
Lemmatizer Reduces to dictionary form, context-aware. e.g., "better" → "good"
Penn Treebank Annotated corpus with POS tags and syntactic parse trees.
Brill's Tagger Rule-based POS tagger with transformation rules.
WordNet Lexical database with synsets, hypernyms, hyponyms.
PropBank Annotated corpus with predicate-argument structures.
FrameNet Database of semantic frames capturing contextual meaning.
Brown Corpus Annotated 1M-word English corpus across genres.
British Nat. Corpus 100M-word modern British English corpus.

7. Install NLP Tools with NLTK (Python)

import nltk
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('brown')

Install with pip:

pip install nltk

Discourse Segmentation
Discourse segmentation is the process of dividing a text or discourse into meaningful units
such as sentences, paragraphs, or subtopics. These segments help in understanding the
structure and flow of the discourse. It is crucial in applications like summarization,
sentiment analysis, and dialogue systems.
Coherence and Reference Phenomena
Coherence refers to the logical and semantic connections that make a discourse meaningful
And interpretable.

Reference Phenomena involve the use of linguistic elements (like pronouns) that refer to
other parts of the discourse for maintaining coherence. These include:

 Anaphora: Refers to expressions that depend on preceding text for their meaning (e.g.,
 "John is here. He is happy").
 Cataphora: Refers to expressions that rely on succeeding text (e.g., "Before he
arrived, John was expected").

Anaphora Resolution
Anaphora resolution is the task of determining the antecedent (the word or phrase to which
an anaphor refers).

Two notable algorithms used for this are:

1. Hobbs' Algorithm:
 A syntactic approach for pronoun resolution.
 Traverses a parse tree of the sentence and searches for antecedents in a systematic
manner.
 Steps involve walking up and down the parse tree to find the most likely antecedent.

2. Centering Algorithm:
 Focuses on local coherence in a discourse.

 Uses centers (entities that a discourse is about) to resolve references:

 Forward-looking centers: Potential referents in the next utterance.
 Backward-looking centers: Referents carried over from the previous utterance.

 Attempts to maintain continuity by preferring references to entities that were

prominent in the previous utterance.

Coreference Resolution
Coreference resolution is the broader task of identifying when multiple expressions in a text
refer to the same entity. It includes:

 Resolving anaphors (e.g., "he" in "John said he would come").

 Linking nominal references (e.g., "the professor" and "Dr. Smith").

Resources for NLP

1. Porter Stemmer: A widely used algorithm for stemming (reducing words to their root
form).
2. Lemmatizer: Normalizes words to their dictionary form (e.g., "running" → "run").
3. Penn Treebank: A linguistic corpus annotated with part-of-speech tags and syntactic
trees.
4. Brill's Tagger: A rule-based part-of-speech tagger.
5. WordNet: A lexical database of English words grouped into synsets with semantic
relationships.
6. PropBank: A corpus annotated with predicate-argument structures for semantic role
labeling.
7. FrameNet: A database of semantic frames for modeling word meaning and context.
8. Brown Corpus: A standard corpus for linguistic analysis, containing diverse text
genres.
9. British National Corpus (BNC): A 100-million-word corpus of British English,
covering spoken and written text.

DISCOURSE MEANING:

Discourse refers to communication through spoken or written language. It is a formal term

that encompasses the ways in which language is used to convey ideas, engage in dialogue,
and construct meaning within a particular context.

DISCOURSE SEGMENTATION
Discourse segmentation is the process of dividing a discourse into smaller units, such as
sentences, clauses, paragraphs, or topics, to understand its structure and meaning more
effectively. These units, called discourse segments, help in identifying boundaries and
organizing the information for easier interpretation.

Types of Discourse Segmentation

1. Sentence-Level Segmentation: Splits a discourse into individual sentences.

2. Topic-Based Segmentation: Divides the discourse into sections based on the theme
or subject.
3. Dialogue Segmentation: Segments dialogues into conversational turns or speaker
utterances.

Example of Discourse Segmentation

InputText:
"John went to the store. He bought some apples. Later, he decided to bake a pie."

Segmented Discourse:
1. Sentence 1: "John went to the store."
2. Sentence 2: "He bought some apples."
3. Sentence 3: "Later, he decided to bake a pie."

Each segment conveys a meaningful unit of information, contributing to the overall

coherence of the discourse.

Coherence
Coherence refers to the logical flow and connection between segments in a discourse. A
coherent discourse enables the reader or listener to understand the relationships between
ideas and follow the progression of thought.

Types of Coherence
1. Local Coherence: Connections between adjacent sentences or segments.
2. Global Coherence: Logical consistency across the entire discourse.

Devices for Achieving Coherence

1. Referential Devices
 Definition: Use of pronouns or phrases to refer back to earlier elements in a discourse.
 Example:
Text: John bought a new car. He is very happy with it.
 Explanation:
 He refers back to John.
 It refer back to a new car.
These pronouns maintain coherence by avoiding repetition.

2. Lexical Cohesion
 Definition: Repetition of key terms or use of synonyms to link ideas.
 Example:
Text: The project was challenging. Despite the difficulties, the team embraced the
challenge and completed it successfully.
 Explanation:
 Challenging and challenge are repetitions of a key term.
 Difficulties is a near-synonym, reinforcing the theme of overcoming obstacles.

3. Conjunctions and Discourse Markers

 Definition: Words or phrases that indicate relationships between sentences or ideas,
such as addition, contrast, cause, or sequence.
 Example:
Text: The team worked hard. However, the project was delayed due to unforeseen
circumstances.
 Explanation:
 However signals a contrast between the team's effort and the project's
delay, maintaining coherence.

4. Ellipsis
 Definition: Omitting repeated words or phrases to avoid redundancy while ensuring
meaning remains clear.
 Example:
Text: James likes pizza, and Sarah does too.
 Explanation:
 The phrase likes pizza is omitted after Sarah does, as it is implied.
 This omission avoids redundancy while maintaining clarity and coheren

Example of Coherence

Coherent Text:
"Anna loves gardening. She spends her weekends planting flowers and vegetables. Her
garden is admired by her neighbors."

 Coherence Explanation:
o The subject (Anna and her gardening activities) is maintained throughout.
o Pronoun "she" refers back to "Anna", ensuring cohesion.
 The sentences logically follow each other, expanding on Anna’s gardening.

Incoherent Text:
"Anna loves gardening. The weather was cold last week. Many people travel during
holidays."

 Lack of Coherence:
o There is no logical connection between the sentences.
o The topic shifts abruptly from gardening to weather to holidays.

Example of Topic-Based Segmentation and Coherence

InputText:
"The weather was sunny and warm, perfect for a picnic. Sarah decided to invite her friends
for a day at the park. They brought sandwiches, drinks, and games. Later, they enjoyed
playing Frisbee and relaxing under the trees."

Segmented and Coherent Text:

1. Segment 1: Setting the Scene

 "The weather was sunny and warm, perfect for a picnic."
2. Segment 2: Planning the Activity
 "Sarah decided to invite her friends for a day at the park."
3. Segment 3: Enjoying the Picnic
 "They brought sandwiches, drinks, and games. Later, they enjoyed playing
Frisbee and relaxing under the trees."

Coherence Explanation:
 Each segment builds on the previous one, maintaining a logical flow from the weather
to Sarah’s decision and finally to the picnic activities.

REFERENCE PHENOMENA

Reference phenomena are linguistic mechanisms used to link words or expressions within
a discourse, enabling coherence and meaning. References connect elements in the text
(referred to as referents) to their mentions (referred to as anaphors or cataphors).

Types of Reference Phenomena

1. Anaphora:

 Refers back to an earlier expression in the discourse.

 Example: "John arrived late. He apologized."

 He refers back to John.

2. Cataphora:

 Refers forward to an expression introduced later in the discourse.

 Example: "Before he spoke, James took a deep breath."

 He refers to James, which appears later in the sentence.

3. Exophora:

 Refers to something outside the discourse, often in the surrounding physical context.

 Example: "Look at that!"

 That refers to something visible in the environment.

4. Endophora:

 Refers to elements within the discourse. Includes both anaphora and cataphora.

Example

When she saw him, Mary smiled at John.

 "She" is anaphoric, referring back to "Mary."

 "Him" is cataphoric, referring forward to "John."

Anaphora Resolution

Anaphora resolution involves identifying the antecedent (the referent) for a given anaphor.

1. Hobbs’ Algorithm

Hobbs' algorithm is a syntactic approach for resolving pronominal anaphora. It is

computationally efficient and widely used in natural language processing (NLP).

 Steps of Hobbs’ Algorithm:

1. Parse the sentences to generate syntactic trees.

2. Start at the anaphor and traverse the syntactic tree in a breadth-first manner.

3. Check each potential antecedent for compatibility (e.g., gender, number,

syntactic role).

4. Select the most suitable antecedent.

 Example:
Input Text: "John left early because he was tired."

o Anaphor: he.

o Antecedent: John.

o Hobbs’ algorithm identifies John as the referent by traversing the tree and
matching syntactic and semantic features.
Centering Algorithm

The Centering Algorithm is a technique used to ensure coherence in a discourse by

analyzing how entities (like people, objects, or ideas) are referenced across sentences. It
helps identify the most important subject (called the center) of a sentence and how it
connects to the next sentence.

Steps in the Centering Algorithm

1. Identify Forward-Looking Centers (Cf):

 List all entities in the current sentence.

Example: "John went to the park. He played soccer there."

 In Sentence 1: Forward-looking centers = {John, park}.

2. Find the Backward-Looking Center (Cb):

 Determine the most important entity from the previous sentence that connects to
the current sentence.

Example:

 In Sentence 2: He refers to John, making John the backward-looking center.

3. Rank the Forward-Looking Centers (Cf):

 Rank the entities based on their importance (subjects > objects > others).

Forward-looking centers are ranked based on their importance in the sentence.

Typically, subjects (e.g., "John") are ranked higher than objects (e.g., "the
park").

4. Check Transitions for Coherence:

Coherence in discourse depends on how the backward-looking center (Cb)

changes across sentences. The goal is to minimize unnecessary changes.
Example of Centering Algorithm in Action

Text: "Mary went to the store. She bought apples. The apples were fresh."

Step-by-Step Process:
1. Sentence 1: "Mary went to the store."
 Forward-looking centers (Cf) = {Mary, store}.
 No backward-looking center (Cb) since this is the first sentence.

2. Sentence 2: "She bought apples."

 Forward-looking centers (Cf) = {She (Mary), apples}.
 Backward-looking center (Cb) = Mary (connected via She).

3. Sentence 3: "The apples were fresh."

 Forward-looking centers (Cf) = {apples}.
 Backward-looking center (Cb) = apples (connected to the previous mention of
apples).

Coherence:
The backward-looking center shifts smoothly from Mary to apples as the discourse
progresses, maintaining logical flow and coherence.

COREFERENCE RESOLUTION

Coreference Resolution is the task of identifying when different words or phrases in a

text refer to the same entity. It is a key aspect of natural language understanding and is
widely used in applications like chatbots, machine translation, and summarization.

Types of Coreference

1. Anaphora: Refers to an earlier entity in the text.

o Example: "John arrived late. He apologized."
 He refers to John.

2. Cataphora: Refers to an entity mentioned later in the text.

o Example: "When he arrived, John was surprised to see everyone waiting."
 He refers to John, mentioned later.

3. Split Antecedent Coreference: Refers to a combination of two or more entities.

o Example: "John met Sarah. They had lunch together."
 They refers to John and Sarah.
4. Exophora: Refers to something outside the text, often in the physical context.
o Example: "Look at that!"
 That refers to something observable in the environment.

Steps in Coreference Resolution

1. Identify Mentions:
o Locate all noun phrases or pronouns that could refer to an entity.
o Example: "John loves his dog."
 Mentions: John, his, dog.

2. Extract Features:
o Analyze features like gender, number, semantics, and syntactic position.
o Example:
 John (male, singular) matches he (male, singular).

3. Create Candidate Chains:

o Group potential references into chains based on matching features.
o Example:
 Chain: {John, he}.

4. Resolve Coreference:
o Use algorithms or rules to determine the correct antecedent for each pronoun
or noun phrase.

Algorithms for Coreference Resolution

1. Rule-Based Methods:
o Use hand-crafted rules based on linguistic knowledge.
o Example:
 A pronoun like he typically refers to the nearest preceding male noun
phrase.

2. Machine Learning Approaches:

o Use supervised learning to train models on annotated datasets.
o Features include syntactic roles, word embeddings, and distance between
mentions.

3. Neural Network Models:

o Leverage deep learning to capture complex relationships.
o Example: Transformers like BERT can model contextual relationships in
coreference tasks.

Examples of Coreference

Example 1: Simple Coreference

Text: "Alice visited the park. She enjoyed the walk."

 Coreference: Alice ↔ She.

Example 2: Complex Coreference

Text: "The team worked hard. Their efforts paid off in the end."

 Coreference: The team ↔ Their.

Example 3: Split Antecedent

Text: "John greeted Mary. They went to the café."

 Coreference: John + Mary ↔ They.

Applications of Co-reference Resolution

1. Chatbots and Virtual Assistants:

o Enable contextual understanding to maintain coherent conversations.
o Example: "What is Elon Musk's company? Tell me more about it."
Resolves it to Elon Musk's company.

2. Machine Translation:
o Ensures proper pronoun reference in target languages.

3. Text Summarization:
o Helps group mentions of the same entity to create concise summaries.

4. Question Answering Systems:

o Resolves references in follow-up questions.
o Example: "Who won the race? Was he happy?"

RESOURCES

In the context of Natural Language Processing (NLP), resources refer to tools, datasets,
and frameworks that help in various tasks like text analysis, machine learning, and
language understanding. These resources provide foundational data and methods for
processing, analyzing, and understanding language.

1. Porter Stemmer

 What It Is:
A stemming algorithm that reduces words to their base or root form by removing
common suffixes.
 Why It’s Used:
Helps in text normalization by treating words like "running" and "runner" as the
same root word, "run."
 Example:
o Input: "playing, played, playful"
o Output: "play"

2. Lemmatizer

 What It Is:
A tool that converts words to their dictionary base form (lemma), considering the
word’s meaning and context.
 Why It’s Used:
Unlike stemming, it ensures that the base form is a real word and more linguistically
accurate.
 Example:
o Input: "better"
o Output: "good"

3. Penn Treebank

 What It Is:
A dataset containing annotated syntactic structures and part-of-speech tags for
English text.
 Why It’s Used:
Provides a standardized corpus for training and testing NLP models.
 Example Annotation:
o Sentence: "The dog barked loudly."
o POS Tags: [The/DT dog/NN barked/VBD loudly/RB]

4. Brill's Tagger

 What It Is:
A rule-based part-of-speech (POS) tagger developed by Eric Brill.
 Why It’s Used:
Assigns grammatical tags to words based on contextual rules.
 Example:
o Sentence: "She runs fast."
o POS Tags: She/PRP runs/VBZ fast/RB
5. WordNet

 What It Is:
A lexical database for the English language that groups words into synsets (synonym
sets) and provides relationships like hypernyms (broader terms) and hyponyms
(specific terms).
 Why It’s Used:
Supports tasks like word sense disambiguation and semantic analysis.
 Example:
o Word: "car"
o Synsets: {automobile, motorcar}
o Hypernym: "vehicle"

6. PropBank

 What It Is:
A corpus annotated with information about verb arguments and their roles in
sentences (semantic role labeling).
 Why It’s Used:
Enables understanding of sentence semantics by identifying "who did what to
whom."
 Example:
o Sentence: "John gave Mary a book."
o Roles: John (giver), Mary (recipient), book (object)
7. FrameNet

 What It Is:
A database that groups words into semantic frames—concepts that capture the
relationships between words in context.
 Why It’s Used:
Helps in understanding the broader context of sentences.
 Example:
o Frame: Commerce_buy
o Sentence: "She bought a car from the dealer."
o Roles: Buyer (She), Goods (car), Seller (dealer)

8. Brown Corpus

 What It Is:
One of the first large, annotated corpora of English text, covering diverse genres like
fiction, news, and academic writing.
 Why It’s Used:
Provides a standard dataset for linguistic analysis and training language models.
 Example:
o Contains over 1 million words categorized into genres like news and editorials.

9. British National Corpus (BNC)

 What It Is:
A 100-million-word text corpus that represents modern British English from various
contexts like books, conversations, and broadcasts.
 Why It’s Used:
Useful for understanding language trends, dialects, and usage patterns in British
English.
 Example:
o Provides frequency counts of word usage, e.g., "colour" is more common than
"color."

Install NLTK (Natural Language Toolkit)

NLTK is one of the most popular libraries in Python for NLP. It includes a variety of
useful modules for tasks like tokenization, stemming, lemmatization, POS tagging, and
more. To install NLTK:
Steps:

1. Open a terminal or command prompt.

2. Run the following command to install NLTK via pip:

3. After installing NLTK, you can use the following code to download the datasets you
need (like WordNet, Brown Corpus, etc.):

Alternatively, you can download specific datasets

Note: The nltk.download('all') command will download all datasets, which could take some
time and space. If you only need specific resources, you can download them individually as
shown above.

NLP Reading Material-1
No ratings yet
NLP Reading Material-1
15 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
Important Questions and Answer NLP
No ratings yet
Important Questions and Answer NLP
10 pages
NLP Notes Unit 1to5 Final
No ratings yet
NLP Notes Unit 1to5 Final
75 pages
Challenges in NLP
No ratings yet
Challenges in NLP
9 pages
Introduction To NLPAbebe Zerihun
No ratings yet
Introduction To NLPAbebe Zerihun
45 pages
Unit I - Natural Language Processing
No ratings yet
Unit I - Natural Language Processing
34 pages
Unit I - NLP
No ratings yet
Unit I - NLP
24 pages
Natural Language Processing 5
No ratings yet
Natural Language Processing 5
24 pages
Unit 1
No ratings yet
Unit 1
99 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
CH 6
No ratings yet
CH 6
30 pages
NLP Viva
No ratings yet
NLP Viva
14 pages
Cs383 Lecture16 PDF
No ratings yet
Cs383 Lecture16 PDF
46 pages
NLP Introduction Notes Anna University
No ratings yet
NLP Introduction Notes Anna University
2 pages
Ai Unit - 5
No ratings yet
Ai Unit - 5
12 pages
NLP Answers
No ratings yet
NLP Answers
6 pages
NLP 1
No ratings yet
NLP 1
8 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
74 pages
NLP Unit-1-Introduction-And-Word-Level-Analysis NLP Unit-1-Introduction-And-Word-Level-Analysis
No ratings yet
NLP Unit-1-Introduction-And-Word-Level-Analysis NLP Unit-1-Introduction-And-Word-Level-Analysis
26 pages
Unit 1
No ratings yet
Unit 1
20 pages
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
No ratings yet
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
41 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
What Is NLP?: Components of An FSA
No ratings yet
What Is NLP?: Components of An FSA
16 pages
Artificial Intelligence: Rohan Raj Poudel
No ratings yet
Artificial Intelligence: Rohan Raj Poudel
34 pages
NLPNotes
No ratings yet
NLPNotes
12 pages
Chapter 01
No ratings yet
Chapter 01
47 pages
NLP Unit-1 Notes
No ratings yet
NLP Unit-1 Notes
162 pages
NLP Notes Unit 1
No ratings yet
NLP Notes Unit 1
42 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
NLP Assignment Notes
No ratings yet
NLP Assignment Notes
28 pages
Lec 1.1.2
No ratings yet
Lec 1.1.2
44 pages
NLP Unit-1-Introduction-And-Word-Level-Analysis
No ratings yet
NLP Unit-1-Introduction-And-Word-Level-Analysis
25 pages
NLP Notes 2
No ratings yet
NLP Notes 2
137 pages
NLP Notes For Students
67% (3)
NLP Notes For Students
18 pages
Lecture01 Introduction
No ratings yet
Lecture01 Introduction
35 pages
Introduction To Natural Language Processing and NLTK
No ratings yet
Introduction To Natural Language Processing and NLTK
23 pages
Ngrams
No ratings yet
Ngrams
22 pages
Unit 1
No ratings yet
Unit 1
17 pages
NLP Notes Unit 1
No ratings yet
NLP Notes Unit 1
42 pages
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
No ratings yet
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
54 pages
Natural Language Processing-Course Handout September 2022
No ratings yet
Natural Language Processing-Course Handout September 2022
8 pages
NLP Sem 7 Imp Questions
No ratings yet
NLP Sem 7 Imp Questions
11 pages
Module 1 Lecture 1
No ratings yet
Module 1 Lecture 1
29 pages
Mod 1
No ratings yet
Mod 1
71 pages
NLP Course Overview Fall 2020
No ratings yet
NLP Course Overview Fall 2020
44 pages
Module 1
No ratings yet
Module 1
40 pages
NLP - AI2214601 Unit 1to Unit 5 Notes
No ratings yet
NLP - AI2214601 Unit 1to Unit 5 Notes
98 pages
Intro To Language Models - Soumyasis Mishra - 191001021003 - BCS4C
No ratings yet
Intro To Language Models - Soumyasis Mishra - 191001021003 - BCS4C
10 pages
Module 1.1
No ratings yet
Module 1.1
9 pages
Outline
No ratings yet
Outline
34 pages
NLP Important Question and Answers Module Wise
No ratings yet
NLP Important Question and Answers Module Wise
101 pages
Natural Language Processing Slides
No ratings yet
Natural Language Processing Slides
1,027 pages
21ML1601 NLP QB
No ratings yet
21ML1601 NLP QB
34 pages
A Beginner's Guide To Natural Language Processing - IBM Developer
No ratings yet
A Beginner's Guide To Natural Language Processing - IBM Developer
9 pages
NLP Module 1
No ratings yet
NLP Module 1
55 pages
Basics of Text Processing - Tokenization and Stemming
No ratings yet
Basics of Text Processing - Tokenization and Stemming
47 pages
NLP QB Final
No ratings yet
NLP QB Final
51 pages
Unit 1 ESSS
No ratings yet
Unit 1 ESSS
34 pages
NLP AU Ques
No ratings yet
NLP AU Ques
4 pages
Challenges in Automatic Speech Recognition
No ratings yet
Challenges in Automatic Speech Recognition
34 pages
FDSA Unit - 2
No ratings yet
FDSA Unit - 2
142 pages
03-01-06
No ratings yet
03-01-06
11 pages
DC Notes 2
No ratings yet
DC Notes 2
15 pages
Secure Software Notes
No ratings yet
Secure Software Notes
4 pages
SDN PPT1
No ratings yet
SDN PPT1
29 pages
File Handling
No ratings yet
File Handling
13 pages
Secure Software Notes
No ratings yet
Secure Software Notes
4 pages
FDS Unit - 5
No ratings yet
FDS Unit - 5
16 pages
Secure Software Notes
No ratings yet
Secure Software Notes
3 pages
NLP Unit5 Discourse and Lexical Resources Elaborated
No ratings yet
NLP Unit5 Discourse and Lexical Resources Elaborated
4 pages
FDS Unit - 5
No ratings yet
FDS Unit - 5
26 pages
FDS Unit - 5
No ratings yet
FDS Unit - 5
17 pages
FDS Unit - 5
No ratings yet
FDS Unit - 5
13 pages
FDS Unit - 5
No ratings yet
FDS Unit - 5
10 pages
ADVANCED GRAMMAR AND COMPOSITION For Review
No ratings yet
ADVANCED GRAMMAR AND COMPOSITION For Review
8 pages
Introduction To English Grammar
No ratings yet
Introduction To English Grammar
43 pages
The War Against Grammar: G R R A A M M
100% (1)
The War Against Grammar: G R R A A M M
6 pages
WAJ3102 English Language Proficiency TOPIC 1
No ratings yet
WAJ3102 English Language Proficiency TOPIC 1
23 pages
Let-Notes-Gen Edprof Ed
No ratings yet
Let-Notes-Gen Edprof Ed
102 pages
FINAL Revision For Checkpoint (English)
100% (2)
FINAL Revision For Checkpoint (English)
142 pages
Lesson Plan Final
No ratings yet
Lesson Plan Final
37 pages
Mypna GPW g11 Web PDF
No ratings yet
Mypna GPW g11 Web PDF
139 pages
MNTG Textbook 2017
100% (2)
MNTG Textbook 2017
468 pages
Phonetics and Phonology PDF
100% (1)
Phonetics and Phonology PDF
62 pages
Clauses and Phrases
No ratings yet
Clauses and Phrases
19 pages
Diagnost For Gratitude
No ratings yet
Diagnost For Gratitude
4 pages
Integrating Technology, Cultural Diversity, and Listening and Speaking Activities in Language Arts
100% (1)
Integrating Technology, Cultural Diversity, and Listening and Speaking Activities in Language Arts
5 pages
Final 12th Syllabus 2023
No ratings yet
Final 12th Syllabus 2023
64 pages
GNS 301 2024-2025-1
No ratings yet
GNS 301 2024-2025-1
26 pages
2nd Quarter Grade
No ratings yet
2nd Quarter Grade
4 pages
Final Test (Uas) : English Morphologi
No ratings yet
Final Test (Uas) : English Morphologi
20 pages
XI Term1-Employability Skills
No ratings yet
XI Term1-Employability Skills
41 pages
Syntax - Multiple Choice - Words TUẦN 3: Form classes - Structure classes: Lexical words Test 3
No ratings yet
Syntax - Multiple Choice - Words TUẦN 3: Form classes - Structure classes: Lexical words Test 3
1 page
Stage 1 - Desired Results
No ratings yet
Stage 1 - Desired Results
5 pages
Quiz 1
No ratings yet
Quiz 1
1 page
A Dictionary of Kalam With Ethnographic Notes (Pacific Linguistics, 630)
No ratings yet
A Dictionary of Kalam With Ethnographic Notes (Pacific Linguistics, 630)
825 pages
0 Level Zero To Hero
No ratings yet
0 Level Zero To Hero
62 pages
Lecture 8. Terminology-Management Systems
100% (1)
Lecture 8. Terminology-Management Systems
30 pages
Meetings Vocabulary Guide
No ratings yet
Meetings Vocabulary Guide
6 pages
Grade 8 English Notes
No ratings yet
Grade 8 English Notes
6 pages
Lexical Semantic Problems in Translation
100% (2)
Lexical Semantic Problems in Translation
13 pages
Book CSEET Reference Reading Material I 19052025
No ratings yet
Book CSEET Reference Reading Material I 19052025
642 pages
(Ebook PDF) Literacy For The 21st Century: A Balanced Approach 3rd Edition Instant Download
67% (3)
(Ebook PDF) Literacy For The 21st Century: A Balanced Approach 3rd Edition Instant Download
167 pages
Phonetics 3 - Cheatsheet 2018
No ratings yet
Phonetics 3 - Cheatsheet 2018
19 pages