NLP ASSIGNMENT NOTES
UNIT 1
1. Comparison: Boolean Model vs Vector Space Model vs
Probabilistic Model (for NLP & IR)
🔷 Boolean Model
Basic Idea: Documents and queries are represented as a set of terms
(present or not).
Operations: Uses Boolean logic (AND, OR, NOT).
Strengths:
o Simple to implement.
o Fast for small datasets.
o Precise matching when query and document terms align exactly.
Weaknesses:
o No partial matching – either a document is relevant or not.
o No ranking of results.
o Doesn’t consider term frequency or document length.
Example Use: Early search engines and digital libraries.
🔷 Vector Space Model (VSM)
Basic Idea: Documents and queries are represented as vectors in a
multi-dimensional space (each dimension = term).
Similarity Measure: Cosine similarity is often used.
Strengths:
o Supports ranking of documents based on similarity.
o Handles partial matches and term weighting (like TF-IDF).
o Simple mathematical foundation.
Weaknesses:
o Ignores relationships between terms (no semantics).
o Assumes independence between terms.
Example Use: Search engines using TF-IDF scoring.
🔷 Probabilistic Model
Basic Idea: Assigns a probability that a document is relevant to a
given query.
Models: Includes Binary Independence Model (BIM), BM25, Language
Models.
Strengths:
o Considers uncertainty and estimates relevance.
o Can adapt and learn from user feedback.
o Supports probabilistic ranking.
Weaknesses:
o Computationally more complex.
o Requires training data or relevance judgments.
Example Use: Modern IR systems like Google ranking algorithms, ad
retrieval.
2. Classical NLP Models: Rule-based vs Statistical vs Information
Retrieval Models
🔹 Rule-Based Models
How They Work:
o Use manually written linguistic rules.
o Depend on syntax, lexicons, and grammar rules.
Strengths:
o Transparent and explainable.
o Precise for controlled language environments.
Weaknesses:
o Hard to scale across domains and languages.
o Rigid and brittle – small changes break the system.
Examples:
o Grammar checkers (like Grammarly).
o Early NER systems using regular expressions.
🔹 Statistical Models
How They Work:
o Use data-driven approaches based on probability.
o Learn from annotated corpora (training data).
Types:
o N-gram models, Hidden Markov Models (HMM),
Conditional Random Fields (CRF).
Strengths:
o Scalable and domain-adaptive.
o More robust than rule-based models.
Weaknesses:
o Require large labeled datasets.
o May lack transparency in decision-making.
Examples:
o POS tagging (HMM).
o Named Entity Recognition (CRF).
🔹 Information Retrieval (IR) Models
How They Work:
o Match documents to queries using keyword overlap or statistical
scoring (like TF-IDF).
Strengths:
o Fast and efficient on large-scale corpora.
o Good for unstructured document retrieval.
Weaknesses:
o Lack deep understanding of language (semantic gap).
Examples:
o Search engines (Lucene, Solr).
o Document recommendation systems.
3. Probabilistic Graphical Models (PGMs) and Their Role in NLP
🔷 Definition:
PGMs are frameworks that use graph structures to represent and
reason about uncertain variables.
Combine graph theory + probability theory.
🔷 Two Main Types:
1. Bayesian Networks (Directed Acyclic Graphs)
o Represent causal relationships.
o Use conditional probabilities.
2. Markov Random Fields (Undirected Graphs)
o Represent symmetric relationships.
o Capture dependencies without direction.
🔷 Importance in NLP Tasks:
Task Role of PGMs
POS Tagging Hidden Markov Models (HMMs) estimate tag sequences.
Task Role of PGMs
CRFs (Conditional Random Fields) predict label sequences
NER / Chunking
with context.
LDA (Latent Dirichlet Allocation) identifies hidden topics
Topic Modeling
using a generative Bayesian model.
Probabilistic CFGs (Context-Free Grammars) used to assign
Parsing
tree structures.
Semantic Role Use graphical models to capture dependencies between
Labeling sentence elements.
🔷 Advantages:
Handle uncertainty and hidden variables.
Model dependencies and structure in language.
Scalable to large datasets.
Provide probabilistic predictions useful in many NLP applications.
🔷 Challenges:
Require training data and parameter estimation.
Inference can be computationally expensive.
UNIT 2
1. Key Differences between Phonetics and Phonology
Aspect Phonetics Phonology
Study of the physical sounds Study of how sounds function
Definition
of speech and are organized in a language
Abstract, rule-based sound
Physical articulation, acoustic
Focus patterns and systems in a
properties, and perception
language
Phonemes (distinct sound units
Units Phones (actual spoken sounds)
that can change meaning)
Aspect Phonetics Phonology
Scientific More related to physics, More related to linguistics and
Area biology, and physiology mental representation
[p] and [ph] are different
/p/ and /b/ are phonemes in
Example phonetically (with/without
English – "pat" vs "bat"
aspiration)
Linguistic analysis, language
Applicatio Speech synthesis, recognition,
teaching, computational
ns forensic linguistics
linguistics
2. Definition of Morphology and Word Formation Processes
🔹 Morphology:
The study of the structure and formation of words.
Focuses on morphemes – the smallest meaningful units in a language.
🔹 Word Formation Processes (with examples):
1. Derivation:
o Adding prefixes/suffixes to form new words.
o happy → unhappiness (prefix: un-, suffix: -ness)
2. Inflection:
o Changes a word’s form to express tense, number, case, etc.
o walk → walked (past tense), cat → cats (plural)
3. Compounding:
o Joining two or more words to form a new word.
o tooth + brush → toothbrush
4. Conversion (Zero Derivation):
o Changing the word class without changing the form.
o run (verb) → a run (noun)
5. Clipping:
o Shortening a longer word.
o advertisement → ad
6. Blending:
o Combining parts of two words.
o breakfast + lunch → brunch
7. Acronyms and Initialisms:
o Forming words from initials.
o NASA (National Aeronautics and Space Administration)
8. Reduplication:
o Repeating part/all of a word.
o bye-bye, tick-tock
3. Role of Finite-State Transducers (FSTs) in Morphological Analysis
🔹 What is an FST?
A Finite-State Transducer is a type of automaton that maps
between two levels of representation, e.g., surface form ↔ lexical
form.
Used to model morphological rules in computational linguistics.
🔹 How FSTs Work in Morphology:
FSTs take an input word and break it down into root + affix(es).
They can also generate surface forms from lexical entries.
🔹 Example:
Suppose we have the word: "talked"
Lexical form: talk + PAST
Surface form: talked
✅ An FST can:
Analyze: From talked → talk + PAST
Generate: From talk + PAST → talked
🔹 Practical Scenarios:
1. Spell-checkers:
o Recognize and correct inflected and derived forms of words.
2. Text-to-speech systems:
o Accurately pronounce morphologically complex words.
3. Search engines:
o Perform stemming and lemmatization to match different word
forms.
4. Machine Translation:
o Accurately translate morphologically rich languages.
5. Language learning apps:
o Teach correct word forms and conjugations.
UNIT 3
1. Tokenization in NLP
✅ Definition:
Tokenization is the process of breaking down a text into smaller
units called tokens (e.g., words, phrases, symbols).
✅ Steps Involved:
1. Input Text Processing:
o Raw text is taken as input for processing.
2. Language Identification (optional but useful):
o Detect the language to apply proper tokenization rules
(since rules vary by language).
3. Sentence Segmentation:
o Divide the text into individual sentences.
o Example: "Hello world! How are you?" → ["Hello world!",
"How are you?"]
4. Word Tokenization:
o Each sentence is broken into words/tokens.
o Example: "Hello world!" → ["Hello", "world", "!"]
5. Punctuation Handling:
o Decide whether to keep/remove punctuation as separate
tokens.
6. Special Handling (e.g., contractions):
o "I'm" → ["I", "am"] (depending on the tokenizer)
7. Language-Specific Rules:
o For agglutinative languages (e.g., Turkish), sub-word or
morpheme tokenization may be applied.
✅ Importance:
Acts as the first step in most NLP pipelines.
Essential for:
o Text analysis
o Information retrieval
o Sentiment analysis
o Machine translation
2. PoS Tagging: Rule-Based vs Stochastic vs Lexical
When It’s
Model Type Description Examples
Effective
If a word Low-resource
Uses hand-crafted
ends with settings,
Rule-Based linguistic rules to assign
"ly", tag as languages with
PoS tags.
adverb. rich morphology.
Stochastic Uses probabilistic "The dog When trained on
When It’s
Model Type Description Examples
Effective
barks." →
models like HMMs, CRFs
High chance large, labeled
(Statistical) based on word and tag
"dog" = corpora.
frequencies.
noun
When working
WordNet or
Lexical with standard
Tags are assigned based large
(Dictionary vocabulary and
on dictionaries/lexicons. tagged
-Based) limited context
corpora
needed.
✅ Examples:
Rule-Based:
o "If previous word is 'to' and current word is verb → mark
as base form of verb"
Stochastic:
o Trained on corpus: "flies like" could infer "flies" is noun or
verb based on context.
Lexical:
o "run" is both a verb and a noun. Dictionary helps list all
possible tags, but context is not used.
3. Named Entity Recognition (NER)
✅ Definition:
NER is a subtask of NLP that identifies and classifies named entities
in text into predefined categories such as:
Person names
Organizations
Locations
Dates
Quantities
Monetary values
Events, etc.
✅ Example:
text
CopyEdit
"Apple Inc. announced a new product in California on March 21,
2023."
NER Output:
Apple Inc. → Organization
California → Location
March 21, 2023 → Date
✅ Applications:
Field Use Case
Extracting company names, financial info, executive
Business
movements from reports.
Journalis
Quickly tagging people, places, events in articles.
m
Healthcar Identifying diseases, drug names, patient details from
e clinical notes.
Extracting case numbers, defendant/plaintiff names,
Legal
dates from documents.
Social Identifying trending people, brands, and places in
Media tweets/posts.
UNIT 4
1. What is Syntactic Parsing?
Syntactic parsing (also called syntax analysis) is the process of analyzing
a sentence to reveal its grammatical structure, often represented as a
parse tree.
It helps determine how words relate to each other in a sentence.
Outputs phrase structure or dependency relations.
🟩 2. Why is Syntactic Parsing Important?
Understands sentence structure for deeper NLP tasks.
Aids in:
o Machine Translation
o Information Extraction
o Question Answering
o Grammar Checking
🟩 3. Types of Parsing
✅ A. Constituency Parsing (Phrase Structure Parsing)
Breaks sentence into nested constituents (NP, VP, PP).
Based on Context-Free Grammar (CFG).
Example:
scss
CopyEdit
(S
(NP The quick brown fox)
(VP jumps
(PP over
(NP the lazy dog))))
✅ B. Dependency Parsing
Focuses on word-to-word relationships.
Each word is a node, and arcs represent dependencies (subject, object,
etc.)
Example (Dependency Tree):
jumps → root
fox → subject of jumps
over → modifier of jumps
dog → object of over
🟩 4. Parsing Techniques
✅ A. Rule-Based Parsing
Uses hand-written grammar rules (e.g., CFGs)
Parses sentence according to production rules.
✅ Transparent and explainable
❌ Not robust to real-world noisy text
✅ B. Statistical Parsing
Learns parsing from treebank datasets (annotated corpora)
Uses:
o PCFG (Probabilistic CFG): Assigns probabilities to CFG rules.
o Chart Parsers, CKY Algorithm
✅ C. Transition-Based Parsing (for Dependency Parsing)
Builds parse tree incrementally using actions (SHIFT, REDUCE)
Efficient and used in real-time NLP systems
✅ D. Neural Parsing
Uses neural networks (e.g., LSTMs, Transformers)
Learns from large corpora
Highly accurate and generalizable
Libraries:
spaCy
Stanza (Stanford NLP)
Benepar (Berkeley Neural Parser)
AllenNLP
🟩 5. Treebanks – Role in Parsing
Treebanks are annotated corpora with syntactic parse trees.
Used for training and evaluating parsers.
Examples:
Penn Treebank (for constituency parsing)
Universal Dependencies (for dependency parsing)
🟩 6. Example Grammar and CFG Parsing
Grammar Rules (simple CFG):
rust
CopyEdit
S -> NP VP
NP -> Det N
VP -> V NP
Det -> 'the'
N -> 'dog' | 'cat'
V -> 'chased'
Sentence:
the dog chased the cat
Parse Tree:
mathematica
CopyEdit
/ \
NP VP
/\ /\
Det N V NP
| | | /\
the dog chased Det N
| |
the cat
🟩 7. Applications of Syntactic Parsing
Grammar correction (e.g., Grammarly)
Voice assistants: understanding commands
Machine translation: structural disambiguation
Question answering systems
Summarization
What is a Linguistically Annotated Corpus?
A linguistically annotated corpus is a large collection of text
that has been tagged or marked up with linguistic information
such as:
Part-of-Speech (PoS) tags
Syntax (parse trees)
Semantics (meaning)
Named Entities
Morphological information
Coreference links
Dependency relations
💡 Think of it as a text + expert linguistic labels for machines to learn from.
🟩 2. Purpose of Annotated Corpora
Training and evaluating NLP models.
Studying language structure.
Developing tools like:
o Part-of-speech taggers
o Parsers
o Named Entity Recognizers
o Machine Translation systems
🟩 3. Common Types of Annotations
Annotation
Description Example
Type
Labels each word with its
PoS Tagging dog/NN, run/VB
part of speech
Includes root word, running → run +
Morphological
tense, number, etc. V + ing
Phrase structure or NP → Det + Adj
Syntactic
dependency trees + Noun
"bank" →
Meaning-based roles or
Semantic financial vs
senses
river
Annotation
Description Example
Type
NER (Named Tags names of persons, Obama →
Entities) places, orgs, etc. PERSON
Links pronouns and their
Coreference She = Mary
antecedents
🟩 4. Popular Linguistically Annotated Corpora
Corpus Name Features Use Case
PoS tags, phrase
Penn Treebank Syntax parsing
structure (CFG trees)
Universal
Multilingual, Cross-lingual
Dependencies
dependency parsing NLP
(UD)
General
One of the first
Brown Corpus linguistic
annotated corpora
research
Syntax, semantics, Semantic role
OntoNotes
coreference, NER labeling
Word Sense Lexical
SemCor
Disambiguation semantics
🟩 5. Example: Annotated Sentence (from Penn Treebank)
Raw Sentence:
"The quick brown fox jumps over the lazy dog."
Annotated with PoS:
swift
CopyEdit
[The/DT quick/JJ brown/JJ fox/NN]NP
[jumps/VBZ]VP
[over/IN the/DT lazy/JJ dog/NN]PP
Parse Tree (simplified):
less
CopyEdit
/ \
NP VP
/|\ |
DT JJ NN jumps
🟩 6. Benefits in NLP
Improves model accuracy via supervised learning.
Provides ground truth for evaluation.
Enables complex tasks like:
o Coreference Resolution
o Semantic Role Labeling
o Machine Translation
🟩 7. How They're Created
Manual Annotation: By linguists or trained annotators.
Semi-Automated Tools: Annotators use tools like Brat, Prodigy,
WebAnno.
Crowdsourcing: Amazon Mechanical Turk for large datasets.
🟩 8. Tools to Use Annotated Corpora
NLTK (Python): Comes with corpora like Treebank, Brown
spaCy: Pretrained models from annotated corpora
Stanza: Accesses Universal Dependencies
1. Treebanks and Their Role in Syntactic Parsing
✅ What is a Treebank?
A treebank is a linguistically annotated corpus where each
sentence is paired with a syntactic parse tree that shows its
grammatical structure.
The parse tree is created using a grammar (usually a Context-Free
Grammar).
✅ Types of Treebanks:
Constituency Treebanks: Show how words group into constituents
(phrases).
Dependency Treebanks: Show head-dependent relations between
words.
✅ Examples:
Penn Treebank (most famous English treebank using phrase structure
trees)
Universal Dependencies (UD) for multiple languages using
dependency parsing
✅ Role in Syntactic Parsing:
Used to train and evaluate parsers.
Helps NLP models learn grammatical structures and patterns.
Crucial for applications like:
o Machine translation
o Grammar correction
o Question answering
2. Statistical Parsing vs. Probabilistic CFGs (PCFGs)
Aspect Statistical Parsing Probabilistic CFG (PCFG)
Definiti General parsing using A CFG where each production rule has
on machine-learned models an associated probability
Data-driven, learned from Based on probabilistic rules derived
Basis
treebanks from corpus frequencies
Most probable parse tree for Same, but derived using a
Output
a sentence probabilistic CFG
Flexibili Can use rich features and
Limited to rules and probabilities
ty context
Exampl Neural parsers, transition- Inside-Outside algorithm, CKY with
es based parsers probabilities
Use Real-time parsing in large Structured prediction in small
Case applications grammar-based systems
3. Create Grammar Rules Using Context-Free Grammar (CFG)
✅ CFG Basics:
A Context-Free Grammar is defined by:
o A set of non-terminals (e.g., S, NP, VP)
o A set of terminals (e.g., words)
o Production rules (e.g., S → NP VP)
o A start symbol (usually S)
✅ Sample Grammar for a Subset of English:
S → NP VP
NP → Det N | Det Adj N | Pronoun
VP → V NP | V
Det → "a" | "the"
N → "dog" | "cat" | "boy"
Adj → "happy" | "angry"
V → "chased" | "saw"
Pronoun → "he" | "she"
✅ Example Sentence:
"the happy dog chased a cat"
✅ Parse Tree (Structure):
/ \
NP VP
/|\ /\
Det Adj N V NP
the happy dog chased / \
Det N
a cat
✅ Parsing Using Python (with NLTK):
python
CopyEdit
import nltk
from nltk import CFG
# Define the grammar
grammar = CFG.fromstring("""
S -> NP VP
NP -> Det N | Det Adj N | Pronoun
VP -> V NP | V
Det -> 'a' | 'the'
N -> 'dog' | 'cat' | 'boy'
Adj -> 'happy' | 'angry'
V -> 'chased' | 'saw'
Pronoun -> 'he' | 'she'
""")
# Create the parser
parser = nltk.ChartParser(grammar)
# Input sentence
sentence = ['the', 'happy', 'dog', 'chased', 'a', 'cat']
# Parse the sentence
for tree in parser.parse(sentence):
tree.pretty_print()
UNIT 5
1. First-Order Logic (FOL) and Description Logics (DLs)
✅ First-Order Logic (FOL)
Definition: A formal system used to express statements with
quantifiers, predicates, and logical connectives.
Components:
o Constants: specific entities (e.g., John)
o Variables: general placeholders (e.g., x)
o Predicates: properties/relations (e.g., Loves(John, Mary))
o Quantifiers:
Universal: ∀x (for all)
Existential: ∃x (there exists)
o Logical connectives: AND (∧), OR (∨), NOT (¬), IMPLIES (→)
Importance in Semantic Analysis:
o Captures meaning using formal logic.
o Enables inference and reasoning.
o Used in question answering, knowledge representation,
and semantic parsing.
✅ Description Logics (DLs)
Definition: A subset of FOL focused on concepts (classes), roles
(relationships), and individuals.
Used primarily in ontology languages like OWL (Web Ontology
Language).
Example:
o Concept: Person
o Role: hasChild
o Assertion: Person ⊑ ∃hasChild.Person (every person has a child
who is also a person)
Importance in Semantic Analysis:
o Facilitates semantic web, ontology building, and knowledge
graphs.
o Balances expressiveness and computational tractability.
o Powers tools like reasoners (e.g., Pellet, HermiT) to infer new
knowledge.
2. Report on Thematic Roles and Selectional Restrictions
✅ Thematic Roles (Semantic Roles)
Define the relationship between a verb and its arguments.
Common Roles:
o Agent: The doer of the action (John in “John kicked the ball”)
o Theme: The entity affected (the ball)
o Experiencer: One who feels or perceives (Mary in “Mary felt
cold”)
o Instrument: Means by which action is performed (knife in “cut
with a knife”)
o Location, Goal, Source, etc.
✅ Selectional Restrictions
Definition: Constraints that verbs place on their arguments based on
semantic compatibility.
Examples:
o eat expects an edible object: ✔️“eat an apple”, ❌“eat a table”
o drive expects a vehicle: ✔️“drive a car”, ❌“drive a banana”
✅ Importance:
Ensures grammatical and semantic validity.
Helps in word sense disambiguation.
Enhances machine understanding of meaning by filtering
implausible combinations.
3. Word Sense Disambiguation (WSD)
✅ Definition:
The process of identifying which sense (meaning) of a word is used
in a sentence when the word has multiple meanings.
✅ Example:
Word: “bank”
o “I deposited money at the bank.” → financial institution
o “He sat by the river bank.” → river edge
✅ WSD Approaches:
1. Knowledge-Based:
o Use dictionaries/ontologies like WordNet
o Example: Lesk Algorithm (overlap of dictionary definitions and
context)
2. Supervised Learning:
o Train classifiers (e.g., SVM, Naive Bayes) on labeled corpora.
o Requires annotated data.
3. Unsupervised Learning:
o Use clustering on word contexts.
o Doesn’t require labeled data.
4. Neural Approaches:
o Contextual embeddings (e.g., BERT, ELMo) capture word
meaning in context.
o Example: Fine-tuned BERT model on WSD datasets.
✅ Applications:
Machine Translation (choose correct word in target language)
Information Retrieval (more accurate search results)
Text Mining (correct extraction of entities or topics)
PoS Tagging
🟩 1. What is PoS Tagging?
Part-of-Speech (PoS) tagging is the process of assigning a grammatical
category (tag) to each word in a sentence based on its context.
✅ Example:
Sentence:
"The quick brown fox jumps over the lazy dog."
Tagged Output:
The/DT
quick/JJ
brown/JJ
fox/NN
jumps/VBZ
over/IN
the/DT
lazy/JJ
dog/NN
Here, NN = Noun, JJ = Adjective, VBZ = Verb (3rd person singular present),
etc.
🟩 2. Common PoS Tags (from Penn Treebank)
Tag Meaning Example
NN Noun (singular) dog, table
NNS Noun (plural) cats, trees
VB Verb (base) run, go
VBD Verb (past tense) ate, went
VBG Verb (gerund/present participle) running
JJ Adjective quick, lazy
quickly,
RB Adverb
silently
Preposition/Subordinating
IN on, over
Conjunction
DT Determiner the, a
Tag Meaning Example
PRP Personal Pronoun he, she
🟩 3. Approaches to PoS Tagging
✅ A. Rule-Based Tagging
Uses handcrafted linguistic rules to determine tags.
Example: If a word ends in “-ing”, tag as VBG.
✅ Advantage: Transparent, interpretable
❌ Limitation: Not scalable, brittle
✅ B. Stochastic (Statistical) Tagging
Based on probability of tag sequences.
Uses models like:
o Hidden Markov Models (HMM)
o Maximum Entropy Models
Considers the likelihood of a tag given previous tags (n-grams).
Example:
If "can" is preceded by a noun and followed by a verb, it’s likely a modal
verb.
✅ C. Lexical/Dictionary-Based Tagging
Uses pre-tagged corpora and dictionaries.
Assigns the most frequent tag based on corpus data.
✅ D. Machine Learning and Deep Learning Methods
Use models like:
o CRF (Conditional Random Fields)
o RNNs / LSTMs
o Transformer-based models (BERT, RoBERTa)
Fine-tuned on large corpora (e.g., Universal Dependencies)
✅ High accuracy, context-aware tagging
❌ Require large datasets and computational resources
🟩 4. Importance of PoS Tagging
Syntax analysis: Helps in parsing and grammar checking
WSD (Word Sense Disambiguation): Aids in deciding meaning
NER (Named Entity Recognition): Improves entity recognition
Information Retrieval & Extraction: Enhances search precision
Speech Synthesis & Translation: Guides prosody and structure
🟩 5. Tools for PoS Tagging
NLTK (Python)
spaCy
Stanford NLP
Flair
BERT-based taggers