You're absolutely right!
Looking at the comprehensive document you
provided, my previous notes were indeed missing substantial content.
Let me create a much more complete and detailed version that includes
all the important information from your NLP notes.
Natural Language Processing
(NLP) - Complete Comprehensive
Notes
Table of Contents
1. Introduction to NLP
2. Components of NLP
3. Steps in NLP Processing
4. Applications of NLP
5. Word Structure and Morphology
6. NLTK and Text Processing
7. Syntax and Parsing
8. Ambiguity in NLP
9. Algorithms and Models
10. Knowledge Bottlenecks
11. Advanced Topics
Introduction to NLP
Natural Language Processing (NLP) is the study and engineering of
computational systems that can analyze, understand, generate, and
interact using human language. It bridges unstructured linguistic signals
—text and speech—and structured machine representations by
combining computational linguistics with statistical and neural machine
learning.
Key Objectives
Making computers learn, understand, analyze, manipulate and
interpret natural (human) languages
Enable human-computer interaction through natural language
Part of Computer Science, Human Languages/Linguistics, and
Artificial Intelligence
NLP Pipeline Architecture
Complete Pipeline Flow:
text
Raw Text → Normalization → Tokenization → Linguistic Analysis →
Representations → Model → Output
Detailed Examples for Each Stage:
Normalization: Lowercasing, Unicode NFKC
Tokenization: Whitespace splitting, WordPiece
Linguistic Analysis: POS tagging, dependency parsing
Representations: TF-IDF, BERT embeddings
Model: CRF, Transformer
Output: Labels, summary, translation
Historical Evolution
1. Rule-based Systems: Grammars, lexicons
2. Statistical NLP: N-grams, HMMs/CRFs, PCFGs
3. Neural NLP: Attention-based Transformers, large-scale pretraining
Components of NLP
1. Natural Language Understanding (NLU)
Function: Transforms human language into machine-readable
format
Tasks: Extract keywords, emotions, relations, semantics
Complexity: Harder than NLG
2. Natural Language Generation (NLG)
Function: Converts computerized data into natural language
Components:
Text planning
Sentence planning
Text realization
Steps in NLP Processing
1. Lexical Analysis
Primary Phase: Scans text as character stream
Functions:
Converts characters into meaningful lexemes
Divides text into paragraphs, sentences, words
Key Concepts:
Lexeme: Basic unit of meaning (individual or multiword)
Examples:
Individual: "talk" → "talks", "talked", "talking"
Multiword: "speak up", "pull through"
2. Syntactic Analysis (Parsing)
Purpose: Check grammar and word arrangements
Functions:
Show relationships among words
Reject grammatically incorrect sentences
Example: "The school goes to boy" → Rejected by English
syntactic analyzer
3. Semantic Analysis
Focus: Literal meaning of words, phrases, sentences
Functions:
Meaning representation
Reject semantically incorrect phrases
Examples:
Rejects: "hot ice-cream"
Passes syntax but fails semantics: "Manhattan calls out to
Dave"
4. Discourse Integration
Context Dependency: Considers preceding and following
sentences
Example:
"Manhattan speaks to all its people"
"It calls out to Dave"
→ "It" refers to Manhattan
5. Pragmatic Analysis
Purpose: Re-interpret what was said vs. what was meant
Requirements: Real-world knowledge
Example: "Manhattan speaks to all its people" → Metaphor for
emotional connection
Applications of NLP
1. Sentiment Analysis (Opinion Mining)
Purpose: Identify emotional tone (positive, negative, neutral)
Applications: Customer sentiment, brand reputation
Data Sources: Emails, reviews, social media, surveys
Technology: Machine learning text analysis models
2. Machine Translation (MT)
Challenges:
No equivalent words across languages
Multiple word meanings
Idiom translation
Solutions: Corpus statistical and neural techniques
Example: Handling linguistic typology differences
3. Text Extraction
Information Types: Entity names, locations, quantities
Industries: Healthcare, finance, e-commerce
Benefits: Process unstructured data efficiently
4. Text Classification
Also Known As: Text tagging, text categorization
Process: Categorize text into organized groups
Business Value: Automated insights, process automation
5. Speech Recognition (ASR)
Alternative Names: Computer speech recognition, Speech-to-
Text (STT)
Applications by Industry:
Automotive: Voice-activated navigation
Technology: Virtual assistants (Siri, Alexa, Google Assistant)
Healthcare: Medical dictation
Sales: Call center transcription, AI chatbots
6. Chatbots
Function: Automated conversation systems
Benefits: 24/7 customer service, quick assistance
Types: Pre-scripted or AI-generated responses
7. Email Filtering
Evolution: From simple spam filters to intelligent categorization
Example: Gmail's categorization (main, social, promotional)
8. Search Autocorrect and Autocomplete
Functions:
Suggest probable search keywords
Correct typing errors
Improve search accuracy
Word Structure and Morphology
Tokens and Tokenization
Tokens: Syntactic words with independent roles
Examples:
"newspaper" → compound word with derivational structure
"won't" → "will" + "not" (two syntactic words)
Lexemes and Lemmas
Lexeme: Set of alternative forms expressing same concept
Lemma: Citation form of lexeme
Operations:
Inflection: Convert word to other forms (mouse → mice)
Derivation: Transform to morphologically related lexeme
(receive → receiver, reception)
Morphemes
Definition: Minimal meaningful elements of words
Types:
Stems: Core meaning (play, cat, friend)
Affixes: Modify meaning
Prefixes: Precede stem (un-)
Suffixes: Follow stem (-ed, -s, -ly)
Morphological Processes
Inflectional Morphology
Purpose: Different forms of same word
Examples:
cat → cats
mouse → mice
Derivational Morphology
Purpose: Create new words from roots
Examples:
inter + national = international
international + ize = internationalize
internationalize + ation = internationalization
Morphophonemic Changes
Allomorphs: Alternative forms of morphemes
Examples: Plural morpheme
-s in "cats", "dogs"
-es in "dishes"
-en in "oxen"
Morphological Typology
1. Isolating/Analytic Languages
Characteristics: Few morphemes per word
Examples: Chinese, Vietnamese, Thai, English
2. Synthetic Languages
Agglutinative: One function per morpheme
Examples: Korean, Japanese, Finnish, Tamil
Fusional: Multiple functions per morpheme
Examples: Arabic, Czech, Latin, German
3. Word Formation Processes
Concatenative: Morphemes linked sequentially
Nonlinear: Structural components merge non-sequentially
NLTK and Text Processing
Introduction to NLTK
Natural Language Toolkit - Pedagogical and prototyping Python
library
Key Features:
Tokenizers, stemmers, lemmatizers
POS taggers, chunkers, parsers
Corpus readers, evaluation metrics
Access to corpora (Gutenberg, Brown, movie_reviews)
Basic NLTK Workflow
python
# Installation and setup
import nltk
nltk.download() # Fetch resources
1. Tokenization
Sentence Tokenization
python
from nltk.tokenize import sent_tokenize, word_tokenize
example_string = """
Muad'Dib learned rapidly because his first training was in how to learn.
And the first lesson of all was the basic trust that he could learn.
It's shocking to find how many people do not believe they can learn.
"""
sentences = sent_tokenize(example_string)
Output:
text
["Muad'Dib learned rapidly because his first training was in how to learn.",
'And the first lesson of all was the basic trust that he could learn.',
"It's shocking to find how many people do not believe they can learn."]
Word Tokenization
python
words = word_tokenize(example_string)
Output:
text
["Muad'Dib", 'learned', 'rapidly', 'because', 'his', 'first', 'training',
'was', 'in', 'how', 'to', 'learn', '.', 'And', 'the', 'first', 'lesson', ...]
2. Stop Words Filtering
python
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
worf_quote = "Sir, I protest. I am not a merry man!"
words_in_quote = word_tokenize(worf_quote)
stop_words = set(stopwords.words("english"))
filtered_list = []
for word in words_in_quote:
if word.casefold() not in stop_words:
filtered_list.append(word)
Results:
Original: ['Sir', ',', 'I', 'protest', '.', 'I', 'am', 'not', 'a', 'merry', 'man',
'!']
Filtered: ['Sir', ',', 'protest', '.', 'merry', 'man', '!']
Content vs. Context Words
Content Words: Information about topics and sentiment
Context Words: Information about writing style
3. Stemming
python
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer = PorterStemmer()
string_for_stemming = "The crew of the USS Discovery discovered many discoveries.
Discovering is what explorers do."
words = word_tokenize(string_for_stemming)
stemmed_words = [stemmer.stem(word) for word in words]
Stemming Results:
Original Word Stemmed Version
'Discovery' 'discoveri'
'discovered' 'discov'
'discoveries' 'discoveri'
'Discovering' 'discov'
4. Part-of-Speech (POS) Tagging
Parts of Speech Categories
Part of Speech Role Examples
Noun Person, place, or thing mountain, bagel, Poland
Pronoun Replaces a noun you, she, we
Adjective Describes what a noun is like efficient, windy, colorful
Verb Action or state of being learn, is, go
Adverb Modifies verb, adjective, or adverb efficiently, always, very
Part of Speech Role Examples
Preposition Shows relationship between noun/pronoun and another word from, about, at
Conjunction Connects words or phrases so, because, and
Interjection Exclamation yay, ow, wow
python
import nltk
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import word_tokenize
sagan_quote = "If you wish to make an apple pie from scratch, you must first invent
the universe."
words_in_sagan_quote = word_tokenize(sagan_quote)
pos_tags = nltk.pos_tag(words_in_sagan_quote)
Output:
text
[('If', 'IN'), ('you', 'PRP'), ('wish', 'VBP'), ('to', 'TO'), ('make', 'VB'),
('an', 'DT'), ('apple', 'NN'), ('pie', 'NN'), ('from', 'IN'), ('scratch', 'NN'),
(',', ','), ('you', 'PRP'), ('must', 'MD'), ('first', 'VB'), ('invent', 'VB'),
('the', 'DT'), ('universe', 'NN'), ('.', '.')]
5. Lemmatization
python
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
lemmatizer = WordNetLemmatizer()
string_for_lemmatizing = "The friends of DeSoto love scarves."
words = word_tokenize(string_for_lemmatizing)
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
Advanced Lemmatization:
python
lemmatizer.lemmatize("worst") # Output: 'worst'
lemmatizer.lemmatize("worst", pos="a") # Output: 'bad'
6. Chunking
Purpose: Identify phrases (groups of words functioning as single units)
Noun Phrase Examples
"A planet"
"A tilting planet"
"A swiftly tilting planet"
python
import nltk
from nltk.tokenize import word_tokenize
quote = "It's a dangerous business, Frodo, going out your door."
words_quote = word_tokenize(quote)
tags = nltk.pos_tag(words_quote)
# Define chunk grammar
grammar = "NP: {<DT>?<JJ>*<NN>}"
chunk_parser = nltk.RegexpParser(grammar)
tree = chunk_parser.parse(tags)
Output Tree Structure:
text
(S
It/PRP
's/VBZ
(NP a/DT dangerous/JJ business/NN)
,/,
Frodo/NNP
,/,
going/VBG
out/RP
your/PRP$
(NP door/NN)
./.)
7. Chinking
Purpose: Exclude patterns from chunks (opposite of chunking)
python
grammar = """
Chunk: {<.*>+}
}<JJ>{"""
chunk_parser = nltk.RegexpParser(grammar)
tree = chunk_parser.parse(tags)
8. Named Entity Recognition (NER)
python
nltk.download("maxent_ne_chunker")
nltk.download("words")
tree = nltk.ne_chunk(tags)
Output:
text
(S
It/PRP
's/VBZ
a/DT
dangerous/JJ
business/NN
,/,
(PERSON Frodo/NNP)
,/,
going/VBG
out/RP
your/PRP$
door/NN
./.)
Binary NER (without entity type specification):
python
tree = nltk.ne_chunk(tags, binary=True)
Syntax and Parsing
Grammar Types in NLP
1. Context-Free Grammar (CFG)
Structure: Rules for forming well-structured sentences
Language Patterns: SVO (Subject-Verb-Object), SOV, OSV
2. Constituency Grammar (Phrase Structure
Grammar)
Focus: Phrase and clause structure
Examples: NP (Noun Phrase), PP (Prepositional Phrase), VP (Verb
Phrase)
3. Dependency Grammar
Focus: Grammatical relations between individual words
Structure: Network of relations rather than recursive structure
Features: Labeled relations between words
Parsing Process
Definition: Determining syntactic structure of text by analyzing
constituent words based on underlying grammar.
Example Grammar Rules
text
sentence → noun_phrase verb_phrase
noun_phrase → determiner noun
verb_phrase → verb noun_phrase
determiner → 'the', 'a', 'an'
noun → 'Tom', 'apple'
verb → 'ate'
Parse Tree Structure
Root: sentence
Non-terminals: noun_phrase, verb_phrase (intermediate nodes)
Terminals: 'Tom', 'ate', 'an', 'apple' (leaves)
Syntactic Structure Representations
1. Dependency Parsing
Philosophy: Connect head word with dependents in phrase
Structure: Directed (asymmetric) connections
Components:
Words as vertices
Directed arcs as binary relations (head to dependent)
Each word depends on exactly one parent
2. Phrase Structure Parsing
Approach: Partition sentences into constituents
Method: Recursive partitioning into phrases (NP, VP, PP)
Traditional: Derives from sentence diagrams
Constituency Tree vs. Dependency Tree
Dependency structures explicitly
represent:
Head-dependent relations (directed arcs)
Functional categories (arc labels)
Possibly some structural categories (POS)
Phrase structure explicitly represents:
Phrases (non-terminal nodes)
Structural categories (non-terminal labels)
Possibly some functional categories (grammatical functions)
Treebanks: Data-Driven Approach
Definition: Linguistically annotated corpus including syntactic analysis
beyond POS tagging
Key Features:
Collection of sentences with complete syntax analysis
Human expert judgment for most plausible analysis
Consistent treatment across related grammatical phenomena
No explicit grammar rules provided
Benefits:
1. Solves Grammar Problem: Syntactic analysis directly given
2. Solves Probability Problem: Supervised learning for scoring
functions
Analysis Types:
Dependency Analysis: Favored for free word order languages
(Czech, Turkish)
Phrase Structure Analysis: Used for long-distance dependencies
(English, French)
Parsing Algorithms
Key Concepts:
Derivation: Sequence of steps to derive string from grammar
Sentential Form: Each line in derivation sequence
Rightmost Derivation: Expand rightmost nonterminal at each
step
Algorithm Types:
CKY Parsing: Requires CNF grammar, fills triangular chart bottom-
up
Earley Parsing: Handles arbitrary CFGs with dotted rules
Neural Parsers: Use encoders for contextual token vectors
Ambiguity in NLP
Core Challenge: Ambiguity drives the fundamental difficulty in
language understanding
Types of Ambiguity
1. Lexical Ambiguity
Definition: Word has multiple senses
Example: "bank" = financial institution vs. river edge
Resolution: Word Sense Disambiguation using context and world
knowledge
2. Syntactic Ambiguity
Definition: Same sequence yields multiple parse trees
Example: "I saw the man with a telescope"
Did seeing happen with telescope?
Did the man have the telescope?
Resolution: Probabilistic or neural parsers score structures
3. Semantic Ambiguity
Definition: Sentence-meaning level ambiguity
Example: "Visiting relatives can be boring"
Act of visiting them is boring
Relatives who visit are boring
Requirements: Selectional preferences and event structure
4. Pragmatic Ambiguity
Definition: Depends on context, intention, social norms
Example: "Can you pass the salt?" (request, not ability query)
Resolution: Speech act recognition
5. Referential/Anaphoric Ambiguity
Definition: Pronouns/descriptions have multiple candidates
Example: "Alice told Jane that she would win" (who is "she"?)
Resolution: Coreference resolution and entity tracking
Mathematical Approaches to Ambiguity
Bayesian Scoring
For syntactic ambiguity: P(T|x) ∝ P(x|T)P(T)
Viterbi Algorithm
Used in HMMs for optimal tag sequence selection
Persistent Challenges
Rare senses
Long-distance dependencies
Idioms and sarcasm
Under-specified references
Need for external knowledge and discourse modeling
Algorithms and Models
Algorithmic Families by Task and Data
1. Rule-Based Methods
Components: Finite-state transducers, handcrafted grammars
Advantages: Interpretability, precision in constrained domains
Applications: Tokenization, morphology, parsing
2. Statistical Learning
Models:
N-gram language models
HMMs and CRFs (tagging, segmentation)
PCFGs (parsing with chart algorithms)
Decoding: Dynamic programming
3. Neural Models
Evolution:
RNNs and LSTMs (sequences)
CNNs (character/subword features)
Attention mechanisms (long-range dependencies)
Transformer architecture (self-attention only)
Knowledge Bottlenecks
The Knowledge Gap
Definition: Gap between what model encodes vs. what robust language
understanding requires
Problem Areas:
Text alone lacks full commonsense and world knowledge
Labeled data is costly and uneven across domains/languages
Distribution shift causes failures when test differs from training
Classic Error Examples:
Pronoun Resolution: "The trophy doesn't fit in the suitcase
because it is too small" (it = suitcase)
Temporal Reasoning: Understanding time relationships
Spatial Relations: Understanding spatial concepts
Procedural Knowledge: Understanding processes
Strategies to Address Bottlenecks:
1. Pretraining
Massive corpora for linguistic and factual regularities
2. Retrieval Augmentation
Ground generation in up-to-date sources
3. Knowledge Integration
Structured knowledge graphs (Wikidata)
Differentiable memory systems
4. Weak Supervision
Data programming to expand coverage
5. Instruction Tuning
Preference optimization for better task intent following
Advanced Topics
Word-Level Analysis
Text Normalization (Task-Dependent
Decisions):
Lowercasing: Affects proper nouns
Unicode Normalization: NFKC harmonizes similar characters
Punctuation Handling: Differs between IR and sentiment tasks
Number Handling: Map digits to placeholders
Contractions: Expansion improves parsing
Stopword Removal: Reduces noise for bag-of-words, may harm
generation
Stemming vs. Lemmatization:
Stemming: Aggressive suffix chopping (may over-stem)
Lemmatization: Uses vocabulary and POS for accurate lemmas
Tokenization Approaches:
Whitespace → Rule-based → Subword (BPE/WordPiece)
Trade-offs: OOV handling vs. morphological fidelity
Edit Distance
Levenshtein Distance:
Counts insertions, deletions, substitutions (unit costs)
Example: "kitten" → "sitting" = 3 edits
1. k → s (substitute)
2. e → i (substitute)
3. Insert g at end
Damerau-Levenshtein Distance:
Adds transposition of adjacent characters
Applications:
Spell correction
DNA sequence comparison
String similarity tasks
Spelling Correction
Two-Phase Process:
1. Detection: Is token erroneous?
2. Correction: Which candidate is intended?
Error Types:
Nonword Errors: Easy detection with lexicon
Real-word Errors: Require contextual modeling ("peace" vs
"piece")
Candidate Generation Methods:
Edit neighborhoods (distance 1-2)
Keyboard adjacency graphs
Phonetic hashing (Soundex, Metaphone)
Morphological variants
POS Tagging Details
Word Classes:
Open Classes: Nouns, verbs, adjectives, adverbs
Closed Classes: Prepositions, determiners, conjunctions,
pronouns, particles
Tagset Examples:
Penn Treebank: NN, NNS, NNP, VB, VBD, VBG, JJ, RB, IN, DT, PRP
Universal Dependencies: NOUN, VERB, ADJ, ADV, ADP, PRON,
DET
Applications:
Supporting parsing
Lemmatization (verb vs noun lemmas)
Downstream tasks like NER
Modern System Patterns
1. Retrieval-Augmented Generation (RAG)
Dense retrievers fetch knowledge
Generate grounded responses
2. Tool-Use Agents
Combine language models with deterministic tools
Ensure reliable actions
Evaluation Considerations
Task-Specific Metrics:
Classification/Extraction: Accuracy, F1-score
Generation: BLEU, ROUGE, BERTScore
Production: Human preference, task success metrics, latency,
cost
Deployment Considerations:
Multilingual coverage
Fairness and bias audits
Privacy for sensitive data
Robustness to adversarial prompts/noisy inputs
Summary
This comprehensive guide covers all essential aspects of Natural
Language Processing, from basic concepts and preprocessing techniques
to advanced parsing algorithms and modern neural approaches. The field
continues to evolve rapidly, with current research focusing on large
language models, multimodal systems, and addressing the persistent
challenges of ambiguity, knowledge integration, and robust
understanding across diverse domains and languages.
The integration of statistical methods with neural architectures,
combined with massive pretraining and retrieval-augmented approaches,
represents the current state-of-the-art, while traditional rule-based and
statistical methods remain important for specific applications and
understanding fundamental principles.