CCS369 TEXT AND SPEECH ANALYSIS L T P C
20 2 3
COURSE OBJECTIVES:
Understand natural language processing basics
Apply classification algorithms to text documents
Build question-answering and dialogue systems
Develop a speech recognition system
Develop a speech synthesizer
UNIT I NATURAL LANGUAGE BASICS 6
Foundations of natural language processing – Language Syntax and
Structure- Text Preprocessing and Wrangling – Text tokenization –
Stemming – Lemmatization – Removing stop-words – Feature Engineering
for Text representation – Bag of Words model- Bag of N-Grams model –
TF-IDF model
UNIT II TEXT CLASSIFICATION 6
Vector Semantics and Embeddings -Word Embeddings - Word2Vec model
– Glove model – FastText model – Overview of Deep Learning models –
RNN – Transformers – Overview of Text summarization and Topic Models
UNIT III QUESTION ANSWERING AND DIALOGUE
SYSTEMS 9
Information retrieval – IR-based question answering – knowledge-based
question answering – language models for QA – classic QA models –
chatbots – Design of dialogue systems -– evaluating dialogue systems
UNIT IV TEXT-TO-SPEECH SYNTHESIS 6
Overview. Text normalization. Letter-to-sound. Prosody, Evaluation.
Signal processing - Concatenative and parametric approaches, WaveNet
and other deep learning-based TTS systems
UNIT V AUTOMATIC SPEECH RECOGNITION 6
Speech recognition: Acoustic modelling – Feature Extraction - HMM, HMM-DNN
systems
30 PERIODS
UNIT I NATURAL LANGUAGE BASICS
Foundations of natural language processing – Language Syntax and Structure- Text
Preprocessing and Wrangling – Text tokenization – Stemming – Lemmatization – Removing
stop-words – Feature Engineering for Text representation – Bag of Words model- Bag of N-
Grams model – TF-IDF model.
INTRODUCTION TO LANGUAGE ANALYSIS
What is Text and Speech Analysis? [2M]
Text and speech analysis is the process of studying written words (text) and spoken
words (speech) to understand their meaning, emotions, and patterns.
Text and Speech Analysis is the process of examining written and spoken language to
understand meaning, emotions, and patterns. It is used in applications like chatbots,
voice assistants, sentiment analysis, and fraud detection.
How Do We Analyze Words and Speech? [2M]
1. Text Analysis (Written Language)
o Computers scan words, sentences, and paragraphs to find meaning,
emotions, and key topics.
o Example: A company reads customer reviews to see if people are happy
or unhappy with a product.
2. Speech Analysis (Spoken Language)
o Computers listen to how we speak—our words, tone, and speed—to
understand what we mean.
o Example: Voice assistants (like Siri or Alexa) recognize our voice
commands and respond.
Why is Text and Speech Analysis Important? [2M]
Text and speech analysis helps us make sense of words—whether written or
spoken—to understand people’s emotions, opinions, and trends. It has many
real-world applications that improve businesses, healthcare, security, and
accessibility.
1.1 Foundations of Natural Language Processing (NLP)
1. Introduction to NLP
Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) that
focuses on enabling computers to understand, interpret, and generate human language.
NLP bridges linguistics and computer science to facilitate seamless human-computer
interactions through text and speech.
1.1 Importance of NLP
NLP is crucial for various real-world applications, including:
Machine Translation: Google Translate, DeepL.
Speech Recognition: Siri, Alexa, Google Assistant.
Text Summarization: Automated news summarization.
Sentiment Analysis: Understanding customer reviews and social media trends.
Chatbots & Virtual Assistants: Automating customer support services.
NLP leverages computational linguistics, machine learning (ML), and deep learning
(DL) to process large volumes of natural language data.
2. Core Components of NLP
2.1 Text Processing Techniques
1. Tokenization: Splitting text into smaller units, typically words or sentences.
For example:
o Input: "Natural Language Processing is exciting!"
o Tokenized: ["Natural", "Language", "Processing", "is", "exciting", "!"]
2. Stemming & Lemmatization: Reducing words to their root forms.
o Stemming (Porter Stemmer): "running" → "run"
o Lemmatization (WordNet Lemmatizer): "better" → "good"
3. Stopword Removal: Removing common words such as "the," "is," and "and."
o Example: "The cat is sitting on the mat." → ["cat", "sitting", "mat"]
4. Part-of-Speech (POS) Tagging: Assigning grammatical categories (noun,
verb, adjective, etc.).
o Example: "The quick brown fox jumps over the lazy dog."
o POS Tags: [("The", DET), ("quick", ADJ), ("brown", ADJ), ("fox",
NOUN), ("jumps", VERB), ...]
2.2 Syntax and Parsing
Dependency Parsing: Establishes relationships between words.
Constituency Parsing: Breaks sentences into hierarchical structures using
syntax trees.
2.3 Semantic Analysis
Word Sense Disambiguation (WSD): Differentiating word meanings based on
context (e.g., "bank" as a financial institution vs. riverbank).
Named Entity Recognition (NER): Identifying proper nouns like names of
people, places, and organizations (e.g., "Barack Obama" → PERSON).
Sentiment Analysis: Classifying text as positive, negative, or neutral based on
emotions.
3. Speech Processing in NLP
Speech processing involves techniques to convert spoken language into text and vice
versa.
3.1 Automatic Speech Recognition (ASR)
Converts speech into text using models like Hidden Markov Models (HMMs)
and deep neural networks.
Example: Dictation software and voice assistants.
3.2 Text-to-Speech (TTS) Synthesis
Converts text into spoken words using models like Tacotron and WaveNet.
Used in audiobooks and voice response systems.
3.3 Prosody Analysis
Studies pitch, rhythm, and intonation to enhance naturalness in synthesized
speech.
4. Machine Learning & Deep Learning in NLP
4.1 Traditional Machine Learning Methods
1. Naïve Bayes Classifier: Used for spam filtering.
2. Support Vector Machines (SVMs): Effective for text classification.
3. Hidden Markov Models (HMMs): Applied in speech recognition and POS
tagging.
4.2 Deep Learning-Based NLP
1. Recurrent Neural Networks (RNNs) & LSTMs: Handle sequential data for
tasks like machine translation and speech recognition.
2. Transformers (BERT, GPT-4):
o BERT (Bidirectional Encoder Representations from
Transformers): Pre-trained model for context-aware language
understanding.
o GPT (Generative Pre-trained Transformer): Used for text generation
and conversational AI.
5. Challenges in NLP
1. Ambiguity: Words with multiple meanings (e.g., "apple" as fruit or company).
2. Data Sparsity: Insufficient labeled data for low-resource languages.
3. Context Understanding: Difficulty in grasping sarcasm, idioms, and
metaphors.
4. Bias in NLP Models: AI models may inherit biases from training data.
5. Multilingual Processing: Challenges in handling various languages and
dialects.
6. Future Trends in NLP
1. Few-shot and Zero-shot Learning: Training models with minimal examples.
2. Multimodal NLP: Combining text with images, video, and audio.
3. Explainable AI in NLP: Enhancing transparency in language models.
4. Conversational AI Evolution: Improvements in human-like chatbot
interactions.
The field of NLP is evolving rapidly with advancements in deep learning, enabling
more accurate and human-like language understanding. From text processing to speech
recognition and machine learning, NLP continues to impact various industries,
including healthcare, finance, and customer service. Future research aims to address
challenges like bias and context understanding while improving the efficiency and
accessibility of NLP applications.
1.2 Natural Language Processing (NLP) Pipeline
Natural Language Processing (NLP) is a branch of artificial intelligence that enables
computers to understand and process human language. It involves various
components that help break down, analyze, and extract meaning from text and speech.
Below are the key components of NLP explained in detail.
1️. Tokenization: Breaking Text into Smaller Units
Definition:
Tokenization is the process of splitting text into smaller units called tokens, which can
be words, phrases, or sentences.
Types of Tokenization:
Word Tokenization – Splitting text into individual words.
o Example: "ChatGPT helps analyze text." → ["ChatGPT", "helps",
"analyze", "text", "."]
Sentence Tokenization – Dividing text into sentences.
o Example: "AI is powerful. It is changing industries." → ["AI is
powerful.", "It is changing industries."]
Applications:
✔ Search Engines – Google tokenizes search queries to retrieve relevant results.
✔ Chatbots – AI assistants use tokenization to understand user messages.
✔ Text Summarization – Helps break down large texts into smaller sections.
2️. Morphological Analysis: Stemming & Lemmatization
Definition:
Morphological analysis studies the structure of words and converts them into their
base form.
Two Key Processes:
🔹 Stemming – Removes prefixes and suffixes from words to get the root form.
Example: "Running" → "Run", "Happily" → "Happi"
Algorithmic approach, sometimes leading to incorrect word forms.
🔹 Lemmatization – Converts words to their proper base form using a dictionary-
based approach.
Example: "Running" → "Run", "Better" → "Good"
More accurate than stemming, used in AI writing assistants.
Applications:
✔ Search Optimization – Ensures search results include variations of a word (e.g.,
"run" and "running").
✔ AI Writing Tools – Grammarly uses lemmatization for grammar correction.
✔ Text Classification – Improves categorization of emails, reviews, and documents.
3️. Part-of-Speech (POS) Tagging: Assigning Grammatical Roles
Definition:
POS tagging assigns words into categories like nouns, verbs, adjectives, etc., to
understand sentence structure.
Example:
"The quick brown fox jumps over the lazy dog."
"Fox" (Noun)
"Jumps" (Verb)
"Quick" (Adjective)
Applications:
✔ Grammar Correction Tools – Used in tools like Grammarly.
✔ Machine Translation – Google Translate uses POS tagging for accurate
translations.
✔ Speech Recognition – Helps convert spoken words into structured text.
4️. Named Entity Recognition (NER): Identifying Important Names
Definition:
NER detects proper nouns such as names, locations, organizations, dates, and more
within text.
Example:
Sentence: "Elon Musk founded SpaceX in 2002."
Person: Elon Musk
Organization: SpaceX
Date: 2002
Applications:
✔ News & Financial Analysis – Extracts important entities from articles and reports.
✔ Customer Service – Identifies customer names and locations for better assistance.
✔ Healthcare Records – Detects patient names, medications, and conditions in
medical reports.
5️. Parsing (Syntactical Analysis): Understanding Sentence Structure
Definition:
Parsing analyzes how words are structured in a sentence and how they relate to each
other.
Types of Parsing:
Dependency Parsing – Determines relationships between words.
Constituency Parsing – Breaks sentences into hierarchical structures.
Example:
"She bought a new car."
"She" → Subject
"Bought" → Verb
"Car" → Object
Applications:
✔ Voice Assistants – Helps Siri and Alexa understand complex commands.
✔ Legal Document Processing – Analyzes contracts and legal texts for clarity.
✔ Language Learning Apps – Helps users understand sentence structure.
Eg: Sentence – I drive a Car to my College.
6️. Semantic Analysis: Understanding Meaning in Context
Definition:
Semantic analysis focuses on understanding the meaning and relationships between
words in a sentence.
Key Concepts:
Word Sense Disambiguation – Determines the correct meaning of a word
based on context.
o Example: "I saw a bat at night." – (Bat = animal, not baseball bat)
Sentiment Analysis – Identifies emotions (happy, sad, angry) in text.
o Example: "The movie was amazing!" (Positive sentiment)
Applications:
✔ Chatbots & AI Assistants – Detects intent behind user messages.
✔ Social Media Monitoring – Brands analyze tweets and reviews to measure public
opinion.
✔ Customer Feedback Analysis – Detects dissatisfaction in product reviews.
7️. Coreference Resolution: Linking Words to Their Meaning
Definition:
Coreference resolution identifies when different words refer to the same entity within
a text.
Example:
"John bought a car. He loves it."
"He" refers to John.
"It" refers to the car.
Applications:
✔ Text Summarization – Ensures clarity in automated summaries.
✔ AI Story Comprehension – Helps AI understand narratives in books and articles.
✔ Question-Answering Systems – Enhances responses by identifying references.
Each component of NLP plays a crucial role in enabling AI to process human language
effectively. From breaking text into words (tokenization) to detecting emotions
(semantic analysis), these techniques help power real-world applications like search
engines, chatbots, virtual assistants, and machine translation tools. As NLP
technology advances, it continues to revolutionize how humans interact with machines,
making communication more natural and efficient.
1️.3️ Types of Natural Language Processing
Natural Language Processing (NLP) is a branch of artificial intelligence (AI) that
enables machines to understand, interpret, and generate human language. NLP can be
categorized into different types based on its applications and functions.
1. Syntax-Based NLP
Syntax refers to the structure of sentences, including word order and grammatical
relationships.
1.1. Part-of-Speech (POS) Tagging
Identifies the grammatical category of each word in a sentence (noun, verb,
adjective, etc.).
Example: "The dog runs fast."
o POS tags:
The (Determiner)
dog (Noun)
runs (Verb)
fast (Adverb)
1.2. Parsing (Syntactic Analysis)
Analyzes the grammatical structure of sentences using syntax trees.
Example: "The boy eats an apple."
Tree Structure: The boy eats an apple
S (Sentence)
/ \
NP (Noun Phrase) VP (Verb Phrase)
/ \ / \
Det N V NP (Noun Phrase)
| | | / \
The boy eats Det N
| |
an apple
S (Sentence): The root of the tree, representing the entire sentence.
NP (Noun Phrase): Contains a Determiner (Det) and a Noun (N).
"The boy" → "The" (Det) + "boy" (N)
VP (Verb Phrase): Contains a Verb (V) and another Noun Phrase (NP).
"eats an apple" → "eats" (V) + "an apple" (NP)
Second NP (Noun Phrase): Contains Determiner (Det) + Noun (N).
"an apple" → "an" (Det) + "apple" (N)
1.3. Sentence Segmentation
Splits text into meaningful sentences.
Example:
o Input: "Hello! How are you? I’m fine."
o Output: ["Hello!", "How are you?", "I’m fine."]
1.4. Word Tokenization
Splits sentences into words.
Example:
o Input: "NLP is amazing!"
o Output: ["NLP", "is", "amazing", "!"]
2. Semantics-Based NLP
Semantics refers to the meaning of words, phrases, and sentences.
2.1. Named Entity Recognition (NER)
Identifies proper names in text (people, places, organizations, dates, etc.).
Example:
o Input: "Elon Musk founded SpaceX in 2002."
o Output: {"Person": "Elon Musk", "Organization": "SpaceX", "Date":
"2002"}
2.2. Word Sense Disambiguation (WSD)
Determines the correct meaning of a word based on context.
Example: "He went to the bank to withdraw money."
o The word "bank" refers to a financial institution, not a riverbank.
2.3. Semantic Role Labeling (SRL)
Identifies the roles of words in a sentence (who did what to whom).
Example:
o Input: "John gave Mary a book."
o Output: {Agent: "John", Action: "gave", Recipient: "Mary", Object:
"book"}
2.4. Coreference Resolution
Identifies when different words refer to the same entity.
Example: "John said he is happy."
o Output: {John = he}
3. Pragmatics-Based NLP
Pragmatics deals with meaning in context, considering speaker intentions and
conversational flow.
3.1. Sentiment Analysis (Opinion Mining)
Determines the sentiment (positive, negative, neutral) in text.
Example:
o Input: "The movie was fantastic!"
o Output: Positive
3.2. Text Summarization
Generates a short summary of a longer document.
Example:
o Input: "Artificial intelligence is transforming industries by automating
tasks, improving efficiency, and enabling new technologies."
o Output: "AI is revolutionizing industries through automation and
efficiency."
3.3. Speech Acts Analysis
Analyzes intent behind sentences (request, command, question, etc.).
Example: "Could you open the window?"
o Intent: Request
3.4. Conversational AI (Chatbots & Virtual Assistants)
Enables human-like interactions with machines.
Example: ChatGPT, Siri, Alexa
4. Discourse-Based NLP
Discourse analysis focuses on understanding language beyond individual sentences.
4.1. Topic Modeling
Identifies topics from large text datasets.
Example: Analyzing news articles and identifying topics like politics, sports,
technology, health.
4.2. Text Coherence and Cohesion Analysis
Ensures logical connections between sentences.
Example:
o "Tom bought a car. The vehicle is red."
o Recognizing that "the vehicle" refers to "the car".
4.3. Question Answering (QA)
Extracts direct answers from text.
Example:
o Question: "Who founded Microsoft?"
o Answer: "Bill Gates and Paul Allen."
5. Speech and Audio-Based NLP
5.1. Speech Recognition (Automatic Speech Recognition - ASR)
Converts spoken language into text.
Example: Google Voice, Siri, Dictation software
5.2. Text-to-Speech (TTS)
Converts text into human-like speech.
Example: AI-generated voice assistants
5.3. Emotion Recognition in Speech
Detects emotions from voice signals.
Example: "I am very happy!" → Emotion: Joy
6. Application-Based NLP
6.1. Machine Translation (MT)
Translates text between languages.
Example: Google Translate, DeepL
6.2. Optical Character Recognition (OCR)
Converts scanned images of text into machine-readable text.
Example: Scanning handwritten documents and converting them into text
6.3. Fake News Detection
Identifies misinformation using NLP techniques.
Example: Detecting AI-generated vs. real news articles.
6.4. Plagiarism Detection
Finds duplicate or copied content.
Example: Turnitin, Copyscape
NLP is a vast field covering everything from basic syntax processing to advanced
AI-powered applications. Whether analyzing sentence structure, understanding
meaning, or enabling human-like interactions, NLP is transforming how machines
process human language.
1.4 Types of NLP
Syntax-Based NLP
NLP (Natural Language Processing)
├── 1. Syntax-Based NLP (Structure)
│ ├── POS Tagging
│ ├── Parsing (Syntactic Analysis)
│ ├── Sentence Segmentation
│ ├── Word Tokenization
├── 2. Semantics-Based NLP (Meaning)
│ ├── Named Entity Recognition (NER)
│ ├── Word Sense Disambiguation (WSD)
│ ├── Semantic Role Labeling (SRL)
│ ├── Coreference Resolution
├── 3. Pragmatics-Based NLP (Context & Intent)
│ ├── Sentiment Analysis
│ ├── Text Summarization
│ ├── Speech Acts Analysis
│ ├── Conversational AI (Chatbots, Assistants)
├── 4. Discourse-Based NLP (Text Flow)
│ ├── Topic Modeling
│ ├── Text Coherence & Cohesion Analysis
│ ├── Question Answering (QA)
├── 5. Speech & Audio-Based NLP (Spoken Language)
│ ├── Speech Recognition (ASR)
│ ├── Text-to-Speech (TTS)
│ ├── Emotion Detection in Speech
└── 6. Application-Based NLP (Practical Uses)
├── Machine Translation (MT)
├── Optical Character Recognition (OCR)
├── Fake News Detection
├── Plagiarism Detection
1.4 Levels of Natural Language Processing
Natural Language Processing (NLP) consists of multiple levels, each handling different
aspects of language processing. These levels help machines understand and generate
human language more effectively.
1. Phonetic & Phonological Level (Sound Processing)
This level deals with speech sounds and their patterns, mainly relevant in speech
recognition and text-to-speech (TTS) systems.
Example:
"Hello" → /həˈloʊ/ (Phonetic transcription)
Recognizing accents and pronunciation differences.
Applications:
Speech-to-Text (STT)
Voice Assistants (Siri, Alexa)
Emotion Detection in Speech
2. Morphological Level (Word Formation)
Morphology studies how words are formed from smaller units called morphemes
(the smallest meaning-bearing units of language).
Example:
"Unhappiness" → "un-" (prefix) + "happy" (root) + "-ness" (suffix)
"Running" → "run" + "-ing" (present continuous)
Applications:
Lemmatization (convert "running" → "run")
Stemming (reduce words to their base form, e.g., "happily" → "happy")
Spell Checking
3. Lexical Level (Word Meaning)
This level focuses on individual words and their meanings. It involves word sense
disambiguation (WSD) and POS tagging.
Example:
"Bank" → Can mean a financial institution or a riverbank (WSD helps
determine the correct meaning).
POS Tagging:
o "The dog runs fast."
o The (Det) | dog (N) | runs (V) | fast (Adv)
Applications:
Named Entity Recognition (NER)
Synonym & Antonym Identification
Word Sense Disambiguation
4. Syntactic Level (Grammar & Sentence Structure)
This level analyzes the grammatical structure of sentences. It ensures proper
sentence construction using syntax rules.
Example (Parsing a Sentence):
Sentence: "The boy eats an apple."
S (Sentence)
/ \
NP (Noun Phrase) VP (Verb Phrase)
/ \ / \
Det N V NP (Noun Phrase)
| | | / \
The boy eats Det N
| |
an apple
Applications:
Grammar Checkers (e.g., Grammarly)
Sentence Parsing
POS Tagging
5. Semantic Level (Meaning of Sentences)
This level deals with the meaning of words and sentences, considering context and
relationships.
Example:
"He went to the bank."
o Is it a financial bank or riverbank? (Semantic analysis helps
determine this.)
Semantic Role Labeling (SRL):
o "John gave Mary a book."
o {Agent: "John", Action: "gave", Recipient: "Mary", Object: "book"}
Applications:
Named Entity Recognition (NER)
Text Summarization
Machine Translation
6. Pragmatic Level (Context & Intent)
This level analyzes context, speaker intention, and implied meaning.
Example:
"Could you pass me the salt?"
o Literal Meaning: A yes/no question.
o Pragmatic Meaning: A polite request.
Sentiment Analysis:
o "This movie is amazing!" → Positive Sentiment
o "I hate this product!" → Negative Sentiment
Applications:
Chatbots & Virtual Assistants
Sentiment Analysis (Customer Reviews)
Fake News Detection
7. Discourse Level (Text Coherence & Flow)
This level ensures that multiple sentences in a text are logically connected.
Example (Coreference Resolution):
"John went to the store. He bought some milk."
o "He" refers to "John".
Text Summarization:
o "AI is transforming industries by automating tasks, improving
efficiency, and enabling new technologies."
o Summary: "AI revolutionizes industries through automation."
Applications:
Question Answering (QA)
Document Summarization
Text Coherence Analysis
8. Computational Level (AI & NLP Models)
This level involves AI-driven NLP techniques like Machine Learning (ML) and
Deep Learning (DL).
Example:
Transformers & Neural Networks (GPT, BERT) process and generate
human-like text.
Google Translate uses AI to improve translations.
Applications:
AI Chatbots (ChatGPT, Bard)
Neural Machine Translation
Text Generation
Level Focus Example Applications
Speech sounds & Speech-to-Text, Voice
1. Phonetic "Hello" → /həˈloʊ/
pronunciation Assistants
Word formation "Unhappiness" → "un-" + Lemmatization,
2. Morphological
& structure "happy" + "-ness" Stemming, Spell Check
Word meaning & "Bank" → NER, Word Sense
3. Lexical
POS tagging financial/riverside meaning Disambiguation
Grammar &
Grammar Checkers,
4. Syntactic sentence Parsing sentences
Sentence Parsing
structure
Meaning "John gave Mary a book" NER, Machine
5. Semantic
interpretation (SRL) Translation
Context & intent "Could you pass the salt?" Chatbots, Sentiment
6. Pragmatic
analysis (Request) Analysis
Text coherence "He bought milk." (Who is QA, Text
7. Discourse
& flow "he"?) Summarization
AI-driven NLP GPT, BERT for text AI Chatbots, Neural
8. Computational
models generation MT, Deep Learning
1.5 Language Syntax and Structure
Introduction to Syntax
Syntax is the set of rules that govern the arrangement of words to form meaningful
sentences in a language. In English, word order plays a crucial role in determining
meaning.
For example, consider these two sentences:
The cat chased the mouse.
The mouse chased the cat.
Even though the same words are used, the meaning changes based on the order. This
shows how syntax is essential for comprehension.
Introduction to Language Structure
Language structure refers to the way sounds, words, and sentences are organized to
create meaning. Linguists analyze this structure through different levels of language,
each contributing to how we communicate effectively. These levels include
phonology, morphology, syntax, semantics, and pragmatics.
Each level builds upon the previous one, creating a hierarchical structure from
simple sounds to complex meanings. Here’s an overview:
1️. Phonology (Sound System of Language)
Phonology studies the sounds in a language and their organization. The smallest unit
in phonology is a phoneme (a sound that changes meaning).
Example:
Minimal Pairs (English): /pɪn/ (pin) vs. /bɪn/ (bin)
Phonological Rules: In English, the plural -s sounds different in different
words:
o cats → /s/
o dogs → /z/
o buses → /ɪz/
2. Morphology (Word Structure)
Morphology deals with morphemes, the smallest units of meaning.
Example (Morpheme Breakdown):
Unhappiness → un- (prefix) + happy (root) + -ness (suffix)
Walked → walk (root) + -ed (past tense marker)
Types of Morphemes:
Free morphemes: Can stand alone (e.g., book, run).
Bound morphemes: Must attach to other words (-ing, re-)
3️. Syntax (Sentence Structure)
Syntax governs word order and sentence structure to create grammatically correct
sentences.
Subject (NP) → "The cat"
Verb (VP) → "sleeps"
Syntax Tree Structure (Phrase Structure Rules)
A syntax tree visually represents how a sentence is structured hierarchically.
Example (Basic Sentence Structure):
"The cat sleeps."
/ \
NP VP
/ \ \
DT N V
| | |
The cat sleeps
S (Sentence)
NP (Noun Phrase) → "The cat"
o DT (Determiner) → "The"
o N (Noun) → "cat"
VP (Verb Phrase) → "sleeps"
o V (Verb) → "sleeps"
Example 2: Complex Sentence ("The small cat sleeps on the mat.")
/ \
NP VP
/ \ / \
DT N V PP
| | | / \
The cat sleeps P NP
| | / \
small on DT N
| |
The mat
S (Sentence)
NP (Noun Phrase) → "The small cat"
VP (Verb Phrase) → "sleeps on the mat"
o V (Verb) → "sleeps"
o PP (Prepositional Phrase) → "on the mat"
P (Preposition) → "on"
NP (Noun Phrase) → "the mat"
Sentence Types Based on Syntax
1. Simple Sentence: One independent clause.
o "She runs."
2. Compound Sentence: Two independent clauses joined by a conjunction.
o "She runs, and he walks."
3. Complex Sentence: One independent clause + one dependent clause.
o "She runs because she is late."
4️. Semantics (Meaning of Words and Sentences)
Semantics focuses on literal and implied meanings.
Example:
Synonyms: happy = joyful
Antonyms: hot vs. cold
Ambiguity:
o "She saw the man with binoculars." (Who has the binoculars?)
5️. Pragmatics (Context and Social Meaning)
Pragmatics studies meaning in context.
Example:
Indirect Speech Act:
o "Can you pass the salt?" (Not about ability, but a request.)
Conversational Implicature:
o A: "Did you like the food?"
o B: "It was very colorful." (Avoiding direct opinion.)
Syntax Parsing Techniques (With Examples & Tree Structures)
1️. POS (Part-of-Speech) Tagging
Each word is labeled with its grammatical category.
Example:
"The quick brown fox jumps over the lazy dog."
The/DT quick/JJ brown/JJ fox/NN jumps/VBZ over/IN the/DT lazy/JJ
dog/NN
DT: Determiner
JJ: Adjective
NN: Noun
VBZ: Verb
2️. Shallow Parsing (Chunking)
Groups words into phrases (without deep syntax).
Example (Chunked Sentence):
"The quick brown fox jumps over the lazy dog."
[NP The quick brown fox] [VP jumps] [PP over] [NP the lazy dog]
NP (Noun Phrase): "The quick brown fox"
VP (Verb Phrase): "jumps"
PP (Prepositional Phrase): "over"
3️. Constituency Parsing (Tree Representation)
Analyzes sentence structure using phrase rules.
Example (Tree for "The cat sleeps.")
/ \
NP VP
/ \ \
DT N V
| | |
The cat sleeps
4️. Dependency Parsing (Word Relationships)
Analyzes dependencies between words.
Example (Dependency Graph for "She loves dogs.")
loves
/ \
She dogs
Root: loves (main verb)
Subject: She → linked to loves
Object: dogs → linked to loves
Concept Description Tree Example
Phonology Sound structure N/A
Morphology Word formation "unhappy" → "un-" + "happy"
Syntax Sentence structure (See syntax trees above)
"She saw the man with binoculars"
Semantics Meaning interpretation
(ambiguous)
Context-based "Can you pass the salt?" (request, not
Pragmatics
meaning question)
1️.6️ Text Preprocessing and Wrangling
Text preprocessing and wrangling are fundamental steps in Natural Language
Processing (NLP) that involve cleaning, structuring, and normalizing text data before
it is used in machine learning or deep learning models. Since raw text data often
contains noise, inconsistencies, and irrelevant information, preprocessing ensures that
the data is more structured, meaningful, and optimized for analysis. These steps
include lowercasing, removing punctuation, stopwords, stemming, lemmatization,
tokenization, handling missing values, and feature extraction, among others.
1️. Text Preprocessing: The Foundation of Clean Data
Text preprocessing consists of basic cleaning operations that remove unwanted
elements from raw text while preserving meaningful information. This process is
crucial in preparing the data for computational analysis, as machine learning models
work best when the input data is in a uniform format. Below are the essential steps
involved in text preprocessing:
A. Lowercasing
One of the first steps in text preprocessing is converting all text to lowercase. This
ensures that words like "Machine" and "machine" are treated as the same word rather
than different entities. This is particularly important in tasks like information retrieval
and text classification, where case differences can lead to unnecessary complexity.
Example:
"The BOY plays Soccer." → "the boy plays soccer."
By standardizing text to lowercase, models can avoid duplication and reduce
feature space, making processing more efficient.
B. Removing Punctuation
Punctuation marks such as ., !?; do not typically contribute to the meaning of a
sentence when processing text data for machine learning. Removing punctuation
helps in creating a more consistent text format, reducing unnecessary variations in
the dataset.
Example:
"Hello! How are you?" → "Hello How are you"
However, in some NLP applications like sentiment analysis, punctuation (e.g.,
exclamation marks) can carry emotional meaning. Therefore, punctuation removal
should be carefully considered based on the specific NLP task.
C. Removing Special Characters and Numbers
Special characters like @, #, $, % and numbers can be irrelevant or misleading in
many NLP applications unless explicitly required. For instance, a financial dataset
might retain numbers, whereas a sentiment analysis model may not need them.
Example:
"Order #123 is confirmed! Call 555-0199." → "Order is confirmed Call"
This step eliminates noise and makes textual data more readable and structured.
D. Removing Stopwords
Stopwords are commonly used words such as "is", "the", "and", "on" that do not
contribute significant meaning in NLP models. Removing stopwords can reduce
dimensionality and enhance processing speed. However, in some contexts, retaining
them can be useful.
Example:
"The cat is sitting on the table." → "cat sitting table"
By filtering out stopwords, we retain only the most informative parts of the sentence.
E. Tokenization
Tokenization is the process of breaking text into individual components, such as
words (word tokenization) or sentences (sentence tokenization). This step allows NLP
models to process text efficiently by treating each word as a distinct entity.
Example:
"The boy plays soccer." → ["The", "boy", "plays", "soccer"]
Tokenization is fundamental for word frequency analysis, sentiment detection, and
machine translation tasks.
F. Stemming & Lemmatization
Stemming and Lemmatization are techniques used to reduce words to their root
forms, helping models understand similar words in a standardized way.
Stemming: This method truncates words to their base form by removing
suffixes. However, it may not always return a valid word.
Example: "playing" → "play", "running" → "run"
Lemmatization: This technique reduces words to their dictionary form,
making them meaningful and grammatically correct.
Example: "better" → "good", "geese" → "goose"
Lemmatization is preferred over stemming in many NLP applications as it produces
more accurate root words.
Text Wrangling in NLP: A Detailed Overview
1️. Introduction to Text Wrangling
Text wrangling is the process of cleaning, transforming, and restructuring raw text
data to make it suitable for analysis and modeling. Unlike basic text preprocessing,
which focuses on removing unwanted elements, text wrangling deals with handling
inconsistencies, missing values, standardizing formats, and transforming text
into structured representations for better model performance.
Text wrangling is crucial for Natural Language Processing (NLP) because real-
world text data is often messy, containing misspellings, abbreviations, duplicate
information, inconsistent formatting, and missing values. Proper wrangling
ensures that the text is structured, consistent, and ready for downstream NLP
applications like chatbots, sentiment analysis, machine translation, and text
summarization.
2️. Key Steps in Text Wrangling
A. Handling Missing or Incomplete Data
Missing text data is common in large datasets, often appearing as "NULL", "N/A",
"—", or "" (empty string). Handling missing data correctly prevents models from
learning biased or incorrect patterns.
Approaches to Handle Missing Data:
1. Remove Rows with Missing Values (if the dataset is large and missing values
are minimal).
2. Fill Missing Values with Defaults (e.g., replacing "NULL" with "Unknown").
3. Use Contextual Imputation (e.g., filling missing gender based on a person’s
name).
Example:
Name Age City Profession
Alice 25 NYC Engineer
Bob 30 NULL Doctor
Charlie 28 LA NULL
After Filling Missing Values:
Name Age City Profession
Alice 25 NYC Engineer
Bob 30 Unknown Doctor
Charlie 28 LA Unspecified
B. Correcting Spelling Errors & Typos
Spelling errors can introduce noise into text analysis. Spelling correction can be
achieved using:
Edit Distance (Levenshtein Distance): Measures how many edits (insertions,
deletions, replacements) are needed to correct a word.
Dictionary-based Lookup: Maps misspelled words to correct ones.
Spell-checking Libraries (e.g., SymSpell, Hunspell, PySpellChecker)
Example:
"I recieved teh mesage yestarday" → "I received the message yesterday"
C. Standardizing Text Formatting
Text formatting inconsistencies (e.g., different date formats, casing, abbreviations)
can create ambiguity in NLP tasks. Standardization ensures uniformity across the
dataset.
Common Formatting Issues & Fixes:
1. Case Normalization: Convert text to lowercase.
o "Machine Learning" → "machine learning"
2. Date Format Standardization: Convert multiple date formats to a standard
one.
o "12/03/2025" → "March 12, 2025"
3. Abbreviation Expansion: Replace abbreviations with full words.
o "Dr." → "Doctor", "NYC" → "New York City"
Example:
Raw Data: "NYC is gr8! I'll b thr @ 5pm."
Cleaned Data: "New York City is great! I will be there at 5 PM."
D. Removing Duplicates and Irrelevant Data
In large datasets, duplicate entries or irrelevant information can reduce model
efficiency.
1. Removing Duplicate Rows
o If a dataset contains repeated tweets, reviews, or articles, duplicate
rows should be removed.
2. Eliminating Unnecessary Text Data
o Example: Removing boilerplate text like "Copyright 2025" from
multiple documents.
Example:
ID Review
1 "This product is amazing!"
2 "This product is amazing!" (Duplicate)
After Removal:
ID Review
1 "This product is amazing!"
E. Handling Contractions and Informal Language
Contractions (e.g., "can't", "I'm", "won't") should be expanded into their full forms to
improve text clarity.
Example:
"I'm gonna buy it" → "I am going to buy it"
Similarly, informal words and internet slang can be standardized:
"u r amazing" → "you are amazing"
F. Named Entity Recognition (NER) for Identifying Key Elements
Named Entity Recognition (NER) identifies names, locations, organizations, and
dates in text. This is useful for customer feedback analysis, financial data
processing, and news categorization.
Example: “Apple Inc. was founded by Steve Jobs in California."
o Entities Extracted: "Apple Inc." (Organization), "Steve Jobs"
(Person), "California" (Location)
G. Part-of-Speech (POS) Tagging for Grammatical Structuring
POS tagging labels each word with its grammatical role (noun, verb, adjective). This
is essential for syntactic analysis and machine translation.
Example:
Sentence: "The boy plays soccer."
POS Tags: The (Det), boy (N), plays (V), soccer (N)
H. Lemmatization & Stemming for Text Normalization
Both techniques reduce words to their base form:
Stemming: Chops off suffixes but may not return real words (e.g., "running"
→ "run").
Lemmatization: Converts words to dictionary forms (e.g., "better" → "good",
"geese" → "goose").
Example:
"She is walking happily" → Stemming: "She is walk happi"
"She is walking happily" → Lemmatization: "She is walk happy"
3️. Final Step: Converting Text into Features for Machine Learning
After wrangling, text data is converted into numerical features:
Bag of Words (BoW): Represents text as word frequency vectors.
TF-IDF (Term Frequency-Inverse Document Frequency): Assigns weight
to words based on importance.
Word Embeddings (Word2️Vec, GloVe, BERT): Transforms words into
vector representations for deep learning.
Example:
"king" - "man" + "woman" = "queen" (Word2Vec analogy)
Text wrangling transforms messy, inconsistent text data into structured,
standardized, and meaningful representations for NLP tasks. Without proper
wrangling, machine learning models may struggle with noisy or biased data,
reducing accuracy and efficiency.
1️.7️ Text tokenization
1️. Introduction to Text Tokenization
Tokenization is a fundamental step in Natural Language Processing (NLP) that
involves breaking down a large body of text into smaller units called tokens. These
tokens can be words, subwords, sentences, or characters, depending on the type of
tokenization used. The primary goal of tokenization is to structure and segment text
data so that it can be analyzed by computational models.
In NLP, tokenization serves as the foundation for various applications, including text
classification, sentiment analysis, information retrieval, machine translation, and
chatbots. Without proper tokenization, text data remains unstructured, making it
difficult for algorithms to process and extract meaningful insights.
2️. Importance of Tokenization in NLP
Tokenization is crucial in NLP for several reasons:
Facilitates Text Processing: Converts unstructured text into manageable
components for analysis.
Reduces Computational Complexity: Breaks down long texts into smaller
units, making data easier to process.
Enables Feature Extraction: Allows NLP models to analyze words, phrases,
and context.
Enhances Machine Learning Models: Helps in the creation of word
embeddings, n-grams, and frequency-based representations.
3️. Types of Tokenization
Tokenization can be categorized into different levels based on the granularity of the
tokens.
A. Word Tokenization
Word tokenization splits text into individual words or subwords. This is one of the
most commonly used tokenization techniques in NLP.
Example:
Input: "The quick brown fox jumps over the lazy dog."
Word Tokens: ["The", "quick", "brown", "fox", "jumps", "over", "the",
"lazy", "dog"]
Word tokenization may pose challenges in handling contractions, punctuation, and
compound words. For example, "don't" might be split into ["don", "'t"], which
requires additional processing.
Challenges in Word Tokenization:
Multi-word Expressions: "New York" should be treated as one unit, not
["New", "York"].
Hyphenated Words: "state-of-the-art" might be incorrectly split.
Contractions: "shouldn't" might be incorrectly tokenized as ["shouldn", "'t"].
B. Sentence Tokenization (Segmentation)
Sentence tokenization breaks text into sentences rather than words. This is useful for
text summarization, dialogue systems, and syntactic parsing.
Example:
Input: "Hello! How are you? I hope you're doing well."
Sentence Tokens:
1. "Hello!"
2. "How are you?"
3. "I hope you're doing well."
Sentence tokenization can be challenging due to:
Abbreviations: "Dr. Smith is here." might be mistakenly split after "Dr.".
C. Subword Tokenization
Subword tokenization is particularly useful for handling morphologically complex
words, rare words, and misspellings. This is widely used in modern NLP models
like BERT, GPT, and WordPiece Tokenization.
Example:
Input: "unhappiness"
Subword Tokens (BPE or WordPiece): ["un", "happiness"]
Subword Tokens (Byte-Pair Encoding - BPE): ["un", "hap", "pi", "ness"]
Advantages of Subword Tokenization:
Reduces the vocabulary size in NLP models.
Handles unseen words by breaking them into known subwords.
Works well in multi-lingual NLP models.
D. Character Tokenization
Character tokenization splits text at the character level, which is useful for language
modeling, spelling correction, and speech recognition.
Example:
Input: "ChatGPT"
Character Tokens: ["C", "h", "a", "t", "G", "P", "T"]
Advantages:
Helps in handling misspellings and noisy text (e.g., social media data).
Useful in low-resource languages where word boundaries are unclear.
Challenges:
It loses contextual meaning since individual characters carry limited
information.
4️. Tokenization Methods and Tools
There are several methods and libraries available for tokenization. Some of the most
commonly used tokenization techniques are:
A. Rule-Based Tokenization
This approach relies on predefined rules and regular expressions to split text.
Example: Using Python’s re. split () method:
import re
text = "Hello! How are you? I'm fine."
tokens = re. split(r'[.!?]', text) # Splitting at punctuation
print(tokens)
Output: ["Hello", " How are you", " I'm fine", ""]
Limitations:
Struggles with complex contractions and abbreviations.
B. Tokenization Using NLP Libraries
Several NLP libraries provide efficient tokenization methods:
1. NLTK (Natural Language Toolkit)
o Provides word_tokenize () and sent_tokenize () functions
o Supports handling contractions and punctuation.
Program
from nltk.tokenize import word_tokenize, sent_tokenize
text = "ChatGPT is amazing! It helps with NLP tasks."
print(word_tokenize(text)) # ['ChatGPT', 'is', 'amazing', '!', 'It', 'helps', 'with', 'NLP',
'tasks', '.']
print(sent_tokenize(text)) # ['ChatGPT is amazing!', 'It helps with NLP tasks.']
Output: ['ChatGPT', 'is', 'amazing', '!', 'It', 'helps', 'with', 'NLP', 'tasks', '.']
['ChatGPT is amazing!', 'It helps with NLP tasks.']
2. spaCy
Faster and more accurate than NLTK for tokenization.
Uses pretrained models for efficient segmentation.
Program
pip install spacy
python -m spacy download en_core_web_sm
import spacy
nlp = spacy. load("en_core_web_sm")
doc = nlp ("I'm learning NLP with spaCy.")
print ([token.text for token in doc]) # ["I", "'m", "learning", "NLP", "with", "spaCy",
"."]
Output: ['I', "'m", "learning", "NLP", "with", "spaCy", "."]
Explanation:
1. I → The pronoun "I" is correctly identified as a separate token.
2. 'm → The contraction "I'm" is split into "I" and "'m" (which represents
"am").
3. learning → The verb "learning" is treated as a single token.
4. NLP → "NLP" is recognized as a separate token.
5. with → The preposition "with" is also tokenized separately.
6. spaCy → "spaCy" is correctly identified as one token, keeping the
capitalization.
7. . → The period (.) is treated as a separate token.
Why Use spaCy for Tokenization?
Handles contractions (I'm → ["I", "'m"]) better than some basic tokenizers.
Efficient for named entity recognition (NER), part-of-speech tagging, and
dependency parsing.
1️.8 Stemming
1️. Introduction to Stemming
Stemming is a fundamental text preprocessing technique in Natural Language
Processing (NLP) used to reduce inflected or derived words to their root or base
form. The process of stemming removes suffixes, prefixes, or other affixes from
words to obtain a common base form, known as the word stem.
Stemming is widely used in applications such as information retrieval, text mining,
search engines, and text classification to improve efficiency by reducing vocabulary
size.
Example:
Original words: "running", "runs", "runner", "ran"
After stemming: "run"
By converting words to their stems, NLP models can recognize that different forms of
a word refer to the same concept.
2️. Importance of Stemming in NLP
Stemming plays a crucial role in reducing redundancy in textual data and improving
computational efficiency. Some key benefits include:
Enhancing Information Retrieval:
o Search engines can retrieve relevant documents even if users type
variations of a keyword.
o Example: Searching for "connect" will also return results containing
"connected", "connecting", etc.
Reducing Vocabulary Size:
o Stemming converts words to their base forms, reducing the total
number of unique words in a dataset.
o Example: "studies", "studying", "study" → "studi" (fewer variations to
process).
Improving Text Classification:
o Stemming allows NLP models to generalize over different word forms.
o Example: "argue", "argued", "arguing" all reduce to "argu", allowing
classifiers to treat them as the same word.
3️. Types of Stemming Algorithms
There are several stemming algorithms, each with different approaches to word
reduction.
A. Porter’s Stemmer (Most Commonly Used)
The Porter Stemmer, developed by Martin Porter in 1980, is one of the most
widely used stemming algorithms. It applies a set of rules-based suffix stripping
techniques to convert words to their stems.
Example:
Word Stem (Porter Algorithm)
Connection Connect
Connecting Connect
Connected Connect
Happily Happili
Studying Studi
Python Program using NLTK Porter Stemmer:
from nltk.stem import PorterStemmer
ps = PorterStemmer ()
words = ["running", "runs", "runner", "studying", "studies", "happiness"]
for word in words:
print(f"{word} → {ps. stem(word)}")
Output:
running → run
runs → run
runner → runner
studying → studi
studies → studi
happiness → happi
Limitation: The stem "studi" for "studying" and "studies" may not always be
meaningful.
B. Lancaster Stemmer (More Aggressive than Porter)
The Lancaster Stemmer is an aggressive stemming algorithm that reduces words
even further than the Porter Stemmer, sometimes over-stemming words to very
short roots.
Example:
Word Stem (Lancaster Algorithm)
Connection connect
Connecting connect
Connected connect
Happiness happy
Studies study
Python Program using NLTK Lancaster Stemmer:
from nltk.stem import LancasterStemmer
ls = LancasterStemmer()
words = ["running", "runs", "runner", "studying", "studies", "happiness"]
for word in words:
print(f"{word} → {ls. stem(word)}")
Output:
running → run
runs → run
runner → run
studying → study
studies → study
happiness → happy
Limitation: The Lancaster Stemmer is often too aggressive, leading to stems that
may not be recognizable.
C. Snowball Stemmer (Improved Version of Porter’s Stemmer)
The Snowball Stemmer (also known as Porter2 Stemmer) is a more advanced
version of the Porter Stemmer, with improvements in handling different languages.
Python Program using Snowball Stemmer:
from nltk. stem import SnowballStemmer
ss = SnowballStemmer("english")
words = ["running", "runs", "runner", "studying", "studies", "happiness"]
for word in words:
print(f"{word} → {ss. stem(word)}")
Output:
running → run
runs → run
runner → runner
studying → studi
studies → studi
happiness → happi
Advantage: Works with multiple languages like Spanish, French, German, etc.
4️. Stemming vs. Lemmatization
While stemming and lemmatization both reduce words to their base forms, they
differ in their approach:
Feature Stemming Lemmatization
Removes prefixes and Converts words to their base form using a
Definition
suffixes using rules. dictionary.
Uses heuristics (rules) to
Approach Uses linguistic knowledge and context.
chop word endings.
Speed Faster (rule-based). Slower (requires dictionary lookup).
Can produce non-real
Output Always results in meaningful words.
words.
"running" → "run" "running" → "run"
Example
"happiness" → "happi" "happiness" → "happiness"
NLP applications needing grammatical
Search engines, indexing,
Use Case correctness, such as chatbots and machine
and fast text processing.
translation.
When to Use Stemming vs. Lemmatization?
Use Stemming when speed is crucial and accuracy is less important (e.g.,
search engines).
Use Lemmatization when word meaning and grammatical correctness are
necessary (e.g., chatbots, text summarization).
5️. Applications of Stemming in NLP
Stemming is widely used in various NLP applications:
Application Role of Stemming
Allows retrieval of documents containing different word forms
Search Engines
(e.g., "running" and "run").
Text Reduces vocabulary size by grouping words with the same root
Classification together.
Spam Detection Helps identify variations of spam words.
Speech
Converts spoken words into base forms for better accuracy.
Recognition
Chatbots Recognizes different forms of words to improve user responses.
6️. Challenges and Limitations of Stemming
Despite its advantages, stemming has some challenges:
Over-stemming: Some algorithms reduce words too much, making them
unrecognizable.
o Example: "University" → "Univers" (Porter Stemmer)
Under-stemming: Some words are not reduced enough, missing relationships
between words.
o Example: "running" → "run" but "runner" → "runner"
Lack of Context Awareness: Stemming does not consider the grammatical
role of a word in a sentence.
o Example: "better" does not stem to "good" (which Lemmatization can
handle).
Stemming is a powerful NLP preprocessing technique that helps simplify text
data, reduce redundancy, and improve computational efficiency. While different
stemming algorithms exist, each has strengths and weaknesses depending on the use
case. For applications where speed matters, stemming is preferred. However, if
accuracy and context are crucial, lemmatization is the better choice.
1️.9 Lemmatization
1. Introduction to Lemmatization
Lemmatization is a text preprocessing technique in Natural Language Processing
(NLP) that reduces words to their base or dictionary form, known as the lemma.
Unlike stemming, which simply removes suffixes based on rules, lemmatization
considers the context and meaning of a word, ensuring that the output is a valid
word.
Example:
Word
Lemma
Variant
Running Run
Studies Study
Flies Fly
Happier Happy
Better Good
Lemmatization is widely used in chatbots, search engines, text summarization,
machine translation, and question-answering systems, where grammatical
correctness and word meaning are crucial.
2️. How Lemmatization Works
Lemmatization relies on linguistic knowledge and morphological analysis,
considering a word’s Part of Speech (POS) to determine its correct base form. The
process follows these steps:
Steps in Lemmatization:
1. Tokenization: Breaking text into individual words.
2. POS Tagging: Identifying the grammatical category of each word (noun, verb,
adjective, etc.).
3. Dictionary Lookup: Finding the base form of the word in a lexicon.
4. Lemmatization Output: Converting words to their valid base form.
3️. Importance of Lemmatization in NLP
Reduces Vocabulary Size → Helps NLP models understand variations of the
same word.
Improves Search Accuracy → Search engines retrieve more relevant results
by mapping different forms of a word to the same root.
Enhances Text Processing → Helps in text classification, sentiment
analysis, and chatbot responses.
Supports Machine Learning Models → Reduces noise in text data,
improving model efficiency.
4️. Lemmatization vs. Stemming: Key Differences
Feature Lemmatization Stemming
Uses a dictionary and context-based
Approach Uses rule-based suffix stripping.
rules.
May produce meaningless
Output Produces real words.
words.
Speed Slower (more accurate). Faster (less accurate).
Example "Running" → "Run" "Running" → "Run"
Chatbots, summarization, NLP tasks Search engines, indexing, fast
Use Case
needing accuracy. text processing.
5️. Lemmatization Implementation in Python
Lemmatization can be implemented using NLTK (WordNet Lemmatizer) and
Spacy
a) Lemmatization using NLTK WordNetLemmatizer
Program
from nltk. stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ["running", "flies", "happier", "better", "studies"]
for word in words:
print(f"{word} → {lemmatizer. lemmatize(word)}”) # Default assumes noun form
Output:
running → running
flies → fly
happier → happier
better → better
studies → study
Here, "running" did not change because lemmatization needs the correct Part of
Speech (POS).
b) Lemmatization with POS Tagging in NLTK
Program
To improve accuracy, we specify the POS tag ('v' for verbs, 'a' for adjectives):
from nltk. stem import WordNetLemmatizer
from nltk. corpus import wordnet
lemmatizer = WordNetLemmatizer ()
words = [("running", wordnet. VERB), ("flies", wordnet. NOUN), ("happier",
wordnet. ADJ), ("better", wordnet. ADJ)]
for word, pos in words:
print(f"{word} → {lemmatizer. lemmatize (word, pos=pos)}")
Output:
running → run
flies → fly
happier → happy
better → good
"better" is correctly lemmatized to "good" because it is an irregular adjective.
c)Lemmatization using spacy (more advanced)
Program:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp ("She is running faster and happier than before.")
for token in doc:
print(f"{token.text} → {token.lemma_}")
Output:
She → she
is → be
running → run
faster → fast
and → and
happier → happy
than → than
before → before
spaCy automatically detects POS tags, making it more efficient than NLTK in many
cases.
6️. Applications of Lemmatization in NLP
Application Role of Lemmatization
Retrieves relevant documents by mapping different
Search Engines
word forms to a common lemma.
Chatbots & Virtual Improves understanding of user queries by reducing
Assistants words to their base form.
Helps generate concise summaries by normalizing
Text Summarization
words.
Assists in translating words accurately by recognizing
Machine Translation
their base form.
Named Entity Enhances entity recognition by considering different
Recognition (NER) word forms.
7️. Challenges and Limitations of Lemmatization
Despite its advantages, lemmatization has some challenges:
Computational Cost: Requires dictionary lookup, making it slower than
stemming.
POS Tag Dependency: Requires accurate Part of Speech (POS) tagging for
best results.
Language-Specific: Lemmatization is language-dependent and requires
different lexicons for different languages.
Handling Irregular Forms: Some irregular words (e.g., "went" → "go")
may not always be handled correctly.
Lemmatization is a powerful NLP technique that enhances text preprocessing by
reducing words to meaningful root forms while preserving grammatical accuracy.
Compared to stemming, it provides better results but requires more computation.
For applications like search engines, chatbots, and machine translation,
lemmatization is preferred over stemming for its accuracy.
1️.1️0 Removing stop-words
1️. Introduction to Stop Words
Stop words are common words in a language that do not carry significant meaning
and are often removed during text preprocessing in Natural Language Processing
(NLP). These words include articles, pronouns, prepositions, conjunctions, and
other frequently occurring words such as:
Examples of English Stop Words: the, is, in, at, which, and, to, from, he, she,
it, on, a, an, that
Why Remove Stop Words?
1. Reduces Text Size → Eliminates unnecessary words, making text
processing more efficient.
2. Improves Computational Speed → Less data to analyze means faster
machine learning models.
3. Enhances Meaningful Word Extraction → Focuses on important words
rather than common filler words.
4. Increases Accuracy in NLP Tasks → Helps in search engines, text
classification, and sentiment analysis by eliminating noise.
2️. When Should We Keep Stop Words?
While stop words are usually removed, there are cases where keeping them is useful:
Sentiment Analysis → Words like not, never, and don’t are crucial for
detecting negative sentiment.
Text Generation (Chatbots, Translation, Summarization) → Stop words
help maintain fluency and natural sentence structure.
Named Entity Recognition (NER) → Some stop words may be part of
proper nouns (e.g., The White House).
3️. Implementing Stop Word Removal in Python
Stop words can be removed using NLTK (Natural Language Toolkit) or spaCy.
a) Removing Stop Words Using NLTK
Program
import nltk
from nltk. corpus import stopwords
from nltk. tokenize import word_tokenize
nltk. download('stopwords')
nltk. download('punkt')
text = "This is an example showing stop words removal in NLP."
# Tokenization
words = word_tokenize(text)
# Removing Stop Words
filtered_words = [word for word in words if word.lower() not in
stopwords.words('english')]
print ("Original Words:", words)
print ("Filtered Words (After Stop Word Removal):", filtered_words)
Output:
Original Words: ['This', 'is', 'an', 'example', 'showing', 'stop', 'words', 'removal', 'in',
'NLP', '.']
Filtered Words (After Stop Word Removal): ['example', 'showing', 'stop', 'words',
'removal', 'NLP', '.']
The stop words "This", "is", "an", "in" have been removed, leaving only
meaningful words.
b. Removing Stop Words Using spaCy
Program
import spacy
nlp = spacy. load("en_core_web_sm")
text = "This is an example showing stop words removal in NLP."
doc = nlp(text)
filtered_tokens = [token.text for token in doc if not token.is_stop]
print ("Filtered Words:", filtered_tokens)
Output
Filtered Words: ['example', 'showing', 'stop', 'words', 'removal', 'NLP', '.']
spaCy automatically detects stop words and removes them efficiently.
4️. Creating a Custom Stop Word List
In some cases, the default stop words may not be suitable, so custom stop word lists
can be used.
a) Custom Stop Words in NLTK
Program
custom_stopwords = set (stopwords. words('english'))
custom_stopwords. Update (["example", "showing"]) # Adding custom words to
remove
filtered_words = [word for word in words if word. lower () not in custom_stopwords]
print ("Custom Filtered Words:", filtered_words)
Output:
Custom Filtered Words: ['stop', 'words', 'removal', 'NLP', '.']
The words "example" and "showing" were manually added to the stop word list and
removed.
5️. Applications of Stop Word Removal in NLP
Application Role of Stop Word Removal
Improves search results by ignoring
Search Engines (Google, Bing, etc.)
common words.
Text Classification (Spam Detection, Focuses on key terms instead of filler
News Categorization) words.
Sentiment Analysis (Product Reviews, Helps extract meaningful words but may
Tweets) retain certain stop words.
Reduces noise in transcriptions by filtering
Speech Recognition
unnecessary words.
Reduces dimensionality, improving
Machine Learning Models
efficiency and accuracy.
6️. Challenges and Considerations in Stop Word Removal
While removing stop words improves text analysis, some challenges exist:
1. Loss of Important Information → Removing words like not, never, nor can
change the meaning of a sentence.
o Example: "I do not like this product."
o Without "not," it becomes "I do like this product.", changing the
sentiment completely.
2. Language-Specific Stop Words → Different languages have different stop
words (e.g., French, Spanish).
o Need to use language-specific stop word lists for multilingual NLP
tasks.
3. Domain-Specific Stop Words → In specialized fields (medical, legal,
scientific), common words may not be stop words.
o Example: In medical NLP, words like "patient", "disease", "treatment"
should not be removed.
4. Contextual Importance → Some stop words may be significant in Named
Entity Recognition (NER) tasks.
o Example: "The United Nations" → Removing "The" could lead to
incorrect recognition.
Stop word removal is a fundamental step in text preprocessing for NLP, reducing
unnecessary words and improving model performance. However, it should be used
carefully depending on the task.
For search engines and classification: Removing stop words improves
efficiency.
For sentiment analysis and chatbots: Some stop words should be retained.
For domain-specific applications: Custom stop word lists are essential.
1️.1️1️ Feature Engineering for Text Representation
1️. Introduction to Feature Engineering in NLP
Feature Engineering is the process of transforming raw text into meaningful
numerical representations that machine learning models can understand. In Natural
Language Processing (NLP), text must be converted into features before applying
algorithms for tasks such as sentiment analysis, text classification, and machine
translation.
Why Feature Engineering is Important?
1. Improves Model Performance → Better features lead to better predictions.
2. Reduces Dimensionality → Removes unnecessary data, making models
efficient.
3. Enhances Interpretability → Converts text into structured formats, aiding
analysis.
4. Optimizes Learning Algorithms → Helps models learn patterns from textual
data.
2️. Types of Text Representation Techniques
There are several ways to represent text as numerical data. These include:
A. Basic Text Representations
1. Bag of Words (BoW)
2. Term Frequency-Inverse Document Frequency (TF-IDF)
3. Word Embeddings (Word2️Vec, GloVe, FastText)
4. Character and Subword Embeddings
5. Sentence and Document Representations (Doc2️Vec, Transformers)
3️. Basic Text Representation Techniques
A. Bag of Words (BoW) Model
The Bag of Words (BoW) model represents text as a frequency-based vector,
ignoring word order.
Example:
Text:
"The cat sat on the mat."
"The dog sat on the floor."
Word Sentence 1️ Sentence 2️
The 2️ 2️
Cat 1️ 0
dog 0 1️
Sat 1️ 1️
On 1️ 1️
Mat 1️ 0
floor 0 1️
Limitations:
Ignores word meaning and order.
Large vocabulary size leads to sparse vectors
Implementation using Scikit-Learn:
from sklearn. feature_extraction.text import CountVectorizer
texts = ["The cat sat on the mat.", "The dog sat on the floor."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
print (vectorizer.get_feature_names_out ())
print (X. toarray ())
Output:
Index | Word
-------------------
0 | cat
1 | dog
2 | floor
3 | mat
4 | on
5 | sat
6 | the
['cat' 'dog' 'floor' 'mat' 'on' 'sat' 'the']
[[1 0 0 1 1 1 2]
[0 1 1 0 1 1 2]]
B. Term Frequency-Inverse Document Frequency (TF-IDF)
TF-IDF weights words based on their importance in a document relative to a corpus.
Formula:
TF (Term Frequency) = (Number of times a word appears in a document) /
(Total words in the document).
IDF (Inverse Document Frequency) = log (N / (df + 1)), where N is the total
number of documents and df is the number of documents containing the word.
🔹 Advantages:
✔ Reduces the importance of common words.
✔ Highlights important keywords.
Implementation using Scikit-Learn:
from sklearn. feature_extraction.text import TfidfVectorizer
texts = ["The cat sat on the mat.", "The dog sat on the floor."]
vectorizer = TfidfVectorizer ()
X = vectorizer.fit_transform(texts)
print (vectorizer.get_feature_names_out ())
print (X. toarray ())
Step-by-Step Execution:
1. Tokenization & Feature Extraction
The TfidfVectorizer splits the text into words (tokens), converts them to
lowercase, removes punctuation, and calculates TF-IDF scores.
The unique words across both sentences:
['cat', 'dog', 'floor', 'mat', 'on', 'sat', 'the']
2. Calculating TF-IDF Values:
TF (Term Frequency) → How often a word appears in a sentence.
IDF (Inverse Document Frequency) → Measures importance across all
sentences.
TF-IDF Score → Computed using the formula:
TFIDF=TF×log(N/df)
where N is the total number of documents, and df is the number of documents
containing the word.
3. Final TF-IDF Matrix Output:
['cat' 'dog' 'floor' 'mat' 'on' 'sat' 'the']
[[0.5 0.0 0.0 0.4455 0.34520502 0.34520502 0.69041004]
[0.0 0.4455 0.5 0.0 0.34520502 0.34520502 0.69041004]]
4️. Advanced Feature Representation Techniques
A. Word Embeddings (Word2️Vec, GloVe, FastText)
Unlike BoW and TF-IDF, word embeddings represent words in a continuous vector
space, capturing their meaning and relationships.
Technique Features
Word2Vec -Learns word meanings based on context (CBOW & Skip-Gram).
GloVe -Captures global word co-occurrence statistics.
FastText -Works with subword information, useful for morphologically rich
languages.
Example:
"King" - "Man" + "Woman" ≈ "Queen" (Word2Vec captures word relationships!)
B. Character and Subword Embeddings (FastText, Byte Pair Encoding)
Character-based models work well for handling misspellings and unknown words.
Example:
✔ "run" → "running", "runner" → Captures morphological variations.
C. Sentence and Document Representations
1. Doc2️Vec → Learns vector representations of entire documents.
2. Transformer Models (BERT, GPT, T5️) → Uses contextual embeddings to
generate deep semantic understanding.
5️. Feature Engineering for NLP Machine Learning Models
Feature Type Best for
Traditional ML models (Naive Bayes, SVM, Logistic
BoW, TF-IDF
Regression)
Word2️Vec, GloVe,
Deep Learning (LSTMs, CNNs)
FastText
BERT, GPT, Advanced NLP (Chatbots, Summarization, Sentiment
Transformers Analysis)
6️. Challenges in Feature Engineering for Text
Curse of Dimensionality → Large vocabulary = Sparse matrices.
Loss of Semantic Meaning → Simple models (BoW, TF-IDF) don’t capture
meaning.
Computational Cost → Deep learning models require high resources.
Handling OOV (Out-of-Vocabulary) Words → Word2Vec & FastText mitigate
this with subword embeddings.
Feature Engineering in NLP transforms text into numerical representations that
machine learning models can process. Choosing the right representation depends
on the use case:
✔ BoW & TF-IDF → Good for basic NLP tasks.
✔ Word Embeddings → Capture word relationships and semantics.
✔ Transformers (BERT, GPT) → Best for advanced NLP applications.
1️.1️2️ Bag of Words (BoW)
The Bag of Words (BoW) model is a widely used technique in Natural Language
Processing (NLP) for converting textual data into numerical representations. It helps
machine learning models process and analyze text without understanding the actual
meaning of words.
The fundamental idea behind BoW is to treat a text as a collection of individual words,
ignoring grammar and word order but keeping track of word frequency. This means that
two sentences with the same words but in different orders will have identical BoW
representations.
The BoW process begins with tokenization, where a given text is broken down into
individual words. These words are then used to build a vocabulary containing all
unique words from the dataset.
Once the vocabulary is created, each sentence or document is represented as a word
frequency matrix, where each row corresponds to a document, and each column
corresponds to a unique word from the vocabulary. The values in the matrix indicate
how many times a word appears in a particular document.
For example, consider two simple sentences: “The cat sat on the mat.” and “The dog
sat on the floor.”. The vocabulary derived from these sentences consists of the unique
words: [‘cat’, ‘dog’, ‘floor’, ‘mat’, ‘on’, ‘sat’, ‘the’]. The first sentence is then
represented as [1, 0, 0, 1, 1, 1, 2], while the second sentence is [0, 1, 1, 0, 1, 1, 2], where
the numbers indicate word occurrences.
While BoW is simple and effective for tasks like text classification, spam detection,
and sentiment analysis, it has limitations. One major drawback is that it ignores word
order, meaning phrases like “I love NLP” and “NLP love I” will be represented
identically. Additionally, BoW does not capture the meaning or relationships between
words, leading to high-dimensional and sparse matrices when dealing with large
vocabularies.
Despite its limitations, BoW is still a foundational technique in text analysis and is often
enhanced using more advanced methods like TF-IDF (Term Frequency-Inverse
Document Frequency), which assigns importance to words based on their frequency
across multiple documents.
Modern alternatives such as Word Embeddings (Word2Vec, GloVe, and BERT)
have further improved text representation by capturing semantic meanings and
contextual relationships between words. Nonetheless, BoW remains a crucial stepping
stone in the field of NLP, offering a simple yet powerful way to transform raw text into
structured data for computational processing.
The Bag of Words (BoW) model is one of the most fundamental techniques in Natural
Language Processing (NLP) for text representation. It converts textual data into a
numerical format that can be used for machine learning models.
BoW treats text as a "bag" of words, ignoring grammar and word order but
preserving the frequency of words. This approach is widely used in applications like
text classification, sentiment analysis, and spam detection.
How Does the Bag of Words Model Work?
The BoW model follows these key steps:
Step 1️: Tokenization
Break down the text into individual words (tokens).
Example sentences:
Sentence 1: "The cat sat on the mat."
Sentence 2: "The dog sat on the floor."
Tokenized Words: ['The', 'cat', 'sat', 'on', 'the', 'mat', 'dog', 'floor']
Step 2️: Create a Vocabulary
Identify all the unique words (features) across all documents.
The vocabulary from both sentences:
['cat', 'dog', 'floor', 'mat', 'on', 'sat', 'the']
Step 3: Create a Word Frequency Matrix
Each row represents a document (sentence), and each column represents a word from
the vocabulary. The values are word counts.
dog floor mat on sat The
cat
Sentence1 1 0 0 1 1 1 2
Sentence 2 0 1 1 0 1 1 2
1 means the word appears once in that sentence.
0 means the word does not appear in that sentence.
The word "the" appears twice in both sentences.
Advantages of BoW
Simple and effective for text representation.
Works well with traditional machine learning algorithms (e.g., Naïve Bayes,
SVM).
Can be easily implemented using Python libraries like sklearn.
Limitations of BoW
Ignores word order (e.g., "I love NLP" and "NLP love I" are treated as identical).
Fails to capture meaning or relationships between words.
Leads to a large sparse matrix (many zeroes when handling big vocabularies).
Bag of Words in Python
You can implement BoW using Scikit-learn (CountVectorizer).
Program:
from sklearn. feature_extraction.text import CountVectorizer
texts = ["The cat sat on the mat.", "The dog sat on the floor."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
print (vectorizer.get_feature_names_out ()) # ['cat', 'dog', 'floor', 'mat', 'on', 'sat', 'the']
print (X. toarray ()) # Word frequency matrix
Output:
['cat' 'dog' 'floor' 'mat' 'on' 'sat' 'the']
[[1 0 0 1 1 1 2]
[0 1 1 0 1 1 2]]
Extensions of BoW
TF-IDF (Term Frequency-Inverse Document Frequency): Improves BoW
by weighting words based on their importance.
Word Embeddings (Word2️Vec, GloVe, BERT): Capture word meanings,
context, and relationships.
Problems and Solutions in the Bag of Words (BoW) Model
The Bag of Words (BoW) model is a simple yet widely used text representation
technique in Natural Language Processing (NLP). However, it has several limitations
that affect its performance in real-world applications. Below are some of the major
problems associated with BoW, along with their respective solutions.
1️. Ignores Word Order and Context
Problem:
BoW treats a sentence as an unordered collection of words, ignoring their sequence and
structure. This means that "The cat chased the dog" and "The dog chased the cat"
will have the same representation, even though they convey different meanings.
Solution:
Use N-grams: Instead of considering single words (unigrams), N-grams capture word
sequences (e.g., bigrams: "chased the", "the dog") to preserve some context.
Use Word Embeddings (Word2Vec, GloVe, BERT): These models encode word
meaning and relationships based on large text corpora, allowing words with similar
meanings to have similar numerical representations.
2️. High Dimensionality and Sparsity
Problem:
For large text datasets, BoW creates a huge vocabulary, leading to high-dimensional
feature vectors with many zero values. This results in sparse matrices, making
computation inefficient and increasing memory usage.
Solution:
Feature Selection Techniques: Reduce dimensionality by selecting the most
relevant words based on term frequency, information gain, or mutual
information.
Dimensionality Reduction Methods: Use Principal Component Analysis
(PCA) or Latent Semantic Analysis (LSA) to reduce the number of features.
Use TF-IDF instead of Raw Counts: TF-IDF (Term Frequency-Inverse
Document Frequency) assigns weight to words based on importance rather than
raw frequency, reducing the impact of common words.
3️. Fails to Capture Semantic Meaning
Problem:
BoW treats words independently, ignoring their meaning or relationships. For example,
"great" and "excellent" have similar meanings but are treated as completely different
features.
Solution:
Use Word Embeddings: Approaches like Word2Vec, FastText, GloVe, and
BERT represent words in a continuous vector space, capturing semantic
relationships between words.
Use Concept-based Models: Topic modeling techniques such as Latent
Dirichlet Allocation (LDA) group words into topics, improving understanding
of the text.
4️. Assigns Equal Importance to All Words
Problem:
BoW does not differentiate between important and unimportant words. Common
words like "the," "is," and "on" may appear frequently, dominating the word count
while providing little useful information.
Solution:
Use TF-IDF: It reduces the weight of frequently occurring words and increases
the importance of words unique to a document.
Remove Stopwords: Eliminate common words such as "the," "is," and "in" that
do not add significant meaning. Libraries like NLTK and spaCy provide
predefined lists of stopwords.
5.Struggles with Out-of-Vocabulary (OOV) Words
Problem:
BoW can only recognize words that were present in the training dataset. If new words
appear in test data, they are ignored or represented as zero, reducing model
performance.
Solution:
Use Word Embeddings: Pre-trained embeddings (Word2Vec, GloVe) capture
relationships between words, making models more robust to new words.
Use Character-level Models: Models like FastText can generate embeddings
for unseen words based on subwords and character n-grams.
6️. Poor Performance on Large Datasets
Problem:
As the dataset grows, the vocabulary size increases, leading to longer processing
times, more memory usage, and higher computational costs.
Solution:
Apply Feature Engineering Techniques: Reduce the number of words using
stemming, lemmatization, or frequency-based filtering.
Use Hashing Trick: Instead of storing full vocabulary, hashing vectorizers
convert words into numerical indices, reducing storage requirements.
Use Pre-trained Language Models: Advanced transformer-based models like
BERT, GPT, and XLNet handle large text corpora more efficiently.
1️.1️3️ Bag of N-grams Model
The Bag of N-grams (BoN) model is an extension of the Bag of Words (BoW) model
used in Natural Language Processing (NLP) to represent text data. While BoW treats
text as an unordered collection of words, BoN captures sequences of words (N-
grams) to preserve some context and improve performance in text classification,
sentiment analysis, and other NLP tasks.
What is an N-gram?
An N-gram is a contiguous sequence of N words from a given text. The value of N
determines how many words are grouped together:
Unigram (1️-gram): Single words (e.g., "The", "cat", "sat").
Bigram (2️-gram): Two-word sequences (e.g., "The cat", "cat sat").
Trigram (3️-gram): Three-word sequences (e.g., "The cat sat", "cat sat on").
N-gram (N > 3️): Larger sequences (e.g., "The cat sat on", "cat sat on the").
Unlike BoW, which loses word order, BoN captures word dependencies by
considering adjacent words in the text.
How Does the Bag of N-grams Model Work?
Step 1️: Tokenization
The input text is split into individual words or tokens.
Example Sentence:
➡ "The cat sat on the mat."
Step 2️: Generate N-grams
For different values of N, the following N-grams are extracted:
Unigrams: ["The", "cat", "sat", "on", "the", "mat"]
Bigrams: ["The cat", "cat sat", "sat on", "on the", "the mat"]
Trigrams: ["The cat sat", "cat sat on", "sat on the", "on the mat"]
Step 3️: Build the Vocabulary
A vocabulary is created containing all unique N-grams from the dataset.
Step 4️: Convert Text into a Numerical Representation
Each text document is represented as a numerical vector indicating the presence or
frequency of each N-gram in the vocabulary.
Example: Bag of N-grams in Python.
from sklearn. feature_extraction.text import CountVectorizer
# Example sentences
texts = ["The cat sat on the mat.", "The dog sat on the floor."]
# Create a CountVectorizer for bigrams
vectorizer = CountVectorizer (ngram_range= (2,2)) # Bigrams only
X = vectorizer.fit_transform(texts)
# Display the N-gram vocabulary and feature matrix
print (vectorizer.get_feature_names_out ())
print (X. toarray ())
Output:
['cat sat' 'dog sat' 'on the' 'sat on' 'the cat' 'the dog' 'the mat' 'the floor']
[[1 0 1 1 1 0 1 0]
[0 1 1 1 0 1 0 1]]
Here, each bigram is treated as a separate feature, and the frequency of its occurrence
in the text is counted.
Advantages of the Bag of N-grams Model
Preserves Some Word Order: Unlike BoW, N-grams retain local word
sequences, which improves context understanding.
Improves Accuracy in NLP Tasks: Helps in sentiment analysis, text
classification, and machine translation by considering word dependencies.
Better Handling of Polysemy: Words with multiple meanings (e.g., "bank" in
"river bank" vs. "money bank") are distinguished through context.
Disadvantages of the Bag of N-grams Model
Increased Dimensionality: Adding more N-grams results in a larger feature
space, making computation expensive.
Sparse Representations: Many N-grams appear only once, leading to sparse
vectors that impact model efficiency.
Limited Context Beyond N: N-grams capture only local word relationships
and do not consider long-range dependencies.
Solutions to Improve N-grams
Use TF-IDF Instead of Raw Counts: Reduces the effect of common N-
grams by assigning weights.
Apply Feature Selection: Remove low-frequency N-grams to reduce
dimensionality.
Use Word Embeddings: Methods like Word2️Vec, FastText, and BERT
capture semantic relationships and long-range dependencies better.
The Bag of N-grams model is an improvement over Bag of Words because it
preserves some word order and improves text representation. However, it suffers
from high dimensionality and sparsity issues. While useful for text classification
and sentiment analysis, it is often combined with TF-IDF or word embeddings for
better results in modern NLP applications.
1.14 TF – IDF (Term frequency- Inverse Document Frequency)
What is TF-IDF?
TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure
used in Natural Language Processing (NLP) to evaluate the importance of a word in
a document relative to a collection (corpus) of documents. Unlike the Bag of Words
(BoW) model, which considers only word frequency, TF-IDF assigns weights to words
based on their importance, making it more effective in text mining, information
retrieval, and search engines.
Formula
TF-IDF is computed as the product of two components:
1. Term Frequency (TF): Measures how often a term appears in a document.
TF (t, d) =Number of times term t appears in document d/Total number of terms in document d
2. Inverse Document Frequency (IDF): Measures how important or rare a term
is across all documents.
IDF(t) = log (N/DF(t))
where:
NNN = Total number of documents in the corpus
DF(t)DF(t)DF(t) = Number of documents containing the term t
TF-IDF Score
TF−IDF (t, d) = TF (t, d) ×IDF(t)
Example Calculation
Corpus of Three Documents
1. D1️: "The cat sat on the mat"
2. D2️: "The dog sat on the mat"
3. D3️: "The cat slept on the bed"
Step 1️: Compute TF for the word "cat"
D1: TF("cat") = 1/6 ≈ 0.167
D2: TF("cat") = 0/6=0
D3: TF("cat") = 1/6 ≈ 0.167
Step 2️: Compute IDF for "cat"
"cat" appears in 2️ out of 3 documents.
IDF("cat”) = log (3/2) ≈ 0.176
Step 3️: Compute TF-IDF
D1️: TF−IDF("cat”) =0.167×0.176 ≈ 0.02️94️
D2️: TF−IDF("cat”) =0×0.176 = 0
D3️: TF−IDF("cat”) =0.2×0.176≈ 0.03️5️2️
Since "cat" appears in multiple documents, its importance is slightly reduced.
Applications of TF-IDF
1. Search Engines (Google, Bing) – Ranking documents based on query
relevance.
2. Keyword Extraction – Identifying important words in a document.
3. Text Classification – Feature extraction for machine learning models.
4. Plagiarism Detection – Identifying unusual word distributions in texts.
5. Chatbots and NLP Tasks – Finding key terms in user queries.
UNIT II TEXT CLASSIFICATION
Vector Semantics and Embeddings -Word Embeddings - Word2Vec model – Glove model –
FastText model – Overview of Deep Learning models – RNN – Transformers – Overview of Text
summarization and Topic Models
1.Vector Semantics and Embeddings
Vector semantics is a method in Natural Language Processing (NLP) used to represent words as
vectors (numerical arrays) in a continuous vector space. The idea is to place similar words close
together in that space based on their meanings or usage in context.
To represent the meaning of words as vectors (lists of numbers) so that computers can understand
relationships between words.
Computers understand numbers, not words. So, we convert words into numbers to perform machine
learning tasks like classification, translation, or sentiment analysis.
Example Words:
king, queen, man, woman
Let’s say we have the following (sample) word embeddings:
Word Vector Representation
king [0.8, 0.65, 0.1]
queen [0.82, 0.7, 0.12]
man [0.7, 0.5, 0.05]
woman [0.75, 0.55, 0.07]
king - man + woman = ?
[0.8, 0.65, 0.1] - [0.7, 0.5, 0.05] + [0.75, 0.55, 0.07] = [0.85, 0.7, 0.12] ≈ queen
The result is very close to the vector for "queen".
This shows the model understands relationships like:
King is to man as queen is to woman
These relationships are captured mathematically in the vector space.
1.1. Why Do We Use Vectors for Words?
Let’s say we have the words: “king”, “queen”, “apple”.
We want the computer to know that “king” and “queen” are related (royalty), but
“king” and “apple” are not.
By using vector semantics:
Example: Word Vectors
Suppose we assign these simple vectors:
king → [0.8, 0.2, 0.9]
queen → [0.82, 0.25, 0.88]
apple → [0.1, 0.9, 0.2]
We can now measure how close the words are using math like cosine similarity.
Closer vectors = similar meaning
Distant vectors = unrelated words
“king” and “queen” will be close together in vector space.
“king” and “apple” will be far apart.
1.2. Before Embeddings: One-Hot Encoding
In early NLP, we used One-Hot Encoding:
Each word had a long vector with all zeros and a one in one position.
Example for vocabulary = [“king”, “queen”, “apple”]:
“king” → [1, 0, 0]
“queen” → [0, 1, 0]
“apple” → [0, 0, 1]
Problem: No relation between “king” and “queen.” They look totally different!
Word embeddings are a way to represent words as numbers (vectors). This helps the computer
understand the meaning of words based on their context. Each word is represented by a list of
numbers (a vector) in a way that similar words have similar vectors.
1.3. Why Do We Need Word Embeddings?
In traditional methods, each word was represented as a unique index, like [1, 0, 0] for
“dog” and [0, 1, 0] for “cat”. But this doesn't capture meaning.
Word embeddings, on the other hand, capture the meaning and relationships between
words.
1.4. How Does It Work?
Word embeddings use context to learn the meaning of words.
For example, if we use the sentence:
"The cat sits on the mat."
Words like "cat" and "mat" are related, so their embeddings will be closer together in
the vector space.
Words like "cat" and "apple" will be far apart because they are not related in
meaning.
Example of Word Embeddings
Let’s take a small example with 3 words: “dog”, “cat”, and “apple”. We’ll represent
them as vectors (numbers).
Word Vector Representation
dog [0.9, 0.2, 0.3]
cat [0.8, 0.3, 0.4]
apple [0.1, 0.9, 0.5]
Now, look at the vectors:
dog and cat have similar vectors, meaning they are related (both are animals).
apple has a very different vector, showing that it’s not related to "dog" and
"cat".
1.5. Why Are Word Embeddings Useful?
1. Semantic meaning: Words with similar meanings are closer together in the vector
space.
2. Text classification: Helps computers classify texts better (e.g., positive or negative
reviews).
3. Translation: Helps translate words between languages (e.g., “dog” in English and
“perro” in Spanish will be close in vector space).
1.6. Popular Word Embedding Models
A. Word2Vec (by Google)
Learns from word context.
Two methods:
o CBOW: Predicts a word from surrounding words.
o Skip-Gram: Predicts surrounding words from one word.
Example sentence:
→ “The cat sits on the mat.”
CBOW: Input = [“The”, “cat”, “on”, “the”, “mat”], Predict = “sits”
B. GloVe (by Stanford)
Uses word co-occurrence counts: How often words appear together.
Example:
If “ice” and “cold” often appear together, their vectors will be similar.
C. FastText (by Facebook)
Breaks words into subwords or character n-grams.
Example:
“playing” → “play”, “lay”, “aying” ...
Helps in:
Handling unknown or rare words.
Learning meaning from parts of words.
1.7. Limitations of Basic Embeddings
Problem Description Example
Same vector for “bank” (money) and “bank”
Confusing for the model
all meanings (river) have one vector
“man” + “computer” = “programmer”,
Can carry biases Models may learn stereotypes
“woman” + “computer” = “homemaker”
1.8. Modern Approach: Contextual Embeddings (BERT)
Now we use models like BERT that create different vectors for the same word based
on the sentence.
Example:
Sentence 1: “She sat by the bank of the river.”
Sentence 2: “He went to the bank to deposit money.”
BERT gives different embeddings for “bank” in each sentence.
Model How It Works Special Feature
Word2Vec Uses context Simple, fast, good for small data
GloVe Word co-occurrence Good at capturing global relationships
FastText Subword-based Handles rare/misspelled words
BERT Context-aware embeddings Different meaning → different vectors
2. Word Embeddings
Word Embeddings are a technique in Natural Language Processing (NLP) to convert
words into numerical vectors so that machines can understand and process human
language.
Word embeddings represent words as numbers based on their meanings and context.
Unlike one-hot encoding (where each word is just a 1 or 0), word embeddings capture:
Semantic meaning (meaning of the word)
Syntactic relationships (how it is used in grammar)
Similarity between words
2.1. What is One-Hot Encoding?
One-Hot Encoding is a basic method to represent categorical data (like words) in a
numerical format that machines can understand.
In NLP, One-Hot Encoding is used to represent words as binary vectors.
Example:
Let’s say we have a vocabulary of 4 words:
Vocabulary = ["cat", "dog", "apple", "banana"]
Now assign each word an index:
Word Index
Cat 0
Dog 1
apple 2
banana 3
One-Hot Encoded Vectors:
Word One-Hot Vector
Cat [1, 0, 0, 0]
Dog [0, 1, 0, 0]
apple [0, 0, 1, 0]
banana [0, 0, 0, 1]
Each vector is the same length as the vocabulary size (4 in this case), and only one
element is 1, the rest are 0.
Characteristics of One-Hot Encoding:
Feature Description
Sparse Most values are 0, only one is 1
No semantic meaning "cat" and "dog" are unrelated numerically
Easy to implement Simple to use and understand
No similarity info Does not show relationships between words
Limitations of One-Hot Encoding:
1. High dimensionality:
o If your vocabulary has 10,000 words, each word will have a 10,000-
length vector.
2. No context or meaning:
o "cat" and "dog" are both animals, but the vectors [1,0,0, 0…] and
[0,1,0, 0….] are totally different.
o So, semantic similarity is not captured.
Summary of One Hot encoding:
One-Hot Encoding represents words as binary vectors.
Each vector is mostly 0s, with a single 1 at the position of that word.
It's easy and fast but doesn't capture meaning or relationships between
words.
Used in basic NLP pipelines and before more advanced models like
Word2Vec or embeddings.
2.2. Why Do We Need Word Embeddings?
Earlier, computers used one-hot vectors, where each word had a unique vector. For
example:
"cat" = [1, 0, 0]
"dog" = [0, 1, 0]
"apple" = [0, 0, 1]
Problem: These vectors don't show any relationship or meaning. "cat" and "dog" are both
animals, but this method treats them as completely different.
Word embeddings solve this by placing words in a vector space where similar words are
closer together.
Example: Word Embeddings
Word Vector Representation
King [0.7, 0.9, 0.2]
queen [0.72, 0.91, 0.25]
Man [0.6, 0.8, 0.1]
woman [0.62, 0.81, 0.15]
king - man + woman ≈ queen
o [0.7, 0.9, 0.2] - [0.6, 0.8, 0.1] + [0.62, 0.81, 0.15] ≈ [0.72, 0.91, 0.25]
→ queen
This shows the relationship and meaning is captured by the numbers!
2.3. Popular Word Embedding Models
a. Word2Vec
Developed by Google.
Based on the idea: “You shall know a word by the company it keeps.”
Two architectures:
o CBOW (Continuous Bag of Words): Predicts a word from
surrounding words.
o Skip-Gram: Predicts surrounding words from a word.
Example: Sentence: "The dog chased the cat"
Skip-Gram: “dog” → predicts "the", "chased", etc.
b. GloVe (Global Vectors)
Developed by Stanford.
Uses word co-occurrence matrix across the whole corpus.
Combines local context and global statistics.
Example: If "ice" and "cold" appear often together, their vectors will be close.
c. FastText
Developed by Facebook.
Unlike Word2Vec, it breaks words into sub-word units (n-grams).
Handles rare and misspelled words better.
Example:
Word: "running" → broken into "run", "unn", "nni", "nin", "ing"
Good for understanding morphology (word structure)
2.4. Applications of Word Embeddings
Use Case Description
Sentiment Analysis Detect whether a review is positive or negative
Machine Translation Translate between languages like English → Spanish
Text Classification Categorize emails as spam or not spam
Chatbots & Virtual Assistants Understand user queries and respond
Search Engines Show relevant results even if keywords differ
2.5. Challenges in Word Embeddings
1. Bias in Data: If training data contains stereotypes, embeddings may reflect
them.
2. Out-of-Vocabulary (OOV) Words: Words not seen during training won't
have vectors (FastText helps here).
3. Context Ignorance: Traditional embeddings like Word2Vec do not capture
the context of a word in a sentence. For example:
o “I went to the bank to withdraw money.”
o “The boat reached the bank of the river.”
Both use “bank,” but mean different things — solved better by newer models like BERT.
Summary of Word Embeddings
Word embeddings represent words as dense vectors of real numbers.
Similar words have similar vectors.
Word2Vec, GloVe, and FastText are common methods.
Embeddings are used in NLP tasks like translation, classification, sentiment
analysis.
They improve over traditional methods by capturing semantic and syntactic
meaning.
3. Word2Vec model
Word2Vec is a popular technique to generate word embeddings — i.e., to convert words
into dense numerical vectors that capture semantic meaning and contextual similarity.
Developed by Google in 2013, Word2Vec learns word associations from a large text
corpus and places words with similar meanings closer in vector space.
Words that appear in similar contexts have similar meanings.
Example:
Words like king, queen, prince, and royal often appear in similar contexts, so their vectors
will be close to each other in the embedding space.
Word2Vec Architectures: CBOW and Skip-Gram
Word2Vec uses two main model architectures to learn word embeddings:
3.1. Continuous Bag of Words (CBOW)
CBOW tries to predict the target word (center word) using its context words (surrounding
words).
Example:
Sentence:
"The cat sat on the mat."
Suppose we’re focusing on the word "sat" and using a window size of 2 (two words
before and after):
Context words: ["The", "cat", "on", "the"]
Target word: "sat"
CBOW Input → ["The", "cat", "on", "the"]
CBOW Output → "sat"
How CBOW Works Internally:
1. Input layer: Converts each context word into a one-hot encoded vector.
2. Projection layer: A shared hidden layer that maps each input vector to an
embedding.
3. Averaging: Averages all embeddings of the context words.
4. Output layer: Predicts the target word using SoftMax.
CBOW Characteristics:
Faster to train
Works well for frequent words
Context → Target
3.2. Skip-Gram
Skip-Gram tries to predict context words given the center word (opposite of CBOW).
Example:
Sentence:
"The cat sat on the mat."
Target word = "sat"
With window size = 2, the model tries to predict:
["The", "cat", "on", "the"]
Skip-Gram Input → "sat"
Skip-Gram Output → ["The", "cat", "on", "the"]
How Skip-Gram Works Internally:
1. Input layer: One-hot encoded vector of the center word.
2. Hidden layer: Maps input to the embedding.
3. Output layer: Predicts the probability of each context word using softmax.
4. Trains multiple pairs: (“sat”, “cat”), (“sat”, “on”), etc.
Skip-Gram Characteristics:
Works well with rare words
More accurate for large datasets
Target → Context
3.3. CBOW vs Skip-Gram – Comparison Table
Feature CBOW Skip-Gram
Task Predict target word Predict context words
Input Context words Center (target) word
Output Target word Context words
Speed Faster Slower (but more accurate)
Works well with Frequent words Rare words
Data efficiency Less More (uses more training pairs)
3.4. What Does Word2Vec Learn?
Word2Vec does not memorize words. Instead, it learns:
Words that appear in similar environments have similar meanings
Distances between word vectors reflect semantic closeness
Example Similar Words:
man ↔ boy
France ↔ Paris
walk ↔ run
3.5. Applications of Word2Vec
Application Description
Sentiment Analysis Understand tone of reviews, tweets, etc.
Machine Translation Translate text across languages
Question-Answering Systems Match questions with relevant answers
Chatbots Understand and generate human-like responses
Text Classification Categorize documents/emails/news
4.Glove model
GloVe is a word embedding algorithm developed by researchers at Stanford University. It
provides a way to convert words into meaningful numeric vectors.
Unlike Word2Vec, which learns embeddings by predicting neighboring words in a sentence,
GloVe learns from word co-occurrence statistics across the whole corpus — combining the
global matrix factorization approach with local context.
GloVe: Global Vectors for Word Representation
4.1. Why GloVe?
GloVe was introduced to fix a limitation in Word2Vec:
Word2Vec uses local context only (based on a sliding window).
GloVe uses global co-occurrence: how frequently words appear together in the entire
dataset.
This allows GloVe to capture fine-grained semantic meaning, including word analogies and
relationships.
Imagine you're trying to understand the meaning of words based on how they appear together
in big collections of text — like how you learn a new word by seeing it used in different
sentences.
GloVe does exactly that, in 3 easy steps:
Step 1: Count How Words Appear Together
GloVe reads a huge amount of text and keeps track of how often each word appears with
every other word.
Example:
If the word “ice” often appears near “cold”, GloVe notes that down.
If “ice” rarely appears near “hot”, it notes that too.
Step 2: Use the Counts to Understand Meaning
GloVe doesn’t just memorize the counts.
It tries to figure out relationships between words using math.
For example:
If "man" and "woman" both appear with similar words (like "person", "human"), their
meanings are related.
If "king" and "queen" appear with words like "royal", "palace", "crown", their meaning is
also similar — but also differs by gender.
GloVe captures this difference and similarity in numbers (vectors).
Step 3: Turn Words into Vectors (Numbers with Meaning)
In the end, each word becomes a vector — a list of numbers like this:
"apple" → [0.12, -0.23, 0.65, ..., 0.02]
"fruit" → [0.11, -0.22, 0.64, ..., 0.01]
Words that mean similar things have vectors that are close together.
GloVe can capture relationships like: "king" - "man" + "woman" ≈ "queen"
4.2. Real-World Applications of GloVe
Chatbots and virtual assistants
Sentiment analysis
Named entity recognition (NER)
Machine translation
Document similarity and clustering
4.3. What Is a Word Co-occurrence Table?
A word co-occurrence table (or matrix) tells us how often words appear near
each other in a sentence or document.
Example Sentence: “I like deep learning"
Now, let's understand who appears next to whom.
Let’s say we use a context window = 1
That means we only care about the 1 word before and 1 word after a given
word.
Step-by-Step: Word Neighbours
Word Left Neighbour Right Neighbour
I — like
like I deep
deep like learning
learning deep —
Now, we count how many times each word appears next to the others.
4.4. Co-occurrence Table (Matrix)
Word I like deep learning
I 0 1 0 0
like 1 0 1 0
deep 0 1 0 1
learning 0 0 1 0
This table helps GloVe understand:
Which words appear together a lot (strong connection).
Which words never appear together (weak or no connection).
So, it helps computers learn word meanings from patterns of usage.
5.FastText model
FastText is a library and model developed by Facebook’s AI Research (FAIR) for:
1. Word embeddings (like Word2Vec),
2. Text classification (faster and more efficient than traditional models).
Unlike Word2Vec, which treats words as atomic entities, FastText represents words as
collections of character n-grams, allowing it to better understand morphology and handle out-
of-vocabulary (OOV) words.
5.1. Core Concept
1. Subword Embeddings
Each word is represented by a bag of character n-grams.
Example:
For the word where, with n-grams of length 3 (trigrams), FastText represents:
<wh, whe, her, ere, re>
Where < and > are used to signify word boundaries.
So, the embedding for where is computed as the sum of the embeddings of all its n-grams +
the word itself.
5.2. Why Subword Units?
Handles morphologically rich languages better (e.g., Turkish, Finnish).
Improves performance on rare or unseen words.
Learns better semantic relationships due to character-level features.
5.3. FastText Architecture
For Word Embeddings (unsupervised):
FastText is based on the Skip-Gram model:
Predict context words given a target word.
Instead of using just the word embedding, FastText uses the average of all its subword
embeddings.
For Text Classification (supervised):
FastText represents a sentence as a bag of n-gram embeddings and then:
Averages the vectors,
Applies a linear classifier for prediction.
FastText is extremely fast and memory-efficient, making it suitable for mobile and
low-resource environments.
5.4. Advantages of FastText
Feature Description
OOV Word Handling Can infer vectors for unseen words using subwords.
Morphology-aware Effective in morphologically complex languages.
Efficient and Scalable Handles large datasets quickly.
Compact Models Quantization reduces model size significantly.
Multi-lingual Support Pre-trained vectors available in 157+ languages.
6. Overview of Deep Learning models
Text classification is when you train a model to label text — like:
Spam or Not Spam
Positive or Negative
News Topic: Politics, Sports, Tech, etc.
Deep learning helps by learning patterns in the words and their context, better than traditional
rule-based or bag-of-words approaches.
Common Deep Learning Models for Text Classification
1. Feedforward Neural Networks (FNN / MLP)
Idea: Each word/sentence is converted into a fixed-size vector, and passed through dense layers.
✅ Simple and fast
❌ Ignores word order, lacks context understanding
✅ Good for small-scale or basic classification tasks
2. Convolutional Neural Networks (CNN)
Idea: Just like how CNNs find patterns in images, they find key phrases in text (like “not good”
or “extremely helpful”).
✅ Detects important n-grams
✅ Works well with short texts (tweets, reviews)
❌ Doesn’t fully understand long-term word order
3. Recurrent Neural Networks (RNN)
Idea: Reads the text word by word, keeping memory of past words — like a person reading a
sentence.
✅ Keeps word order
❌ Slow to train, hard to capture long-term dependencies
4. LSTM (Long Short-Term Memory) / GRU (Gated Recurrent Unit)
Improved RNNs that can remember context over longer text.
✅ Good for longer sentences and capturing meaning from sequence
✅ Better than basic RNN
❌ Still slower than some newer models
5. BiLSTM / BiGRU (Bidirectional)
Reads text from both directions — left to right and right to left.
✅ Understands full sentence context
✅ Great for sentiment, intent classification
❌ Heavier than one-direction models
6. Attention Mechanism
Helps the model focus on important words (like “not” in “not happy”).
✅ Adds interpretability (we can see what the model focused on)
✅ Boosts performance when combined with LSTM/GRU
❌ Needs more computation
7. Transformer
A game-changer! Doesn't read word-by-word, but processes all words at once with self-
attention to understand relationships.
✅ Fast, parallel, very accurate
✅ Best for large-scale and multilingual tasks
✅ Used as the base for BERT, GPT, etc.
8. BERT (Bidirectional Encoder Representations from Transformers)
Pre-trained by Google. It understands context deeply because it reads text in both directions.
✅ Amazing accuracy for many text tasks
✅ Can be fine-tuned on your own dataset
✅ Handles sarcasm, nuance, and context
❌ Slower to train, large model size
9. DistilBERT, ALBERT, RoBERTa, etc.
Lighter or smarter versions of BERT with tweaks for speed, accuracy, or size.
✅ Faster and smaller
✅ Still powerful
✅ Great for mobile or low-resource use
10. XLNet, GPT, T5, etc.
More advanced transformers that learn from huge corpora and can-do multiple tasks
(classification, translation, summarization).
✅ GPT is better at generation
✅ T5 can treat everything as text-to-text
❌ Need lots of data and compute if training from scratch
7. RNN
7.1. Introduction
Recurrent Neural Networks (RNNs) are a special class of artificial neural networks
designed for modeling sequential data. Unlike traditional feedforward neural networks,
RNNs have a form of memory that allows them to retain information about previous inputs
in the sequence, making them suitable for tasks like language modeling, speech recognition,
time-series prediction, and text classification.
RNNs process sequences step-by-step while maintaining a hidden state vector that stores
contextual information. This design enables RNNs to learn dependencies and patterns across
time or sequence steps, which is crucial in natural language processing (NLP).
7.2. Architecture of RNN
An RNN cell processes input sequentially and maintains a hidden state ht that evolves as it
encounters each input xt . The basic flow is:
At each time step t:
ht = tanh (WxhXt +Whhht−1 +bh)
yt =Whyht + by
xt: input at time step t.
ht : hidden state at time step t.
yt : output at time step t.
Wxh , Whh , Why : weight matrices.
bh , by : bias terms.
tanh: non-linear activation function.
7.3. Working Example
Suppose you are analyzing a sentence:
"I love deep learning"
Each word is passed one-by-one:
x1 : “I”
x2 : “love”
x3 :“deep”
x4: “learning”
The RNN processes each word while updating its memory (hidden state). This allows
it to recognize that “love” is likely positive because it remembers “I” and its context.
7.4. Key Features of RNN
Feature Description
Sequence Modeling Maintains temporal order in data.
Memory of Past Stores previous input information.
Shared Weights Same weights used across all time steps.
Contextual Awareness Learns meaning based on sequence.
7.5. Limitations of RNNs
Despite their theoretical power, basic RNNs suffer from major limitations:
Vanishing Gradient Problem
Gradients shrink during backpropagation through time (BPTT).
Model struggles to learn long-term dependencies.
Exploding Gradient Problem
Gradients grow uncontrollably and destabilize training.
Sequential Processing
Cannot easily parallelize across time steps → slow training.
7.6. RNN Variants
To overcome the above issues, several enhanced RNN variants were introduced:
a) Long Short-Term Memory (LSTM)
Introduced memory cells and gates: input, forget, and output gates.
Remembers long-range dependencies effectively.
b) Gated Recurrent Unit (GRU)
Simpler than LSTM with fewer gates (update and reset).
Faster and often equally effective.
c) Bidirectional RNN (BiRNN)
Processes sequence forward and backward.
Useful when full context of the sentence is important.
d) Attention Mechanism
Allows model to focus on important parts of input sequence.
Improves performance in longer sequences.
7. Applications of RNN
Domain Use Case
Text Sentiment analysis, text classification, language modeling
Speech Speech recognition, voice-to-text
Finance Time-series forecasting
Cognitive Sequence prediction, pattern recognition
Music Music generation
Robotics Action sequence modeling
8. Transformers
8.1. What are Transformers?
Transformers are a deep learning architecture. They revolutionized NLP by eliminating the
need for recurrence (used in RNNs/LSTMs), and instead relying entirely on attention
mechanisms to model relationships between words in a sequence — regardless of their
distance.
Transformers allow for parallel processing of sequences, better long-term dependency
modeling, and have become the backbone of models like BERT, GPT, T5, RoBERTa, and
more.
8.2. Key Idea
Instead of processing inputs sequentially like RNNs, Transformers process the entire input at
once, using self-attention to determine which parts of the input are most relevant to each word.
For example: In the sentence “The animal didn’t cross the road because it was tired”, the model
must understand that “it” refers to “animal”. Transformers can learn this via attention
mechanisms.
8.3. Transformer Architecture
Main Components:
a. Input Embedding + Positional Encoding
Input tokens (words) are embedded into vectors.
Since Transformers process inputs in parallel (not sequentially), they need
positional encoding to understand the order of words.
b. Encoder–Decoder Structure
Encoder: Reads and encodes the input sequence.
Decoder: Takes the encoder’s output and generates the target sequence (e.g.,
in translation).
In classification tasks (like sentiment analysis), only the encoder is usually used.
8.4. Self-Attention Mechanism
Key Formula:
For each word, the model computes three vectors:
Q (Query)
K (Key)
V (Value)
Self-attention computes:
Attention (Q, K, V) = softmax (QKT / √𝒅𝒌) V
This helps each word in a sentence attend to all other words, learning context-
sensitive representations.
Example:
In “The bank raised interest rates”, the word bank may attend to interest, rates, to
understand its meaning (finance-related, not riverbank).
8.5. Multi-Head Attention
Instead of one attention calculation, the model runs multiple attention heads
in parallel, each learning different relationships.
Then, it concatenates and projects the outputs.
8.6. Layer Normalization and Feedforward Networks
After multi-head attention, a feedforward network is applied to each token
separately.
Layer normalization and residual connections help with stable training.
8.7. Variants and Popular Transformer Models
Model Description
BERT Bidirectional Encoder; great for classification, Q&A
GPT Decoder-only; excels at text generation
RoBERTa Robust version of BERT
DistilBERT Lightweight BERT
T5 Text-to-text model for multi-task NLP
XLNet Permutation-based pretraining, better than BERT in some tasks
Vision Transformers (ViT) Transformer applied to image patches
9. Overview of Text summarization and Topic Models
9.1. What is Text Summarization?
Text summarization is the process of generating a shorter, condensed version of a text
while retaining its essential meaning. The goal is to provide a summary that captures
the most important information in a text, making it easier to understand and process.
There are two main types of summarization:
1. Extractive Summarization:
Extracts portions (sentences, phrases, etc.) directly from the input text and
assembles them into a summary. The focus is on selecting the most important
or representative parts.
2. Abstractive Summarization:
Involves generating entirely new sentences that convey the main idea of the
original text. It uses techniques like paraphrasing and generating novel
sentences. This is closer to human-style summarization.
9.2. How it Works
1. Extractive Summarization
Key steps:
1. Sentence ranking: Sentences are ranked based on their importance
using methods like TF-IDF, cosine similarity, or machine learning
models.
2. Sentence selection: The top-ranked sentences are selected and
concatenated to form a summary.
Techniques:
o TF-IDF (Term Frequency-Inverse Document Frequency): Weights
the importance of words based on how often they appear in the
document and how rare they are in the whole corpus.
o Graph-based Algorithms (e.g., TextRank): Builds a graph where
each node is a sentence, and edges represent the similarity between
sentences. The most central nodes (sentences) are chosen for the
summary.
2. Abstractive Summarization
Key steps:
1. Text understanding: The model reads and understands the full
content of the text, capturing its meaning.
2. Generation: The model generates a concise version of the text,
rephrasing sentences or generating entirely new sentences.
Techniques:
o Sequence-to-Sequence Models: These models use encoder-decoder
architectures (like RNNs, LSTMs) to learn the relationship between
the input (full text) and the output (summary).
o Transformers: Modern models like BERT, T5, and BART have
revolutionized abstractive summarization. They leverage attention
mechanisms and large pre-trained models to generate human-like
summaries.
9.3. Popular Models for Text Summarization
1. TF-IDF and Graph-based Methods (Extractive)
o Simple, easy to implement.
o Can be less flexible and produce less coherent summaries.
2. RNN, LSTM-based Models (Abstractive)
o Powerful, but can struggle with long-term dependencies and generate
incoherent summaries.
3. Transformer-based Models (e.g., BERT, T5, BART, GPT)
o State-of-the-art for both extractive and abstractive tasks.
o Pre-trained models fine-tuned on summarization tasks, like BART for
abstractive summarization.
4. Pointer-Generator Networks (Abstractive)
o Combines extractive and abstractive techniques.
o Can copy words directly from the source text while also generating new
words.
9.4. Applications of Text Summarization
News aggregation: Summarizing news articles to provide quick overviews.
Scientific research: Generating abstracts from long research papers.
Customer reviews: Summarizing product reviews for easy understanding.
Legal documents: Summarizing lengthy contracts and legal documents.
10. Topic Models
10.1. What are Topic Models?
Topic modeling is a technique used to discover the latent themes or topics within a
collection of documents. It helps in identifying the underlying topics that a set of
documents is about, without requiring labeled data. This is particularly useful when
dealing with large amounts of unstructured text data.
10.2. How Topic Modeling Works
Topic modeling algorithms aim to uncover groups of words (topics) that frequently
occur together in a collection. Each document is considered a mixture of topics, and
each topic is represented as a probability distribution over words.
10.3. Common Topic Modeling Algorithms
10.3.1. Latent Dirichlet Allocation (LDA)
LDA is the most widely used topic modeling technique. It assumes that each document
is a mixture of topics, and each topic is a distribution over words.
How it works:
1. Each document is represented as a probabilistic mixture of topics.
2. Each word in the document is assigned to a topic based on its
probability.
3. Topics are represented as word distributions, and documents as mixtures
of these topics.
Steps:
1. Initialization: Assign random topics to words in the corpus.
2. Iterative Update: For each word, update its topic assignment based on
the topic distribution in the document and the word distribution across
the entire corpus.
3. Convergence: Repeat until the model converges.
Output: The model outputs two things:
1. The topic-word distribution: The most probable words for each topic.
2. The document-topic distribution: The proportion of each topic in each
document.
10.3.2. Non-Negative Matrix Factorization (NMF)
NMF is an alternative to LDA for topic modeling. It factorizes the document-
term matrix into two non-negative matrices: one representing topics and the
other representing the distribution of topics across documents.
Key Difference from LDA:
o NMF can be seen as a more deterministic model compared to LDA's
probabilistic approach.
10.3.3. Latent Semantic Analysis (LSA)
LSA uses singular value decomposition (SVD) to reduce the dimensions of
the document-term matrix. It identifies latent semantic structures in the data,
helping to group similar documents together.
Difference from LDA:
o LSA is more about dimensionality reduction, while LDA focuses on
topic inference.
10.4. Applications of Topic Modeling
Content Recommendation: Analyzing articles, books, or blogs to recommend
related content based on topics.
Customer Feedback: Discovering common topics in customer reviews,
helping businesses understand customer concerns.
Document Organization: Grouping articles, papers, or legal documents by
topic.
News Categorization: Automatically categorizing news articles into topics like
politics, technology, or sports.
UNIT III QUESTION ANSWERING AND DIALOGUE SYSTEMS
Information retrieval – IR-based question answering – knowledge-based question answering –
language models for QA – classic QA models – chatbots – Design of dialogue systems -– evaluating
dialogue systems
1.Information retrieval
1.1. Introduction to Information Retrieval
Information Retrieval (IR) is the science of searching for information in documents, searching for
documents themselves, and also searching for metadata that describe data, and for databases of
texts, images, or sounds.
IR primarily deals with unstructured or semi-structured data such as natural language text, unlike
database systems which handle structured data.
Example: A user types a query into Google like "best tourist places in Kerala". The IR system
must retrieve and rank the most relevant documents/web pages from a massive corpus.
1.2. Architecture of an IR System
An IR system generally includes the following major components:
a) Document Collection
A large set of documents (e.g., articles, web pages, emails) from which information is
to be retrieved.
b) Text Preprocessing
Preprocessing transforms raw text into a more analyzable format.
Tokenization: Splitting text into individual words or phrases (tokens).
Stop Word Removal: Removing frequent words like "the", "is", "in", etc.,
which do not carry significant meaning.
Stemming and Lemmatization: Reducing words to their base or root form.
o Stemming: “running” → “run”
o Lemmatization: “better” → “good”
c) Indexing
Indexing organizes data for efficient searching.
Inverted Index is a key data structure: it maps each word to the list of
documents that contain it.
Speeds up retrieval by avoiding full-document scans.
1.3. Querying
a) Query Formulation
Users may use keywords or natural language.
Complex queries may include Boolean operators (AND, OR, NOT).
b) Query Processing
The system analyzes the query, expands it (optional), and matches it to documents using
the index.
Query Expansion: Synonyms or related terms are added to improve recall.
Example: Query “car” → expanded to “automobile”, “vehicle”.
1.4. Document Representation Models
a) Boolean Model
Documents and queries are sets of terms.
Uses logical operators:
o AND → Intersection
o OR → Union
o NOT → Difference
Results are binary: relevant or not relevant.
b) Vector Space Model (VSM)
Documents and queries are represented as vectors in a multi-dimensional space.
Relevance is measured using cosine similarity.
c) Probabilistic Models
Estimate the probability that a document is relevant to a query.
Example: BM25 ranking function.
1.5. Ranking of Retrieved Documents
After retrieval, documents are ranked based on their relevance scores.
a) Term Frequency-Inverse Document Frequency (TF-IDF)
TF (Term Frequency): Frequency of term in a document.
IDF (Inverse Document Frequency): Measures how rare a term is.
Formula:
TF-IDF=TF ×log (N/df )
Where, N is total number of documents, df is number of documents containing the term.
b) Cosine Similarity [2M]
Measures angle between query and document vectors.
Value ranges from 0 (no similarity) to 1 (identical direction).
Cosine Similarity is a way to measure how similar two things are (like a document
and a search query) by looking at the angle between them, not how long they are.
Think of each document or query as an arrow (vector).
If the arrows point in the same direction, they are very similar.
If they point in different directions, they are less similar.
Formula
Cosine Similarity = A.B/∥A∥×∥B∥
Where:
A⋅B is the dot product of the two vectors.
∥A∥ and ∥B∥ are the lengths (magnitudes) of the vectors.
Cosine Similarity Values
1 → Exactly the same
0.5 → Somewhat similar
0 → Completely different
Example
If your query is:
"Smartphone with good camera"
And a document talks about:
"Best smartphone with high-quality camera and battery life"
Then both the query and document have similar words (smartphone, camera),
so the cosine similarity will be high.
c) BM25 (Best Match 25)
A probabilistic ranking function.
Takes term frequency saturation and document length normalization into
account.
1.6. Evaluation Metrics in IR
To evaluate the effectiveness of an IR system, several performance metrics are used:
a) Precision
Proportion of retrieved documents that are relevant.
Precision= Relevant Documents Retrieved / Total Documents Retrieved
b) Recall
Proportion of relevant documents that were retrieved.
Recall= Relevant Documents Retrieved / Total Relevant Documents
c) F1 Score
Harmonic mean of precision and recall.
F1= 2×Precision×Recall / Precision + Recall
1.7. Applications of Information Retrieval
Search Engines (Google, Bing,Mozilla Firefox)
Digital Libraries (IEEE, ACM, PubMed)
E-commerce (Product search on Amazon)
Legal/Medical Document Search
Question Answering Systems
1.8. Challenges in IR
Ambiguity in Language: Same word, different meanings (polysemy), or
different words, same meaning (synonymy).
Scalability: Handling web-scale data.
User Intent: Interpreting vague or unclear queries.
Ranking Accuracy: Delivering the most relevant content at the top.
Information Retrieval is a foundational technology enabling systems to fetch relevant
data efficiently. With advancements in machine learning and natural language
processing, modern IR systems are becoming increasingly intelligent and user-centric,
significantly enhancing search quality across applications.
※ RANKING TECHNIQUE PROBLEMS
Search Query Example: "Electric car battery performance"
Three Sample Documents
Doc Content
Doc A "Electric cars are popular due to their low emissions. They use lithium-ion batteries."
"Battery performance in electric vehicles depends on climate, usage, and charge
Doc B
cycles."
"Gasoline cars are still widely used. Performance of internal combustion engines is
Doc C
high."
1. TF-IDF Ranking
TF-IDF scores terms that are frequent in a document but rare in the entire
collection.
Term TF in A TF in B TF in C
electric 1 1 0
car(s) 1 0 1
battery 1 1 0
performance 0 1 1
Doc B has 2 strong terms ("battery", "performance"), both likely to be rare in the
collection.
TF-IDF ranks:
1st – Doc B, 2nd – Doc A, 3rd – Doc C
2. Cosine Similarity
Each document and the query are treated as vectors based on shared words.
Doc A has: ["electric", "cars", "batteries"]
Doc B has: ["battery", "performance", "electric", "vehicles"]
Doc C has unrelated terms.
Vector representation (simplified):
Let’s say the query vector is:
[electric=1, car=1, battery=1, performance=1]
Approximate similarity:
Doc A: 2 matching terms → moderate angle → score ≈ 0.5
Doc B: 3 matching terms → smaller angle → score ≈ 0.75
Doc C: 1 match only → large angle → score ≈ 0.2
Cosine Similarity ranks:
1st – Doc B, 2nd – Doc A, 3rd – Doc C
3. BM25 Ranking
BM25 adjusts for:
Term frequency saturation (penalizes over-repetition),
Document length (normalizes longer docs),
Rarity of terms (like TF-IDF).
Let’s assume:
Doc A is short,
Doc B is medium-length,
Doc C is also short but unrelated.
BM25 scores (simplified):
Doc A: Mentions “electric” and “battery” → score ≈ 2.1
Doc B: Uses 3 main query terms in relevant context → score ≈ 2.8
Doc C: Only "performance" matches → score ≈ 1.0
BM25 ranks:
1st – Doc B, 2nd – Doc A, 3rd – Doc C
Final Ranking Summary (per technique)
Rank TF-IDF Cosine Similarity BM25
1st Doc B Doc B Doc B
2nd Doc A Doc A Doc A
3rd Doc C Doc C Doc C
In this real-time example:
Doc B consistently ranks highest because it directly discusses battery
performance in electric vehicles, matching the query strongly.
Doc A is related but doesn’t address "performance".
Doc C talks about gasoline cars, so it ranks lowest.
2. IR-based question answering
2.1. Introduction
IR-based Question Answering (IR-QA) is a system that aims to provide direct answers
to user queries by retrieving and extracting relevant information from large text corpora,
such as documents, web pages, or news articles. Unlike traditional Information
Retrieval, which only returns documents, IR-based QA attempts to find the most
relevant answer to a specific natural language question by combining IR techniques
with answer extraction mechanisms.
2.2. Working Architecture of IR-Based QA
The architecture of an IR-based QA system typically involves three major stages:
a) Question Processing
This phase interprets the input question. It involves several steps:
Tokenization: Breaking the question into words.
Stop-word removal: Removing common unimportant words.
POS tagging: Understanding the grammatical structure.
NER (Named Entity Recognition): Identifying the type of entity expected
(person, place, date, etc.).
Question Classification: Understanding if the question is asking for a fact,
reason, definition, etc.
Example: For the question "Who invented the telephone?", the system identifies the
expected answer type as Person, and keywords are "invented", "telephone".
b) Document or Passage Retrieval
In this step, the system searches a large corpus to retrieve the most relevant
documents or passages based on the query terms. It uses traditional IR methods like:
TF-IDF (Term Frequency-Inverse Document Frequency)
BM25 (Best Matching 25)
Cosine Similarity
Inverted Indexing
These techniques score and rank documents by matching query terms to document
terms. The top N documents or passages are selected for the next stage.
c) Answer Extraction
Once relevant texts are retrieved, the system needs to extract the actual answer. It
searches within the passages for specific phrases or named entities that match the
expected answer type. Techniques like pattern matching, regex, part-of-speech tagging,
or pre-trained models are used.
2.3. Simple Example of IR-Based QA
Question: "Who is the founder of Microsoft?"
Step 1: Question Processing
Keywords: "founder", "Microsoft"
Expected answer type: Person
Step 2: Document Retrieval
Top retrieved sentences:
o "Microsoft was founded by Bill Gates and Paul Allen in 1975."
o "Bill Gates co-founded Microsoft Corporation."
Step 3: Answer Extraction
Based on question type (Person) and presence in text → Extract: "Bill Gates"
2.4. Techniques Used in IR-Based QA
Technique Purpose
TF-IDF Weighs importance of words across documents
BM25 Ranks documents using improved TF weighting
Cosine Similarity Measures angle between query and document vectors
NER Extracts specific entities (person, date, location)
POS tagging Helps locate noun phrases or answers grammatically
These techniques are essential in identifying the most appropriate texts from a large
corpus.
2.5. Real-Life Example
Question: "What is the capital of France?"
The query is transformed into keywords: "capital", "France".
IR engine retrieves a sentence like: "Paris is the capital and largest city of
France."
Answer extractor identifies "Paris" as a location → Final Answer: Paris
2.6. Advantages of IR-Based QA
Scalability – Can work with very large document sets or the entire web.
Speed – Fast response due to efficient indexing.
Simplicity – Easy to implement using existing IR libraries like Lucene or
Elasticsearch.
Flexibility – Can handle different types of questions and corpora.
2.7. Limitations of IR-Based QA
Shallow Understanding – Only retrieves information, doesn't
understand complex semantics.
No Reasoning – Cannot perform multi-hop reasoning or inference.
Fails with Implicit Answers – Cannot answer if the answer is not
directly stated.
Overdependence on Keyword Match – Misses paraphrased content if
exact words aren't present.
2.8. Applications of IR-Based QA
Web Search Engines (e.g., Google Snippets)
Chatbots and Virtual Assistants (e.g., Alexa, Siri)
Customer Service QA Systems
FAQ Automation
2.9. Difference Between IR-Based and Knowledge-Based QA
Feature IR-Based QA Knowledge-Based QA
Source Text documents Structured databases or graphs
Output Extracted answer span Exact answer from data
Example Search result from Wikipedia Answer from DBpedia or Wikidata
Approach Statistical retrieval Logical reasoning and querying
2.10. Conclusion
IR-based QA systems provide a foundational approach to answering natural language
questions using text retrieval techniques. Though they may not generate deep or multi-
step answers, they are extremely useful for fact-based, direct-answer questions over
large corpora. With the integration of NLP and machine learning, these systems can
now extract more accurate and meaningful answers, making them a crucial part of
modern QA applications.
3. Knowledge-based question answering
Knowledge-Based Question Answering (KB-QA) is a subfield of QA systems that
focuses on answering user questions using structured knowledge representations,
such as knowledge graphs, ontologies, or relational databases, instead of plain text
documents. Unlike IR-based QA, which retrieves relevant documents or text passages,
KB-QA aims to return precise answers by querying a formal knowledge base.
Goal: Retrieve exact answers (like "Paris") from structured data, not just related
documents or sentences.
3.1. What Is a Knowledge Base?
A knowledge base (KB) is a structured repository of facts, usually stored as triples
(subject, predicate, object) in a graph format.
Example triple: (Microsoft, Founder, Bill Gates)
Popular public knowledge bases include:
DBpedia (structured data from Wikipedia)
Wikidata
Freebase
YAGO
Google Knowledge Graph
3.2. Architecture of Knowledge-Based QA System
A typical KB-QA system includes the following major components:
a) Question Understanding
Converts natural language questions into a logical query or semantic
representation.
NLP tools like NER, POS tagging, and dependency parsing are used.
b) Entity Linking
Identifies and links entities (names, places, etc.) in the question to
corresponding nodes in the knowledge graph.
Example: "Who is the CEO of Tesla?" → Entity detected: Tesla, mapped to Tesla Inc.
in KB.
c) Query Generation (Semantic Parsing)
Transforms the question into a query language like SPARQL, which is used to
query knowledge graphs.
SPARQL Query
SELECT? person WHERE {
dbr: Tesla_Inc. dbo: keyperson? person.
}
Query Breakdown:
dbr: Tesla_Inc. refers to the resource for Tesla Inc. in the DBpedia
namespace.
dbo: keyperson is a property that indicates a significant individual associated
with a company (e.g., founder, CEO, etc.).
? person is a variable that will capture the output — the key person related to
Tesla Inc.
Expected Output
? person
---------------------------
http://dbpedia.org/resource/Elon_Musk
So, the answer is: Elon Musk
He is listed as the key person associated with Tesla Inc. in DBpedia.
d) Answer Retrieval
Executes the SPARQL query on the KB to fetch the exact answer.
Post-processing might involve re-ranking or filtering to choose the most
relevant result.
3.3. Simple Example
Question: "Who is the founder of Facebook?"
Step 1: Detect entity → Facebook
Step 2: Understand relation → "founder"
Step 3: Generate query → Search for triples where Facebook is subject and predicate
is "founder"
Query Result: → Mark Zuckerberg
Final Answer: Mark Zuckerberg
3.4. Real-World Example Using Wikidata
Question: "What is the capital of Germany?"
Entity: Germany → Q183 (Wikidata ID)
Property: Capital → P36
SPARQL Query:
SELECT? capitalLabel WHERE {
wd: Q183 wdt: P36? capital.
SERVICE wikibase: label {bd: serviceParam wikibase: language "en".}
Output: Berlin.
3.5. Key Features of KB-QA
Feature Description
Structured Answers Based on factual triples, not text
Precision Provides exact answer (no paraphrasing)
Explainability Easy to trace the source of an answer
Query Language Uses SPARQL or SQL to extract data
3.6. Advantages of KB-QA
High Accuracy: Precise answers from trusted data.
Explainable: Easy to verify how the answer was derived.
No Ambiguity: Uses structured relationships, not text similarity.
Efficient Querying: Uses graph traversal and indexing.
3.7. Limitations of KB-QA
Limited Coverage: Cannot answer questions if the KB lacks data.
Entity Linking Errors: Mistakes in identifying entities lead to wrong answers.
Hard to Handle Complex Questions: Multi-hop or ambiguous questions are
difficult.
Requires Structured Data: Not suitable for open-ended or subjective
questions.
3.8. Comparison: IR-Based QA vs KB-QA
Feature IR-Based QA Knowledge-Based QA
Source Text corpus Knowledge graph
Output Extracted sentence or passage Exact fact/triple
Accuracy Approximate High precision
Speed Fast Slower for complex queries
Example Retrieve Wikipedia page Query Wikidata or DBpedia
3.9. Applications of KB-QA
Search Engines: Google’s “Knowledge Panel”
Digital Assistants: Siri, Google Assistant, Alexa
Chatbots: Banking and travel assistants
Academic QA: Biomedical QA over PubMed KB
E-commerce: QA over product knowledge graphs
3.10. Conclusion
Knowledge-based Question Answering is an advanced method for returning factual,
direct answers from structured data sources. By converting user questions into graph
queries, KB-QA provides highly accurate and explainable responses. While
powerful, it is limited by the coverage and completeness of the underlying knowledge
base. KB-QA systems play a critical role in today’s intelligent agents and digital
assistants, enabling them to provide reliable, fact-based answers in real-time.
4. Language Models for QA
4.1. Introduction
A Language Model (LM) is a machine learning model that understands and generates
human language. In Question Answering (QA), language models are used to interpret
user questions, understand context, and generate or extract answers from a given
knowledge source like text, documents, or a database.
Traditional QA systems used rule-based or keyword-matching techniques, but modern
QA systems use deep learning-based language models for better accuracy and
contextual understanding.
4.2. What Is a Language Model?
A language model assigns probabilities to sequences of words. It can:
Predict the next word (e.g., "The sun is ___")
Fill in blanks (e.g., "The capital of France is ___")
Understand sentence meaning and context.
Answer questions using trained knowledge.
4.3. Role of Language Models in QA
Language models help in:
Understanding the question (e.g., classifying it as 'what', 'who', 'when', etc.)
Extracting answers from unstructured text (extractive QA)
Generating answers in natural language (generative QA)
Handling ambiguity and context through deep learning.
4.4. Types of Language Models for QA
Model Type Description Example Models
Statistical LMs Use probability to predict next word N-gram models
Neural LMs Use neural networks to capture context RNN, LSTM
Transformer-based Use attention mechanism for deep BERT, GPT, T5,
LMs understanding RoBERTa
4.5. Major Language Models Used in QA
a) BERT (Bidirectional Encoder Representations from Transformers)
Trained on large corpus like Wikipedia and Books Corpus
Used in extractive QA: identifies answer span from context
Strength: Understands both left and right context
Example:
Question: "Who is the president of the United States?"
Context: "Joe Biden is the 46th president of the United States."
Answer: "Joe Biden"
b) GPT (Generative Pretrained Transformer)
Used in generative QA: forms full-sentence answers
Strength: Good for conversational agents and open-domain QA
Example:
Question: "What is the capital of Italy?"
Answer (generated): "The capital of Italy is Rome."
c) T5 (Text-To-Text Transfer Transformer)
Converts all NLP tasks into text generation problems
Used for QA, translation, summarization, etc.
Example:
Input: question: Who wrote Hamlet? context: Hamlet was written by Shakespeare.
Output: William Shakespeare
4.6. Architecture of LM-Based QA System
1. Input Layer: Natural language question (e.g., “Where is Eiffel Tower?”)
2. Encoding: Input is tokenized and passed through layers of the model
3. Contextual Understanding: Model attends to relevant context (e.g., travel
article)
4. Answer Prediction:
o Extractive QA: Selects a span from context
oGenerative QA: Forms answer from scratch
5. Output Layer: Final answer
4.7. Tasks in LM-Based QA
Task Description Example
Answer any question using web-
Open-Domain QA "What is AI?"
scale data
Reading Comprehension Answering based on
Given a passage, find answer span
QA story
No context given; model uses "Who discovered
Closed-Book QA
memory gravity?"
Chatbots, Virtual
Conversational QA Multi-turn dialogue
Assistants
4.8. Real-World Applications
1. Google Search Snippets – BERT improves snippet answers
2. Siri, Alexa, Google Assistant – Powered by language models
3. Chatbots in healthcare and banking – LM-based QA for FAQs
4. Educational Tools – Instant answers and explanations
5. Customer Support Bots – Understand queries and provide automated answers
4.9. Advantages of LM-Based QA
Contextual Understanding – Can handle ambiguity
High Accuracy – Especially with models like BERT and GPT
Scalable – Can be trained on millions of documents
Versatile – Used in many types of QA (factual, opinion, multi-turn)
4.10. Limitations
Requires huge data and compute to train
May hallucinate answers in generative QA
Biases in Training Data may reflect in answers
Interpretability is challenging
Sample Question and Answer Flow
Question: “What is the currency of Japan?”
Context: “Japan uses the Yen as its official currency.”
Answer (BERT): Yen
5.Classic QA Models
5.1. What is a Question Answering (QA) System? [2M]
A QA system is a computer program that gives direct answers to questions asked in
natural language (like English). Instead of giving a full document like a search engine,
it tries to give just the answer.
5.2. What are Classic QA Models? [2M]
Classic QA models are the older methods used for answering questions before deep
learning and AI tools like BERT and ChatGPT came. They used simple techniques
like:
Rules
Search engines
Basic grammar tools
Templates
These systems were used mostly for fact-based questions like:
Who is the president of the USA?
What is the capital of Japan?
When was the telephone invented?
5.3. How do Classic QA Models Work?
They usually follow three simple steps:
1. Understand the question – Find what the user wants.
2. Find related documents – Search for useful text.
3. Extract the answer – Pick the best sentence or word as the answer.
5.4. Types of Classic QA Models
A. Rule-Based QA Models
These use if-then rules to find the answer.
Useful when sentences follow a known pattern.
Example:
Question: Who discovered gravity?
Sentence: "Isaac Newton discovered gravity."
→ The rule is: look for a name after the word "discovered"
Answer: Isaac Newton
B. IR-Based QA (Information Retrieval)
Works like Google Search.
It searches for documents that match the question, then finds the answer from
those documents.
Example:
Question: What is the capital of France?
Text found: "Paris is the capital of France."
Answer: Paris
C. Template-Based QA
Uses fixed question formats (templates) and fills them with answers from a
database.
Example:
Template: "Who is the president of [Country]?"
Question: "Who is the president of India?"
The system finds: India → Droupadi Murmu
Answer: Droupadi Murmu
D. Shallow NLP + Machine Learning
Uses tools like:
o Part-of-speech tagging (nouns, verbs, etc.)
o Named Entity Recognition (NER) to find names, places, dates
A simple machine learning model (like Decision Tree or SVM) is used to pick
the answer.
Example:
Question: When did World War 2 end?
Text: "World War 2 ended in 1945."
The system finds the date using NER
Answer: 1945
Real-Life Example
Q: What is the largest planet in the solar system?
Classic QA process:
1. It searches documents or Wikipedia pages.
2. Finds a sentence: “Jupiter is the largest planet in the solar system.”
3. Extracts answer: Jupiter
5.5. Popular Classic QA Systems
System What It Did
ELIZA Early chatbot with basic conversation rules (1960s)
START First system to answer full English questions (MIT)
TREC QA A test system to evaluate QA performance
5.6. Advantages of Classic QA Models
Simple to build
Fast and rule-based
Works well for specific domains (like weather, travel, etc.)
Easy to debug (you can see how it got the answer)
5.7. Limitations
Can’t handle complex or confusing questions
Needs hand-written rules or templates
Doesn’t understand deep meanings
Not good for general topics
Summary
Classic QA models are the first systems to answer natural language questions.
They use rules, templates, and document search.
While simple, they helped build today’s smart AI tools.
They’re still useful in areas with fixed types of questions.
6. Chatbots
6.1. Introduction: What is a Chatbot?
A chatbot is a software application designed to simulate conversation with human
users, especially over the internet. It uses either predefined rules or artificial
intelligence (AI) to understand the user's questions and provide relevant responses in
natural language.
Chatbot = "Chat" (conversation) + "Bot" (robot/program)
Chatbots are used in many fields such as:
Customer service
Healthcare
E-commerce
Education
Banking
6.2. How Does a Chatbot Work?
The basic working of a chatbot involves the following steps:
Step 1: User Input
The user types or speaks a message.
Step 2: Input Processing
The chatbot uses Natural Language Processing (NLP) to understand the text.
This includes:
o Identifying the intent (What is the user asking?)
o Extracting entities (Names, dates, locations)
Step 3: Response Generation
Based on the input, the bot either selects a predefined answer or generates one
using AI.
Step 4: User Response
The bot sends the response back to the user in text (or voice).
6.3. Types of Chatbots
A. Rule-Based Chatbots
These bots follow fixed rules or decision trees.
If the user's message matches a certain pattern, the bot gives a predefined
answer.
No learning ability.
Example:
User: "Hi"
Bot: "Hello! How can I help you?"
Advantages:
Simple to build
Predictable behavior
Limitations:
Can only answer limited types of questions
Gets confused if question format changes
B. AI-Based Chatbots
These bots use machine learning and NLP to understand and respond.
They can learn from previous conversations.
Better at handling free-flow, human-like conversations.
Example:
User: "Book a table for 2 people at 7 PM"
Bot: "Sure! Your table has been booked for 7 PM."
Advantages:
Can handle many types of questions
Learns and improves over time
Limitations:
Needs training data
More complex to build
6.4. Architecture / Components of a Chatbot
A chatbot typically includes:
Component Function
User Interface (UI) The platform where users type/speak (e.g., website, app, WhatsApp)
NLP Engine Understands the user's language and intent
Intent Recognizer Identifies what the user wants (e.g., book ticket, ask time)
Entity Extractor Extracts keywords (date, place, number)
Component Function
Dialogue Manager Maintains the flow of conversation
Response Generator Creates or selects an appropriate reply
Backend/Database Stores information (e.g., booking details, FAQ data)
6.5. Real-World Examples of Chatbots
Chatbot Purpose
Amazon Alexa / Google Assistant Voice-based personal assistant
Siri (Apple) Voice assistant for iOS users
Zomato / Swiggy bots Food ordering support
Banking bots (HDFC EVA) Balance inquiry, transaction info
Educational bots (Duolingo) Language learning via conversation
6.6. Sample Conversation
Zomato/Swiggy Chatbot – Sample Food Ordering Conversation
(Detailed)
User:
Hi, I’m hungry. Can you suggest something good?
Bot (Step 1 - Greeting + Intent Recognition):
Hi there! 😊 I’d love to help you with that.
Are you in the mood for something spicy, sweet, or maybe a full meal?
📌 [Intent Detected: Food Recommendation]
User:
Spicy food sounds good!
Bot (Step 2 - Suggest Items Based on Preference):
Yum! Here are some spicy dishes near you:
1. Paneer Tikka – ₹150 (from Spice Villa)
2. Chicken Biryani – ₹220 (from Biryani Express)
3. Spicy Schezwan Noodles – ₹180 (from Wok n Roll)
Would you like to order any of these?
📌 [Entity Detected: Taste Preference = Spicy]
User:
Yes, I’ll go with Chicken Biryani.
Bot (Step 3 - Confirm Dish Selection):
Great choice! 🍗 One plate of Chicken Biryani from Biryani Express – ₹220.
Should I proceed to checkout or would you like to add more items?
📌 [Entity Extracted: Dish = Chicken Biryani]
📌 [Restaurant = Biryani Express]
User:
Add a soft drink too.
Bot (Step 4 - Add-On Handling):
Done! 🥤 Added one soft drink – ₹40.
Order Summary:
Chicken Biryani – ₹220
Soft Drink – ₹40
Total: ₹260
Shall I proceed to payment?
User:
Yes, go ahead.
Bot (Step 5 - Order Placement):
Redirecting to payment... 💳
✅ Payment Successful!
🎉 Your order has been placed!
Estimated delivery time: 35 minutes
You can track your order live from the app. Enjoy your meal! 😋
NLP Breakdown of Conversation
Turn Intent Entities Extracted
1 Greet / Ask for Recommendation -
2 Food Preference Taste = Spicy
3 Dish Selection Dish = Chicken Biryani
4 Add Item Item = Soft Drink
5 Payment Confirmation -
Behind the Scenes Components:
Component Role
NLP Engine Detects intent & extracts entities (dish name, preferences)
Dialogue Manager Manages turn-by-turn flow and remembers context
Menu API Fetches real-time dishes from restaurants
Payment Gateway Handles order payment securely
Order Tracking System Provides delivery updates
6.7. Advantages of Chatbots
24/7 availability: No human needed at all times
Instant replies: Fast support
Scalable: Can handle thousands of users at once
Cost-effective: Reduces need for human staff
Improves customer satisfaction with quick service
6.8. Limitations of Chatbots
Limited understanding in some cases (e.g., sarcasm, complex language)
Fails with unexpected questions (especially rule-based ones)
Need frequent updates to stay relevant
May sound robotic if not well designed
6.9. Tools and Platforms to Build Chatbots
Tool Type
Dialogflow (Google) Cloud NLP-based platform
Rasa Open-source framework
IBM Watson Assistant Enterprise-level bot builder
Microsoft Bot Framework Supports multi-channel chatbots
ManyChat / Chatfuel For Facebook Messenger bots
6.10. Use Cases of Chatbots
Domain Use
Education Quiz bots, tutor bots
Banking Transaction info, balance check
Healthcare Symptom checkers, appointment booking
E-commerce Order tracking, product recommendations
Tourism Travel bookings, virtual guides
7. Design of dialogue systems
A Dialogue System (also called a conversational agent) is an intelligent system that
interacts with users using natural language — either spoken or written — to
accomplish specific tasks or provide information.
7.1. What is a Dialogue System?
A dialogue system simulates human conversation. It can be found in:
Chatbots (e.g., Zomato, Swiggy)
Virtual Assistants (e.g., Siri, Alexa)
Voice-enabled customer support
Smart appliances and cars
Purpose:
Answering queries
Completing tasks (e.g., bookings, shopping)
Having conversations (social, entertainment, therapy)
7.2. Components of a Dialogue System
Component Function
Takes user input in text or speech. If speech: uses
1. Input Module Automatic Speech Recognition (ASR) to convert it
to text.
2. Natural Language
Extracts meaning: detects intent and entities.
Understanding (NLU)
Controls the flow. Decides what to do next using state
3. Dialogue Manager (DM)
tracking and dialogue policy.
4. Natural Language Converts system response (internal logic) to human-
Generation (NLG) readable text.
Delivers response via text or Text-To-Speech (TTS) if
5. Output Module
it's a voice bot.
Example Flow:
User: “Book a flight from Chennai to Delhi on May 10”
→ ASR → Text
→ NLU → Intent: Flight Booking, Entities: Chennai, Delhi, May 10
→ Dialogue Manager → Ask for passenger name
→ NLG → "Can you please provide the passenger’s name?"
7.3. Types of Dialogue Systems
Type Description Example
Task- Goal is to complete a task (booking, Zomato bot, airline
Oriented ordering) booking
Chat-
Casual or social conversation ChatGPT, Replika
Oriented
Hybrid Combines both capabilities Alexa, Google Assistant
7.4. Dialogue System Architectures
A. Rule-Based Systems
Handcrafted rules
Best for small domains
Easy to implement but not scalable
B. Frame-Based (Slot-Filling)
Collects pieces of information (slots) like name, location, date
Used in most commercial chatbots
C. Machine Learning-Based
Uses models like:
o RNNs, Transformers for understanding
o Reinforcement Learning for policy decisions
Dynamic, learns from data
D. End-to-End Neural Dialogue Systems
Use deep learning to map input directly to output
Based on Seq2Seq, Transformers, or GPT models
7.5. Dialogue Manager – The Brain of the System
The Dialogue Manager has two parts:
➤ A. Dialogue State Tracker:
Keeps track of current conversation state
Stores filled slots, previous turns
➤ B. Dialogue Policy:
Decides what to do next
Can be:
o Rule-based (IF this THEN that)
o ML-based (Reinforcement Learning)
7.6. Design Steps of a Dialogue System
1. Define Domain
e.g., Food delivery, airline booking, campus FAQs
2. List Use Cases & Intents
e.g., "Order food", "Track delivery", "Cancel order"
3. Define Slots (Entities)
e.g., Food item, address, time
4. Create Dialogue Flow / Story Paths
Design how a conversation unfolds using flowcharts
5. Build NLU Components
Use tools like Rasa, spaCy, BERT to extract intents/entities
6. Implement Dialogue Manager
Manages states, handles logic, guides interaction
7. Generate Responses (NLG)
Use templates or generate with AI (e.g., "Your pizza will arrive in 30
minutes!")
8. Test and Train
Use conversations logs, user testing
7.7. Example – Food Ordering Dialogue System
User: I want to order a burger.
Bot: Great! Would you like Veg or Chicken?
User: Chicken.
Bot: How many?
User: Two.
Bot: Okay! 2 Chicken burgers added to your cart. Proceed to checkout?
Slots:
Dish: Burger
Type: Chicken
Quantity: 2
7.8. Evaluation of Dialogue Systems
Metric Meaning
Task Completion Rate Was the user's goal achieved?
Turn Success Rate Were responses accurate and helpful?
User Satisfaction Score Based on user feedback
Error Rate Number of misinterpretations
Dialogue Efficiency How many turns to complete the task?
7.9. Tools to Build Dialogue Systems
Tool Features
Rasa (Open-source) Python-based, customizable
Dialogflow (Google) UI-based, easy to use
Microsoft Bot Framework Integrates with Azure
Amazon Lex Used for Alexa
GPT APIs Powerful for open-domain conversation
A dialogue system is an intelligent interface that processes natural language and
responds in a conversational manner. It is a combination of multiple NLP, ML, and
design components and requires structured planning, training, and testing to ensure a
natural and helpful user experience.
8. Evaluating Dialogue Systems
8.1. Introduction
Dialogue systems are interactive systems designed to converse with users and
perform tasks.
Evaluation ensures these systems work correctly, efficiently, and offer a good
user experience.
A good evaluation framework measures both technical accuracy and human
satisfaction.
8.2. Types of Evaluation Approaches
A. Objective Evaluation (Quantitative)
Based on measurable, system-centric metrics.
Suitable for task-oriented dialogue systems.
Key Metrics:
Task Completion Rate: % of successful tasks completed (e.g., bookings done).
Slot-Filling Accuracy: How correctly the system extracts key values (e.g., date,
destination).
Intent Recognition Accuracy: Whether the bot correctly understands the user’s
intent.
Dialogue Efficiency: Measures how few turns are needed to complete a task.
Response Time: Average time the bot takes to respond.
B. Subjective Evaluation (Qualitative)
Based on human feedback and perception.
Useful for evaluating user experience and naturalness.
Key Aspects:
User Satisfaction: Overall experience; often rated on a scale (e.g., 1 to 5).
Naturalness: How human-like the conversation feels.
Engagement: User interest and retention during the conversation.
Ease of Use: Simplicity and intuitiveness of interaction.
Coherence: Logical flow and consistency in responses.
C. Mixed Evaluation (Hybrid)
Combines both automated logs and human feedback.
Helps in more realistic and comprehensive analysis.
Techniques Used:
A/B Testing: Compare two versions of a bot with different users.
Simulated Users: Use programmed agents to test large-scale interactions.
Dialogue Log Analysis: Review conversations for errors, abandonment, or
repetition.
8.3. Evaluation Metrics in Detail
Metric Description Example
Task Completion 85 out of 100 flight bookings
Goal achieved or not
Rate successful → 85%
Slot-Filling Location = "Mumbai", Date =
Correct extraction of user data
Accuracy "May 1st"
Correct classification of user “I want to book a room” →
Intent Recognition
intent Intent = RoomBooking
Dialogue 5 turns is better than 12 for the
Turns needed to finish the task
Efficiency same task
Response Time Time taken by system to reply 1.2 seconds is considered fast
BLEU/ROUGE Compares generated response to
Used for open-domain chatbots
Score reference human response
8.4. Tools Used for Evaluation
Tool/Platform Purpose
Rasa X Dialogue monitoring and improvement
Dialogflow Built-in analytics and intent tracking
Simulated Users Automated evaluation at scale
A/B Testing Tools Compare different versions of the dialogue system
Manual Review Human evaluation of naturalness and correctness
8.5. Example: Food Ordering Bot Evaluation
Conversation:
User: I want pizza
Bot: Veg or non-veg?
User: Veg
Bot: How many?
User: Two
Bot: Great! Your order for 2 veg pizzas has been placed.
Evaluation Results:
Task Completion: Yes
Slot Accuracy: 100%
Average Response Time: 1.2 seconds
User Satisfaction: 4.5 / 5
Turns Taken: 4 (Efficient)
8.6. Challenges in Dialogue System Evaluation
Open-ended conversations are hard to quantify.
High BLEU score doesn’t always equal better conversation quality.
Users often give ambiguous or unexpected inputs.
Difficult to measure long-term engagement or learning-based improvements.
8.7. Conclusion
Dialogue system evaluation is essential to ensure the system is both technically
sound and user-friendly.
Using a combination of objective and subjective evaluations gives a complete
picture.
Continuous testing with real users, simulated agents, and analytics tools helps
improve system performance and adapt to user needs.
Ultimately, a well-evaluated dialogue system leads to higher satisfaction,
trust, and real-world success.
PREVIOUS YEAR QUESTIONS WITH ANSWERS
How will you apply NLP technique for relevant information retrieval
from web?
Natural Language Processing (NLP) plays a vital role in enabling intelligent systems to
search, extract, and present relevant information from vast sources on the web.
Traditional keyword-based search engines often return a broad range of results, many
of which may not align with the user's actual intent. NLP improves this process by
understanding the meaning behind user queries, processing large volumes of
unstructured data, and delivering contextually relevant results.
1. Understanding User Query Semantics
The first step in information retrieval (IR) using NLP is to understand the user's query
in natural language. Instead of treating queries as a bag of keywords, modern systems
use:
Tokenization: Breaking the query into meaningful words or phrases.
Part-of-Speech (POS) Tagging: Identifying the grammatical structure of the
query.
Named Entity Recognition (NER): Detecting important entities like people,
places, organizations, and dates (e.g., "latest research on COVID-19 vaccines").
Intent Detection: Using machine learning to determine what the user actually
wants (e.g., looking for reviews vs. purchasing information).
By extracting syntactic and semantic structure, the system better understands whether
a user is asking for definitions, comparisons, recent news, or specific products.
2. Query Expansion and Reformulation
Users often provide incomplete or vague queries, such as “top universities”. NLP can
improve such queries using:
Synonym Expansion: Adding similar or related terms (e.g., "top universities"
→ "best colleges", "ranking of institutions").
Spelling Correction: Fixing typos using edit distance or context-aware
correction.
Contextual Rewriting: Reformulating ambiguous queries into more specific
ones using user history or conversational context.
Example:
User Query: “AI in education”
→ NLP expands it to: “Artificial Intelligence applications in the field of education,
including personalized learning, grading systems, and tutoring.”
3. Information Extraction from Web Content
Once the query is interpreted, NLP is applied to web pages or documents to extract
relevant content:
Web Crawling: Automated bots (spiders) gather web pages.
Text Preprocessing: Cleaning the content (removing HTML, scripts, stop
words).
Document Representation: Documents are converted into structured forms
using:
o TF-IDF (Term Frequency-Inverse Document Frequency)
o Word Embeddings (Word2Vec, BERT)
These help in matching documents to queries based on semantic similarity, not just
word overlap.
4. Relevance Ranking Using NLP Models
NLP helps rank retrieved documents using models such as:
Cosine Similarity: Measures how close the query and document vectors are.
BM25: A probabilistic retrieval function that considers term frequency and
document length.
Transformer-based models (like BERT): These use deep learning to rank
documents based on sentence-level context.
Example:
User query: “How does climate change affect ocean life?”
The system identifies semantic matches even if the document says: “Marine
biodiversity is declining due to rising sea temperatures and ocean acidification.”
5. Summarization and Snippet Generation
Once relevant documents are found, NLP generates summaries or snippets to present
useful content concisely:
Extractive Summarization: Picks key sentences from the original text.
Abstractive Summarization: Generates new sentences that summarize the
content, similar to how a human would.
This helps users quickly judge if the document is relevant without reading the whole
thing.
6. Multilingual and Voice-Based Retrieval
NLP also enables:
Cross-lingual Retrieval: Query in English can retrieve documents in other
languages (using translation models).
Speech-to-Text Integration: Voice assistants (Alexa, Siri) use NLP to convert
spoken queries into text and retrieve answers.
Example:
Spoken Query: “Show me recent news about earthquakes in Japan.”
→ Transcribed, processed using NLP, and relevant articles are shown.
7. Applications in Real-World Systems
System NLP Contribution
Google Search BERT for semantic understanding and ranking
ChatGPT / Bing AI Conversational query handling, summarization
Academic Search Semantic Scholar uses NLP to match research queries
E-commerce Product search, filters, and recommendations using NLP
NLP techniques have transformed information retrieval from keyword matching to
contextual understanding. By processing both the user’s intent and the document
content at a semantic level, NLP allows systems to retrieve and rank information that is
truly relevant. This leads to faster, smarter, and more accurate results, especially in
domains such as search engines, digital assistants, e-learning platforms, and research
portals.
Develop an NLP System for (Online Banking Systems) with diagrammatic representation
1. Introduction
Natural Language Processing (NLP) is a critical technology that allows machines to understand,
interpret, and respond to human language. In online banking systems, NLP helps in improving
customer service, enhancing security, automating support queries, and offering
personalized financial advice.
An NLP-powered banking system enables users to interact through text or voice interfaces,
simplifying operations like checking account balances, reporting lost cards, applying for loans,
or getting fraud alerts — all via natural conversation.
2. Objectives of NLP in Online Banking
Automate Customer Support: Reducing reliance on human agents.
Enhance User Experience: Quick, natural language-based interactions.
Detect Fraud: Identifying suspicious behavior through language patterns.
Provide Personalized Services: Recommend services based on user inquiries.
Improve Accessibility: Voice-based banking for visually impaired users.
3. Architecture of the NLP System
The NLP system for online banking consists of several key modules arranged in a
pipeline.
Block Diagram Representation
+------------------+
| User Interface |
| (Chat/Voice App) |
+--------+---------+
+---------------------+
| Text Preprocessing |
| (Cleaning, Tokenizing) |
+--------+------------+
+-------------------------+
| Intent Recognition Model |
| (BERT, Transformer Models) |
+--------+------------------+
|
+------------------------+
| Entity Extraction |
| (Accounts, Dates, etc.) |
+--------+--------------+
+------------------------+
| Dialogue Manager |
| (Session Tracking) |
+--------+--------------+
+-------------------------+
| Banking Backend Systems|
| (Databases, APIs) |
+--------+---------------+
|
+-------------------------+
| Response Generator |
| (Template or NLG Model) |
+--------+---------------+
|
+-----------------+
| User Feedback |
+-----------------+
4. Detailed Workflow Explanation
4.1. User Interface
Users interact via chatbots, voice assistants, or mobile/web applications.
Inputs can be text or speech (speech-to-text conversion if needed).
4.2. Text Preprocessing
Noise Removal: Remove special characters, HTML tags.
Tokenization: Split sentences into words.
Stop Word Removal: Eliminate common words like "is", "the", etc.
Stemming/Lemmatization: Reduce words to their root form.
4.3. Intent Recognition
Understand the purpose of the user's message.
Example Intents:
o Check Balance
o Report Lost Card
o Request Mini Statement
Model Used: Pre-trained transformer models like BERT, fine-tuned on
banking-specific datasets.
4.4. Entity Extraction
Extract key information from the user's message.
Entities:
o Account Number, Transaction ID, Date, Amount.
Tools:
o spaCy Named Entity Recognition (NER)
o CRF (Conditional Random Fields) models.
4.5. Dialogue Management
Context tracking: Remember previous questions or slots (e.g., if the user says
"block my card", the system asks "which card?").
Tools:
o Rasa Core
o Dialogflow Contexts
4.6. Backend Banking System Integration
Securely connect with banking servers to retrieve real-time data.
Actions:
o Fetch balances.
o Block a card.
o Record a complaint.
Technologies:
o REST APIs, gRPC, OAuth 2.0 Authentication.
4.7. Response Generation
Reply appropriately using:
o Template-Based Responses (e.g., "Your account balance is ₹XX.XX")
o Natural Language Generation (NLG) models for dynamic responses.
4.8. Feedback Loop
Collect user ratings, correction feedback, etc.
Retrain the NLP models periodically to improve performance.
5. Technologies Used
Module Technology
Preprocessing spaCy, NLTK
Intent Recognition BERT, RoBERTa
Entity Recognition spaCy, custom CRF models
Dialogue Management Rasa, Dialogflow, Botpress
Voice Processing Google Speech API, Whisper (OpenAI)
Backend API REST APIs, OAuth 2.0, SSL Encryption
Hosting AWS Lambda, Azure Bot Services
6. Advantages of NLP-Based Banking Systems
24/7 Availability: No downtime, instant responses.
Operational Cost Savings: Reduces need for human customer support.
Enhanced Customer Satisfaction: Quick and accurate query handling.
Fraud Prevention: Early fraud detection through suspicious text patterns.
Accessibility: Enables users to interact hands-free (voice).
7. Conclusion
An NLP system in online banking revolutionizes customer interaction by providing
intelligent, efficient, and personalized service. By integrating advanced NLP models,
secure backend APIs, and dialogue management systems, banks can automate a wide
range of tasks while maintaining high standards of security and user satisfaction. As
artificial intelligence and machine learning techniques evolve, the role of NLP in
banking will become even more significant, paving the way for smarter, conversational,
and proactive banking experiences.
UNIT IV TEXT-TO-SPEECH SYNTHESIS
Overview. Text normalization. Letter-to-sound. Prosody, Evaluation. Signal processing -
Concatenative and parametric approaches, WaveNet and other deep learning-based TTS
systems
Overview of Text-to-Speech (TTS) Synthesis
Definition:
Text-to-Speech (TTS) synthesis is the process of automatically converting written
text into spoken words. It combines linguistics, signal processing, and machine
learning to generate speech that sounds human-like.
Real-Time Example:
Google Maps Navigation:
When you type a destination, it says: “Turn right after 300 meters.” This voice
is generated by a TTS system.
Assistive Technology:
TTS helps visually impaired users to read books, websites, or messages
aloud.
1.Text Normalization
Definition:
Text normalization is the process of preparing raw text by converting non-standard
words (NSWs) like numbers, dates, abbreviations, and symbols into a standard,
readable format for speech synthesis.
Steps Involved:
Numerical Conversion:
“23” → “twenty-three”
Abbreviation Expansion:
“Dr.” → “Doctor”
Date Format:
“12/04/2025” → “twelfth of April twenty twenty-five”
Currency/Symbol Handling:
“$100” → “one hundred dollars”
Real-Time Example:
Text: “Dr. Smith’s appointment is on 12/04/2025 at 10:30 AM.”
Normalized Text: “Doctor Smith’s appointment is on twelfth of April twenty
twenty-five at ten thirty AM.” This step is crucial to avoid robotic or
misinterpreted speech.
Key Operations in Text Normalization
Text normalization means cleaning, expanding, and converting written text into a
standard format that a TTS system can read aloud correctly.
Let’s explore the most important operations one by one:
1. Number Expansion
What it does:
Converts numerals into words.
Example:
Input: "I have 2 apples"
Output: "I have two apples"
Special cases:
Years → 1999 → "nineteen ninety-nine"
Decimals → 3.14 → "three point one four"
Phone numbers → 9876543210 → "nine eight seven six..."
2.Date, Time, and Currency Normalization
What it does:
Expands dates, times, and money into spoken form.
Examples:
Date: "10/04/2025" → "tenth of April twenty twenty-five"
Time: "3:15 PM" → "three fifteen p m"
Currency: "Rs. 500" → "five hundred rupees"
3.Abbreviation and Acronym Expansion
What it does:
Expands abbreviations into full words or spell them out.
Examples:
"Dr." → "Doctor"
"NASA" → "N A S A" (spelled out)
"etc." → "et cetera"
4.Symbol and Unit Expansion
What it does:
Converts symbols and units into spoken words.
Examples:
"%" → "percent"
"kg" → "kilograms"
"&" → "and"
"@" → "at"
5.Punctuation Handling
What it does:
Removes or interprets punctuation for proper phrasing and intonation.
Examples:
"Hello, John!" → Comma helps add a pause.
"Wait... what?" → Ellipsis might affect rhythm/intonation.
Note: Punctuation is not spoken, but influences how speech sounds.
6.Lowercasing (Optional)
What it does:
Converts capital letters to lowercase to avoid confusion.
Example:
"HELLO" → "hello" (unless it's an acronym like NASA)
7.Homograph Disambiguation (Context-based)
What it does:
Handles words that are spelled the same but pronounced differently based on context.
Example:
"I read a book" (past tense: /rɛd/)
"I like to read" (present tense: /riːd/)
This step is usually partially handled during L2S (Letter-to-Sound).
Operation Example Input Normalized Output
Number Expansion "He has 3 dogs" "He has three dogs"
Date/Time/Currency "10/04/2025" "Tenth of April twenty twenty-five"
Abbreviation Expansion "Dr. Smith" "Doctor Smith"
Symbol/Unit Expansion "80 kg" "Eighty kilograms"
Punctuation Handling "Hi, Bob!" (used for pauses, not spoken)
Lowercasing "HELLO" "hello"
Homograph Handling "read" /rɛd/ or /riːd/ based on context
Real-World Example (Full Text):
Input:
"Dr. A.P.J. Abdul Kalam was born on 15/10/1931. He spent Rs. 200 on 3 books."
Normalized:
"Doctor A P J Abdul Kalam was born on fifteenth of October nineteen thirty-one. He
spent two hundred rupees on three books."
2. Letter-to-Sound
2.1. Definition
Letter-to-Sound (L2S) Conversion is the process of converting written characters
(graphemes) into spoken sounds (phonemes).
It plays a critical role in text-to-speech (TTS) systems, enabling them to pronounce
words correctly, especially in languages with non-phonetic spelling (like English).
Objective:
Given an input word (text), the system outputs a phonetic transcription, which tells
the speech synthesizer how to pronounce it.
2.2. Importance in TTS Pipeline
This step comes after Text Normalization and before prosody modeling in a typical
TTS pipeline.
Raw Text → Text Normalization → Letter-to-Sound → Prosody → Signal
Generation → Speech
2.3. Real-Time Example
Word L2S Output (Phonemes) Spoken as
"cat" /k/ /æ/ /t/ “cat”
"schedule" /ˈʃɛd.juːl/ or /ˈskɛd.juːl/ “shed-yule” or “sked-yule”
"ghoti" ??? Tricky! (English is irregular)
2.4. Why Is It Challenging?
Irregular spellings in English (e.g., “colonel” → /ˈkɝː.nəl/)
Homographs (e.g., “lead” as a verb vs noun)
Loanwords (e.g., “croissant”)
New or unseen words (e.g., brand names, URLs)
2.5. Techniques Used for L2S Conversion
A. Rule-Based Approach
Uses handwritten linguistic rules to map sequences of letters to sequences of
phonemes.
How It Works:
Apply rules based on spelling patterns and contextual position of letters.
Example Rules:
“c” before “e”, “i”, or “y” → /s/ (as in “cent”, “circle”)
“gh” at the end of a word → silent (“though”)
Example:
Word: “phone”
Apply rule: “ph” → /f/
Output phonemes: /f/ /oʊ/ /n/
Pros:
Easy to understand and control
Works well for known regular patterns
Cons:
Breaks with exceptions
Cannot handle ambiguous or foreign words
High effort to create and maintain
B. Dictionary Lookup
Uses a precompiled pronunciation dictionary like the CMU Pronouncing
Dictionary.
If the word exists in the dictionary, return its phonetic transcription.
Pros:
Accurate for known words
Fast and reliable
Cons:
Fails with unseen or new words
Cannot scale for all languages/domains
C. Statistical Models / Data-Driven Approach
Trains a model (like decision trees, HMMs) on a dataset of word → phoneme
pairs.
Learns the probabilistic mappings of graphemes to phonemes.
Example:
Word: “tough” → Model learns that “gh” at end = /f/
Pros:
Can handle unseen words better than rules
Works well with small datasets
Cons:
Struggles with long context or complex words
D. Neural Network-Based / Deep Learning Models
Uses sequence-to-sequence models (LSTM, Transformer) to learn how to
pronounce words.
Treats G2P as a translation task:
input sequence = graphemes → output sequence = phonemes
Example:
Input: “knight” → Output: /n/ /aɪ/ /t/
Model learns:
“k” is silent when followed by “n”
“igh” → /aɪ/
Popular Architectures:
LSTM (Long Short-Term Memory)
Transformer-based models
Tacotron (also includes prosody)
Pros:
Highly accurate
Learns complex patterns and long dependencies
Can generalize to new, unseen words
Cons:
Requires large training datasets
Computationally expensive
Hard to debug (black box)
2.6. Application Areas
Domain Role of L2S
Virtual Assistants Ensures correct pronunciation of names, commands
Navigation Systems Pronounce street names or numbers accurately
Language Learning Helps learners hear correct phonemes
Screen Readers Pronounce content for visually impaired users
Smart Devices For reading messages, notifications aloud
2.7. Hybrid Approaches
Modern TTS systems often use hybrid techniques:
Dictionary lookup first
If word not found, fallback to neural G2P model
Combine with rules for known edge cases
2.8. Diagram – G2P Conversion in TTS Pipeline
Input Text: "The chef cooked gnocchi."
↓ Text Normalization
"The chef cooked gnocchi."
↓ G2P / L2S
"The /ðə/ chef /ʃɛf/ cooked /kʊkt/ gnocchi /ˈnɒki/"
↓ Prosody & Timing
↓ Acoustic Modeling
↓ Vocoder
→ 🔊 Synthesized Speech
Technique Description Best For
Small systems, low-resource
Rule-Based Fixed rules (e.g., regex)
setup
Lookup table of word →
Dictionary Known words, fast applications
phoneme
Statistical Probabilistic modeling Medium-scale TTS, limited data
Neural Large, robust, multi-language
Seq2Seq with DL models
Networks TTS
What Are Loanwords?
Loanwords are words that a language borrows from another language, usually
without translating them.
They are adopted into the vocabulary, often keeping their original spelling or
pronunciation — or both!
Example: English borrows “ballet” from French.
Why Do Loanwords Exist?
Languages borrow words due to:
Trade and globalization
Technology and innovation
Cultural influence (music, food, fashion, etc.)
Real-World Examples of Loanwords in English
Loanword Origin Language Meaning / Context
Ballet French Classical dance
Tsunami Japanese Giant sea wave
Cliché French Overused idea or phrase
Pizza Italian Flatbread with toppings
Karaoke Japanese Singing to background music
Déjà vu French Feeling of familiarity
Safari Swahili (via Arabic) Journey or expedition
Robot Czech From the play R.U.R. by Karel Čapek
Kindergarten German “Children’s garden”; preschool
Why Are Loanwords Challenging in TTS?
1. Unfamiliar Pronunciations:
TTS systems may mispronounce words from other languages.
2. Non-phonetic Spelling:
Some loanwords retain foreign spelling rules.
3. Accent & Style Sensitivity:
Should "croissant" be pronounced as /ˈkwɑːsɒ̃/ (French-style) or /krəˈsɑnt/
(Anglicized)?
Example Challenges in TTS
Word: "Ballet"
Spelling suggests /bal-et/
But correct is /bæˈleɪ/ (from French)
Word: "Fajita"
Looks like /fa-jee-ta/
Actual: /fəˈhiːtə/ (Spanish pronunciation)
3.Prosody
3.1. What is Prosody?
Prosody refers to the rhythm, stress, and intonation of speech — basically, how we
say things, not just what we say.
It adds emotion, emphasis, and natural flow to speech, making it sound more
human-like and expressive.
3.2. Why is Prosody Important in TTS?
Without prosody:
Speech sounds robotic, flat, or unnatural.
Emotions, intentions, and question vs. statement differences are lost.
With good prosody:
Speech sounds natural and easy to understand.
Listeners can grasp mood, meaning, and emphasis.
3.3. Key Components of Prosody
Component Description Example
Highness or lowness of voice
Pitch (F0) “Is it raining?” ↑
(intonation)
Duration Length of sounds or pauses “Wait... what?” (pause adds meaning)
Emphasis on certain syllables “I didn’t say she stole it.” (Meaning
Stress
or words changes based on stress)
Pattern of syllables, pauses,
Rhythm “She sells sea shells...”
and speed
Rise and fall of pitch across a “He’s going home.” vs. “He’s going
Intonation
sentence home?”
3.4. Real-Life Examples
Example 1: Neutral vs Emotional
Flat TTS: “I am happy to see you.” (monotone)
Natural TTS with prosody: “I’m happy to see you!” (excited tone, stress on
“happy”)
Example 2: Question vs Statement
“She’s coming.” (↘️ falling intonation → statement)
“She’s coming?” (↗️ rising intonation → question)
Example 3: Sarcasm
Prosody helps detect sarcasm even when words stay the same:
o “Great, just what I needed.” (sarcastic tone)
3.5. How is Prosody Modeled in TTS?
Two key methods:
1. Rule-Based (Traditional)
Uses linguistic rules to control pitch, stress, and duration.
Based on punctuation, word type, sentence structure.
Limited naturalness.
2. Data-Driven (Modern TTS)
Uses machine learning or deep learning (e.g., Tacotron, FastSpeech) to
learn prosody patterns from real human speech.
Can model emotion, style, speaker identity, and natural variations.
3.6. Prosody in TTS Pipeline:
Text → Text Normalization → Phoneme Conversion → Prosody Modeling →
Acoustic Features → Speech Signal
The prosody modeling stage adds:
Pitch contours
Durations of phonemes
Pauses
Intonation patterns
These are passed to neural vocoders like WaveNet to produce expressive audio.
3.7. Evaluation of Prosody
Since prosody affects naturalness, we evaluate it using:
Subjective Tests like MOS (Mean Opinion Score)
ABX tests to compare prosodic differences between TTS systems
Objective metrics like F0 prediction error, duration RMSE
Summary
Aspect Explanation
Definition Rhythm, stress, and intonation in speech
Importance Makes speech expressive, human-like
Components Pitch, stress, duration, rhythm, intonation
Modeling Methods Rule-based, statistical, or deep learning models
Real-Time Examples Questions, sarcasm, excitement, emotion
Final Example
Input Text:
“Can you help me with this?”
TTS with good prosody:
Rising pitch at end (indicating a question)
Emphasis on “help”
Natural pause before “with this”
4. Evaluation
4.1. Definition:
Text-to-Speech (TTS) evaluation is the process of assessing the quality,
intelligibility, and naturalness of the synthesized speech produced by a TTS system.
It helps answer questions like:
“Does the voice sound human-like?”
“Is the speech clear and easy to understand?”
“How accurate is the pronunciation, pitch, and timing?”
4.2. Purpose of Evaluation:
To compare different TTS models
To detect errors or artifacts
To improve training and tuning
To ensure user satisfaction
4.3. Types of Evaluation Methods
TTS Evaluation is broadly divided into two categories:
A. Objective Evaluation
Objective methods use quantitative mathematical metrics — no human listeners
involved.
These metrics focus on comparing the synthetic output to real human speech.
1. Mel Cepstral Distortion (MCD)
Purpose: Measures how spectrally close the synthesized speech is to a natural
recording.
Formula (simplified):
MCD =
Lower MCD = better quality
Measures differences in mel-cepstral coefficients
Real-time Example:
If Google TTS has an MCD of 3.5 dB and a new model has 4.1 dB, Google TTS is
more spectrally accurate.
2. Root Mean Square Error (RMSE)
Purpose: Measures prediction accuracy in features like:
Pitch (F0)
Duration
Energy
Lower RMSE = Better alignment with target
Real-time Example:
If predicted pitch varies too much from real pitch, the TTS may sound robotic or
unnatural.
3. Word Error Rate (WER)
Purpose: Measures intelligibility — how many words are misunderstood by ASR
(Automatic Speech Recognition) system.
Where:
S = Substitutions
D = Deletions
I = Insertions
N = Total words in reference
Real-time Example:
If synthesized audio is: "The cat set on the mat" instead of "The cat sat on the mat" →
1 substitution → WER = 1/6 = 16.67%
B. Subjective Evaluation
Subjective methods involve human listeners rating or comparing TTS audio.
These are considered the gold standard because they reflect real human perception.
1. Mean Opinion Score (MOS)
Purpose: Measures overall naturalness of the voice.
Scale:
Score Meaning
5 Excellent
4 Good
3 Fair
2 Poor
1 Bad
Real-time Example:
System Average MOS Score
Amazon Polly 4.2
Google TTS 4.5
Old TTS Model 3.6
Humans clearly preferred Google TTS for its more natural sound.
2. AB or ABX Testing
Purpose: Compares two or more systems directly.
A and B are two synthesized versions.
X is a reference or random unknown sample.
Listeners are asked:
o “Which sounds better: A or B?”
o “Is X closer to A or B?”
Real-time Example:
Participants compare:
A: Amazon Polly
B: Google TTS
They may listen blind and rate preference.
4.4. When Are These Methods Used?
Phase Evaluation Type Used
During training Objective (MCD, RMSE)
Final tuning Subjective (MOS, ABX)
Product testing Mix of both (to ensure robustness)
4.5. Why Both Are Needed?
Objective methods are fast and automatic — good for large-scale testing.
Subjective methods capture real human feelings — essential for quality.
4.6. Conclusion:
TTS Evaluation is critical to ensure that synthesized speech:
Sounds natural and human-like
Is intelligible and accurate
Meets the expectations of end users
Modern TTS research combines deep learning with human-in-the-loop testing to
improve both metrics and real-world experience.
5. Signal Processing – Concatenative and Parametric Approaches
Signal processing in TTS refers to the methods used to convert linguistic and
acoustic features (like pitch, phonemes, stress) into an actual speech waveform that
can be played through speakers.
It’s the final stage of a TTS system — after the text has been:
1. Normalized
2. Translated into phonemes
3. Prosody (intonation, stress, rhythm) is modeled.
Text-to-Speech Pipeline
Text → Text Normalization → Phoneme Conversion → Prosody Modeling → Signal
Processing → Speech Output.
What Signal Processing Does:
1. Synthesizes the final sound
2. Maintains clarity, pitch, tone, and duration
3. Adds human-like qualities to the generated voice
Types of Signal Processing Approaches:
Approach Type Description
1. Concatenative Joins pre-recorded speech pieces
2. Parametric Generates speech from parameters
3. Neural Uses deep learning to generate realistic speech
Concatenative vs Parametric Synthesis
These are two traditional methods of generating speech waveforms from processed
text and phonetic information. Each has its own mechanism, strengths, and
limitations.
1. Concatenative Synthesis
Definition:
Concatenative synthesis involves stitching together small units of pre-recorded
human speech to produce full sentences.
How It Works:
1. A large speech corpus is recorded and stored. It contains units like:
o Phonemes
o Diphones (half of one phoneme to half of the next)
o Syllables or words
2. At runtime:
o The input text is converted to phonemes
o A unit selection algorithm chooses the best-matching recorded
segments
o These units are concatenated (joined) to form the speech output
Real-Time Example:
Text: "Thank you"
"Thank" → matched to diphone segment from database
"you" → matched similarly
Both segments are joined: "Thank" + "you" → synthesized audio
Types of Concatenative Synthesis:
Type Description
Unit Selection Chooses the most natural segments from large database
Diphone Uses all possible phoneme transitions (~1500 diphones in
Synthesis English)
Pros:
High naturalness (because real human voice is used)
Excellent intonation and emotion, if matched correctly
Cons:
Requires huge storage
Mismatch artifacts at joins (e.g., robotic breaks)
Difficult to customize or change voice characteristics
Use Cases:
Early TTS systems (e.g., older IVR systems, GPS)
Systems that demand natural voice without deep learning
2. Parametric Synthesis
Definition:
In parametric synthesis, speech is not recorded, but generated mathematically
using a small set of parameters like pitch, duration, spectral envelope, etc.
How It Works:
1. A statistical model (like Hidden Markov Models – HMM or Deep Neural
Networks) learns how to predict:
o Pitch (F₀)
o Spectral shape (formants, envelope)
o Duration of phonemes
2. These parameters are fed into a vocoder (like STRAIGHT, WORLD) that
synthesizes the speech waveform from scratch.
Real-Time Example:
Text: "Hello"
Model predicts:
o Pitch = 110 Hz
o Duration = 0.5s
o Formant info = vector values
Vocoder generates a smooth waveform using these.
Pros:
Lightweight – no big databases
Easy to control voice tone, speed, speaker identity
Can generate emotion, accents, or even whispers
Cons:
Less natural than concatenative (because of smooth averaging)
Speech may sound buzzy, muffled, or robotic.
Use Cases:
Voice assistants before 2016 (e.g., early Siri, old TTS readers)
Embedded devices with low memory (like digital toys)
Concatenative vs Parametric – Quick Comparison
Feature Concatenative Parametric
Based on Recorded segments Mathematical parameters
Naturalness Very high Moderate
Flexibility Low (fixed speaker) High (customizable)
Memory usage High Low
Emotion control Difficult Easy
Synthesis speed Fast Faster
Concatenative Synthesis:
[Text] → [Phonemes] → [Segment Selection] → [Join Units] → [Speech
Output]
Parametric Synthesis:
[Text] → [Phonemes] → [Statistical Model] → [Parameters] →
[Vocoder] → [Speech Output]
6. WaveNet
WaveNet is a deep neural network architecture developed by DeepMind (Google)
that generates raw audio waveforms. It can create very realistic human speech by
modeling the probability of each audio sample, one at a time.
Instead of selecting recorded speech (like concatenative) or generating from parameters
(like parametric), WaveNet learns to model raw audio sample-by-sample using deep
learning.
How WaveNet Works – Step-by-Step:
Step 1: Training Phase
Input: A large dataset of human speech
The model learns:
o Acoustic patterns
o How one sound follows another
o The relationship between phonemes, pitch, intonation, etc.
Step 2: Modeling the Speech Signal
Predicts the next sample value based on previous values:
P (xt ∣x1, x2,..., xt−1)
Uses causal convolutions (only looks at past, not future samples)
Uses dilated convolutions to expand its receptive field without adding
complexity
Step 3: Generation
During synthesis, the model predicts sample-by-sample in real-time to
generate smooth, realistic speech.
Architecture Breakdown
Components:
1. Input Encoding:
o Text → phonemes or linguistic features
o Converted into conditioning features
2. Dilated Causal Convolutions:
o Predict next sample while preserving time order
o Expands receptive field exponentially
3. Residual & Skip Connections:
o Allow training of deep networks
o Prevent vanishing gradients
4. Softmax Output:
o Predicts the probability of the next sample from 256 possible values
(8-bit μ-law encoding)
Applications
Product Uses WaveNet?
Google Assistant Yes
Google Translate voice Yes
YouTube Auto Captions (TTS) Yes
Text-to-speech readers Some newer versions
WaveNet Model Diagram:
[ Linguistic Features]
↓
Conditioning Stack
↓
Dilated Causal Conv 1
Dilated Causal Conv 2……….
Dilated Causal Conv n
↓
Softmax Output Layer
↓
[ Audio Sample t]
Example – Speech Generation
Text: “Hello”
Linguistic model converts to phonemes: /h/ /ə/ /l/ /oʊ/
Features are encoded
WaveNet generates samples:
x₁ = 0.01, x₂ = 0.05, ..., x₁₆₀₀₀ = -0.03
These 16000 samples = 1 second of natural-sounding “Hello”
Advantages:
Unmatched naturalness of speech
Captures long-term dependencies in voice
Can model emotion, prosody, accents naturally
Disadvantages:
Extremely slow: predicts one sample at a time
Needs GPU acceleration for real-time performance
Large model size, not suitable for small devices directly
7. Other Deep Learning Models in TTS System
While WaveNet generates raw audio waveforms, it needs a mel-spectrogram as
input — a representation of audio over time (like a heatmap).
So, other TTS systems evolved to handle the full pipeline, from text to mel-
spectrogram, which is then passed to a vocoder (like WaveNet).
These systems are usually divided into two parts:
Stage What It Does Example Models
Acoustic Model Text → Spectrogram Tacotron, FastSpeech
Vocoder Spectrogram → Waveform WaveNet, WaveGlow
1. Tacotron (Google)
What it does:
Converts text/phonemes into a mel-spectrogram
Then uses WaveNet or other vocoders to generate audio
How it works:
1. Input: Text (or phonemes)
2. Encoder processes text sequence
3. Decoder (with attention mechanism) outputs mel-spectrogram
4. Vocoder (like WaveNet) generates final waveform
Real-world example:
Text: “How are you?”
Tacotron generates spectrogram
WaveNet converts to human voice: → “How are you?” in smooth, natural tone
Pros:
High naturalness
Learns intonation and rhythm (prosody)
Cons:
Slow due to autoregressive nature (generates step-by-step)
Sensitive to long texts
2. Tacotron 2
An improved version by Google, combining:
Tacotron 2 for text → mel-spectrogram
WaveNet as vocoder
Improvements:
Better attention mechanism
More stable and less robotic
Produces realistic emotional speech
3. FastSpeech (by Microsoft)
What it does:
Speeds up Tacotron by generating outputs in parallel
Predicts durations of each phoneme and generates full spectrogram at once
Architecture:
1. Input: Phonemes
2. Duration predictor decides how long each phoneme should last
3. Spectrogram generator (non-autoregressive)
4. Vocoder (like Parallel WaveGAN) generates waveform
Pros:
Very fast (real-time or faster)
High-quality audio
More stable for long sentences
Cons:
Slightly lower quality than Tacotron + WaveNet combo (but very close)
4. FastSpeech 2
Upgrades:
Adds pitch and energy prediction (for emotion & expressiveness)
Even more controllable and better naturalness
5. Glow-TTS
Based on normalizing flows (probability-based models)
Generates mel-spectrograms directly using invertible transformations
Very fast and accurate
6. VITS (Variational Inference Text-to-Speech) by Kakao Brain
Why it’s special:
End-to-End model: No separate acoustic model and vocoder
Combines variational autoencoders, normalizing flows, and GANs
Learns everything together (text → waveform)
Pros:
Extremely natural
No need for external vocoder
Real-time performance
7. YourTTS, StyleTTS, Bark, and Others
These models focus on:
Voice cloning (speaking in someone’s voice)
Style transfer (changing emotion, tone)
Multilingual or zero-shot TTS (generate speech from any speaker/language
without retraining)
Comparison Table
Model Type Speed Quality Use Case
Tacotron Autoregressive Slow Very High Research, studios
Tacotron 2 Improved Medium Very High Google Assistant
FastSpeech Non-autoregressive Fast High Real-time devices
FastSpeech 2 + Pitch/Energy Very Fast Very High Emotion control
Glow-TTS Flow-based Fast High Lightweight apps
VITS End-to-end, GAN-based Fast Very High Advanced research
Real-Time Example: Google Assistant
1. Text: “Turn on the lights”
2. Tacotron 2 generates mel-spectrogram
3. WaveNet (or HiFi-GAN) converts it to waveform
4. Speaker hears: “Turning on the lights now.”
Diagram: Typical Deep Learning TTS Pipeline
[ Text]
↓
[ Phoneme Embedding]
↓
[ Tacotron / FastSpeech]
↓
[ Mel Spectrogram]
↓
[ WaveNet / HiFi-GAN / VITS]
↓
[ Speech Output]
Modern TTS systems use deep learning to create realistic and expressive speech
WaveNet revolutionized vocoding
Tacotron → FastSpeech → VITS represent evolution in speed, quality, and end-to-
end learning
UNIT V AUTOMATIC SPEECH RECOGNITION
Speech recognition: Acoustic modelling – Feature Extraction - HMM, HMM-DNN systems
1. Introduction to Speech Recognition
Speech recognition is the process of converting a speech signal into a sequence of words
by a computer system. It is a core component of many AI applications such as virtual
assistants, transcription tools, and voice-controlled devices.
A standard speech recognition system involves several critical stages:
1. Feature Extraction – Converting raw speech waveform into a set of
meaningful acoustic features.
2. Acoustic Modeling – Mapping extracted features to speech units like
phonemes.
3. Language Modeling – Modeling the probability of sequences of words.
4. Decoding – Searching the most probable word sequence by combining acoustic
and language models.
This answer focuses on the first two stages in detail.
1.1. Feature Extraction (Signal → Features)
The goal of feature extraction is to transform the raw audio waveform into a compact
and robust representation that retains speech-specific information and discards noise
and irrelevant variability.
Steps in Feature Extraction:
1. Pre-emphasis:
o A high-pass filter is applied to amplify high frequencies.
o Equation: y(t)=x(t)−αx(t−1), where α ≈ 0.95
2. Framing:
o Speech is non-stationary but quasi-stationary over short intervals
(~25ms).
o The signal is divided into overlapping frames (e.g., 25ms with 10ms
overlap).
3. Windowing:
o A window function (typically Hamming) is applied to each frame to
minimize spectral leakage.
4. Fast Fourier Transform (FFT):
o Converts the time-domain signal into frequency domain.
5. Mel Filterbank Processing:
o Filters spaced according to the Mel scale simulate the human ear’s
frequency sensitivity.
o Emphasizes perceptually important frequencies.
6. Logarithmic Compression:
o Log of the filterbank energies is taken to simulate human loudness
perception.
7. Discrete Cosine Transform (DCT):
o DCT is applied to decorrelate the features.
o Retain the first 12–13 coefficients (plus energy term).
8. Delta and Delta-Delta Features:
o Compute temporal derivatives (velocity and acceleration) to capture
dynamic information.
2. Acoustic Modeling
Acoustic modeling is the process of mapping audio feature vectors (extracted from
speech signals) to linguistic units (such as phonemes or sub-phonemes). It estimates
the likelihood of observing a sequence of features given a hypothesized sequence of
spoken units.
2.1. Role of Acoustic Models in Speech Recognition
Acoustic models form the core of Automatic Speech Recognition (ASR) systems.
They provide the statistical relationship between spoken sounds and their feature
representations by answering:
🔸 What is the probability of observing a feature vector, given a phonetic unit
(like a phoneme or state)?
2.2. Types of Acoustic Models
A. GMM-HMM (Gaussian Mixture Model - Hidden Markov Model)
Structure:
HMM models temporal sequence of speech using states and transitions.
GMM models the probability distribution of acoustic features for each HMM
state.
Key Components:
Component Description
States Represent sub-phonemes (e.g., 3 states per phoneme)
Transition Probabilities Probability of moving between states
Emission Probabilities Modeled using GMMs: ( p(x
Limitations:
GMMs assume features are Gaussian-distributed, which isn’t always true.
Poor at modeling complex data structures or long context dependencies.
B. HMM-DNN (Hybrid Deep Neural Network - HMM)
Motivation:
To overcome GMM limitations, Deep Neural Networks (DNNs) replace GMMs for
modeling the emission probabilities.
Working Principle:
DNN models posterior probabilities of HMM states: P(s∣x)
During decoding, Bayes’ rule converts it to likelihood: P(x∣s) ∝P(s)P(s∣x)
Architecture:
Input: Feature vectors (e.g., MFCCs, filterbanks, fMLLR), often with spliced
context (±5 frames).
Hidden Layers: Multiple nonlinear layers (ReLU/tanh) to learn deep features.
Output Layer: Nodes = number of HMM states (senones).
Loss Function: Cross-entropy using state-level labels from forced alignments.
Advantages:
Captures complex, nonlinear relationships.
Leverages large-scale data.
Improves accuracy significantly compared to GMM-HMM.
2.3. Temporal Modeling in Acoustic Models
Model Purpose
HMM Models sequence and duration of speech units
TDNN (Time Delay Neural
Captures temporal context explicitly by design
Network)
Handles long-term dependencies in audio; suitable for
LSTM/BLSTM
sequence-to-sequence learning
CNNs Capture local spectro-temporal patterns
Transformers (in end-to-
Capture global context using attention mechanisms
end models)
2.4. Training of Acoustic Models
Alignment:
Needed for supervised training.
GMM-HMM provides frame-level alignment between audio and phonetic
units (called senones).
🔹 Training Algorithms:
Baum-Welch (EM): For GMM-HMM
Stochastic Gradient Descent / Adam: For DNN-HMM
Discriminative Training (e.g., sMBR, MMI): To directly optimize
recognition accuracy
2.5. Advanced Acoustic Modeling Techniques
Model Description
TDNN-HMM Contextual modeling with fixed time delays; used in Kaldi
LSTM-HMM Memory-based modeling for long speech dependencies
CNN-HMM Effective for learning local time-frequency structures
BLSTM/TDNN-F State-of-the-art hybrid models used in production systems
2.6. Tools and Toolkits
Toolkit Purpose
Advanced acoustic model development (supports GMM, DNN,
Kaldi
TDNN, LSTM, chain models)
HTK Classical GMM-HMM modeling
ESPnet /
End-to-end speech modeling frameworks
Fairseq
2.7. Conclusion
Acoustic modeling plays a vital role in ASR systems by providing statistical mappings
from feature vectors to phonetic units. With the evolution from GMM-HMM to
DNN-HMM and context-aware architectures like TDNN and LSTM, modern
acoustic models have significantly improved the performance and robustness of speech
recognition, especially in noisy and variable environments.
3. Hidden Markov Model (HMM)
3.1. Introduction
A Hidden Markov Model (HMM) is a statistical model used to represent sequences
of observable events that are generated by a sequence of internal (hidden) states. In
speech recognition, HMMs are used to model the temporal variability of speech by
aligning sequences of acoustic features to phonetic units.
3.2. Why Use HMMs in Speech Recognition?
Speech is time-varying and sequential in nature.
Different speakers utter the same word with variability in speed and
pronunciation.
HMMs handle sequence modeling, temporal alignment, and probabilistic
decoding, making them ideal for speech applications.
3.3. Components of an HMM
An HMM is defined by the 5-tuple:
λ = (Q, A, B, π, O)
Component Description
Q Set of N states
A State transition probability matrix
B Observation probability distribution
Π Initial state distribution
O Sequence of observations (e.g., MFCC feature vectors)
3.4. HMM Operation in Speech Recognition
Given a sequence of acoustic observations (features), HMM helps in:
A. Evaluation (Likelihood Computation):
Given model λ and observation sequence O, compute:
P(O∣λ)
Algorithm: Forward algorithm
B. Decoding (Best State Sequence):
Find the most likely state sequence for a given observation:
argQmaxP (Q∣O, λ)
Algorithm: Viterbi algorithm
C. Training (Parameter Estimation):
Adjust A, B, π to maximize likelihood:
λ∗=argmaxP(O∣λ)
Algorithm: Baum-Welch algorithm (an instance of Expectation-
Maximization)
3.5. Example: Word Modeling Using HMMs
A word like “SIX” is decomposed into phonemes: /s/ /ih/ /k/ /s/
Each phoneme is modeled by a 3-state HMM
The word model is a concatenation of phoneme HMMs
3.6. HMM vs Other Models
Aspect HMM DNN-HMM Hybrid End-to-End
Assumes temporal structure Yes Yes No (learns directly)
Requires alignments Yes Yes Not always
Probabilistic modeling Yes Yes Yes
Sequence learning Markov Deep Neural Net Attention/CTC/RNN-T
3.7. Conclusion
HMMs are fundamental in modeling the sequential and probabilistic nature of
speech. Although now often integrated with Deep Neural Networks (DNNs) for hybrid
systems (DNN-HMM), the HMM component remains critical for sequence decoding,
alignment, and state transition modeling. It enables speech recognizers to align
variable-length audio inputs with linguistic units effectively and is still a backbone for
modern hybrid ASR systems.
4.HMM-DNN Systems
1. Introduction
The Hybrid HMM-DNN model is an advanced acoustic modeling approach that
combines the temporal modeling power of Hidden Markov Models (HMMs) with
the classification power of Deep Neural Networks (DNNs). This hybrid approach
has become a standard in state-of-the-art ASR systems, replacing the older GMM-
HMM models.
2. Background
A. Traditional HMM-GMM Systems:
HMMs model the sequence of speech.
GMMs estimate the likelihood of acoustic features per state.
Limitations:
o GMMs assume Gaussian distribution.
o Poor performance with high-dimensional or complex data.
o Limited ability to exploit contextual information.
B. Need for DNNs:
DNNs model complex, non-linear relationships in data.
Can use large amounts of data and long temporal context effectively.
3. Hybrid HMM-DNN Architecture
Basic Concept:
Use a DNN to estimate posterior probabilities of HMM states given acoustic
feature vectors.
HMM handles temporal sequence modeling, while DNN handles frame-
level classification.
How It Works:
1. Input feature vectors xtx_txt (e.g., MFCC, fMLLR, or filterbank features).
2. DNN estimates:
P(st∣xt) (posterior probability of HMM state given input)
3. Use Bayes’ Rule to convert posterior to likelihood: P(xt∣st) ∝ P(st∣xt)/ P(st)
4. Integrate into HMM for decoding using Viterbi algorithm.
4. DNN Structure in HMM-DNN
Layer Purpose
Input Layer Receives context window of features (e.g., ±5 frames of MFCCs)
Hidden Layers Deep layers with ReLU or tanh activations learn complex patterns
Output Layer One node per senone (tied HMM state); uses softmax
Loss Function Cross-entropy based on frame-level labels
5. Training Process
A. Label Preparation:
Use GMM-HMM system to perform forced alignment → get frame-to-state
alignments (senones).
B. Supervised Training:
Train DNN to classify each frame to its corresponding senone.
C. Decoding:
Combine DNN outputs (converted to likelihoods) with language model and
HMM topology to decode speech.
6. Advantages of HMM-DNN Systems
Feature Benefit
Discriminative Training Directly minimizes classification error
Context-Dependent
Uses multiple frames for robust decisions
Modeling
Deep Feature Learning Learns hierarchical features automatically
Outperforms GMM-HMM, especially in noisy/real-
Improved Accuracy
world scenarios
7. Enhancements of the Basic HMM-DNN System
Model Variant Description
Time-Delay Neural Networks – efficient modeling of
TDNN-HMM
temporal context
LSTM-HMM / BLSTM-
Recurrent models capture long-term dependencies
HMM
CNN-HMM Capture local spectro-temporal patterns effectively
Factorized TDNN – efficient training with low resource
TDNN-F
usage
8. Toolkits Supporting HMM-DNN
Toolkit Features
Kaldi Industry-standard, supports TDNN, LSTM, and chain models
HTK Traditional HMM-GMM; extensions support DNNs
ESPnet End-to-end + hybrid capabilities
PyTorch/Keras For custom DNN model training and experimentation
9. Applications
Voice Assistants (e.g., Google Assistant, Alexa)
Dictation and Transcription
Call Center Analytics
Real-time Captioning
Voice-Controlled Interfaces
10.Conclusion
The HMM-DNN hybrid model is a powerful approach that leverages the temporal
structure of HMMs and the deep learning capabilities of DNNs. It represents a major
leap over traditional GMM-based systems and forms the backbone of many modern
ASR systems. With extensions like TDNN, LSTM, and CNN-HMM, these models
achieve high accuracy even in challenging real-world scenarios.