1
UNIT-1
Review of Chomsky’s Hierarchy of Languages
According to Chomsky’s hierarchy, grammars are classi ed into four types based on their
generative power:
🔹 Type 0: Unrestricted Grammar
• Also known as Recursively Enumerable grammar.
• Recognized by a Turing Machine.
• It includes all formal grammars.
➤ Grammar Rule Format:
α → β
Where:
• α ∈ (V ∪ T)* V (V ∪ T)* (i.e., at least one variable must appear on the left side)
• β ∈ (V ∪ T)*
V: Variables
T: Terminals
✔ Example:
Sab → ba
A → S
Here, Variables = {S, A}, Terminals = {a, b}
🔹 Type 1: Context-Sensitive Grammar
• Generates Context-Sensitive Languages.
• Recognized by a Linear Bounded Automaton (LBA).
• Must satisfy Type 0 conditions.
➤ Grammar Rule Format:
α → β where |α| ≤ |β| and α ≠ ε
• The length of the RHS must be greater than or equal to the LHS.
• The left side cannot be empty.
✔ Example:
fi
2
S → AB
AB → abc
B → b
🔹 Type 2: Context-Free Grammar
• Generates Context-Free Languages.
• Recognized by a Pushdown Automaton.
• Must satisfy Type 1 conditions.
➤ Grammar Rule Format:
A → γ
Where:
• Left-hand side is a single variable
• No restriction on the right-hand side (γ ∈ (V ∪ T)*)
✔ Example:
S → AB
A → a
B → b
🔹 Type 3: Regular Grammar
• Generates Regular Languages.
• Recognized by a Finite Automaton (DFA/NFA).
• Most restricted form of grammar.
➤ Grammar Rule Formats:
Right-regular:
V → T / TV
Left-regular:
V → T / VT
Where V is a variable, T is a terminal.
✔ Example (Strict Regular Grammar):
S → a
✔ Example (Extended Regular Grammar):
3
S → ab
This allows T* (multiple terminals) on either side with only one variable.
🔁 Summary: Inclusion of Language Classes
Regular ⊂ Context-Free ⊂ Context-Sensitive ⊂ Recursively
Enumerable
Regular Expressions (RegEx)
A Regular Expression is a pattern used to match character combinations in strings. They're
powerful tools used in searching, matching, and replacing text.
Regular expressions are used in programming languages like JavaScript, Python, Java, C++, and
many more, as well as in tools like text editors, command-line utilities (grep, sed), and regex-
based form validations.
Regular Expressions (RegEx) are sequences of characters that de ne a search pattern. They are
used to match, locate, and manage text. They form the basis of pattern matching in many
programming languages and are fundamental in lexical analysis.
🔹 Applications of Regular Expressions:
• Lexical Analysis in compilers
• Pattern matching in text processing
fi
4
• Input validation (emails, passwords, phone numbers)
• Search and Replace operations
• Text mining and NLP tasks
🔹 Regular Expressions in Formal Language Theory:
In theoretical computer science:
• Regular expressions describe Regular Languages
• Recognized by Finite Automata (FA)
• Part of Chomsky Hierarchy (Type 3)
🔹 Basic Symbols in RegEx:
Symbo
Meaning
l
a Matches the character a
. Matches any single character
Matches the beginning of a
^ string
$ Matches the end of a string
* 0 or more occurrences
+ 1 or more occurrences
? 0 or 1 occurrence (optional)
` `
() Grouping
[] Character class
{n} Exactly n repetitions
{n,} n or more repetitions
{n,m
Between n and m repetitions
}
🔹 Shorthand Character Classes:
Symbo
Matches
l
\d Any digit [0–9]
\D Any non-digit
5
Word character [a-zA-
\w
Z0-9_]
\W Non-word character
\s Whitespace
\S Non-whitespace
🔹 Examples:
Pattern Matches
a* ``, a, aa, aaa
[a-z] Any lowercase letter
[A-Za-z]
One or more letters
+
^\d{3}$ Exactly three digits
ab, abab,
(ab)+
ababab
🔹 Limitations:
• Cannot handle nested or recursive patterns (e.g., balanced parentheses)
• Limited to regular languages only
🧠 Summary:
• Regular expressions de ne regular languages
• Recognized by Finite Automata
• Useful in both theoretical CS and practical programming
🔹 Finite Automata (FA)
Finite Automata are abstract machines used to model computation. They process strings over an
input alphabet and accept or reject them based on whether they follow speci c rules.
➤ Deterministic Finite Automaton (DFA) — In Detail
✅ Characteristics:
• At any point in time, the machine is in exactly one state.
fi
fi
6
• For each symbol in the alphabet, there is exactly one transition from a state.
• No room for guessing or ambiguity.
🔍 DFA Formal De nition Recap:
DFA = (Q, Σ, δ, q₀, F)
Where:
• Q = Set of states ( nite)
• Σ = Input alphabet (set of symbols)
• δ = Transition function: Q × Σ → Q
• q₀ = Initial (start) state (q₀ ∈ Q)
• F = Set of accepting ( nal) states (F ⊆ Q)
🧠 Example: DFA for strings ending in 01 over Σ = {0,1}
States:
• q0: Start
• q1: Seen 0
• q2: Seen 0 followed by 1 (Accept)
Transition Table:
State 0 1
q0 q1 q0
q1 q1 q2
q2 q1 q0
Final state = q2 ✅
So: "1001" → q0 → q0 → q1 → q2 (ACCEPT)
➤ Non-Deterministic Finite Automaton (NFA) — In Detail
✅ Characteristics:
• Can move to multiple states for a single input symbol.
• Can have ε-transitions (i.e., move without consuming input).
fi
fi
fi
7
• Easier to design (especially for complex patterns), but non-deterministic.
• Cannot directly be implemented as-is — needs to be converted to DFA.
🔍 NFA Formal De nition:
NFA = (Q, Σ, δ, q₀, F)
Where:
• δ = Q × (Σ ∪ {ε}) → 2^Q (transition function gives a set of possible states)
🧠 Example: Accept strings containing substring "ab"
States:
• q0: Start
• q1: Seen 'a'
• q2: Seen "ab" (Accept)
Transitions:
• q0 → q0 on a or b (loop)
• q0 → q1 on a
• q1 → q2 on b
Any string that reaches q2 is accepted ✅
🔁 DFA vs NFA: Comparison
Feature DFA NFA
Transitions per
One per state/input Zero, one, or multiple
input
Epsilon transitions ❌ Not allowed ✅ Allowed
Implementation Easy Harder (needs conversion)
Harder for complex Easier for complex
Design simplicity
languages patterns
Power (language) Same (regular languages) Same (equivalent power)
📌 Key Point: Any NFA can be converted to an equivalent DFA using the subset construction
algorithm.
fi
8
🔹 Beyond Finite Automata (Non-Finite Automata)
Some languages (especially non-regular) can't be recognized by FA. We need more powerful
machines.
➤ Pushdown Automaton (PDA)
🧠 Why PDAs?
Finite Automata can’t handle nested structures like a^n b^n (equal a's followed by equal b's).
But PDAs can — by using a stack!
✅ Characteristics:
• Like an NFA but with a stack (LIFO memory).
• Can push, pop, or peek symbols on the stack.
• Recognizes context-free languages (like palindromes, balanced parentheses).
📌 Real-world Example:
A compiler checking for matching brackets in code {[()()]} — PDA handles it.
➤ Turing Machine
The most powerful computational model in formal language theory.
✅ Characteristics:
• Has a tape (in nite memory) and head that can read/write/move.
• Can simulate any algorithm.
• Recognizes recursively enumerable languages (Type-0 in Chomsky hierarchy).
Turing Machine vs FA:
Finite
Feature Turing Machine
Automata
Memory No In nite Tape
Pattern
Computation General Computation
Matching
Language Recursively
Regular
Class Enumerable
fi
fi
9
🔍 Applications:
• Theoretical limits of computation
• Language design
• Problem solvability (halting problem, etc.)
🔹 Finite State Transducers (FST)
A Finite State Transducer (FST) is a computational model that extends the concept of a Finite
Automaton (FA) by producing output as it processes an input. FSTs are essential in areas such as
Natural Language Processing (NLP), speech processing, and morphological analysis.
✅ What is an FST?
An FST is essentially a Finite Automaton with the added functionality of generating output while
reading an input string.
Just like a Finite Automaton, an FST:
• Has states.
• Has transitions based on input symbols.
However, the key difference is that FSTs produce output on each transition, which means they can
map an input string to an output string.
🔍 Formal De nition of FST:
An FST is formally de ned as a 6-tuple:
FST = (Q, Σ, Γ, δ, ω, q₀, F)
Where:
• Q: A nite set of states.
• Σ: The input alphabet (set of input symbols).
• Γ: The output alphabet (set of output symbols).
• δ: The transition function: δ: Q × Σ → Q. This de nes the next state given the current state
and an input symbol.
• ω: The output function: ω: Q × Σ → Γ*. This de nes the output string produced during a
transition.
fi
fi
fi
fi
fi
10
• q₀: The start state (the state at the beginning of the computation).
• F: A set of nal (accepting) states.
🔍 How Does an FST Work?
• The FST processes an input string from left to right.
• On each input symbol, the FST:
◦ Moves to a new state based on the current state and the input symbol (using the
transition function δ).
◦ Produces output based on the current state and input symbol (using the output
function ω).
• The process continues until all input symbols are consumed.
• If the FST reaches an accepting state, the output string is considered valid.
🧠 Example of FST:
Consider a simple FST for a morphological analysis task where it converts words to their root
form and adds grammatical features.
• Input: cats
• Output: cat + N + Plural
Here, the FST would:
• Read the string "cats".
• Identify the root word "cat" and the suf x "-s" as plural.
• Output cat + N + Plural to represent the root word along with its grammatical
features.
🔁 Types of Transducers:
1. Moore Machine:
◦ In a Moore machine, the output depends only on the current state.
◦ The output is associated with states rather than transitions.
2. Mealy Machine:
fi
fi
11
◦ In a Mealy machine, the output depends on both the current state and the input
symbol.
◦ The output is associated with transitions rather than states.
✨ Applications of FST in NLP:
1. Morphological Analysis:
◦ FSTs are widely used in morphological analyzers to break down words into their
root forms and grammatical features (e.g., running → run + V +
Progressive).
2. Speech Recognition:
◦ FSTs can help map spoken language to text by producing a sequence of output
phonemes or words.
3. Transliteration:
◦ FSTs can be used to convert words from one script to another, such as converting
Romanized Hindi into Devanagari script.
4. Spelling Correction:
◦ In spelling correction, FSTs can map misspelled words to their corrected forms by
applying prede ned rules.
🔍 Natural Language and Linguistics – In Detail
1⃣ Natural Language
De nition:
Natural language refers to the languages humans use for everyday communication. Unlike
programming languages, which are arti cially created for machines, natural languages evolve
naturally within communities over time. These languages can be spoken, written, or signed.
Examples include English, Hindi, Tamil, Japanese, and many more.
Key Characteristics:
• Ambiguity: Words and phrases in natural languages often have multiple meanings
depending on the context.
◦ Example: The word "bank" could mean a nancial institution or the side of a river.
• Context-dependence: The meaning of a sentence can change based on the situation or
background.
fi
fi
fi
fi
12
◦ Example: The sentence "He saw her duck" could mean he observed her pet duck or
that she lowered her head.
• Evolving Nature: Natural languages constantly evolve with new words, slang, and usages
being created. This dynamic nature allows them to adapt to new societal contexts.
2⃣ Linguistics
De nition:
Linguistics is the scienti c study of language. It focuses on understanding how languages are
structured, how we comprehend and produce language, and how language evolves over time.
Linguistics serves as the foundation for Natural Language Processing (NLP), which is essential
for making computers understand and interact with human languages.
Branches of Linguistics:
1. Phonetics:
◦ Focuses on the physical sounds of speech, how they are produced, transmitted, and
perceived.
◦ Example: The difference between the "p" in "spin" (unaspirated) and "pin"
(aspirated).
2. Phonology:
◦ Studies how sounds function within particular languages and the rules for sound
combinations.
◦ Example: The combination "ng" is never at the start of a word in English (e.g., it
appears in "ring" but not "ngot").
3. Morphology:
◦ Deals with the structure of words and how words are formed from smaller units of
meaning, called morphemes.
◦ Example: The word "unhappiness" is made up of three morphemes: "un-" (pre x
meaning negation), "happy" (root), and "-ness" (suf x meaning noun).
4. Syntax:
◦ Concerned with the structure of sentences and how words combine to form
grammatically correct sentences.
◦ Example: "She is eating an apple" vs. "Is apple she an eating?" (incorrect word
order).
5. Semantics:
◦ Studies the meaning of words, phrases, and sentences.
fi
fi
fi
fi
13
◦ Example: "Cats chase mice" vs. "Mice chase cats" — same words but different
meanings.
6. Pragmatics:
◦ Examines how context in uences the meaning of language in real-world scenarios.
◦ Example: "Can you pass the salt?" is a question, but pragmatically, it’s a request for
action.
7. Discourse Analysis:
◦ Focuses on how larger stretches of language (e.g., conversations, paragraphs) are
structured and how they maintain coherence.
◦ Example: Analyzing how a story is structured or how speakers change topics during
a conversation.
Importance in NLP:
Understanding the branches of linguistics is crucial for Natural Language Processing (NLP), a
eld focused on enabling computers to process and understand human language. Each branch of
linguistics corresponds to different NLP tasks, such as:
• Phonetics and Phonology: Speech recognition and synthesis.
• Morphology: Word segmentation and stemming.
• Syntax and Semantics: Parsing, part-of-speech tagging, and semantic analysis.
• Pragmatics and Discourse: Understanding context, dialogue systems, and conversational
agents.
🔹 Syntax and Structure
🔸 What is Syntax?
Syntax is the set of rules that govern the structure of sentences in a language. It de nes how
words combine to form meaningful phrases and sentences.
🔹 In simple terms:
Syntax = Grammar rules + Sentence patterns
🔸 Why is Syntax Important?
In Natural Language Processing (NLP), understanding syntax helps computers:
fi
fl
fi
14
• Parse a sentence (break it into parts)
• Identify grammatical roles (subject, verb, object, etc.)
• Disambiguate meanings (e.g., “Flying planes can be dangerous”)
• Perform tasks like machine translation, chatbots, text summarization, etc.
🔸 Syntax Structures
1. Phrase Structure (Constituency Grammar)
This structure breaks a sentence into nested sub-phrases or constituents.
🔹 Example Sentence:
"The cat sat on the mat."
We can break this as:
[S
[NP The cat]
[VP sat
[PP on
[NP the mat]
]
]
]
Legend:
• S: Sentence
• NP: Noun Phrase
• VP: Verb Phrase
• PP: Prepositional Phrase
This hierarchy forms a syntax tree, or parse tree.
2. Dependency Grammar
Instead of using nested phrases, dependency grammar focuses on word-to-word relationships.
🔹 In the same sentence:
• "sat" is the root verb.
15
• "cat" is the subject of "sat".
• "on" is a preposition dependent on "sat".
• "mat" is the object of "on".
This creates a dependency tree where each word is connected directly to another word it depends
on.
🔸 Grammar Rules and Syntax in NLP
➤ CFG (Context-Free Grammar)
Used to de ne syntactic rules in formal language theory.
Rules have the form: A → α
• A is a non-terminal (e.g., S, NP, VP)
• α is a sequence of terminals and/or non-terminals
🔹 Example Rules:
S → NP VP
NP → Det N
VP → V NP
Det → the | a
N → cat | mat
V → sat | saw
This set of rules helps generate or parse valid sentences.
🔸 Syntax Trees (Parse Trees)
A syntax tree is a visual representation of the syntactic structure of a sentence according to a
grammar.
🔹 Example:
For sentence: “The dog barked.”
Tree:
S
/ \
fi
16
NP VP
/ \ \
Det N V
| | |
The dog barked
🔸 Applications of Syntax in NLP
Application Role of Syntax
Machine Maintains sentence structure in other
Translation language
Grammar Checking Identi es syntactic errors
Question Answering Helps identify subject, object, etc.
Text Summarization Understands clause hierarchy and importance
3⃣ Syntax and Structure
De nition:
Syntax is the branch of linguistics that deals with the structure of sentences. It focuses on the rules
and principles that govern the way words combine to form grammatically correct sentences in a
given language. Syntax examines the relationships between different elements of a sentence, such
as subject, verb, object, and how these components follow speci c word order patterns.
Key Aspects of Syntax:
1. Word Order: Syntax dictates the order in which words should appear in a sentence to
maintain grammaticality.
◦ Example (English): “She eats an apple” vs. “Eats she an apple” (incorrect word
order).
2. Syntactic Categories: Words in a language can be classi ed into categories based on their
function in the sentence. These categories include:
◦ Nouns (person, place, thing)
◦ Verbs (actions or states)
◦ Adjectives (describe nouns)
◦ Adverbs (modify verbs, adjectives, or other adverbs)
◦ Prepositions (show relationships between words, e.g., in, on, at)
3. Sentence Types:
fi
fi
fi
fi
17
◦ Declarative: Statements (e.g., “I am a student”)
◦ Interrogative: Questions (e.g., “Are you a student?”)
◦ Imperative: Commands (e.g., “Please sit down.”)
◦ Exclamatory: Expressing strong feelings (e.g., “What a beautiful day!”)
4. Phrases: A phrase is a group of words that work together to convey a single idea. Phrases
can be categorized as:
◦ Noun Phrase (NP): Consists of a noun and its modi ers. (e.g., “the big dog”)
◦ Verb Phrase (VP): Contains the main verb and its auxiliaries. (e.g., “has been
running”)
◦ Prepositional Phrase (PP): Begins with a preposition and includes its object. (e.g.,
“in the park”)
5. Syntactic Trees: Syntax often uses tree structures (also called parse trees) to represent
sentence structure, showing how words and phrases are hierarchically arranged.
Importance in NLP:
In Natural Language Processing (NLP), syntax is essential for tasks such as:
• Sentence Parsing: Identifying the syntactic structure of a sentence.
• Part-of-Speech Tagging: Assigning the correct grammatical category to each word (e.g.,
verb, noun).
• Machine Translation: Translating sentences from one language to another while
maintaining grammaticality.
4⃣ Representation of Meaning
De nition:
The representation of meaning in language refers to how the meaning of words, phrases, and
sentences is captured, understood, and processed. Understanding meaning is essential for any
language model or system that aims to interact with humans in a natural way.
Types of Meaning:
1. Lexical Meaning: The meaning of individual words. This can be determined from a
dictionary de nition.
◦ Example: The word "dog" refers to a domesticated carnivorous mammal.
2. Compositional Meaning: The meaning of larger linguistic units (like phrases or sentences)
based on the meanings of their parts.
fi
fi
fi
18
◦ Example: “Black cat” means a cat that is black in color, where “black” modi es the
noun “cat.”
3. Contextual Meaning: The meaning that arises from the context in which a word or sentence
is used.
◦ Example: “He’s a hotshot.” In context, it means someone who is very skilled or
successful, but “hotshot” in a different context could refer to a small, fast-moving
object.
4. Ambiguity in Meaning:
◦ Lexical Ambiguity: When a word has multiple meanings.
▪ Example: "Lead" can refer to a metal or to guide.
◦ Syntactic Ambiguity: When a sentence has more than one possible syntactic
interpretation.
▪ Example: “I saw the man with the telescope.” This can mean either:
1. The man had a telescope.
2. I used a telescope to see the man.
5. Semantic Roles: Each part of a sentence typically plays a speci c role in conveying
meaning, such as:
◦ Agent: The doer of an action (e.g., in “John kicked the ball,” John is the agent).
◦ Theme: The entity that is affected by the action (e.g., in “John kicked the ball,” the
ball is the theme).
◦ Goal: The recipient of an action or the destination (e.g., in “She gave him the book,”
him is the goal).
Importance in NLP:
In Natural Language Processing (NLP), representing meaning is crucial for tasks like:
• Word Sense Disambiguation: Determining the correct meaning of a word based on
context.
• Machine Translation: Ensuring accurate translation of meaning from one language to
another.
• Information Retrieval: Matching queries with relevant documents by understanding the
meaning behind words and sentences.
Let me know when you're ready to proceed to the next topic, or if you'd like more information on
any of the concepts discussed!
fi
fi
19
Lexical and Semantic Models
🔸 A. Lexical Semantics
Lexical Semantics is the study of how words convey meaning, and how they relate to one another
in a language.
✅ 1. Lexeme
• A lexeme is the abstract unit of meaning underlying different word forms.
• Examples:
◦ Lexeme: run
◦ Word forms: run, runs, ran, running
• Lexeme ≠ word — a lexeme groups all in ected forms.
✅ 2. Word Sense
• Many words are polysemous — they have multiple meanings or senses.
• Word Sense Disambiguation (WSD) is the task of determining which sense of a word is
used in a given context.
• Example:
◦ “He sat by the bank” → River side
◦ “He went to the bank to get cash” → Financial institution
✅ 3. Word Relationships (Lexical Relations):
Type Description Example
Synonymy Words with similar meanings happy ↔ joyful
Antonymy Words with opposite meanings hot ↔ cold
A word whose meaning is included in another (IS-
Hyponymy Car is a hyponym of Vehicle
A)
Hypernym
The more general category Animal is a hypernym of Dog
y
Meronymy Part-whole relationship Wheel is a meronym of Car
Homonym
Same spelling/sound but unrelated meaning Bat (animal) and Bat (cricket)
y
fl
20
Paper (material, essay,
Polysemy One word, multiple related meanings
newspaper)
✅ 4. Thesauri and Lexical Databases
• WordNet: A lexical database grouping words into sets of synonyms (synsets) with semantic
relationships.
◦ You can nd synonyms, antonyms, hyponyms, and hypernyms using WordNet.
🔸 B. Semantic Models
Semantic models are computational methods used to represent meaning in text and words.
✅ 1. Bag of Words (BoW)
• Idea: Represent text as an unordered collection of words.
• Each word gets a frequency count; grammar and order are ignored.
• Example:
◦ Sentence 1: “The cat sat on the mat.”
◦ Sentence 2: “Mat the on sat cat the.”
◦ BoW sees both as identical.
Pros:
• Simple and easy to implement.
Cons:
• Ignores grammar, context, and word order.
✅ 2. TF-IDF (Term Frequency - Inverse Document Frequency)
fi
21
• Goal: Highlight important words in a document, downweight common words.
• TF: How often a word appears in a document.
• IDF: How rare the word is across all documents.
Use Case: Improves relevance in search engines.
✅ 3. Word Embeddings
• Words are represented as vectors in a high-dimensional space.
• Semantically similar words have closer vector positions.
• Models:
◦ Word2Vec (Google)
◦ GloVe (Stanford)
◦ FastText (Facebook)
• Example:
◦ vec("king") - vec("man") + vec("woman") ≈ vec("queen")
Advantage: Captures relationships like analogies, semantic proximity, contextual use.
✅ 4. Contextual Embeddings
• Advanced models that generate different word vectors depending on the context.
• Useful for solving word sense disambiguation.
• Models:
◦ BERT (Bidirectional Encoder Representations from Transformers)
◦ GPT (Generative Pretrained Transformer)
◦ ELMo (Embeddings from Language Models)
Example:
• “She went to the bank to deposit money.”
• “The bank was ooded after the storm.”
➡ BERT assigns different vectors for “bank” in each sentence.
✅ 5. Semantic Parsing
fl
22
• Converts natural language into a formal representation of meaning.
• Often results in logical expressions, graphs, or other structures.
• Used in chatbots, question answering, machine translation.
📌 Real-life Applications
Area Use Case
Search Engines Ranking pages by TF-IDF or embeddings
Chatbots Understanding user queries with embeddings
Machine Contextual embedding for accurate
Translation translation
Sentiment Analysis Capturing emotion behind text
Question
Mapping question to semantic form
Answering
Absolutely! Here's a detailed breakdown of Text Corpora, the nal topic in Unit 1.
📚 Text Corpora
🔸 A. What is a Corpus?
• A corpus (plural: corpora) is a large, structured collection of texts used for linguistic
analysis and training NLP models.
• It may include written texts, spoken language transcriptions, dialogues, or social media
posts.
📌 Think of it as the “data backbone” for most NLP applications.
🔸 B. Types of Corpora
Type Description Example
Monolingual
Texts in a single language Brown Corpus (English)
Corpus
Multilingual Texts in multiple languages without translation Leipzig Corpus
Corpus alignment
Parallel Texts in two or more languages with sentence-
Europarl Corpus
Corpus by-sentence translation
Comparable Same topic in different languages, but not News articles from various
Corpus sentence-aligned countries
Annotated Penn Treebank (POS tags,
Corpus enriched with metadata or linguistic tags
Corpus syntactic structure)
fi
23
Spoken Transcriptions of spoken language Switchboard, Spoken BNC
Corpus
Social Media
Tweets, forums, comments, etc. Twitter Sentiment Corpus
Corpus
🔸 C. Annotations in Corpora
Annotations enhance raw text by adding linguistic information, such as:
Annotation Type Purpose Example
POS Tags (Part-of-Speech) Identify word classes “The/DT cat/NN sat/VBD”
Syntactic Trees Show sentence structure (NP (DT The) (NN cat))
Named Entity Recognition Identify entities like names, “Google/ORG launched in 1998/
(NER) places, dates DATE”
Agent: John, Action: bought,
Semantic Roles Label who did what to whom
Theme: book
🔸 D. Uses of Text Corpora in NLP
1. Training Language Models (e.g., GPT, BERT)
2. Grammar and Syntax Learning
3. Statistical Analysis of Word Usage
4. Machine Translation Systems
5. Sentiment and Emotion Analysis
6. Speech Recognition and Generation
7. Chatbots and Dialogue Systems
8. Lexicon and Thesaurus Construction
🔸 E. Popular Corpora in NLP
Corpus Description
First million-word electronic corpus of American
Brown Corpus
English
Penn Treebank Annotated with POS tags and syntactic structure
WordNet Lexical database — can be used as a corpus
COCA (Corpus of Contemporary American
Modern American English usage
English)
Europarl Corpus European Parliament proceedings (parallel
corpus)
Wikipedia Dumps Used in many modern NLP tasks
24
Twitter Sentiment Corpus Useful for sentiment analysis
🔸 F. Building Your Own Corpus
Steps:
1. Collect raw text (web scraping, APIs, documents, etc.)
2. Preprocess: Clean (remove HTML, symbols), tokenize, normalize.
3. Annotate: Add linguistic features manually or with NLP tools.
4. Store and Index: Use formats like JSON, XML, or plain text.
5. Analyze/Use in ML models, rule-based systems, etc.
🔸 G. Challenges with Text Corpora
• Bias: Re ects cultural, gender, or societal biases.
• Domain-speci city: A corpus may not generalize well.
• Licensing: Many corpora are not free for commercial use.
• Annotation errors: Human annotation can introduce inconsistencies.
🧠 Summary
• A corpus is essential for training, evaluating, and improving NLP systems.
• It can be general-purpose or domain-speci c, raw or annotated.
• High-quality corpora lead to better-performing language models.
fl
fi
fi
25
UNIT -2
Natural Language Processing (NLP): Text
Wrangling and Pre-processing
Introduction
In Natural Language Processing (NLP), text wrangling and pre-processing are essential steps to
prepare raw text data for analysis, model training, and machine learning applications. These steps
help convert unstructured text into a structured format, making it easier for algorithms to extract
meaningful patterns.
1. Text Wrangling
Text wrangling (also known as text cleaning) involves handling raw text data to make it suitable for
further processing. It includes:
Key Steps in Text Wrangling
1. Removing Unwanted Characters
◦ Remove special characters (@, #, $, %, &, *, etc.)
◦ Remove punctuations (.,?!:;() etc.)
◦ Remove numerical values if they are not useful
2. Handling Case Sensitivity
◦ Convert all text to lowercase to avoid duplication issues.
Example: "NLP is Amazing" → "nlp is amazing"
3. Removing Extra Spaces & Whitespace Characters
◦ Extra spaces, tabs, and newlines (\t, \n) can be removed for consistency.
4. Handling Encoding Issues
◦ Convert text to UTF-8 format to avoid encoding mismatches.
2. Text Pre-processing
Pre-processing is the next step after text wrangling, which prepares text for analysis or machine
learning models.
Key Steps in Text Pre-processing
1. Tokenization
26
◦ Splitting text into smaller units called tokens (words, sentences, or subwords).
◦ Example:
Text: "I love NLP!"
◦ Tokenized: ['I', 'love', 'NLP', '!']
◦
2. Stopword Removal
◦ Stopwords are common words (e.g., "the", "is", "in", "and") that do not add much
meaning to the text.
◦ Example:
Before: "This is a great NLP tutorial."
◦ After: "great NLP tutorial"
◦
3. Stemming & Lemmatization
◦ Stemming: Reduces words to their base/root form using simple heuristics.
Example: "running" → "run", "easily" → "easili"
◦ Lemmatization: Converts words to their base dictionary form using linguistic
knowledge.
Example: "running" → "run", "better" → "good"
4. Part-of-Speech (POS) Tagging
◦ Assigns grammatical labels (noun, verb, adjective, etc.) to each word.
◦ Example: "The dog barks" → [('The', 'DT'), ('dog',
'NN'), ('barks', 'VBZ')]
5. Named Entity Recognition (NER)
◦ Identi es proper nouns such as names, organizations, locations, etc.
◦ Example: "Google was founded by Larry Page" → ['Google'
(ORG), 'Larry Page' (PERSON)]
6. Text Normalization
◦ Converts text into a standard format:
▪ Expand contractions: "I'm" → "I am"
▪ Correct spelling: "recieve" → "receive"
▪ Normalize slang: "u" → "you"
7. Vectorization (Feature Extraction)
◦ Converts text into numerical form for machine learning models.
fi
27
◦ Techniques:
▪ Bag of Words (BoW)
▪ TF-IDF (Term Frequency-Inverse Document Frequency)
▪ Word Embeddings (Word2Vec, GloVe, BERT, etc.)
Key Points for Exams
• Text wrangling cleans raw text by removing noise, unwanted characters, and formatting
issues.
• Pre-processing prepares text for analysis by tokenization, stopword removal, stemming,
lemmatization, and feature extraction.
• Tokenization splits text into meaningful units (words, sentences).
• Stopwords are removed to reduce computational complexity.
• Stemming vs. Lemmatization: Stemming is faster but less accurate; lemmatization is more
precise.
• NER and POS tagging help in understanding the grammatical structure and named entities
in text.
• Vectorization techniques (BoW, TF-IDF, Word2Vec, etc.) convert text into numerical data
for models.
Tokenization in NLP
What is Tokenization?
Tokenization is the process of breaking down a text into smaller units (tokens), such as words,
sentences, or subwords. It is the rst step in text pre-processing for NLP tasks like text analysis,
machine learning, and deep learning.
Types of Tokenization
1. Word Tokenization
◦ Splits text into individual words.
◦ Example:
Input: "Natural Language Processing is amazing!"
◦ Output: ['Natural', 'Language', 'Processing', 'is',
'amazing', '!']
fi
28
◦
2. Sentence Tokenization
◦ Splits text into sentences based on punctuation (e.g., ., ?, !).
◦ Example:
Input: "NLP is fun. I love learning it!"
◦ Output: ["NLP is fun.", "I love learning it!"]
◦
3. Subword Tokenization (Used in deep learning models like BERT, GPT)
◦ Splits words into meaningful subwords to handle out-of-vocabulary (OOV) words.
◦ Example:
"unhappiness" → ["un", "happiness"]
Why is Tokenization Important?
• Converts unstructured text into structured format.
• Helps in removing stopwords, stemming, and lemmatization.
• Used in feature extraction methods like TF-IDF and word embeddings.
• Essential for NLP models like chatbots, search engines, and sentiment analysis.
Tokenization in Python (Code Examples)
1. Using NLTK (Natural Language Toolkit)
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize
text = "NLP is powerful! It helps machines understand human
language."
# Word Tokenization
word_tokens = word_tokenize(text)
print("Word Tokenization:", word_tokens)
# Sentence Tokenization
sentence_tokens = sent_tokenize(text)
print("Sentence Tokenization:", sentence_tokens)
Output:
29
Word Tokenization: ['NLP', 'is', 'powerful', '!', 'It',
'helps', 'machines', 'understand', 'human', 'language', '.']
Sentence Tokenization: ['NLP is powerful!', 'It helps
machines understand human language.']
2. Using spaCy (More Ef cient for Large Text)
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Tokenization is the first step in NLP. It splits text
into words and sentences."
doc = nlp(text)
# Word Tokenization
print("Word Tokenization:", [token.text for token in doc])
# Sentence Tokenization
print("Sentence Tokenization:", [sent.text for sent in
doc.sents])
3. Using Hugging Face’s Tokenizer (For Deep Learning)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-
uncased")
text = "Tokenization is crucial for NLP!"
tokens = tokenizer.tokenize(text)
print("Subword Tokenization:", tokens)
Output:
Subword Tokenization: ['token', '##ization', 'is', 'crucial',
'for', 'nl', '##p', '!']
Explanation:
• "tokenization" is split into "token" and "##ization" because the model
recognizes "token" as a common word.
• "nlp" is split into "nl" and "##p" as a subword.
Key Points for Exams
• Tokenization is the rst step in NLP text processing.
fi
fi
30
• Types of Tokenization: Word Tokenization, Sentence Tokenization, and Subword
Tokenization.
• NLTK and spaCy are common Python libraries for tokenization.
• Deep learning models (like BERT, GPT) use subword tokenization to handle unknown
words.
• Tokenization helps in text analysis, sentiment analysis, and chatbot development.
Removing Unwanted Tokens in NLP
What are Unwanted Tokens?
Unwanted tokens are elements in text data that do not contribute meaningful information to NLP
tasks. These include:
• Special characters (@, #, $, %, &)
• Punctuation (.,?!:;()[])
• Numbers (123, 45.67)
• Extra whitespace (" NLP is great ")
• Stopwords (the, is, and, in, to, etc.)
• HTML tags (<p>, <div>)
• Emojis and symbols (😊 , ✔, 🚀 )
1. Removing Special Characters & Punctuation
Python Example using Regex (re module)
import re
text = "Hello!! Welcome to NLP. Let's learn & explore?"
clean_text = re.sub(r'[^\w\s]', '', text) # Remove special
characters and punctuation
print(clean_text)
Output:
Hello Welcome to NLP Lets learn explore
2. Removing Numbers
31
text = "NLP has 100 techniques and 50+ models."
clean_text = re.sub(r'\d+', '', text) # Remove digits
print(clean_text)
Output:
NLP has techniques and models.
3. Removing Extra Whitespaces
text = " NLP is awesome! "
clean_text = " ".join(text.split()) # Remove extra spaces
print(clean_text)
Output:
NLP is awesome!
4. Removing Stopwords (Using NLTK)
Stopwords are common words that do not add much meaning to a sentence.
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
text = "This is an amazing NLP tutorial for beginners!"
words = text.split()
filtered_text = " ".join([word for word in words if
word.lower() not in stop_words])
print(filtered_text)
Output:
amazing NLP tutorial beginners!
5. Removing HTML Tags
from bs4 import BeautifulSoup
text = "<p>This is an <b>NLP</b> tutorial.</p>"
clean_text = BeautifulSoup(text, "html.parser").get_text()
print(clean_text)
Output:
This is an NLP tutorial.
6. Removing Emojis and Symbols
32
import emoji
text = "NLP is awesome! 😊 🚀 "
clean_text = emoji.replace_emoji(text, replace="")
print(clean_text)
Output:
NLP is awesome!
Key Points for Exams
• Unwanted tokens include punctuation, numbers, stopwords, special characters, HTML
tags, and emojis.
• Regex (re.sub) is useful for removing punctuation and numbers.
• NLTK stopwords help lter out common words that do not add meaning.
• BeautifulSoup is used for removing HTML tags.
• Emoji library helps remove or replace emojis in text.
Corrections, Stemming, and Normalization in NLP
1. Text Corrections
Text correction is the process of xing spelling errors, typos, and grammatical mistakes in text. It is
crucial for improving text quality before further NLP processing.
Types of Text Corrections:
1. Spell Checking: Identi es and corrects misspelled words.
2. Grammatical Corrections: Fixes grammar mistakes.
3. Word Substitution: Suggests correct words for typos.
Python Example: Spell Checking with TextBlob
from textblob import TextBlob
text = "NLP is amzng and very usful for data anlysis."
corrected_text = TextBlob(text).correct()
print(corrected_text)
Output:
NLP is amazing and very useful for data analysis.
Using pyspellchecker for Faster Spell Checking
fi
fi
fi
33
from spellchecker import SpellChecker
spell = SpellChecker()
text = "Ths is a smple NLP tst."
words = text.split()
corrected_words = [spell.correction(word) for word in words]
corrected_text = " ".join(corrected_words)
print(corrected_text)
Output:
This is a sample NLP test.
2. Stemming
Stemming is the process of reducing words to their root form by removing suf xes. It is a quick
but sometimes inaccurate approach.
Example of Stemming
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
words = ["running", "flies", "happily", "studies"]
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)
Output:
['run', 'fli', 'happili', 'studi']
• "running" → "run" ✅
• " ies" → " i" ❌ (Incorrect due to over-stemming)
• "studies" → "studi" ❌ (Incorrect)
Other Stemming Algorithms
1. Porter Stemmer: Most common and fast.
2. Snowball Stemmer: Advanced version of Porter Stemmer.
3. Lancaster Stemmer: More aggressive and removes more characters.
3. Normalization
Text normalization is the process of converting words into a standard format to ensure
consistency in NLP tasks.
fl
fl
fi
34
Key Normalization Techniques:
1. Lowercasing: Converts all text to lowercase.
text = "Natural Language Processing"
2. print(text.lower())
3.
Output: "natural language processing"
4. Removing Special Characters & Punctuation:
import re
5. text = "Hello!!! NLP is great."
6. clean_text = re.sub(r'[^\w\s]', '', text)
7. print(clean_text)
8.
Output: "Hello NLP is great"
9. Expanding Contractions: Converts short forms to full forms.
from contractions import fix
10. text = "I'll go to the park. It's amazing!"
11. print(fix(text))
12.
Output: "I will go to the park. It is amazing!"
13. Lemmatization (Better than Stemming): Converts words to their dictionary form.
from nltk.stem import WordNetLemmatizer
14. import nltk
15. nltk.download('wordnet')
16.
17. lemmatizer = WordNetLemmatizer()
18. words = ["running", "flies", "happily", "studies"]
19. lemmatized_words = [lemmatizer.lemmatize(word) for word
in words]
20. print(lemmatized_words)
21.
Output:
['running', 'fly', 'happily', 'study']
22.
◦ " ies" → " y" ✅ (Correct compared to stemming)
◦ "studies" → "study" ✅ (Correct)
Key Points for Exams
• Corrections: Fix spelling and grammatical mistakes using tools like TextBlob and
pyspellchecker.
fl
fl
35
• Stemming: Reduces words to root form but may lead to incorrect results (flies →
fli).
• Normalization: Standardizes text (lowercasing, punctuation removal, contractions,
lemmatization).
• Lemmatization is better than stemming as it gives meaningful words.
Parsing the Text in NLP
Parsing is the process of analyzing the structure of a sentence to understand its meaning and
grammatical structure. It includes Part of Speech (POS) tagging and Probabilistic Parsing.
1. Part of Speech (POS) Tagging
What is POS Tagging?
POS tagging assigns grammatical categories (such as noun, verb, adjective, etc.) to each word in a
sentence.
Example POS Tags:
POS
Meaning
Tag
NN Noun (e.g., cat, car)
VB Verb (e.g., run, eat)
Adjective (e.g., beautiful,
JJ
large)
RB Adverb (e.g., quickly, silently)
PRP Pronoun (e.g., he, she, they)
POS Tagging in Python
Using NLTK (Natural Language Toolkit):
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
text = "John is playing football in the park."
words = nltk.word_tokenize(text) # Tokenization
pos_tags = nltk.pos_tag(words) # POS tagging
print(pos_tags)
Output:
36
[('John', 'NNP'), ('is', 'VBZ'), ('playing', 'VBG'),
('football', 'NN'), ('in', 'IN'), ('the', 'DT'), ('park',
'NN'), ('.', '.')]
POS Tagging Using Spacy (More Ef cient)
import spacy
nlp = spacy.load("en_core_web_sm")
text = "John is playing football in the park."
doc = nlp(text)
for token in doc:
print(token.text, ":", token.pos_)
Output:
John : PROPN
is : AUX
playing : VERB
football : NOUN
in : ADP
the : DET
park : NOUN
. : PUNCT
2. Probabilistic Parsing
What is Probabilistic Parsing?
Probabilistic parsing assigns the most likely grammatical structure to a sentence based on
probability. It helps in handling ambiguous sentences.
Types of Parsing:
1. Constituency Parsing (Phrase Structure Parsing)
◦ Breaks the sentence into noun phrases (NP), verb phrases (VP), etc.
2. Dependency Parsing
◦ Identi es dependencies between words (e.g., subject-verb relationship).
Probabilistic Context-Free Grammar (PCFG)
A Probabilistic Context-Free Grammar (PCFG) assigns probabilities to different grammatical
rules.
fi
fi
37
Example grammar:
S → NP VP [0.9]
NP → Det N [0.5] | N [0.5]
VP → V NP [0.7] | V [0.3]
Det → 'the' [1.0]
N → 'dog' [0.5] | 'cat' [0.5]
V → 'chased' [1.0]
This means:
• S → NP VP happens 90% of the time.
• NP → Det N happens 50% of the time.
• N → 'dog' and N → 'cat' both happen 50% of the time.
Probabilistic Parsing Using NLTK
import nltk
grammar = nltk.PCFG.fromstring("""
S -> NP VP [0.9]
NP -> Det N [0.5] | N [0.5]
VP -> V NP [0.7] | V [0.3]
Det -> 'the' [1.0]
N -> 'dog' [0.5] | 'cat' [0.5]
V -> 'chased' [1.0]
""")
parser = nltk.ViterbiParser(grammar)
sentence = ['the', 'dog', 'chased', 'the', 'cat']
for tree in parser.parse(sentence):
print(tree)
Output (Parse Tree):
(S
(NP (Det the) (N dog))
(VP (V chased) (NP (Det the) (N cat))))
Dependency Parsing Using Spacy
import spacy
nlp = spacy.load("en_core_web_sm")
text = "The dog chased the cat."
38
doc = nlp(text)
for token in doc:
print(f"{token.text} --> {token.dep_} -->
{token.head.text}")
Output:
The --> det --> dog
dog --> nsubj --> chased
chased --> ROOT --> chased
the --> det --> cat
cat --> dobj --> chased
. --> punct --> chased
• nsubj (Nominal Subject): "dog" is the subject of "chased."
• dobj (Direct Object): "cat" is the object of "chased."
• det (Determiner): "the" is linked to nouns.
Key Points for Exams
• POS Tagging assigns parts of speech to words.
• NLTK and Spacy are commonly used for POS tagging.
• Probabilistic Parsing helps resolve ambiguities in sentence structure.
• PCFG (Probabilistic Context-Free Grammar) assigns probabilities to grammar rules.
• Constituency Parsing breaks sentences into phrases (NP, VP, etc.).
• Dependency Parsing identi es grammatical relationships between words.
Shallow, Dependency, and Constituency Parsing in
NLP
Parsing is the process of analyzing the structure of a sentence to understand its syntactic and
semantic meaning. It is broadly classi ed into three types:
1. Shallow Parsing (Chunking)
2. Dependency Parsing
3. Constituency Parsing
1. Shallow Parsing (Chunking)
fi
fi
39
What is Shallow Parsing?
Shallow Parsing, also known as Chunking, groups words into phrases (like noun phrases, verb
phrases) without fully analyzing the entire sentence structure. It does not form a complete parse tree
but identi es key phrases.
Example:
Sentence: "The quick brown fox jumps over the lazy dog."
Shallow Parsing Output:
• [The quick brown fox] (Noun Phrase - NP)
• [jumps] (Verb Phrase - VP)
• [over the lazy dog] (Prepositional Phrase - PP)
Shallow Parsing Using NLTK
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
sentence = "The quick brown fox jumps over the lazy dog"
words = nltk.word_tokenize(sentence) # Tokenization
pos_tags = nltk.pos_tag(words) # POS Tagging
# Define chunk grammar
grammar = "NP: {<DT>?<JJ>*<NN>}" # NP = Determiner (DT) +
Adjective (JJ) + Noun (NN)
chunk_parser = nltk.RegexpParser(grammar)
tree = chunk_parser.parse(pos_tags)
tree.pretty_print()
Output:
(S
(NP The/DT quick/JJ brown/JJ fox/NN)
jumps/VBZ
over/IN
(NP the/DT lazy/JJ dog/NN))
✅ Key Points:
• Faster than full parsing
fi
40
• Extracts key phrases (noun, verb, etc.)
• Useful for Named Entity Recognition (NER) and Information Extraction
2. Dependency Parsing
What is Dependency Parsing?
Dependency Parsing identi es relationships between words in a sentence. Each word (except the
root) depends on another word, forming a dependency tree.
Example:
Sentence: "The cat chases the mouse."
Dependency Relations:
• cat → subject of chases
• mouse → object of chases
• the → determiner (modi es "cat" and "mouse")
Dependency Parsing Using Spacy
import spacy
nlp = spacy.load("en_core_web_sm")
sentence = "The cat chases the mouse."
doc = nlp(sentence)
for token in doc:
print(f"{token.text} --> {token.dep_} -->
{token.head.text}")
Output:
The --> det --> cat
cat --> nsubj --> chases
chases --> ROOT --> chases
the --> det --> mouse
mouse --> dobj --> chases
. --> punct --> chases
✅ Key Points:
• Each word depends on a head word
• Represents grammatical structure
• Used in relation extraction, question answering, machine translation
fi
fi
41
3. Constituency Parsing
What is Constituency Parsing?
Constituency Parsing breaks sentences into constituents (phrases) based on a hierarchical
structure, forming a tree structure (Parse Tree).
Example:
Sentence: "The cat chases the mouse."
Parse Tree:
S
/ \
NP VP
/ / \
Det V NP
| | / \
The chases Det N
| |
the mouse
• S → Sentence
• NP (Noun Phrase) → The cat
• VP (Verb Phrase) → chases the mouse
• Det (Determiner) → The
Constituency Parsing Using NLTK
import nltk
grammar = nltk.CFG.fromstring("""
S -> NP VP
NP -> Det N | Det Adj N
VP -> V NP
Det -> 'The' | 'the'
Adj -> 'lazy'
N -> 'cat' | 'mouse'
V -> 'chases'
""")
parser = nltk.ChartParser(grammar)
sentence = ['The', 'cat', 'chases', 'the', 'mouse']
42
for tree in parser.parse(sentence):
tree.pretty_print()
Output:
S
/ \
NP VP
/ / \
Det V NP
| | / \
The chases Det N
| |
the mouse
✅ Key Points:
• Breaks sentences into phrases (NP, VP, etc.)
• Used in syntax analysis, grammar checking, sentence understanding
Comparison of Parsing Techniques
Feature Shallow Parsing Dependency Parsing Constituency Parsing
Output Phrases (NP, VP) Dependency Tree Parse Tree
Complexity Fast & Simple Medium High
Relation Extraction, Semantic Syntax Checking, AI
Use Cases Chunking, POS tagging
Analysis Chatbots
Example "The cat" (NP), "chases" cat --> nsubj -->
Full Sentence Parse Tree
Output (VP) chases
Exam-Oriented Key Points
1. Shallow Parsing
◦ Also called Chunking
◦ Groups words into phrases (NP, VP, etc.)
◦ Used in NER, POS tagging, Information Extraction
2. Dependency Parsing
◦ Identi es word dependencies
◦ Outputs ROOT word & dependencies
◦ Used in Relation Extraction, Machine Translation
fi
43
3. Constituency Parsing
◦ Breaks sentences into phrases (NP, VP, etc.)
◦ Forms Parse Tree (Hierarchical Structure)
◦ Used in Grammar Checking, AI Chatbots
44
UNIT-3
Text Corpus
✅ 1. De nition
A text corpus is a large and structured collection of texts used for natural language processing
(NLP) tasks such as text classi cation, machine translation, and sentiment analysis. It serves as the
foundational dataset for training machine learning models and analyzing language trends.
2. Types of Text Corpus
Different types of corpora exist based on the language, content, and purpose.
1. Monolingual Corpus
• Contains text in only one language.
• Used for training language models in that language.
• Example: Wikipedia Corpus (English), Hindi News Articles.
• Use Case: Chatbots, Sentiment Analysis, Spell Checking.
2. Multilingual Corpus
• Contains text in multiple languages.
• Used for cross-lingual research and translation models.
• Example: EuroParl Corpus, UN Documents in different languages.
• Use Case: Machine Translation (e.g., Google Translate).
3. Parallel Corpus
• A special type of multilingual corpus where the same text is available in multiple
languages, side by side.
• Used for training translation models.
• Example: TED Talks Transcripts, UN Parallel Corpus.
• Use Case: Automatic Translation Tools like DeepL, Google Translate.
4. Specialized Corpus (Domain-Speci c Corpus)
• Focuses on a speci c industry or topic (Medical, Legal, Finance, etc.).
• Used for industry-speci c NLP models.
fi
fi
fi
fi
fi
45
• Example: PubMed (Medical Corpus), Legal Corpus.
• Use Case: Medical chatbots, Legal Document Analysis, Financial Trend Prediction.
5. General Corpus
• Contains a wide variety of topics (news, books, blogs, etc.).
• Used for training general-purpose NLP models.
• Example: Google Books Corpus, Common Crawl (web pages).
• Use Case: AI Assistants like ChatGPT, Speech-to-Text Models.
3. Importance of a Text Corpus
✔ Trains AI models for text understanding and generation.
✔ Improves machine translation and speech recognition.
✔ Used in linguistic research to study language patterns.
✔ Helps in text classi cation, chatbots, and voice assistants.
4. Preprocessing a Text Corpus
Before using a text corpus, preprocessing is essential to clean and standardize text.
✔ Tokenization – Splitting text into words or sentences.
✔ Lowercasing – Convert all text to lowercase for consistency.
✔ Stopword Removal – Removing frequently occurring but unimportant words (e.g., "is", "the",
"and").
✔ Stemming & Lemmatization – Reducing words to their root or base form (e.g., "running" →
"run").
✔ Removing Punctuation & Special Characters – Eliminating symbols like @, #, $, !.
✔ Vectorization – Converting words into numerical values using techniques like Bag of Words
(BoW), TF-IDF, Word2Vec, GloVe.
5. Summary Table
Type of
Description Example Use Case
Corpus
Chatbots, Sentiment
Monolingual Text in one language Wikipedia (English)
Analysis
Cross-language NLP,
Multilingual Text in multiple languages UN Documents
Translation
Same text translated side-
Parallel TED Talks Corpus Machine Translation
by-side
Specialized Domain-speci c text PubMed (Medical) Medical & Legal NLP
fi
fi
46
Google Books, Common AI Assistants, Speech
General Covers multiple topics
Crawl Recognition
6. Key Takeaways
✔ A text corpus is a collection of texts used for NLP and machine learning.
✔ Types include Monolingual, Multilingual, Parallel, Specialized, and General corpora.
✔ Famous corpora include Brown Corpus, Google Books Ngram, and Common Crawl.
✔ Preprocessing is crucial for preparing text before using it in NLP models.
Bag of words
✅ 1. De nition
The Bag of Words (BoW) Model is a simple and widely used technique for representing text in
numerical form. It ignores grammar and word order but keeps track of word frequency in a
document.
It is called "Bag of Words" because it treats text as an unordered "bag" of words, only focusing on
how many times each word appears.
2. Working of Bag of Words Model
Step 1: Collect Text Data
• Example:
Document 1: "I love playing football."
• Document 2: "Football is a great sport."
•
Step 2: Tokenization (Convert Sentences into Words)
• Remove punctuation and split text into words:
["I", "love", "playing", "football"]
• ["Football", "is", "a", "great", "sport"]
•
Step 3: Create a Vocabulary
A vocabulary is a list of all unique words in the dataset:
["I", "love", "playing", "football", "is", "a", "great",
"sport"]
Step 4: Create Word Frequency Vectors
fi
47
Each document is converted into a vector where each column represents a word from the
vocabulary, and the values indicate how often the word appears.
Word I love playing football is a great sport
Doc 1 1 1 1 1 0 0 0 0
Doc 2 0 0 0 1 1 1 1 1
• 1 means the word is present in the document.
• 0 means the word is absent.
3. Advantages of Bag of Words
✔ Simple and easy to implement.
✔ Works well for small datasets.
✔ Good for text classi cation, spam ltering, and sentiment analysis.
4. Disadvantages of Bag of Words
❌ Ignores word meaning – "good" and "not good" are treated as separate words.
❌ Ignores word order – "Apple is red" and "Red is apple" have the same vector.
❌ Leads to large feature space – If the vocabulary is large, it creates very large vectors.
5. Applications of Bag of Words
✔ Text Classi cation – Spam detection, sentiment analysis.
✔ Topic Modeling – Finding the main topics in a document.
✔ Information Retrieval – Search engines rank documents based on keyword frequency.
6. Summary Table
Feature Description
De nition A simple way to represent text as word frequency vectors.
How It Works Creates a vocabulary, then counts word occurrences.
Pros Simple, effective for small datasets, good for text classi cation.
Cons Ignores meaning, ignores word order, large feature space.
Applications Spam ltering, sentiment analysis, search engines.
7. Key Takeaways
fi
fi
fi
fi
fi
fi
48
✔ BoW represents text as word frequency vectors, ignoring word order and meaning.
✔ It is widely used for text classi cation, spam ltering, and search engines.
✔ It struggles with large vocabulary sizes and doesn't capture word relationships.
Bag of N-Grams Model
✅ 1. De nition
The Bag of N-Grams Model is an extension of the Bag of Words (BoW) model that considers
word sequences (N-Grams) instead of individual words. This helps capture the context and order
of words to some extent, making it better than BoW.
2. What is an N-Gram?
An N-Gram is a sequence of N consecutive words from a given text.
Example Sentence: "I love playing
N-Gram Type
football"
Unigram
"I", "love", "playing", "football"
(N=1)
Bigram (N=2) "I love", "love playing", "playing football"
Trigram (N=3) "I love playing", "love playing football"
• Unigrams (N=1) work like Bag of Words (BoW).
• Bigrams (N=2) help capture word order and context.
• Trigrams (N=3) or higher further improve meaning representation.
3. Working of Bag of N-Grams Model
Step 1: Collect Text Data
fi
fi
fi
49
• Example:
Document 1: "I love playing football."
• Document 2: "Football is a great sport."
•
Step 2: Tokenization (Break text into N-Grams)
For bigrams (N=2), we get:
• Document 1 → ["I love", "love playing", "playing football"]
• Document 2 → ["Football is", "is a", "a great", "great sport"]
Step 3: Create a Vocabulary
["I love", "love playing", "playing football", "Football is",
"is a", "a great", "great sport"]
Step 4: Create N-Gram Frequency Vectors
N- I love playing Football is a great
Gram love playing football is a great sport
Doc 1 1 1 1 0 0 0 0
Doc 2 0 0 0 1 1 1 1
• 1 means the bigram is present in the document.
• 0 means the bigram is absent.
4. Advantages of Bag of N-Grams Model
✔ Captures context – Unlike BoW, it preserves some word order.
✔ Improves text classi cation – Helps in sentiment analysis, spam detection.
✔ Better than BoW for understanding meaning.
5. Disadvantages of Bag of N-Grams Model
❌ Higher dimensionality – More features than BoW, making it computationally expensive.
❌ Still limited context – Cannot understand sentence meaning fully.
❌ Sparse Data Issue – Many N-Grams may appear only once, leading to many zero values in
vectors.
6. Applications of Bag of N-Grams Model
✔ Sentiment Analysis – "not good" vs. "good" (BoW fails here).
✔ Spam Detection – Common spam word sequences are captured.
fi
50
✔ Speech Recognition – Predicting the next word in a sentence.
✔ Plagiarism Detection – Identi es similar phrases in documents.
7. Summary Table
Feature Description
De nition A model that represents text using word sequences (N-Grams).
How It Splits text into N-word sequences and creates frequency
Works vectors.
Pros Preserves some context, improves classi cation accuracy.
Cons High dimensionality, still lacks deep meaning understanding.
Applications Sentiment analysis, spam detection, speech recognition.
8. Key Takeaways
✔ N-Grams capture some word order, making it better than BoW.
✔ Bigram and Trigram models improve accuracy in text classi cation.
✔ More N-Grams mean higher complexity and sparse data issues.
TF-IDF (Term Frequency - Inverse Document
Frequency)
✅ 1. De nition
The TF-IDF Model is a statistical measure used to evaluate how important a word is in a
document relative to a collection of documents (corpus). Unlike Bag of Words (BoW) and N-
Grams, TF-IDF considers both frequency and signi cance, reducing the impact of commonly
used words.
2. Why TF-IDF?
In BoW and N-Grams, common words (e.g., "the", "is", "and") appear frequently, making them
seem important when they are not. TF-IDF solves this by:
✔ Giving high weight to important words (e.g., "football", "Python")
✔ Reducing weight of frequently occurring words (e.g., "the", "is", "and")
3. Components of TF-IDF
fi
fi
fi
fi
fi
fi
51
4. Working of TF-IDF Model
Example Dataset
Document 1: "I love playing football."
Document 2: "Football is a great sport."
Document 3: "I love watching football matches."
52
5. Advantages of TF-IDF Model
✔ Filters out common words – Reduces the impact of stop words (e.g., "is", "the").
✔ Improves document relevance – Highlights important words.
✔ Better than BoW & N-Grams – Considers both word frequency and importance.
6. Disadvantages of TF-IDF Model
❌ Does not capture meaning – Cannot understand synonyms (e.g., "happy" vs. "joyful").
❌ Ignores word order – "playing football" and "football playing" are treated the same.
❌ High dimensionality – Large corpora create very big matrices.
7. Applications of TF-IDF Model
✔ Search Engines (Google, Bing) – Ranks relevant documents.
✔ Spam Detection – Identi es spam messages.
✔ Keyword Extraction – Finds important words in text.
✔ Recommender Systems – Suggests articles based on keywords.
8. Summary Table
Feature Description
A statistical measure that nds important words in a
De nition
document.
How It
Calculates TF and IDF, then multiplies them.
Works
Pros Reduces stop words, improves relevance.
Cons Ignores meaning, large feature space.
Applications Search engines, spam detection, keyword extraction.
9. Key Takeaways
✔ TF-IDF reduces the weight of common words and highlights unique words.
✔ It is widely used in search engines, spam detection, and keyword extraction.
✔ Still has limitations, such as ignoring synonyms and word order.
Word2Vec Model
✅ 1. Introduction
The Word2Vec Model is a deep learning-based approach used for generating word embeddings,
representing words as dense numerical vectors in a continuous space. Unlike traditional models
fi
fi
fi
53
like Bag of Words (BoW) and TF-IDF, which treat words independently, Word2Vec captures
semantic relationships between words.
2. Why Use Word2Vec?
Older models like BoW and TF-IDF have signi cant limitations:
❌ Do not understand word meanings (e.g., "king" and "queen" are unrelated).
❌ Ignore relationships between words (e.g., "Paris" is related to "France").
❌ Create high-dimensional, sparse vectors that are computationally inef cient.
Word2Vec overcomes these issues by learning compact, meaningful word representations
where:
✔ Similar words have closer vector representations in the space.
✔ It preserves semantic relationships (e.g., "King - Man + Woman ≈ Queen").
3. How Does Word2Vec Work?
Word2Vec is trained using a neural network and operates using two primary architectures:
1⃣ Continuous Bag of Words (CBOW)
• Predicts the target word based on surrounding words.
• Example: Given "I __ playing football", the model predicts "love".
• Works well for frequent words and is computationally faster.
2⃣ Skip-Gram Model
• Predicts context words given a target word.
• Example: Given "love", the model predicts ["I", "playing", "football"].
• Works better for rare words and captures complex relationships.
4. Word Representation in Word2Vec
Word Vector Representation (Simpli ed)
King [0.2, 0.8, 0.5, 0.9, 0.1]
Queen [0.3, 0.7, 0.6, 0.8, 0.2]
Man [0.1, 0.9, 0.4, 0.7, 0.3]
Woman [0.2, 0.8, 0.5, 0.6, 0.4]
fi
fi
fi
54
💡 Key Property:
🔹 King - Man + Woman ≈ Queen (Captures gender relationship).
🔹 Paris - France + Italy ≈ Rome (Shows country-capital relationship).
5. Advantages of Word2Vec
✔ Captures word meanings and relationships effectively.
✔ Reduces high-dimensional vectors to lower, dense representations.
✔ Enhances performance in NLP tasks such as sentiment analysis and translation.
6. Disadvantages of Word2Vec
❌ Requires large datasets for effective learning.
❌ Ignores word order and syntax, focusing only on word relationships.
❌ Training can be computationally expensive for extensive corpora.
7. Applications of Word2Vec
✔ Search Engines – Enhances keyword matching and ranking.
✔ Chatbots & Virtual Assistants – Improves understanding of user queries.
✔ Machine Translation – Provides better word mappings for translations.
✔ Recommendation Systems – Suggests relevant content based on word similarity.
8. Quick Summary
Feature Description
A deep learning model that learns word meanings using dense
De nition
vectors.
How It
Uses CBOW and Skip-Gram to predict word relationships.
Works
Pros Captures word semantics, low-dimensional, and ef cient.
Requires large data, ignores syntax, and is computationally
Cons
expensive.
Applications Search engines, chatbots, machine translation.
9. Key Takeaways
✔ Word2Vec generates meaningful word embeddings by learning word relationships.
✔ CBOW predicts words from context, while Skip-Gram predicts context from words.
✔ It is widely used in NLP applications like chatbots, search engines, and translation.
fi
fi
55
FastText Model
✅ 1. Introduction
FastText is a word embedding model developed by Facebook's AI Research (FAIR). It improves
upon Word2Vec and GloVe by representing words as subword n-grams, allowing it to handle out-
of-vocabulary (OOV) words and capture morphological structures.
Unlike Word2Vec and GloVe, which treat words as atomic units, FastText breaks words into
smaller character-based n-grams, making it highly effective for in ected languages and
misspelled words.
2. Why Use FastText?
🚀 Limitations of Word2Vec & GloVe:
• ❌ Cannot handle out-of-vocabulary (OOV) words (new or rare words).
• ❌ Fails to capture internal word structure (e.g., pre xes, suf xes).
✅ FastText Advantages:
• Breaks words into character n-grams, making it better at handling unseen words.
• Works well for morphologically rich languages (e.g., German, Finnish, Hindi).
3. How FastText Works?
Step 1: Break Words into Subword Units (N-Grams)
Each word is decomposed into character-level n-grams (subwords).
🔹 Example: "Apple" (n=3, Trigrams)
• <Ap, App, ppl, ple, le>
🔹 Example: "Running"
• Run, unn, nni, nin, ing
These subword representations allow FastText to learn similarities between words with similar
structures (e.g., "running" and "runner").
Step 2: Compute Word Embeddings
fi
fi
fl
56
1⃣ Each word's embedding is computed by adding the embeddings of its subwords.
2⃣ This allows the model to generate vectors for unseen words by summing the n-grams of
similar known words.
💡 Example:
Even if the model has never seen "Playful," it can still derive its meaning from "Play" + "ful"
(which exist in its training data).
4. Comparison of Word Embedding Models
Handles OOV Captures Training Memory
Model
Words? Morphology? Speed Usage
Word2Ve
c ❌ No ❌ No 🚀 Fast ✅ Low
GloVe ❌ No ❌ No ⚡ Medium ⚠ High
FastText ✅ Yes ✅ Yes ⏳ Slower ⚠ High
✔ FastText is superior when dealing with new words, spelling errors, or rich languages.
5. Advantages of FastText
✔ Handles Out-of-Vocabulary (OOV) words dynamically.
✔ Understands word morphology (pre xes, suf xes, roots).
✔ Effective for spelling mistakes (e.g., "colour" vs. "color").
✔ Useful for non-English languages with complex word structures.
✔ Can classify words, sentences, and documents (used in NLP classi cation tasks).
6. Disadvantages of FastText
❌ Computationally expensive (higher memory usage).
❌ Training is slower than Word2Vec due to subword decomposition.
❌ Not always necessary for simple English texts where Word2Vec/GloVe suf ce.
7. Applications of FastText
✔ Spell Correction & Autocomplete – Recognizes spelling mistakes.
✔ Chatbots & NLP Assistants – Understands unseen words better.
✔ Multilingual Text Processing – Works well with complex languages.
✔ Sentiment Analysis – Improves accuracy in understanding word variations.
✔ Document Classi cation – Used in detecting spam, fake news, and topic categorization.
8. Quick Summary
fi
fi
fi
fi
fi
57
Feature Description
De nition A word embedding model that represents words using subword n-grams.
How It
Breaks words into subword units and computes embeddings.
Works
Handles OOV words, captures word structure, good for multilingual
Pros
NLP.
Cons High computational cost, slower than Word2Vec.
Applications Text classi cation, NLP assistants, spell correction, sentiment analysis.
9. Key Takeaways
✔ FastText solves the problem of OOV words by using subword n-grams.
✔ It is useful for morphologically complex languages.
✔ Slower and memory-intensive, but highly effective for NLP applications.
Building a Text Classi er
✅ 1. Introduction
A Text Classi er is a machine learning model that categorizes text into prede ned classes (e.g.,
spam detection, sentiment analysis, topic classi cation). It converts raw text into numerical features
and applies classi cation algorithms to predict the category.
🔹 Example Applications:
✔ Spam vs. Non-Spam Email Classi cation
✔ Sentiment Analysis (Positive/Negative/Neutral)
✔ News Categorization (Sports, Politics, Technology)
2. Steps to Build a Text Classi er
Step 1: Collect and Preprocess the Data
📌 Data Sources: Web Scraping, APIs, Databases, CSV Files
📌 Preprocessing Steps:
✔ Remove punctuation & special characters
✔ Convert to lowercase
✔ Tokenization (splitting text into words)
✔ Stopword Removal (removing words like “the”, “is”)
✔ Stemming/Lemmatization (reducing words to root form)
Step 2: Convert Text into Numerical Features
fi
fi
fi
fi
fi
fi
fi
fi
fi
58
Since ML models work with numbers, we must convert text into vectors using:
Technique Description
Bag of Words
Counts word frequency in text.
(BoW)
TF-IDF Weights words based on importance.
Word2Vec / Learns word relationships using
FastText embeddings.
Example using TF-IDF:
Text 1: "Machine learning is powerful."
Text 2: "Deep learning improves AI."
TF-IDF assigns weights to "Machine", "Deep", "learning", etc.
Step 3: Choose a Classi cation Algorithm
📌 Common Machine Learning Models for Text Classi cation:
✔ Naïve Bayes – Works well for spam ltering, sentiment analysis.
✔ Logistic Regression – Good for binary classi cation.
✔ SVM (Support Vector Machine) – Handles large feature spaces.
✔ Random Forest – Uses decision trees for better accuracy.
✔ Deep Learning (LSTMs, CNNs, Transformers) – Best for complex NLP tasks.
🔹 Example: Using Naïve Bayes for spam classi cation
Input: "Win a free iPhone now!"
Model predicts: SPAM
Step 4: Train & Evaluate the Model
🔹 Split Data into Train & Test Sets (e.g., 80% Train, 20% Test)
🔹 Train the Model using labeled data.
🔹 Evaluate Performance using metrics:
Metric Description
Accuracy % of correct predictions.
Precision How many predicted positives are actually positive?
Recall How many actual positives were correctly classi ed?
F1-Score Balance between Precision & Recall.
Step 5: Deploy & Optimize
fi
fi
fi
fi
fi
fi
59
✔ Deploy as a REST API or integrate with apps.
✔ Optimize using hyperparameter tuning (adjusting model parameters).
✔ Use real-time data to improve classi cation accuracy.
3. Advantages of Text Classi cation
✔ Automates text categorization tasks.
✔ Saves time in customer support & moderation.
✔ Scalable for handling large datasets.
✔ Works well with AI assistants & chatbots.
4. Disadvantages
❌ Sensitive to noise (spelling errors, slang).
❌ Data imbalance issues (some categories may have fewer samples).
❌ Computationally expensive for deep learning models.
5. Applications of Text Classi cation
✔ Spam Detection – Gmail lters spam emails.
✔ Sentiment Analysis – Analyzing customer reviews.
✔ Topic Categorization – Sorting news articles.
✔ Toxic Comment Detection – Social media moderation.
6. Summary
Step Description
1. Preprocessing Clean text, remove stopwords, tokenize.
2. Feature Extraction Convert text into numerical vectors (BoW, TF-IDF, Word2Vec).
3. Choose Model Use Naïve Bayes, SVM, or deep learning.
4. Train & Evaluate Use accuracy, precision, recall.
5. Deploy Deploy as API, optimize performance.
7. Key Takeaways
✔ Text classi cation automates categorizing textual data.
✔ TF-IDF, BoW, and embeddings help in feature extraction.
✔ Naïve Bayes, SVM, and deep learning models are used for classi cation.
✔ Evaluation metrics ensure accuracy & reliability.
59
fi
fi
fi
fi
fi
fi
60
Text Similarity and Document Similarity Measures
✅ 1. Introduction
Text similarity measures how closely related two pieces of text are. It is essential in search
engines, plagiarism detection, chatbots, and recommendation systems.
👉 Example: "Machine learning is great" vs. "Deep learning is amazing" → These sentences have
some similarity but are not identical.
2. Types of Text Similarity
Type Description Example
Lexical Compares word occurrences in both "Hello world" & "Hello there" → Similar due
Similarity texts. to "Hello".
Semantic Measures meaning-based similarity "I love dogs" & "I adore puppies" → Different
Similarity (word relationships). words but same meaning.
3. Text Similarity Methods
1⃣ Jaccard Similarity (Lexical-Based)
📌 Measures overlap between two sets of words.
2⃣ Cosine Similarity (Vector-Based)
61
📌 Measures angle between text vectors (range: 0 to 1).
3⃣ TF-IDF Similarity
📌 Weighs words based on importance in a document.
📌 Common in search engines & keyword extraction.
✅ Example:
✔ "Arti cial Intelligence is the future"
✔ "The future of AI is bright"
✔ TF-IDF assigns higher weights to important words like "Arti cial Intelligence".
🔹 Pros: Helps in search ranking.
🔹 Cons: Doesn't consider word meaning.
4⃣ Word Embeddings (Word2Vec, GloVe, FastText)
📌 Captures semantic relationships between words.
📌 Similar words have closer vector representations.
fi
fi
62
✅ Example:
✔ "King" - "Man" + "Woman" = "Queen"
✔ "Car" is closer to "Vehicle" than "Apple".
🔹 Pros: Handles synonyms & context.
🔹 Cons: Needs a large dataset to train.
4. Document Similarity Methods
1⃣ Cosine Similarity for Documents
✔ Compares document vectors using TF-IDF or Word2Vec.
✔ Used in news categorization & plagiarism detection.
✅ Example:
✔ Article 1: "Space exploration is exciting."
✔ Article 2: "NASA launches new space mission."
✔ Cosine Similarity = 0.75 (High Similarity)
2⃣ Latent Semantic Analysis (LSA)
📌 Identi es hidden topics in text.
📌 Uses Singular Value Decomposition (SVD) for dimensionality reduction.
✅ Example: Group similar documents based on themes like "Technology" or "Sports".
🔹 Pros: Reduces noise.
🔹 Cons: Computationally expensive.
5. Applications of Text Similarity
✔ Plagiarism Detection – Checks content similarity.
✔ Search Engines – Google ranks pages based on similarity.
✔ Chatbots – Detects similar queries.
✔ Document Clustering – Groups similar research papers.
✔ Recommender Systems – Suggests similar articles or movies.
6. Summary
Method Type Best For
Jaccard Similarity Lexical Short text, simple matching.
Cosine Similarity Vector-Based Long text, document similarity.
fi
63
Weighted Search engines, keyword
TF-IDF Similarity
Frequency ranking.
Word
Semantic Capturing meaning, chatbots.
Embeddings
LSA Topic Modeling Document clustering, NLP tasks.
7. Key Takeaways
✔ Text similarity helps in plagiarism detection, chatbots, & search engines.
✔ Jaccard & Cosine Similarity are fast & effective for lexical comparisons.
✔ TF-IDF is great for ranking important words.
✔ Word Embeddings capture semantic relationships.
Building a Text Classi er
✅ 1. Introduction
A Text Classi er automatically assigns labels/categories to text based on its content.
✔ Used in spam detection, sentiment analysis, topic classi cation, and chatbot intent
detection.
👉 Example:
• Spam Email Detection → "Congratulations! You've won a prize!" → Spam
• Sentiment Analysis → "This product is amazing!" → Positive Sentiment
2. Steps to Build a Text Classi er
Step 1⃣ : Data Collection
📌 Gather labeled text data (training dataset) with categories.
✅ Example:
Text Category
"I love this movie!" Positive
"This product is terrible." Negative
"Win a free iPhone now!" Spam
Step 2⃣ : Text Preprocessing
📌 Clean the text to remove noise and make it machine-readable.
fi
fi
fi
fi
64
✔ Lowercasing – Convert text to lowercase.
✔ Removing Punctuation – "Hello, World!" → "Hello World"
✔ Stopword Removal – Remove common words like "the", "is", "in".
✔ Tokenization – Split text into words ("I love NLP" → [I, love, NLP]).
✔ Lemmatization/Stemming – Convert words to their base form (running → run).
✅ Example (Before & After Preprocessing)
💬 Before: "I really love playing football!!!"
💬 After: "love play football"
Step 3⃣ : Feature Extraction (Convert Text into Numbers)
📌 Convert text into numerical representations for the model to process.
✔ Bag of Words (BoW) – Counts word occurrences.
✔ TF-IDF (Term Frequency-Inverse Document Frequency) – Weighs important words.
✔ Word Embeddings (Word2Vec, GloVe, FastText) – Captures word meaning.
✅ Example:
• BoW: "I love AI" → [1, 1, 1, 0, 0]
• TF-IDF: "I love AI" → [0.3, 0.5, 0.7, 0, 0]
• Word Embeddings: "I love AI" → [0.23, 0.45, -0.67, ...]
Step 4⃣ : Choose a Classi cation Model
📌 Use a Machine Learning algorithm to train the classi er.
Algorithm Best For Pros
Naïve Bayes Sentiment analysis, spam Fast & simple
detection
Binary classi cation (Spam/
Logistic Regression Good for small datasets
Not Spam)
Works well with high-
Support Vector Machine (SVM) Text classi cation
dimensional data
Random Forest Multi-category classi cation Reduces over tting
Deep Learning (LSTM, CNN,
Advanced NLP tasks Captures context & sequence
Transformers)
✅ Example:
✔ Naïve Bayes Classi er (for Spam Detection):
• "Win a free iPhone" → Spam
fi
fi
fi
fi
fi
fi
fi
65
• "Meeting at 5 PM" → Not Spam
Step 5⃣ : Model Training & Evaluation
📌 Train the model using a dataset and check accuracy.
✔ Split Data – 80% for training, 20% for testing.
✔ Performance Metrics:
• Accuracy – Correct predictions out of total predictions.
• Precision – True Positives / (True Positives + False Positives).
• Recall – How many actual positives were correctly predicted.
✅ Example Evaluation Results:
✔ Accuracy: 85%
✔ Precision: 90% (Fewer false positives)
✔ Recall: 80% (Some false negatives)
3. Applications of Text Classi cation
✔ Spam Filtering – Gmail detects spam emails.
✔ Sentiment Analysis – Reviews classi ed as Positive/Negative.
✔ News Categorization – Sports, Politics, Tech, etc.
✔ Chatbot Intent Detection – "Order a pizza" → Food Order Category.
4. Summary
Step Description
Data Collection Gather labeled text data.
Preprocessing Clean and prepare text for analysis.
Feature Extraction Convert text into numerical format.
Choose Naïve Bayes, SVM, Deep Learning,
Model Selection
etc.
Training &
Train model and check accuracy.
Evaluation
5. Key Takeaways
✔ Text classi cation helps in spam detection, sentiment analysis, and chatbots.
✔ Preprocessing improves accuracy by removing noise.
✔ TF-IDF & Word Embeddings help in better representation.
✔ Naïve Bayes, SVM, and Deep Learning are commonly used classi ers.
fi
fi
fi
fi
66
UNIT-4
Semantic Analysis: Word Sense Disambiguation
✅ 1. Introduction
📌 Semantic Analysis focuses on understanding the meaning of words and sentences in context.
📌 Word Sense Disambiguation (WSD) is the process of determining the correct meaning of a
word in a given context.
👉 Example:
• "I went to the bank to withdraw money." → Bank = Financial Institution
• "I sat on the river bank to relax." → Bank = River Edge
Here, the word "bank" has multiple meanings, and WSD helps determine the correct one.
2. Why is WSD Important?
✔ Improves NLP Applications – Used in chatbots, search engines, and translation systems.
✔ Reduces Ambiguity – Helps in accurate meaning extraction from text.
✔ Enhances Machine Learning Models – Provides contextual understanding for AI.
3. Approaches to Word Sense Disambiguation
1⃣ Knowledge-Based Approaches (Uses Dictionaries & Ontologies)
📌 Uses prede ned lexical databases like WordNet.
✔ Lesk Algorithm
• Compares dictionary de nitions (glosses) of words.
• Example:
◦ Bank (Financial): "A place where money is kept."
◦ Bank (River): "The land beside a river."
◦ If the sentence contains "money", it selects the rst meaning.
✔ Path-Based Methods
• Uses semantic relationships in WordNet.
fi
fi
fi
67
• Measures distance between words in a semantic graph.
2⃣ Supervised Learning Approaches (Uses Labeled Data)
📌 Requires training data with correct word meanings.
✔ Decision Trees, Naïve Bayes, SVM
• Example: Train a classi er with sentences containing "bank" and their correct meanings.
✔ Limitation: Needs large labeled datasets, which can be costly to create.
3⃣ Unsupervised Learning Approaches (Uses Context Clustering)
📌 Clusters words based on their usage in different sentences.
✔ Example:
• The word "bank" appears in nance-related texts and geography-related texts.
• The model learns that "bank" in nancial texts means nancial institution, while in
geography texts it means river edge.
✔ Latent Semantic Analysis (LSA), Neural Networks, Word Embeddings (Word2Vec, BERT)
are commonly used.
4. Applications of WSD
✔ Machine Translation – Improves accuracy by translating words correctly.
✔ Search Engines – Helps return relevant search results.
✔ Speech Recognition – Corrects homophones like "write" vs. "right".
✔ Chatbots & Virtual Assistants – Enhances conversational AI understanding.
5. Summary Table
Method Type Pros Cons
Knowledge- Simple, uses
Lesk Algorithm Limited accuracy
Based dictionaries
Knowledge-
Path-Based Methods Uses semantic structure Requires a lexical database
Based
Machine
Supervised Learning High accuracy Needs labeled data
Learning
Unsupervised Machine Computationally
Learns from raw text
Learning Learning expensive
fi
fi
fi
fi
68
6. Key Takeaways
✔ WSD helps AI understand words in the right context to reduce ambiguity.
✔ Lexical (dictionary-based), supervised, and unsupervised methods are used for WSD.
✔ Applications include Google Search, ChatGPT, Speech Assistants, and Translation Systems.
Named Entity Recognition (NER)
✅ 1. Introduction
📌 Named Entity Recognition (NER) is a subtask of Natural Language Processing (NLP) that
identi es and categorizes proper names in text.
📌 It extracts entities like Person, Organization, Location, Date, Time, etc.
👉 Example:
• "Elon Musk founded SpaceX in 2002, headquartered in California."
◦ Elon Musk → Person
◦ SpaceX → Organization
◦ 2002 → Date
◦ California → Location
2. Why is NER Important?
✔ Improves Information Extraction – Helps in news classi cation, chatbots, and search
engines.
✔ Enhances Data Organization – Useful for sorting and indexing large text corpora.
✔ Strengthens AI Applications – Helps machine learning models understand structured data.
3. Types of Named Entities
✔ Person (PER): Identi es individuals.
✔ Organization (ORG): Recognizes companies, institutions.
✔ Location (LOC): Detects countries, cities, landmarks.
✔ Date & Time (DATE, TIME): Extracts temporal information.
✔ Monetary Values (MONEY): Identi es prices, salaries.
✔ Percentages (PERCENT): Finds percentage-based data.
✔ Miscellaneous (MISC): Captures other entity types like product names, events.
4. Approaches to NER
fi
fi
fi
fi
69
1⃣ Rule-Based Methods
📌 Uses regular expressions and pattern-matching rules.
✔ Example:
• If a word starts with a capital letter and follows "Mr." or "Dr." → Person Name.
✔ Limitation: Struggles with new or unseen entities.
2⃣ Machine Learning-Based Methods
📌 Uses labeled training data to identify entity patterns.
✔ Common algorithms:
• Conditional Random Fields (CRF)
• Hidden Markov Models (HMM)
• Support Vector Machines (SVM)
✔ Limitation: Needs large, well-annotated datasets.
3⃣ Deep Learning-Based Methods
📌 Uses Neural Networks & Word Embeddings.
✔ Common models:
• BiLSTM + CRF
• Transformers (BERT, RoBERTa, GPT-based models)
✔ Advantage: Learns context and works well with unstructured data.
✔ Limitation: Computationally expensive.
5. Applications of NER
✔ Search Engines – Google uses NER to highlight key results.
✔ Healthcare NLP – Identi es diseases, drugs from medical texts.
✔ Financial Analysis – Extracts companies and nancial terms from reports.
✔ Chatbots & Virtual Assistants – Recognizes names, dates, and locations for personalized
responses.
6. Summary Table
Method Type Pros Cons
Rule-Based Pattern Matching Simple, quick Struggles with unseen words
fi
fi
70
Machine Statistical More
Needs labeled data
Learning Models adaptable
Requires high computing
Deep Learning Neural Networks Context-aware
power
7. Key Takeaways
✔NER extracts and categorizes names, places, dates, and organizations.
✔ Methods include Rule-Based, Machine Learning, and Deep Learning approaches.
✔ It enhances search engines, nance, healthcare, and AI assistants.
Topic Modeling
✅ 1. Introduction
📌 Topic Modeling is an unsupervised learning technique used to discover hidden themes in
large collections of text.
📌 It groups words that frequently appear together into topics without needing labeled data.
👉 Example:
• News Articles Dataset:
◦ Topic 1: ["election," "vote," "candidate," "government"] → Politics
◦ Topic 2: ["goal," "match," "team," "player"] → Sports
2. Why is Topic Modeling Important?
✔ Summarizes Large Text Data – Helps in document classi cation and search engines.
✔ Extracts Meaningful Insights – Identi es main themes in social media, news, and research
papers.
✔ Improves Text Analysis – Helps in sentiment analysis and customer feedback categorization.
3. Common Topic Modeling Techniques
1⃣ Latent Dirichlet Allocation (LDA)
📌 Most popular method that assumes each document contains multiple topics.
✔ Works on probability distribution of words across topics.
✔ Example:
• A movie review may contain "acting," "script," "director" (Topic 1: Film Industry) and
"cinematography," "camera," "effects" (Topic 2: Technical Aspects).
fi
fi
fi
71
✔ Limitation: Needs tuning for the correct number of topics.
2⃣ Latent Semantic Indexing (LSI)
📌 Uses Singular Value Decomposition (SVD) to nd patterns in word usage.
✔ Helps reduce noise in text data.
✔ Example:
• Words like "car," "vehicle," and "automobile" belong to the same topic even if they
aren’t used together frequently.
✔ Limitation: Doesn't handle polysemy (words with multiple meanings) well.
3⃣ Non-Negative Matrix Factorization (NMF)
📌 Decomposes a document-word matrix into topic-based matrices.
✔ Used in recommendation systems and document clustering.
✔ Example:
• Amazon reviews may contain topics related to "quality," "delivery," "price," and
"customer service."
✔ Limitation: Less interpretable than LDA.
4. Applications of Topic Modeling
✔ News Categorization – Groups articles into business, sports, entertainment, etc.
✔ Customer Feedback Analysis – Identi es major issues from reviews.
✔ Healthcare NLP – Extracts medical conditions and symptoms from patient records.
✔ Legal & Academic Research – Helps in summarizing long reports.
5. Summary Table
Metho
Type Pros Cons
d
Finds coherent
LDA Probabilistic Requires tuning
topics
Struggles with
LSI SVD-Based Handles synonyms
polysemy
Matrix
NMF Good for clustering Less interpretable
Factorization
6. Key Takeaways
fi
fi
72
✔ Topic Modeling nds hidden themes in large text datasets.
✔ LDA, LSI, and NMF are widely used techniques.
✔ Applications include search engines, news aggregation, and customer feedback analysis.
Latent Semantic Indexing (LSI)
✅ 1. Introduction
📌 Latent Semantic Indexing (LSI) is a mathematical technique used for dimensionality
reduction in text analysis.
📌 It captures relationships between words and documents by mapping them to a lower-
dimensional space.
👉 Example:
• Words "car," "vehicle," and "automobile" may not appear together but belong to the
same semantic topic.
• LSI groups them under a single concept (e.g., "transportation").
2. Why is LSI Important?
✔ Improves Search Accuracy – Finds relevant documents even if exact keywords aren’t used.
✔ Handles Synonyms Well – Recognizes related words even if they differ in spelling.
✔ Reduces Noise in Text Data – Helps lter out irrelevant terms and improve topic coherence.
3. How LSI Works?
📌 LSI is based on Singular Value Decomposition (SVD), which decomposes a document-term
matrix into three smaller matrices:
👉 Steps:
1⃣ Create a Term-Document Matrix (TDM) – Represents word occurrences in documents.
2⃣ Apply SVD – Breaks TDM into smaller matrices to capture important features.
3⃣ Reduce Dimensionality – Eliminates noise while preserving key topics.
4⃣ Retrieve Topics – Identi es relationships between words and documents.
4. Applications of LSI
✔ Search Engines – Improves search results by understanding word relationships.
✔ Document Clustering – Groups similar articles, emails, and reports together.
✔ Spam Filtering – Detects spam by analyzing latent topics in emails.
✔ Plagiarism Detection – Compares documents based on semantic similarity.
fi
fi
fi
73
5. LSI vs Other Topic Modeling Techniques
Metho
Type Pros Cons
d
Handles synonyms, reduces Struggles with
LSI SVD-Based
noise polysemy
LDA Probabilistic Assigns probability distributions Requires tuning
Matrix
NMF Works well with clustering Harder to interpret
Factorization
6. Key Takeaways
✔ LSI reduces text dimensions while preserving meaning.
✔ It improves search accuracy, document clustering, and spam ltering.
✔ Based on SVD, it captures hidden relationships between words.
Introduction to Lexicons and Sentiment Analysis
✅ 1. Introduction
📌 Lexicons are prede ned lists of words with associated meanings, often used in NLP tasks like
sentiment analysis, named entity recognition, and text classi cation.
📌 Sentiment Analysis is the process of determining the emotional tone behind a piece of text
(e.g., positive, negative, or neutral).
👉 Example:
• "The product is amazing!" → Positive 😊
• "I had a terrible experience." → Negative 😡
• "The service was okay." → Neutral 😐
2. What is a Lexicon?
✔ A Lexicon is a collection of words with metadata (e.g., meaning, polarity, frequency).
✔ In Sentiment Analysis, lexicons contain words labeled as positive, negative, or neutral.
👉 Example of a Sentiment Lexicon:
Sentiment
Word
Score
Excellen
+3 (Positive)
t
fi
fi
fi
74
Bad -2 (Negative)
Happy +2 (Positive)
Angry -3 (Negative)
3. Lexicon-Based Sentiment Analysis
📌 Uses prede ned word lists to determine sentiment.
👉 Steps:
1⃣ Tokenize the text.
2⃣ Match words with sentiment lexicons.
3⃣ Assign scores and compute overall sentiment.
✔ Advantages:
• Simple and interpretable
• No need for training data
• Works well for domain-speci c tasks
❌ Limitations:
• Ignores context and sarcasm (e.g., "This is just great!" could be sarcastic).
• Fails with unseen words or phrases.
4. Machine Learning-Based Sentiment Analysis
📌 Uses supervised learning to classify text into sentiment categories.
👉 Common ML Algorithms Used:
✔ Naïve Bayes – Probability-based approach.
✔ SVM (Support Vector Machine) – Finds the best boundary between classes.
✔ Deep Learning (LSTMs, Transformers) – Captures context better.
✔ Advantages:
• Understands complex patterns (e.g., sarcasm, context).
• Learns from data instead of relying on prede ned words.
❌ Limitations:
• Requires large labeled datasets.
• Computationally expensive.
fi
fi
fi
75
5. Applications of Sentiment Analysis
✔ Social Media Monitoring – Tracks customer opinions on Twitter, Facebook, etc.
✔ Product Reviews Analysis – Determines customer satisfaction.
✔ Stock Market Prediction – Analyzes news sentiment for nancial insights.
✔ Customer Support Automation – Detects unhappy customers for better service.
6. Summary Table
Approach Pros Cons
Lexicon- Ignores context,
Simple, no training needed
Based sarcasm
More accurate, context-
ML-Based Needs large datasets
aware
7. Key Takeaways
✔ Lexicons store word meanings, often used in sentiment analysis.
✔ Sentiment Analysis determines if a text is positive, negative, or neutral.
✔ Lexicon-based methods are simple but lack context, while ML models offer better accuracy.
Word Embeddings: An Overview
✅ 1. Introduction
📌 Word Embeddings are numerical representations of words in a vector space that capture
semantic meaning and relationships between words.
📌 Unlike traditional one-hot encoding, word embeddings provide a dense representation where
similar words are positioned closer together in the vector space.
👉 Example:
• "King" - "Man" + "Woman" ≈ "Queen"
• "Paris" is closer to "France" than to "India" in vector space.
2. Why Are Word Embeddings Important?
✔ Preserve Semantic Relationships – Words with similar meanings are closer in the embedding
space.
✔ Handle Synonyms and Context – "Happy" and "Joyful" have nearly identical vector
representations.
✔ Enhance NLP Tasks – Used in search engines, chatbots, and sentiment analysis.
fi
76
❌ Challenges:
• Requires large datasets for effective training.
• Can inherit biases present in training data.
3. Common Word Embedding Techniques
1⃣ Word2Vec
📌 Developed by Google, converts words into high-dimensional vectors.
✔ Uses CBOW (Continuous Bag of Words) and Skip-Gram models.
✔ Can capture relationships like "France" → "Paris" (capital-city relation).
2⃣ GloVe (Global Vectors for Word Representation)
📌 Developed by Stanford, based on word co-occurrence statistics.
✔ Ideal for capturing semantic word relationships.
✔ Example: Words frequently appearing together have closer vector representations.
3⃣ FastText
📌 Developed by Facebook, enhances Word2Vec by incorporating subword (morpheme)
information.
✔ Recognizes variations of words (e.g., run, running, runner have similar representations).
✔ Suitable for handling low-resource languages.
4. How Word Embeddings Work?
👉 Steps to Generate Word Embeddings:
1⃣ Tokenization – Splitting text into individual words.
2⃣ Building Vocabulary – Creating a list of unique words.
3⃣ Training the Model – Assigning numerical vectors based on word relationships.
4⃣ Applying Embeddings – Using them in tasks like text classi cation, translation, and
summarization.
5. Applications of Word Embeddings
✔ Machine Translation – Google Translate maps words across different languages.
✔ Chatbots & Virtual Assistants – Helps generate human-like responses.
✔ Search Engines – Improves query understanding and retrieval accuracy.
✔ Text Summarization – Extracts essential information from documents.
fi
77
6. Comparison of Word Embedding Models
Model Key Feature Advantages Limitations
Word2Ve Captures semantic
Context-based learning Requires large data
c meaning
Better at word Computationally
GloVe Co-occurrence matrix
relationships expensive
Uses subword
FastText Handles rare words well Slower training
information
7. Key Takeaways
✔ Word embeddings provide a meaningful way to represent words in numerical form.
✔ Word2Vec, GloVe, and FastText are commonly used models for word representation.
✔ These models are crucial in applications like search engines, chatbots, and NLP tasks.
78
UNIT-5
Speech Recognition
✅ 1. Introduction
📌 Speech Recognition (also known as Automatic Speech Recognition - ASR) is the process of
converting spoken language into text.
📌 It enables machines to understand human speech and is widely used in voice assistants,
transcription services, and accessibility tools.
👉 Example Applications:
• Voice Assistants – Siri, Google Assistant, Alexa
• Speech-to-Text Services – Google Docs Voice Typing, Otter.ai
• Call Centers – Automated customer support
2. How Speech Recognition Works?
👉 Step-by-Step Process:
1⃣ Audio Input – The system receives a speech signal via a microphone.
2⃣ Feature Extraction – Converts audio waves into digital features (e.g., Mel-Frequency Cepstral
Coef cients - MFCC).
3⃣ Acoustic Model – Maps speech features to phonemes (smallest units of sound).
4⃣ Language Model – Predicts the most probable sequence of words.
5⃣ Decoder – Converts phonemes into words, forming a meaningful sentence.
3. Key Technologies Used
✔ Hidden Markov Models (HMMs) – Traditional method for recognizing speech patterns.
✔ Deep Neural Networks (DNNs) – Modern AI-based approach for higher accuracy.
✔ Transformer Models (e.g., Wav2Vec, Whisper) – Advanced speech recognition using self-
supervised learning.
4. Challenges in Speech Recognition
❌ Accents & Dialects – Variations in pronunciation make recognition dif cult.
❌ Background Noise – Noisy environments reduce accuracy.
❌ Homophones – Words that sound similar but have different meanings (e.g., "two" vs. "to" vs.
fi
fi
79
"too").
❌ Code-Switching – Mixing of languages within speech (common in India, e.g., Hinglish).
5. Applications of Speech Recognition
✔ Voice Assistants – Apple Siri, Amazon Alexa, Google Assistant.
✔ Dictation & Transcription – Automatic conversion of speech to text.
✔ Healthcare – Voice-based medical documentation.
✔ Call Center Automation – AI-powered voice response systems.
✔ Accessibility – Helps individuals with disabilities (e.g., voice-controlled devices).
6. Popular Speech Recognition Systems
Compan
System Key Feature
y
AI-powered transcription with real-time
Google Speech-to-Text Google
processing.
IBM Watson Speech-to-
IBM Industry-level accuracy with domain adaptation.
Text
Amazon Transcribe Amazon Works with AWS cloud services.
DeepSpeech Mozilla Open-source and AI-powered.
7. Summary & Future Trends
✔Speech recognition has revolutionized human-computer interaction.
✔ AI-based approaches like transformers are improving accuracy.
✔ Future advancements include real-time multilingual recognition and emotion-aware speech
processing.
Machine Translation
✅ 1. Introduction
📌 Machine Translation (MT) is the process of automatically translating text from one language
to another using computers.
📌 It is widely used in global communication, localization, and multilingual NLP applications.
👉 Example Applications:
• Google Translate – Real-time language translation.
• Facebook AI Translation – Translates user posts and comments.
• Microsoft Translator – Used in business and education.
80
2. Types of Machine Translation
1⃣ Rule-Based Machine Translation (RBMT)
✔ Uses linguistic rules, grammar, and dictionaries to translate.
✔ Works well for structured texts (legal, medical).
❌ Limitation – Requires extensive rule databases and lacks exibility.
2⃣ Statistical Machine Translation (SMT)
✔ Uses probability models trained on bilingual texts.
✔ Example: Google Translate (before AI models).
❌ Limitation – Fails for complex grammar structures.
3⃣ Neural Machine Translation (NMT)
✔ Uses deep learning and neural networks for translation.
✔ Captures context, grammar, and meaning effectively.
✔ Example: Google Translate (modern version), OpenAI’s GPT.
3. How Machine Translation Works?
👉 Step-by-Step Process (NMT)
1⃣ Text Preprocessing – Tokenization, normalization, and sentence segmentation.
2⃣ Encoding – Converts words into vector representations (word embeddings).
3⃣ Translation Model – Uses Transformer-based models like BERT, GPT.
4⃣ Decoding – Generates translated text in the target language.
5⃣ Post-processing – Adjusts grammar and structure for natural output.
4. Challenges in Machine Translation
❌ Idioms & Phrases – "Break a leg" may translate literally instead of meaning "Good luck".
❌ Context Understanding – "Bank" ( nancial institution) vs. "Bank" (riverbank).
❌ Low-Resource Languages – Less training data for regional languages like Bhojpuri, Konkani.
❌ Grammar & Syntax Errors – Sentence structure variations between languages.
5. Applications of Machine Translation
✔ Global Communication – Translates emails, messages, and social media posts.
✔ E-commerce & Business – Localizes websites and customer support.
✔ Healthcare & Law – Helps in multilingual documentation.
✔ Education & Research – Translates academic papers and books.
fi
fl
81
6. Popular Machine Translation Systems
System Company Key Feature
Google Translate Google AI-powered translation across 100+ languages.
DeepL Translator DeepL High-quality translations with deep learning.
Microsoft
Microsoft Cloud-based translation for businesses.
Translator
Amazon Translate Amazon Neural translation for applications.
7. Future of Machine Translation
✔ AI-powered models will enhance accuracy and uency.
✔ Zero-shot translation – AI can translate between languages it has never seen before.
✔ Multilingual models – One model can handle multiple languages at once.
✔ Speech-to-Speech Translation – Real-time translation of spoken language.
Question Answering (Q&A)
✅ 1. Introduction
📌 Question Answering (Q&A) is an NLP task where a system automatically provides answers to
user queries.
📌 It is used in chatbots, virtual assistants, search engines, and customer support systems.
👉 Example Applications:
• Google Search – Provides direct answers to questions.
• Chatbots – AI-powered assistants like ChatGPT, Alexa, and Siri.
• Customer Support – Automated responses to FAQs.
2. Types of Question Answering Systems
1⃣ Open-Domain Q&A
✔ Answers questions using a large dataset or the internet.
✔ Example: Google Search answering “Who is the President of India?”.
✔ Uses retrieval-based techniques (search engine-based).
2⃣ Closed-Domain Q&A
✔ Answers questions from a speci c dataset or domain (e.g., medical, legal).
✔ Example: A Q&A system for medical diagnosis.
fi
fl
82
3⃣ Extractive Q&A
✔ Extracts exact phrases from a document to answer a question.
✔ Example: “What is the capital of France?” → Extracts "Paris" from a paragraph.
4⃣ Generative Q&A
✔ Generates answers using AI models like GPT and BERT.
✔ Example: “Explain Quantum Physics” → AI generates a detailed answer.
3. How Question Answering Works?
👉 Step-by-Step Process:
1⃣ Question Processing – Identi es the type of question (Who, What, Where, When, Why).
2⃣ Document Retrieval – Finds relevant documents using a search engine or database.
3⃣ Answer Extraction – AI extracts or generates the best answer.
4⃣ Answer Ranking – Ranks multiple possible answers based on relevance.
4. Technologies Used in Q&A
✔ BERT (Bidirectional Encoder Representations from Transformers) – Extracts precise
answers.
✔ GPT (Generative Pre-trained Transformer) – Generates human-like responses.
✔ TF-IDF & BM25 – Used in retrieval-based Q&A systems.
✔ Knowledge Graphs – Stores facts and relationships for structured Q&A.
5. Challenges in Question Answering
❌ Understanding Context – AI struggles with ambiguous questions.
❌ Multi-turn Conversations – Maintaining context in long conversations is dif cult.
❌ Misinformation – AI might generate or extract incorrect answers.
❌ Low-Resource Languages – Limited training data for regional languages.
6. Applications of Question Answering
✔ Search Engines – Google, Bing, and DuckDuckGo provide instant answers.
✔ Virtual Assistants – Siri, Google Assistant, and Alexa answer user queries.
✔ Healthcare & Legal – AI-powered Q&A for doctors and lawyers.
✔ E-commerce – Automated Q&A for customer queries.
7. Popular Q&A Systems
fi
fi
83
Google Search Extracts answers from indexed
Google
Q&A websites.
IBM Watson IBM AI-powered Q&A for businesses.
ChatGPT OpenAI Generates human-like responses.
Facebook
Facebook DrQA Extractive Q&A from Wikipedia.
AI
8. Future of Question Answering
✔ Multimodal Q&A – AI can answer text, image, and video-based questions.
✔ Personalized Q&A – AI adapts answers based on user preferences.
✔ Voice-Based Q&A – Advanced speech recognition for voice queries.
✔ Conversational Q&A – AI maintains context across multiple questions.
Summarization
✅ 1. Introduction
📌 Summarization is an NLP technique that condenses large texts into shorter, meaningful
summaries while preserving key information.
📌 Used in news articles, legal documents, research papers, and AI-powered assistants.
👉 Example Applications:
• Google News AI – Generates short news summaries.
• ChatGPT & Bard – Summarizes long texts into concise explanations.
• Legal & Medical AI – Extracts key points from case laws and patient records.
2. Types of Summarization
1⃣ Extractive Summarization
✔ Selects important sentences from the original text.
✔ Uses ranking algorithms like TF-IDF, BM25, TextRank.
✔ Example: Highlighting key sentences from an article.
2⃣ Abstractive Summarization
✔ Generates a new summary in natural language, rather than copying sentences.
✔ Uses deep learning models like BERT, GPT, T5.
✔ Example: "The economy is slowing down" instead of “The GDP growth rate is decreasing.”
3. How Summarization Works?
👉 Step-by-Step Process:
84
1⃣ Text Preprocessing – Tokenization, stopword removal, and stemming.
2⃣ Feature Extraction – Identi es key phrases, named entities, and important words.
3⃣ Ranking Sentences (Extractive Method) – Scores sentences based on importance.
4⃣ Generating Summary (Abstractive Method) – Uses AI models to rewrite content.
5⃣ Post-processing – Removes redundancy and re nes sentence structure.
4. Technologies Used in Summarization
✔ TF-IDF & TextRank – Extracts important sentences.
✔ BERTSUM & GPT-4 – Abstractive summarization.
✔ T5 (Text-To-Text Transfer Transformer) – Google's state-of-the-art summarization model.
5. Challenges in Summarization
❌ Understanding Context – AI might miss important nuances.
❌ Redundancy – Some summaries repeat key points unnecessarily.
❌ Fact Preservation – Abstractive models may generate misleading summaries.
❌ Handling Long Texts – Large documents require advanced compression techniques.
6. Applications of Summarization
✔ News & Journalism – AI-generated headlines and article briefs.
✔ Legal & Financial Reports – Condenses case laws and earnings reports.
✔ Healthcare & Research – Summarizes medical ndings and research papers.
✔ Content Creation – AI-powered summarization for blogs and social media.
7. Popular Summarization Tools
Tool Company Key Feature
SummarizeBot AI-powered Summarizes documents, PDFs, and news.
Hugging Face Open-
Provides AI models for summarization.
Transformers source
Generates high-quality abstractive
Google T5 Google
summaries.
GPT-based Summarization OpenAI Generates human-like summaries.
8. Future of Summarization
✔Real-time Summarization – AI can summarize live content (meetings, lectures).
✔ Multimodal Summarization – AI will summarize texts, videos, and audio.
fi
fi
fi
85
✔ Personalized Summarization – AI will generate summaries tailored to user preferences.
✔ Fact-Checked Summarization – AI will verify facts while summarizing.
Text Categorization
✅ 1. Introduction
📌 Text Categorization (also known as Text Classi cation) is the process of assigning prede ned
labels to text documents based on their content.
📌 It is widely used in spam detection, sentiment analysis, topic classi cation, and document
organization.
👉 Example Applications:
• Spam Filtering – Classi es emails as spam or non-spam.
• News Categorization – Tags articles as politics, sports, technology, etc.
• Sentiment Analysis – Categorizes reviews as positive, negative, or neutral.
2. Types of Text Categorization
1⃣ Rule-Based Classi cation
✔ Uses manually de ned rules (e.g., keywords, regular expressions).
✔ Example: If an email contains "win a prize," classify it as spam.
✔ ❌ Limitations: Requires frequent updates and struggles with complex patterns.
2⃣ Machine Learning-Based Classi cation
✔ Uses statistical models to learn from labeled text data.
✔ Algorithms: Naïve Bayes, SVM, Decision Trees, Random Forest, Neural Networks.
✔ Example: Classifying tweets as hate speech, offensive, or normal.
3⃣ Deep Learning-Based Classi cation
✔ Uses Neural Networks (CNNs, RNNs, LSTMs, Transformers) for better accuracy.
✔ Example: BERT and GPT models categorize complex text more effectively.
3. How Text Categorization Works?
👉 Step-by-Step Process:
1⃣ Text Preprocessing – Tokenization, stopword removal, stemming, lemmatization.
2⃣ Feature Extraction – Converts text into numerical vectors using TF-IDF, Word2Vec, BERT
embeddings.
3⃣ Model Training – A classi er learns patterns from labeled data.
fi
fi
fi
fi
fi
fi
fi
fi
fi
86
4⃣ Prediction & Classi cation – The model assigns labels to new text data.
5⃣ Evaluation – Uses accuracy, precision, recall, and F1-score to measure performance.
4. Algorithms Used in Text Categorization
✔ Naïve Bayes Classi er – Works well for spam detection.
✔ Support Vector Machines (SVM) – Effective for short text classi cation.
✔ Random Forest – Performs well with structured text data.
✔ LSTMs & Transformers (BERT, GPT-4) – Advanced classi cation models.
5. Challenges in Text Categorization
❌ Ambiguity in Language – Words can have multiple meanings.
❌ Handling Imbalanced Data – Some categories have more data than others.
❌ Multiclass & Multi-Label Classi cation – Some texts belong to multiple categories.
❌ Domain-Speci c Language – Dif cult to classify text with industry-speci c jargon.
6. Applications of Text Categorization
✔ Email Spam Filtering – Gmail, Outlook classify emails as spam or important.
✔ Sentiment Analysis – Analyzes movie reviews, product feedback.
✔ Customer Support Automation – Routes queries to the right department.
✔ Fake News Detection – Identi es misleading news articles.
7. Popular Text Categorization Tools
Tool Company Key Feature
NLTK & Scikit-Learn Open-source Provides Naïve Bayes & SVM models.
TensorFlow & Google & Supports deep learning models for
PyTorch Facebook classi cation.
FastText Facebook AI Ef cient text classi cation model.
Google AutoML NLP Google Auto-trains text categorization models.
8. Future of Text Categorization
✔ Self-Learning AI – AI will continuously improve by learning from new text.
✔ Multilingual Text Classi cation – AI will classify texts across multiple languages.
✔ Real-Time Categorization – Faster and more ef cient classi cation for chatbots, news, and
social media.
✔ Explainable AI in Classi cation – AI models will provide reasons for their classi cations.
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
87
Context Identi cation
✅ 1. Introduction
📌 Context Identi cation is the process of understanding the meaning of text based on its
surrounding words, phrases, and overall structure.
📌 It helps in chatbots, sentiment analysis, language translation, and intent detection.
👉 Example Applications:
• Voice Assistants (Alexa, Siri, Google Assistant) – Detects user intent from voice
commands.
• Chatbots – Understands user queries to provide accurate responses.
• Sentiment Analysis – Determines whether a statement is positive, negative, or neutral
based on surrounding words.
2. Importance of Context in NLP
✔ Word Meaning Disambiguation – The same word can have different meanings.
✔ Sentiment Understanding – Words can change meaning based on context (e.g., "not bad" is
positive).
✔ Entity Recognition – Identi es if a word refers to a person, place, or organization.
Example:
• "Apple is a big company." 🍏 → Company
• "I ate an apple." 🍎 → Fruit
3. Techniques for Context Identi cation
1⃣ Lexical Analysis
✔ Identi es parts of speech (POS), synonyms, antonyms, and named entities.
✔ Example: "run" can be a verb (run fast) or a noun (a long run).
2⃣ Dependency Parsing
✔ Analyzes grammatical relationships between words.
✔ Example: "The man saw the dog with a telescope."
• Did the man have a telescope?
• Did the dog have a telescope?
✔ Dependency parsing resolves such ambiguities.
fi
fi
fi
fi
fi
88
3⃣ Word Embeddings (Word2Vec, GloVe, BERT)
✔ Converts words into vector representations to capture meaning and context.
✔ Example: BERT understands that "bank" in "I deposited money in the bank" means a nancial
institution, not a riverbank.
4⃣ Attention Mechanism in Transformers
✔ Used in BERT, GPT, T5 to focus on important words in a sentence.
✔ Example: In "I watched a movie yesterday. It was amazing!", the model understands that "It"
refers to "movie".
4. Challenges in Context Identi cation
❌ Word Ambiguity – The same word can have different meanings.
❌ Idioms & Sarcasm – Phrases like “Yeah, right!” can be hard for AI to understand.
❌ Domain-Speci c Language – Context changes in technical, legal, or medical texts.
❌ Pronoun Resolution – AI must determine what "he," "she," or "it" refers to.
5. Applications of Context Identi cation
✔ Chatbots & Virtual Assistants – Understands user intent.
✔ Machine Translation – Improves accuracy in Google Translate, DeepL.
✔ Content Recommendation – Net ix, YouTube suggest relevant content based on context.
✔ Search Engines – Google uses context to re ne search results.
✔ Fake News Detection – Identi es misleading or false information.
6. Tools for Context Identi cation
Tool Company Key Feature
Open-
spaCy Fast NLP pipeline for context analysis.
source
Open-
NLTK Provides dependency parsing and POS tagging.
source
Deep learning model for context
BERT by Google Google
understanding.
GPT-4 by
OpenAI Advanced language understanding.
OpenAI
7. Future of Context Identi cation
✔ Real-time Context Analysis – AI will understand conversations dynamically.
✔ Better Multimodal Context Understanding – AI will interpret text, images, and videos
together.
✔ Enhanced Emotion & Sentiment Recognition – AI will grasp deeper emotional context.
✔ Context-Aware AI for Customer Support – AI chatbots will provide human-like responses.
fi
fi
fi
fi
fl
fi
fi
fi
fi
89
Dialog Systems
✅ 1. Introduction
📌 A Dialog System (or Conversational AI) is a system designed to interact with users in natural
language, either through text or speech.
📌 It is used in chatbots, virtual assistants, voice-based systems, and customer support bots.
👉 Example Applications:
• Alexa, Siri, Google Assistant – Understands and responds to voice commands.
• Customer Support Chatbots – Resolves queries automatically.
• Healthcare AI Assistants – Helps in patient diagnosis via conversation.
2. Types of Dialog Systems
1⃣ Rule-Based Dialog Systems
✔ Uses prede ned IF-ELSE rules for responses.
✔ Example:
• User: "What is your name?"
• Bot: "I am a chatbot."
✔ ❌ Limitation: Cannot handle unexpected queries.
2⃣ Retrieval-Based Dialog Systems
✔ Selects prede ned responses based on input matching.
✔ Uses TF-IDF, Word2Vec, or BERT for better understanding.
✔ Example: Customer service chatbots that answer FAQs.
3⃣ Generative Dialog Systems
✔ Uses Deep Learning (RNNs, LSTMs, Transformers) to generate responses.
✔ More natural and exible than retrieval-based systems.
✔ Example: ChatGPT, Google Bard, Microsoft Copilot.
3. Architecture of Dialog Systems
🛠 Components of a Dialog System:
1⃣ Speech/Text Input – Takes user input (voice or text).
2⃣ Natural Language Understanding (NLU) – Identi es intent, extracts entities.
3⃣ Dialog Manager – Maintains conversation history and context.
fi
fi
fl
fi
90
4⃣ Natural Language Generation (NLG) – Generates human-like responses.
5⃣ Speech/Text Output – Returns a response.
4. Implementing a Simple Dialog System in Python
🔹 Using a Rule-Based Approach
def chatbot_response(user_input):
responses = {
"hello": "Hi! How can I help you?",
"how are you": "I'm just a bot, but I'm doing
great!",
"bye": "Goodbye! Have a nice day!",
}
return responses.get(user_input.lower(), "Sorry, I don't
understand.")
# Example conversation
while True:
user_input = input("You: ")
if user_input.lower() == "exit":
break
print("Bot:", chatbot_response(user_input))
🔹 Limitation: Cannot handle complex queries.
🔹 Using a Retrieval-Based Approach (NLTK & TF-IDF)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
responses = ["Hi there!", "I'm a chatbot.", "I can help
you.", "Goodbye!"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(responses)
def chatbot_response(user_input):
user_vec = vectorizer.transform([user_input])
similarity = cosine_similarity(user_vec, X)
best_match_idx = similarity.argmax()
return responses[best_match_idx]
# Example
91
print("Bot:", chatbot_response("Who are you?"))
🔹 Advantage: Finds the closest match based on TF-IDF similarity.
🔹 Using a Generative Model (Transformer-Based Chatbot - GPT-like Model)
from transformers import pipeline
chatbot = pipeline("conversational", model="facebook/
blenderbot-400M-distill")
while True:
user_input = input("You: ")
if user_input.lower() == "exit":
break
response = chatbot(user_input)
print("Bot:", response['generated_text'])
🔹 Advantage: Can generate human-like responses dynamically.
🔹 Requirement: Needs Hugging Face Transformers Library (pip install
transformers).
5. Challenges in Dialog Systems
❌ Understanding Context – AI must maintain conversation history.
❌ Handling Ambiguity – Words/phrases may have multiple meanings.
❌ Emotional Understanding – Hard for AI to detect sarcasm/emotions.
❌ Multilingual Support – AI must understand different languages.
6. Applications of Dialog Systems
✔ Customer Service Chatbots – Automates query resolution.
✔ Healthcare Assistants – Provides health advice based on symptoms.
✔ AI Tutors – Assists in online learning.
✔ Smart Home Assistants – Controls home devices using voice commands.
✔ Voice-Activated Devices – Helps visually impaired users interact with technology.
7. Future of Dialog Systems
✔ More Human-Like Conversations – AI will improve in emotion detection.
✔ Better Context Retention – Advanced memory models for better context understanding.
✔ Multimodal Conversational AI – AI will combine voice, text, and gestures.
✔ Personalized AI Assistants – AI will learn user preferences over time.
92
Introduction to Famous Deep Learning-Based NLP
Models
✅ 1. Introduction
📌 Deep Learning has revolutionized Natural Language Processing (NLP) by enabling models to
understand and generate human-like text.
📌 Advanced NLP models like BERT, GPT-4, and T5 are widely used in chatbots, machine
translation, question answering, and summarization.
👉 Key Capabilities of Deep Learning NLP Models:
✔ Contextual understanding of language.
✔ Generating human-like text responses.
✔ Summarizing long documents.
✔ Answering complex questions.
2. Famous Deep Learning-Based NLP Models
🔹 BERT (Bidirectional Encoder Representations from Transformers)
📌 Developed by Google AI (2018).
📌 Uses bidirectional context learning (understands both left & right context).
📌 Great for Question Answering (QA), Text Classi cation, Sentiment Analysis, etc.
Example: Using BERT for Text Classi cation
from transformers import BertTokenizer,
BertForSequenceClassification
import torch
tokenizer = BertTokenizer.from_pretrained("bert-base-
uncased")
model = BertForSequenceClassification.from_pretrained("bert-
base-uncased")
text = "The product is amazing and I love it!"
tokens = tokenizer(text, return_tensors="pt", padding=True,
truncation=True)
output = model(**tokens)
fi
fi
93
print(output.logits) # Raw prediction scores
🔹 Advantage: Understands context better than previous NLP models.
🔹 Limitation: Requires large computational power for training.
🔹 GPT-4 (Generative Pre-trained Transformer 4)
📌 Developed by OpenAI (2023).
📌 A generative model that can write essays, code, answer questions, and generate creative text.
📌 Uses self-attention to predict the next word in a sentence.
Example: Using GPT-4 for Text Generation
from transformers import pipeline
gpt_pipeline = pipeline("text-generation", model="gpt-4")
response = gpt_pipeline("Once upon a time in AI history,",
max_length=50)
print(response[0]['generated_text'])
🔹 Advantage: Generates high-quality, human-like responses.
🔹 Limitation: Expensive and requires large datasets.
🔹 T5 (Text-to-Text Transfer Transformer)
📌 Developed by Google AI (2019).
📌 Converts every NLP task into a text generation problem (e.g., translation, summarization).
Example: Using T5 for Summarization
from transformers import T5Tokenizer,
T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-
small")
text = "Artificial Intelligence is transforming the world
with new innovations."
input_ids = tokenizer("summarize: " + text,
return_tensors="pt").input_ids
output = model.generate(input_ids)
94
print(tokenizer.decode(output[0], skip_special_tokens=True))
🔹 Advantage: Works well for summarization, translation, and question answering.
🔹 Limitation: Needs ne-tuning for speci c tasks.
3. Comparison of Famous NLP Models
Mode Developed
Strengths Best For
l By
BER Contextual Sentiment Analysis, Question
Google AI
T Understanding Answering
GPT-
OpenAI Text Generation Chatbots, Story Writing, Q&A
4
T5 Google AI Text-to-Text Transfer Summarization, Translation
4. Applications of Deep Learning NLP Models
✔ Virtual Assistants – Alexa, Google Assistant, Siri.
✔ Chatbots – Customer support AI chatbots.
✔ Machine Translation – Google Translate, DeepL.
✔ Summarization Tools – Automatic summarization of news/articles.
✔ Code Generation – AI-powered programming assistants like GitHub Copilot.
5. Future of Deep Learning-Based NLP Models
✔More Ef cient Models – AI with lower compute requirements.
✔ Better Context Retention – AI understanding long conversations better.
✔ Multilingual AI – Support for multiple languages & dialects.
✔ Ethical AI – Bias-free and safe AI systems.
Deep Learning-Based NLP Models: BERT, GPT-4,
etc.
1. Introduction
Natural Language Processing (NLP) has advanced signi cantly with the introduction of
transformer-based models. These models, such as BERT, GPT-4, and T5, use deep learning
techniques to understand and generate human-like text.
✔ Transformer Architecture: Replaces traditional RNNs and CNNs, enabling parallel processing
and better contextual understanding.
✔ Applications: Chatbots, Question Answering, Sentiment Analysis, Summarization, Machine
Translation, etc.
fi
fi
fi
fi
95
2. BERT (Bidirectional Encoder Representations from
Transformers)
📌 Developed by: Google AI (2018)
📌 Architecture: Transformer-based bidirectional model (reads text from both left and right).
📌 Key Feature: Uses self-attention mechanism to capture word dependencies in context.
🔹 How BERT Works
BERT is pre-trained on large text corpora and ne-tuned for speci c NLP tasks. It follows:
1. Pre-training Phase (Self-Supervised Learning)
◦ Masked Language Model (MLM): Random words in a sentence are masked, and
BERT predicts them.
◦ Next Sentence Prediction (NSP): BERT determines whether two sentences follow
each other in a text.
2. Fine-Tuning Phase
◦ Pre-trained BERT is ne-tuned on speci c tasks like sentiment analysis, Q&A, or
translation.
🔹 Advantages of BERT
✔ Understands context bidirectionally (better meaning extraction).
✔ Handles polysemy (same word, different meanings) effectively.
✔ Improves many NLP tasks like Named Entity Recognition (NER) and Q&A.
🔹 Limitations of BERT
❌ High computational cost (requires GPUs/TPUs for training).
❌ Slow inference due to its deep architecture.
❌ Not designed for text generation, mainly for understanding.
3. GPT-4 (Generative Pre-trained Transformer 4)
📌 Developed by: OpenAI (2023)
📌 Architecture: Transformer-based unidirectional model (processes text from left to right).
📌 Key Feature: Autoregressive text generation – predicts the next word based on previous
words.
🔹 How GPT-4 Works
fi
fi
fi
fi
96
1. Pre-training Phase
◦ Trained on massive text datasets using unsupervised learning.
◦ Learns grammar, facts, reasoning, and common sense.
2. Fine-tuning Phase
◦ Adjusted with human feedback (RLHF - Reinforcement Learning with Human
Feedback) to improve responses.
3. Text Generation
◦ Uses probability distribution to generate human-like text for various applications.
🔹 Advantages of GPT-4
✔ Best in class text generation (coherent, detailed responses).
✔ Handles multi-turn conversations well.
✔ Can process images along with text (multimodal capabilities).
✔ Improved factual accuracy compared to GPT-3.5.
🔹 Limitations of GPT-4
❌ Still prone to hallucinations (generating incorrect facts).
❌ Expensive to train and deploy (requires high computational power).
❌ Limited reasoning in complex problems.
🔹 Example Code: Using GPT-4 with OpenAI API
import openai
openai.api_key = "YOUR_API_KEY"
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": "Explain quantum
computing"}]
)
print(response['choices'][0]['message']['content'])
4. Comparison: BERT vs. GPT-4
Feature BERT (Google) GPT-4 (OpenAI)
Directionality Bidirectional Unidirectional
97
Main Purpose Text Understanding Text Generation
Pre-training
MLM, NSP Autoregressive Learning
Tasks
Fine-tuning Task-Speci c RLHF-based tuning
Q&A, Sentiment Chatbots, Content
Use Cases
Analysis Generation
5. Other Deep Learning NLP Models
🔹 T5 (Text-to-Text Transfer Transformer)
📌 Developed by: Google
📌 Key Feature: Treats all NLP tasks as text-to-text problems.
✔ Example: Summarization – "Summarize: The quick brown fox jumps over the lazy dog."
6. Conclusion & Future Trends
✔ Hybrid Models – Combining BERT for understanding and GPT for generation.
✔ Multimodal AI – Models processing text, images, and videos (e.g., GPT-4 Vision).
✔ Ef cient AI – Reducing size and improving inference speed for deployment.
🚀 BERT and GPT-4 are transforming NLP! Let me know if you need further clari cations
or implementation details. 😊
Indian Language Case Studies
✅ 1. Introduction
📌 India has 22 of cial languages and over 1,600 dialects, making NLP for Indian languages
highly complex.
📌 Many deep learning-based NLP models are developed speci cally to understand, translate,
and process Indian languages.
📌 Challenges include low resource availability, complex grammar, and script diversity.
2. Challenges in Indian Language NLP
✔ Script Diversity – Hindi (Devanagari), Tamil (Brahmic), Urdu (Perso-Arabic), etc.
✔ Low-Resource Languages – Limited datasets for regional languages like Manipuri, Konkani.
✔ Code-Mixing – Hindi-English or Tamil-English mixed language ("Hinglish", "Tanglish").
✔ Phonetic Spelling Variations – Different spellings for the same word.
✔ Morphological Complexity – Words change form based on tense, gender, and plurality.
fi
fi
fi
fi
fi
98
3. Indian NLP Initiatives and Models
🔹 AI4Bharat
📌 Developed Indian NLP tools like IndicBERT and Samanantar (largest parallel dataset for
translation).
📌 Focuses on machine translation, speech recognition, and sentiment analysis.
Example: Using IndicBERT for Text Processing
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indic-
bert")
model = AutoModel.from_pretrained("ai4bharat/indic-bert")
text = "भारत एक सुंदर श ।"
tokens = tokenizer(text, return_tensors="pt")
output = model(**tokens)
print(output.last_hidden_state.shape)
✔ Use Case: Sentiment analysis and NER for Indian languages.
🔹 Google’s MuRIL (Multilingual Representations for Indian Languages)
📌 A BERT-based model that understands Indian languages and code-mixed text.
📌 Pre-trained on 17 Indian languages and transliterated data.
✔ Use Case: Detecting offensive language in Hinglish tweets, improving search engines, etc.
🔹 Microsoft’s IndicTrans
📌 A Neural Machine Translation (NMT) system for translating Indian languages.
📌 Supports multiple scripts and maintains context better than Google Translate.
✔ Use Case: Translation services for government documents, news, and legal texts.
4. Applications of Indian NLP Models
दे
है
99
✔ Automatic Speech Recognition (ASR) – Google Assistant, Alexa in Indian languages.
✔ Machine Translation – AI-powered tools translating Hindi-English, Tamil-Telugu.
✔ Chatbots & Virtual Assistants – Indian banking & e-commerce platforms using Hindi
chatbots.
✔ Sentiment Analysis – Understanding social media opinions in regional languages.
5. Future of Indian NLP
✔ Better Dataset Availability – More labeled Indian language datasets.
✔ Ef cient Multilingual AI – NLP models supporting real-time translation & conversation.
✔ Improved Speech Recognition – AI understanding Indian accents & dialects.
✔ Expansion in Education & Healthcare – AI-powered language learning & medical
consultation.
fi