0% found this document useful (0 votes)

56 views99 pages

NLP Notes Complete

The document reviews Chomsky's hierarchy of languages, detailing four types of grammars: Type 0 (Unrestricted), Type 1 (Context-Sensitive), Type 2 (Context-Free), and Type 3 (Regular), each with increasing restrictions and capabilities. It also discusses Regular Expressions, their applications in programming, and their role in formal language theory, highlighting their limitations. Additionally, the document covers Finite Automata, Pushdown Automata, Turing Machines, and Finite State Transducers, along with their characteristics and applications in computational models and natural language processing.

Uploaded by

Akanksha Soni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views99 pages

NLP Notes Complete

Uploaded by

Akanksha Soni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 99

1

UNIT-1

Review of Chomsky’s Hierarchy of Languages

According to Chomsky’s hierarchy, grammars are classi ed into four types based on their
generative power:

🔹 Type 0: Unrestricted Grammar

• Also known as Recursively Enumerable grammar.

• Recognized by a Turing Machine.

• It includes all formal grammars.

➤ Grammar Rule Format:

α → β
Where:

• α ∈ (V ∪ T)* V (V ∪ T)* (i.e., at least one variable must appear on the left side)

• β ∈ (V ∪ T)*

V: Variables
T: Terminals

✔ Example:

Sab → ba
A → S
Here, Variables = {S, A}, Terminals = {a, b}

🔹 Type 1: Context-Sensitive Grammar

• Generates Context-Sensitive Languages.

• Recognized by a Linear Bounded Automaton (LBA).

• Must satisfy Type 0 conditions.

➤ Grammar Rule Format:

α → β where |α| ≤ |β| and α ≠ ε

• The length of the RHS must be greater than or equal to the LHS.

• The left side cannot be empty.

✔ Example:
fi
2
S → AB
AB → abc
B → b

🔹 Type 2: Context-Free Grammar

• Generates Context-Free Languages.

• Recognized by a Pushdown Automaton.

• Must satisfy Type 1 conditions.

➤ Grammar Rule Format:

A → γ
Where:

• Left-hand side is a single variable

• No restriction on the right-hand side (γ ∈ (V ∪ T)*)

✔ Example:

S → AB
A → a
B → b

🔹 Type 3: Regular Grammar

• Generates Regular Languages.

• Recognized by a Finite Automaton (DFA/NFA).

• Most restricted form of grammar.

➤ Grammar Rule Formats:

Right-regular:

V → T / TV
Left-regular:

V → T / VT
Where V is a variable, T is a terminal.

✔ Example (Strict Regular Grammar):

S → a
✔ Example (Extended Regular Grammar):
3
S → ab
This allows T* (multiple terminals) on either side with only one variable.

🔁 Summary: Inclusion of Language Classes

Regular ⊂ Context-Free ⊂ Context-Sensitive ⊂ Recursively
Enumerable

Regular Expressions (RegEx)

A Regular Expression is a pattern used to match character combinations in strings. They're
powerful tools used in searching, matching, and replacing text.

Regular expressions are used in programming languages like JavaScript, Python, Java, C++, and
many more, as well as in tools like text editors, command-line utilities (grep, sed), and regex-
based form validations.

Regular Expressions (RegEx) are sequences of characters that de ne a search pattern. They are
used to match, locate, and manage text. They form the basis of pattern matching in many
programming languages and are fundamental in lexical analysis.

🔹 Applications of Regular Expressions:

• Lexical Analysis in compilers

• Pattern matching in text processing

fi
4
• Input validation (emails, passwords, phone numbers)

• Search and Replace operations

• Text mining and NLP tasks

🔹 Regular Expressions in Formal Language Theory:

In theoretical computer science:

• Regular expressions describe Regular Languages

• Recognized by Finite Automata (FA)

• Part of Chomsky Hierarchy (Type 3)

🔹 Basic Symbols in RegEx:

Symbo
Meaning
l
a Matches the character a
. Matches any single character
Matches the beginning of a
^ string
$ Matches the end of a string
* 0 or more occurrences
+ 1 or more occurrences
? 0 or 1 occurrence (optional)
` `
() Grouping
[] Character class
{n} Exactly n repetitions
{n,} n or more repetitions
{n,m
Between n and m repetitions
}

🔹 Shorthand Character Classes:

Symbo
Matches
l
\d Any digit [0–9]
\D Any non-digit
5

Word character [a-zA-

\w
Z0-9_]
\W Non-word character
\s Whitespace
\S Non-whitespace

🔹 Examples:

Pattern Matches
a* ``, a, aa, aaa
[a-z] Any lowercase letter
[A-Za-z]
One or more letters
+
^\d{3}$ Exactly three digits
ab, abab,
(ab)+
ababab

🔹 Limitations:

• Cannot handle nested or recursive patterns (e.g., balanced parentheses)

• Limited to regular languages only

🧠 Summary:

• Regular expressions de ne regular languages

• Recognized by Finite Automata

• Useful in both theoretical CS and practical programming

🔹 Finite Automata (FA)

Finite Automata are abstract machines used to model computation. They process strings over an
input alphabet and accept or reject them based on whether they follow speci c rules.

➤ Deterministic Finite Automaton (DFA) — In Detail

✅ Characteristics:

• At any point in time, the machine is in exactly one state.

fi
fi
6
• For each symbol in the alphabet, there is exactly one transition from a state.

• No room for guessing or ambiguity.

🔍 DFA Formal De nition Recap:

DFA = (Q, Σ, δ, q₀, F)

Where:

• Q = Set of states ( nite)

• Σ = Input alphabet (set of symbols)

• δ = Transition function: Q × Σ → Q

• q₀ = Initial (start) state (q₀ ∈ Q)

• F = Set of accepting ( nal) states (F ⊆ Q)

🧠 Example: DFA for strings ending in 01 over Σ = {0,1}

States:

• q0: Start

• q1: Seen 0

• q2: Seen 0 followed by 1 (Accept)

Transition Table:

State 0 1
q0 q1 q0
q1 q1 q2
q2 q1 q0

Final state = q2 ✅

So: "1001" → q0 → q0 → q1 → q2 (ACCEPT)

➤ Non-Deterministic Finite Automaton (NFA) — In Detail

✅ Characteristics:

• Can move to multiple states for a single input symbol.

• Can have ε-transitions (i.e., move without consuming input).

fi
fi
fi
7
• Easier to design (especially for complex patterns), but non-deterministic.

• Cannot directly be implemented as-is — needs to be converted to DFA.

🔍 NFA Formal De nition:

NFA = (Q, Σ, δ, q₀, F)

Where:

• δ = Q × (Σ ∪ {ε}) → 2^Q (transition function gives a set of possible states)

🧠 Example: Accept strings containing substring "ab"

States:

• q0: Start

• q1: Seen 'a'

• q2: Seen "ab" (Accept)

Transitions:

• q0 → q0 on a or b (loop)

• q0 → q1 on a

• q1 → q2 on b

Any string that reaches q2 is accepted ✅

🔁 DFA vs NFA: Comparison

Feature DFA NFA

Transitions per
One per state/input Zero, one, or multiple
input
Epsilon transitions ❌ Not allowed ✅ Allowed
Implementation Easy Harder (needs conversion)
Harder for complex Easier for complex
Design simplicity
languages patterns
Power (language) Same (regular languages) Same (equivalent power)

📌 Key Point: Any NFA can be converted to an equivalent DFA using the subset construction
algorithm.
fi
8

🔹 Beyond Finite Automata (Non-Finite Automata)

Some languages (especially non-regular) can't be recognized by FA. We need more powerful
machines.

➤ Pushdown Automaton (PDA)

🧠 Why PDAs?

Finite Automata can’t handle nested structures like a^n b^n (equal a's followed by equal b's).
But PDAs can — by using a stack!

✅ Characteristics:

• Like an NFA but with a stack (LIFO memory).

• Can push, pop, or peek symbols on the stack.

• Recognizes context-free languages (like palindromes, balanced parentheses).

📌 Real-world Example:

A compiler checking for matching brackets in code {[()()]} — PDA handles it.

➤ Turing Machine

The most powerful computational model in formal language theory.

✅ Characteristics:

• Has a tape (in nite memory) and head that can read/write/move.

• Can simulate any algorithm.

• Recognizes recursively enumerable languages (Type-0 in Chomsky hierarchy).

Turing Machine vs FA:

Finite
Feature Turing Machine
Automata
Memory No In nite Tape
Pattern
Computation General Computation
Matching
Language Recursively
Regular
Class Enumerable
fi
fi
9

🔍 Applications:

• Theoretical limits of computation

• Language design

• Problem solvability (halting problem, etc.)

🔹 Finite State Transducers (FST)

A Finite State Transducer (FST) is a computational model that extends the concept of a Finite
Automaton (FA) by producing output as it processes an input. FSTs are essential in areas such as
Natural Language Processing (NLP), speech processing, and morphological analysis.

✅ What is an FST?

An FST is essentially a Finite Automaton with the added functionality of generating output while
reading an input string.

Just like a Finite Automaton, an FST:

• Has states.

• Has transitions based on input symbols.

However, the key difference is that FSTs produce output on each transition, which means they can
map an input string to an output string.

🔍 Formal De nition of FST:

An FST is formally de ned as a 6-tuple:

FST = (Q, Σ, Γ, δ, ω, q₀, F)

Where:

• Q: A nite set of states.

• Σ: The input alphabet (set of input symbols).

• Γ: The output alphabet (set of output symbols).

• δ: The transition function: δ: Q × Σ → Q. This de nes the next state given the current state
and an input symbol.

• ω: The output function: ω: Q × Σ → Γ*. This de nes the output string produced during a
transition.
fi
fi
fi
fi
fi
10
• q₀: The start state (the state at the beginning of the computation).

• F: A set of nal (accepting) states.

🔍 How Does an FST Work?

• The FST processes an input string from left to right.

• On each input symbol, the FST:

◦ Moves to a new state based on the current state and the input symbol (using the
transition function δ).

◦ Produces output based on the current state and input symbol (using the output
function ω).

• The process continues until all input symbols are consumed.

• If the FST reaches an accepting state, the output string is considered valid.

🧠 Example of FST:

Consider a simple FST for a morphological analysis task where it converts words to their root
form and adds grammatical features.

• Input: cats

• Output: cat + N + Plural

Here, the FST would:

• Read the string "cats".

• Identify the root word "cat" and the suf x "-s" as plural.

• Output cat + N + Plural to represent the root word along with its grammatical
features.

🔁 Types of Transducers:

1. Moore Machine:

◦ In a Moore machine, the output depends only on the current state.

◦ The output is associated with states rather than transitions.

2. Mealy Machine:
fi
fi
11
◦ In a Mealy machine, the output depends on both the current state and the input
symbol.

◦ The output is associated with transitions rather than states.

✨ Applications of FST in NLP:

1. Morphological Analysis:

◦ FSTs are widely used in morphological analyzers to break down words into their
root forms and grammatical features (e.g., running → run + V +
Progressive).

2. Speech Recognition:

◦ FSTs can help map spoken language to text by producing a sequence of output
phonemes or words.

3. Transliteration:

◦ FSTs can be used to convert words from one script to another, such as converting
Romanized Hindi into Devanagari script.

4. Spelling Correction:

◦ In spelling correction, FSTs can map misspelled words to their corrected forms by
applying prede ned rules.

🔍 Natural Language and Linguistics – In Detail

1⃣ Natural Language

De nition:

Natural language refers to the languages humans use for everyday communication. Unlike
programming languages, which are arti cially created for machines, natural languages evolve
naturally within communities over time. These languages can be spoken, written, or signed.
Examples include English, Hindi, Tamil, Japanese, and many more.

Key Characteristics:

• Ambiguity: Words and phrases in natural languages often have multiple meanings
depending on the context.

◦ Example: The word "bank" could mean a nancial institution or the side of a river.

• Context-dependence: The meaning of a sentence can change based on the situation or

background.
fi
fi
fi
fi
12
◦ Example: The sentence "He saw her duck" could mean he observed her pet duck or
that she lowered her head.

• Evolving Nature: Natural languages constantly evolve with new words, slang, and usages
being created. This dynamic nature allows them to adapt to new societal contexts.

2⃣ Linguistics

De nition:

Linguistics is the scienti c study of language. It focuses on understanding how languages are
structured, how we comprehend and produce language, and how language evolves over time.
Linguistics serves as the foundation for Natural Language Processing (NLP), which is essential
for making computers understand and interact with human languages.

Branches of Linguistics:

1. Phonetics:
◦ Focuses on the physical sounds of speech, how they are produced, transmitted, and
perceived.

◦ Example: The difference between the "p" in "spin" (unaspirated) and "pin"
(aspirated).

2. Phonology:
◦ Studies how sounds function within particular languages and the rules for sound
combinations.

◦ Example: The combination "ng" is never at the start of a word in English (e.g., it
appears in "ring" but not "ngot").

3. Morphology:
◦ Deals with the structure of words and how words are formed from smaller units of
meaning, called morphemes.

◦ Example: The word "unhappiness" is made up of three morphemes: "un-" (pre x

meaning negation), "happy" (root), and "-ness" (suf x meaning noun).

4. Syntax:
◦ Concerned with the structure of sentences and how words combine to form
grammatically correct sentences.

◦ Example: "She is eating an apple" vs. "Is apple she an eating?" (incorrect word
order).

5. Semantics:
◦ Studies the meaning of words, phrases, and sentences.
fi
fi
fi
fi
13
◦ Example: "Cats chase mice" vs. "Mice chase cats" — same words but different
meanings.

6. Pragmatics:
◦ Examines how context in uences the meaning of language in real-world scenarios.

◦ Example: "Can you pass the salt?" is a question, but pragmatically, it’s a request for
action.

7. Discourse Analysis:
◦ Focuses on how larger stretches of language (e.g., conversations, paragraphs) are
structured and how they maintain coherence.

◦ Example: Analyzing how a story is structured or how speakers change topics during
a conversation.

Importance in NLP:

Understanding the branches of linguistics is crucial for Natural Language Processing (NLP), a
eld focused on enabling computers to process and understand human language. Each branch of
linguistics corresponds to different NLP tasks, such as:

• Phonetics and Phonology: Speech recognition and synthesis.

• Morphology: Word segmentation and stemming.

• Syntax and Semantics: Parsing, part-of-speech tagging, and semantic analysis.

• Pragmatics and Discourse: Understanding context, dialogue systems, and conversational

agents.

🔹 Syntax and Structure

🔸 What is Syntax?

Syntax is the set of rules that govern the structure of sentences in a language. It de nes how
words combine to form meaningful phrases and sentences.

🔹 In simple terms:
Syntax = Grammar rules + Sentence patterns

🔸 Why is Syntax Important?

In Natural Language Processing (NLP), understanding syntax helps computers:

fi
fl
fi
14
• Parse a sentence (break it into parts)

• Identify grammatical roles (subject, verb, object, etc.)

• Disambiguate meanings (e.g., “Flying planes can be dangerous”)

• Perform tasks like machine translation, chatbots, text summarization, etc.

🔸 Syntax Structures

1. Phrase Structure (Constituency Grammar)

This structure breaks a sentence into nested sub-phrases or constituents.

🔹 Example Sentence:
"The cat sat on the mat."

We can break this as:

[S
[NP The cat]
[VP sat
[PP on
[NP the mat]
]
]
]
Legend:

• S: Sentence

• NP: Noun Phrase

• VP: Verb Phrase

• PP: Prepositional Phrase

This hierarchy forms a syntax tree, or parse tree.

2. Dependency Grammar

Instead of using nested phrases, dependency grammar focuses on word-to-word relationships.

🔹 In the same sentence:

• "sat" is the root verb.

15
• "cat" is the subject of "sat".

• "on" is a preposition dependent on "sat".

• "mat" is the object of "on".

This creates a dependency tree where each word is connected directly to another word it depends
on.

🔸 Grammar Rules and Syntax in NLP

➤ CFG (Context-Free Grammar)

Used to de ne syntactic rules in formal language theory.

Rules have the form: A → α

• A is a non-terminal (e.g., S, NP, VP)

• α is a sequence of terminals and/or non-terminals

🔹 Example Rules:

S → NP VP
NP → Det N
VP → V NP
Det → the | a
N → cat | mat
V → sat | saw
This set of rules helps generate or parse valid sentences.

🔸 Syntax Trees (Parse Trees)

A syntax tree is a visual representation of the syntactic structure of a sentence according to a

grammar.

🔹 Example:
For sentence: “The dog barked.”

Tree:

S
/ \
fi
16
NP VP
/ \ \
Det N V
| | |
The dog barked

🔸 Applications of Syntax in NLP

Application Role of Syntax

Machine Maintains sentence structure in other
Translation language
Grammar Checking Identi es syntactic errors
Question Answering Helps identify subject, object, etc.
Text Summarization Understands clause hierarchy and importance

3⃣ Syntax and Structure

De nition:

Syntax is the branch of linguistics that deals with the structure of sentences. It focuses on the rules
and principles that govern the way words combine to form grammatically correct sentences in a
given language. Syntax examines the relationships between different elements of a sentence, such
as subject, verb, object, and how these components follow speci c word order patterns.

Key Aspects of Syntax:

1. Word Order: Syntax dictates the order in which words should appear in a sentence to
maintain grammaticality.

◦ Example (English): “She eats an apple” vs. “Eats she an apple” (incorrect word
order).
2. Syntactic Categories: Words in a language can be classi ed into categories based on their
function in the sentence. These categories include:

◦ Nouns (person, place, thing)

◦ Verbs (actions or states)

◦ Adjectives (describe nouns)

◦ Adverbs (modify verbs, adjectives, or other adverbs)

◦ Prepositions (show relationships between words, e.g., in, on, at)

3. Sentence Types:
fi
fi
fi
fi
17
◦ Declarative: Statements (e.g., “I am a student”)

◦ Interrogative: Questions (e.g., “Are you a student?”)

◦ Imperative: Commands (e.g., “Please sit down.”)

◦ Exclamatory: Expressing strong feelings (e.g., “What a beautiful day!”)

4. Phrases: A phrase is a group of words that work together to convey a single idea. Phrases
can be categorized as:

◦ Noun Phrase (NP): Consists of a noun and its modi ers. (e.g., “the big dog”)

◦ Verb Phrase (VP): Contains the main verb and its auxiliaries. (e.g., “has been
running”)

◦ Prepositional Phrase (PP): Begins with a preposition and includes its object. (e.g.,
“in the park”)

5. Syntactic Trees: Syntax often uses tree structures (also called parse trees) to represent
sentence structure, showing how words and phrases are hierarchically arranged.

Importance in NLP:

In Natural Language Processing (NLP), syntax is essential for tasks such as:

• Sentence Parsing: Identifying the syntactic structure of a sentence.

• Part-of-Speech Tagging: Assigning the correct grammatical category to each word (e.g.,
verb, noun).

• Machine Translation: Translating sentences from one language to another while

maintaining grammaticality.

4⃣ Representation of Meaning

De nition:

The representation of meaning in language refers to how the meaning of words, phrases, and
sentences is captured, understood, and processed. Understanding meaning is essential for any
language model or system that aims to interact with humans in a natural way.

Types of Meaning:

1. Lexical Meaning: The meaning of individual words. This can be determined from a
dictionary de nition.

◦ Example: The word "dog" refers to a domesticated carnivorous mammal.

2. Compositional Meaning: The meaning of larger linguistic units (like phrases or sentences)
based on the meanings of their parts.
fi
fi
fi
18
◦ Example: “Black cat” means a cat that is black in color, where “black” modi es the
noun “cat.”

3. Contextual Meaning: The meaning that arises from the context in which a word or sentence
is used.

◦ Example: “He’s a hotshot.” In context, it means someone who is very skilled or

successful, but “hotshot” in a different context could refer to a small, fast-moving
object.

4. Ambiguity in Meaning:

◦ Lexical Ambiguity: When a word has multiple meanings.

▪ Example: "Lead" can refer to a metal or to guide.

◦ Syntactic Ambiguity: When a sentence has more than one possible syntactic
interpretation.

▪ Example: “I saw the man with the telescope.” This can mean either:

1. The man had a telescope.

2. I used a telescope to see the man.

5. Semantic Roles: Each part of a sentence typically plays a speci c role in conveying
meaning, such as:

◦ Agent: The doer of an action (e.g., in “John kicked the ball,” John is the agent).

◦ Theme: The entity that is affected by the action (e.g., in “John kicked the ball,” the
ball is the theme).

◦ Goal: The recipient of an action or the destination (e.g., in “She gave him the book,”
him is the goal).

Importance in NLP:

In Natural Language Processing (NLP), representing meaning is crucial for tasks like:

• Word Sense Disambiguation: Determining the correct meaning of a word based on

context.

• Machine Translation: Ensuring accurate translation of meaning from one language to

another.

• Information Retrieval: Matching queries with relevant documents by understanding the

meaning behind words and sentences.

Let me know when you're ready to proceed to the next topic, or if you'd like more information on
any of the concepts discussed!
fi
fi
19

Lexical and Semantic Models

🔸 A. Lexical Semantics

Lexical Semantics is the study of how words convey meaning, and how they relate to one another
in a language.

✅ 1. Lexeme

• A lexeme is the abstract unit of meaning underlying different word forms.

• Examples:

◦ Lexeme: run

◦ Word forms: run, runs, ran, running

• Lexeme ≠ word — a lexeme groups all in ected forms.

✅ 2. Word Sense

• Many words are polysemous — they have multiple meanings or senses.

• Word Sense Disambiguation (WSD) is the task of determining which sense of a word is
used in a given context.

• Example:

◦ “He sat by the bank” → River side

◦ “He went to the bank to get cash” → Financial institution

✅ 3. Word Relationships (Lexical Relations):

Type Description Example

Synonymy Words with similar meanings happy ↔ joyful
Antonymy Words with opposite meanings hot ↔ cold
A word whose meaning is included in another (IS-
Hyponymy Car is a hyponym of Vehicle
A)
Hypernym
The more general category Animal is a hypernym of Dog
y
Meronymy Part-whole relationship Wheel is a meronym of Car
Homonym
Same spelling/sound but unrelated meaning Bat (animal) and Bat (cricket)
y
fl
20

Paper (material, essay,

Polysemy One word, multiple related meanings
newspaper)

✅ 4. Thesauri and Lexical Databases

• WordNet: A lexical database grouping words into sets of synonyms (synsets) with semantic
relationships.

◦ You can nd synonyms, antonyms, hyponyms, and hypernyms using WordNet.

🔸 B. Semantic Models

Semantic models are computational methods used to represent meaning in text and words.

✅ 1. Bag of Words (BoW)

• Idea: Represent text as an unordered collection of words.

• Each word gets a frequency count; grammar and order are ignored.

• Example:

◦ Sentence 1: “The cat sat on the mat.”

◦ Sentence 2: “Mat the on sat cat the.”

◦ BoW sees both as identical.

Pros:

• Simple and easy to implement.

Cons:

• Ignores grammar, context, and word order.

✅ 2. TF-IDF (Term Frequency - Inverse Document Frequency)

fi
21
• Goal: Highlight important words in a document, downweight common words.

• TF: How often a word appears in a document.

• IDF: How rare the word is across all documents.

Use Case: Improves relevance in search engines.

✅ 3. Word Embeddings

• Words are represented as vectors in a high-dimensional space.

• Semantically similar words have closer vector positions.

• Models:

◦ Word2Vec (Google)

◦ GloVe (Stanford)

◦ FastText (Facebook)

• Example:

◦ vec("king") - vec("man") + vec("woman") ≈ vec("queen")

Advantage: Captures relationships like analogies, semantic proximity, contextual use.

✅ 4. Contextual Embeddings

• Advanced models that generate different word vectors depending on the context.

• Useful for solving word sense disambiguation.

• Models:

◦ BERT (Bidirectional Encoder Representations from Transformers)

◦ GPT (Generative Pretrained Transformer)

◦ ELMo (Embeddings from Language Models)

Example:

• “She went to the bank to deposit money.”

• “The bank was ooded after the storm.”

➡ BERT assigns different vectors for “bank” in each sentence.

✅ 5. Semantic Parsing
fl
22
• Converts natural language into a formal representation of meaning.

• Often results in logical expressions, graphs, or other structures.

• Used in chatbots, question answering, machine translation.

📌 Real-life Applications

Area Use Case

Search Engines Ranking pages by TF-IDF or embeddings
Chatbots Understanding user queries with embeddings
Machine Contextual embedding for accurate
Translation translation
Sentiment Analysis Capturing emotion behind text
Question
Mapping question to semantic form
Answering

Absolutely! Here's a detailed breakdown of Text Corpora, the nal topic in Unit 1.

📚 Text Corpora
🔸 A. What is a Corpus?

• A corpus (plural: corpora) is a large, structured collection of texts used for linguistic
analysis and training NLP models.

• It may include written texts, spoken language transcriptions, dialogues, or social media
posts.

📌 Think of it as the “data backbone” for most NLP applications.

🔸 B. Types of Corpora

Type Description Example

Monolingual
Texts in a single language Brown Corpus (English)
Corpus
Multilingual Texts in multiple languages without translation Leipzig Corpus
Corpus alignment
Parallel Texts in two or more languages with sentence-
Europarl Corpus
Corpus by-sentence translation
Comparable Same topic in different languages, but not News articles from various
Corpus sentence-aligned countries
Annotated Penn Treebank (POS tags,
Corpus enriched with metadata or linguistic tags
Corpus syntactic structure)
fi
23
Spoken Transcriptions of spoken language Switchboard, Spoken BNC
Corpus
Social Media
Tweets, forums, comments, etc. Twitter Sentiment Corpus
Corpus

🔸 C. Annotations in Corpora

Annotations enhance raw text by adding linguistic information, such as:

Annotation Type Purpose Example

POS Tags (Part-of-Speech) Identify word classes “The/DT cat/NN sat/VBD”
Syntactic Trees Show sentence structure (NP (DT The) (NN cat))
Named Entity Recognition Identify entities like names, “Google/ORG launched in 1998/
(NER) places, dates DATE”
Agent: John, Action: bought,
Semantic Roles Label who did what to whom
Theme: book

🔸 D. Uses of Text Corpora in NLP

1. Training Language Models (e.g., GPT, BERT)

2. Grammar and Syntax Learning

3. Statistical Analysis of Word Usage

4. Machine Translation Systems

5. Sentiment and Emotion Analysis

6. Speech Recognition and Generation

7. Chatbots and Dialogue Systems

8. Lexicon and Thesaurus Construction

🔸 E. Popular Corpora in NLP

Corpus Description
First million-word electronic corpus of American
Brown Corpus
English
Penn Treebank Annotated with POS tags and syntactic structure
WordNet Lexical database — can be used as a corpus
COCA (Corpus of Contemporary American
Modern American English usage
English)
Europarl Corpus European Parliament proceedings (parallel
corpus)
Wikipedia Dumps Used in many modern NLP tasks
24

Twitter Sentiment Corpus Useful for sentiment analysis

🔸 F. Building Your Own Corpus

Steps:

1. Collect raw text (web scraping, APIs, documents, etc.)

2. Preprocess: Clean (remove HTML, symbols), tokenize, normalize.

3. Annotate: Add linguistic features manually or with NLP tools.

4. Store and Index: Use formats like JSON, XML, or plain text.

5. Analyze/Use in ML models, rule-based systems, etc.

🔸 G. Challenges with Text Corpora

• Bias: Re ects cultural, gender, or societal biases.

• Domain-speci city: A corpus may not generalize well.

• Licensing: Many corpora are not free for commercial use.

• Annotation errors: Human annotation can introduce inconsistencies.

🧠 Summary
• A corpus is essential for training, evaluating, and improving NLP systems.

• It can be general-purpose or domain-speci c, raw or annotated.

• High-quality corpora lead to better-performing language models.

fl
fi
fi
25
UNIT -2

Natural Language Processing (NLP): Text

Wrangling and Pre-processing
Introduction

In Natural Language Processing (NLP), text wrangling and pre-processing are essential steps to
prepare raw text data for analysis, model training, and machine learning applications. These steps
help convert unstructured text into a structured format, making it easier for algorithms to extract
meaningful patterns.

1. Text Wrangling
Text wrangling (also known as text cleaning) involves handling raw text data to make it suitable for
further processing. It includes:

Key Steps in Text Wrangling

1. Removing Unwanted Characters

◦ Remove special characters (@, #, $, %, &, *, etc.)

◦ Remove punctuations (.,?!:;() etc.)

◦ Remove numerical values if they are not useful

2. Handling Case Sensitivity

◦ Convert all text to lowercase to avoid duplication issues.

Example: "NLP is Amazing" → "nlp is amazing"

3. Removing Extra Spaces & Whitespace Characters

◦ Extra spaces, tabs, and newlines (\t, \n) can be removed for consistency.

4. Handling Encoding Issues

◦ Convert text to UTF-8 format to avoid encoding mismatches.

2. Text Pre-processing
Pre-processing is the next step after text wrangling, which prepares text for analysis or machine
learning models.

Key Steps in Text Pre-processing

1. Tokenization
26
◦ Splitting text into smaller units called tokens (words, sentences, or subwords).

◦ Example:
Text: "I love NLP!"

◦ Tokenized: ['I', 'love', 'NLP', '!']

◦
2. Stopword Removal

◦ Stopwords are common words (e.g., "the", "is", "in", "and") that do not add much
meaning to the text.

◦ Example:
Before: "This is a great NLP tutorial."

◦ After: "great NLP tutorial"

◦
3. Stemming & Lemmatization

◦ Stemming: Reduces words to their base/root form using simple heuristics.

Example: "running" → "run", "easily" → "easili"

◦ Lemmatization: Converts words to their base dictionary form using linguistic

knowledge.
Example: "running" → "run", "better" → "good"

4. Part-of-Speech (POS) Tagging

◦ Assigns grammatical labels (noun, verb, adjective, etc.) to each word.

◦ Example: "The dog barks" → [('The', 'DT'), ('dog',

'NN'), ('barks', 'VBZ')]
5. Named Entity Recognition (NER)

◦ Identi es proper nouns such as names, organizations, locations, etc.

◦ Example: "Google was founded by Larry Page" → ['Google'

(ORG), 'Larry Page' (PERSON)]
6. Text Normalization

◦ Converts text into a standard format:

▪ Expand contractions: "I'm" → "I am"

▪ Correct spelling: "recieve" → "receive"

▪ Normalize slang: "u" → "you"

7. Vectorization (Feature Extraction)

◦ Converts text into numerical form for machine learning models.

fi
27
◦ Techniques:

▪ Bag of Words (BoW)

▪ TF-IDF (Term Frequency-Inverse Document Frequency)

▪ Word Embeddings (Word2Vec, GloVe, BERT, etc.)

Key Points for Exams

• Text wrangling cleans raw text by removing noise, unwanted characters, and formatting
issues.

• Pre-processing prepares text for analysis by tokenization, stopword removal, stemming,

lemmatization, and feature extraction.

• Tokenization splits text into meaningful units (words, sentences).

• Stopwords are removed to reduce computational complexity.

• Stemming vs. Lemmatization: Stemming is faster but less accurate; lemmatization is more
precise.

• NER and POS tagging help in understanding the grammatical structure and named entities
in text.

• Vectorization techniques (BoW, TF-IDF, Word2Vec, etc.) convert text into numerical data
for models.

Tokenization in NLP
What is Tokenization?

Tokenization is the process of breaking down a text into smaller units (tokens), such as words,
sentences, or subwords. It is the rst step in text pre-processing for NLP tasks like text analysis,
machine learning, and deep learning.

Types of Tokenization
1. Word Tokenization

◦ Splits text into individual words.

◦ Example:
Input: "Natural Language Processing is amazing!"

◦ Output: ['Natural', 'Language', 'Processing', 'is',

'amazing', '!']
fi
28
◦
2. Sentence Tokenization

◦ Splits text into sentences based on punctuation (e.g., ., ?, !).

◦ Example:
Input: "NLP is fun. I love learning it!"

◦ Output: ["NLP is fun.", "I love learning it!"]

◦
3. Subword Tokenization (Used in deep learning models like BERT, GPT)

◦ Splits words into meaningful subwords to handle out-of-vocabulary (OOV) words.

◦ Example:
"unhappiness" → ["un", "happiness"]

Why is Tokenization Important?

• Converts unstructured text into structured format.

• Helps in removing stopwords, stemming, and lemmatization.

• Used in feature extraction methods like TF-IDF and word embeddings.

• Essential for NLP models like chatbots, search engines, and sentiment analysis.

Tokenization in Python (Code Examples)

1. Using NLTK (Natural Language Toolkit)

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize

text = "NLP is powerful! It helps machines understand human

language."

# Word Tokenization
word_tokens = word_tokenize(text)
print("Word Tokenization:", word_tokens)

# Sentence Tokenization
sentence_tokens = sent_tokenize(text)
print("Sentence Tokenization:", sentence_tokens)
Output:
29
Word Tokenization: ['NLP', 'is', 'powerful', '!', 'It',
'helps', 'machines', 'understand', 'human', 'language', '.']
Sentence Tokenization: ['NLP is powerful!', 'It helps
machines understand human language.']

2. Using spaCy (More Ef cient for Large Text)

import spacy

nlp = spacy.load("en_core_web_sm")
text = "Tokenization is the first step in NLP. It splits text
into words and sentences."

doc = nlp(text)

# Word Tokenization
print("Word Tokenization:", [token.text for token in doc])

# Sentence Tokenization
print("Sentence Tokenization:", [sent.text for sent in
doc.sents])

3. Using Hugging Face’s Tokenizer (For Deep Learning)

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-
uncased")

text = "Tokenization is crucial for NLP!"

tokens = tokenizer.tokenize(text)
print("Subword Tokenization:", tokens)
Output:

Subword Tokenization: ['token', '##ization', 'is', 'crucial',

'for', 'nl', '##p', '!']
Explanation:

• "tokenization" is split into "token" and "##ization" because the model

recognizes "token" as a common word.

• "nlp" is split into "nl" and "##p" as a subword.

Key Points for Exams

• Tokenization is the rst step in NLP text processing.
fi
fi
30
• Types of Tokenization: Word Tokenization, Sentence Tokenization, and Subword
Tokenization.

• NLTK and spaCy are common Python libraries for tokenization.

• Deep learning models (like BERT, GPT) use subword tokenization to handle unknown
words.

• Tokenization helps in text analysis, sentiment analysis, and chatbot development.

Removing Unwanted Tokens in NLP

What are Unwanted Tokens?

Unwanted tokens are elements in text data that do not contribute meaningful information to NLP
tasks. These include:

• Special characters (@, #, $, %, &)

• Punctuation (.,?!:;()[])

• Numbers (123, 45.67)

• Extra whitespace (" NLP is great ")

• Stopwords (the, is, and, in, to, etc.)

• HTML tags (, <div>)

• Emojis and symbols (😊 , ✔, 🚀 )

1. Removing Special Characters & Punctuation

Python Example using Regex (re module)

import re

text = "Hello!! Welcome to NLP. Let's learn & explore?"

clean_text = re.sub(r'[^\w\s]', '', text) # Remove special
characters and punctuation
print(clean_text)
Output:

Hello Welcome to NLP Lets learn explore

2. Removing Numbers
31
text = "NLP has 100 techniques and 50+ models."
clean_text = re.sub(r'\d+', '', text) # Remove digits
print(clean_text)
Output:

NLP has techniques and models.

3. Removing Extra Whitespaces

text = " NLP is awesome! "
clean_text = " ".join(text.split()) # Remove extra spaces
print(clean_text)
Output:

NLP is awesome!

4. Removing Stopwords (Using NLTK)

Stopwords are common words that do not add much meaning to a sentence.

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
text = "This is an amazing NLP tutorial for beginners!"
words = text.split()
filtered_text = " ".join([word for word in words if
word.lower() not in stop_words])
print(filtered_text)
Output:

amazing NLP tutorial beginners!

5. Removing HTML Tags

from bs4 import BeautifulSoup

text = "This is an NLP tutorial."

clean_text = BeautifulSoup(text, "html.parser").get_text()
print(clean_text)
Output:

This is an NLP tutorial.

6. Removing Emojis and Symbols

32
import emoji

text = "NLP is awesome! 😊 🚀 "

clean_text = emoji.replace_emoji(text, replace="")
print(clean_text)
Output:

NLP is awesome!

Key Points for Exams

• Unwanted tokens include punctuation, numbers, stopwords, special characters, HTML
tags, and emojis.

• Regex (re.sub) is useful for removing punctuation and numbers.

• NLTK stopwords help lter out common words that do not add meaning.

• BeautifulSoup is used for removing HTML tags.

• Emoji library helps remove or replace emojis in text.

Corrections, Stemming, and Normalization in NLP

1. Text Corrections

Text correction is the process of xing spelling errors, typos, and grammatical mistakes in text. It is
crucial for improving text quality before further NLP processing.

Types of Text Corrections:

1. Spell Checking: Identi es and corrects misspelled words.

2. Grammatical Corrections: Fixes grammar mistakes.

3. Word Substitution: Suggests correct words for typos.

Python Example: Spell Checking with TextBlob

from textblob import TextBlob

text = "NLP is amzng and very usful for data anlysis."

corrected_text = TextBlob(text).correct()
print(corrected_text)
Output:

NLP is amazing and very useful for data analysis.

Using pyspellchecker for Faster Spell Checking
fi
fi
fi
33
from spellchecker import SpellChecker

spell = SpellChecker()
text = "Ths is a smple NLP tst."
words = text.split()
corrected_words = [spell.correction(word) for word in words]
corrected_text = " ".join(corrected_words)
print(corrected_text)
Output:

This is a sample NLP test.

2. Stemming

Stemming is the process of reducing words to their root form by removing suf xes. It is a quick
but sometimes inaccurate approach.

Example of Stemming

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["running", "flies", "happily", "studies"]
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)
Output:

['run', 'fli', 'happili', 'studi']

• "running" → "run" ✅

• " ies" → " i" ❌ (Incorrect due to over-stemming)

• "studies" → "studi" ❌ (Incorrect)

Other Stemming Algorithms

1. Porter Stemmer: Most common and fast.

2. Snowball Stemmer: Advanced version of Porter Stemmer.

3. Lancaster Stemmer: More aggressive and removes more characters.

3. Normalization

Text normalization is the process of converting words into a standard format to ensure
consistency in NLP tasks.
fl
fl
fi
34
Key Normalization Techniques:

1. Lowercasing: Converts all text to lowercase.

text = "Natural Language Processing"

2. print(text.lower())
3.
Output: "natural language processing"
4. Removing Special Characters & Punctuation:
import re

5. text = "Hello!!! NLP is great."

6. clean_text = re.sub(r'[^\w\s]', '', text)
7. print(clean_text)
8.
Output: "Hello NLP is great"
9. Expanding Contractions: Converts short forms to full forms.
from contractions import fix

10. text = "I'll go to the park. It's amazing!"

11. print(fix(text))
12.
Output: "I will go to the park. It is amazing!"
13. Lemmatization (Better than Stemming): Converts words to their dictionary form.
from nltk.stem import WordNetLemmatizer

14. import nltk

15. nltk.download('wordnet')
16.
17. lemmatizer = WordNetLemmatizer()
18. words = ["running", "flies", "happily", "studies"]
19. lemmatized_words = [lemmatizer.lemmatize(word) for word
in words]
20. print(lemmatized_words)
21.
Output:
['running', 'fly', 'happily', 'study']
22.
◦ " ies" → " y" ✅ (Correct compared to stemming)

◦ "studies" → "study" ✅ (Correct)

Key Points for Exams

• Corrections: Fix spelling and grammatical mistakes using tools like TextBlob and
pyspellchecker.
fl
fl
35
• Stemming: Reduces words to root form but may lead to incorrect results (flies →
fli).

• Normalization: Standardizes text (lowercasing, punctuation removal, contractions,

lemmatization).

• Lemmatization is better than stemming as it gives meaningful words.

Parsing the Text in NLP

Parsing is the process of analyzing the structure of a sentence to understand its meaning and
grammatical structure. It includes Part of Speech (POS) tagging and Probabilistic Parsing.

1. Part of Speech (POS) Tagging

What is POS Tagging?

POS tagging assigns grammatical categories (such as noun, verb, adjective, etc.) to each word in a
sentence.

Example POS Tags:

POS
Meaning
Tag
NN Noun (e.g., cat, car)
VB Verb (e.g., run, eat)
Adjective (e.g., beautiful,
JJ
large)
RB Adverb (e.g., quickly, silently)
PRP Pronoun (e.g., he, she, they)

POS Tagging in Python

Using NLTK (Natural Language Toolkit):

import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

text = "John is playing football in the park."

words = nltk.word_tokenize(text) # Tokenization
pos_tags = nltk.pos_tag(words) # POS tagging
print(pos_tags)
Output:
36
[('John', 'NNP'), ('is', 'VBZ'), ('playing', 'VBG'),
('football', 'NN'), ('in', 'IN'), ('the', 'DT'), ('park',
'NN'), ('.', '.')]

POS Tagging Using Spacy (More Ef cient)

import spacy
nlp = spacy.load("en_core_web_sm")

text = "John is playing football in the park."

doc = nlp(text)

for token in doc:

print(token.text, ":", token.pos_)
Output:

John : PROPN
is : AUX
playing : VERB
football : NOUN
in : ADP
the : DET
park : NOUN
. : PUNCT

2. Probabilistic Parsing
What is Probabilistic Parsing?

Probabilistic parsing assigns the most likely grammatical structure to a sentence based on
probability. It helps in handling ambiguous sentences.

Types of Parsing:

1. Constituency Parsing (Phrase Structure Parsing)

◦ Breaks the sentence into noun phrases (NP), verb phrases (VP), etc.

2. Dependency Parsing
◦ Identi es dependencies between words (e.g., subject-verb relationship).

Probabilistic Context-Free Grammar (PCFG)

A Probabilistic Context-Free Grammar (PCFG) assigns probabilities to different grammatical

rules.
fi
fi
37
Example grammar:

S → NP VP [0.9]
NP → Det N [0.5] | N [0.5]
VP → V NP [0.7] | V [0.3]
Det → 'the' [1.0]
N → 'dog' [0.5] | 'cat' [0.5]
V → 'chased' [1.0]
This means:

• S → NP VP happens 90% of the time.

• NP → Det N happens 50% of the time.

• N → 'dog' and N → 'cat' both happen 50% of the time.

Probabilistic Parsing Using NLTK

import nltk

grammar = nltk.PCFG.fromstring("""
S -> NP VP [0.9]
NP -> Det N [0.5] | N [0.5]
VP -> V NP [0.7] | V [0.3]
Det -> 'the' [1.0]
N -> 'dog' [0.5] | 'cat' [0.5]
V -> 'chased' [1.0]
""")

parser = nltk.ViterbiParser(grammar)

sentence = ['the', 'dog', 'chased', 'the', 'cat']

for tree in parser.parse(sentence):
print(tree)
Output (Parse Tree):

(S
(NP (Det the) (N dog))
(VP (V chased) (NP (Det the) (N cat))))

Dependency Parsing Using Spacy

import spacy
nlp = spacy.load("en_core_web_sm")

text = "The dog chased the cat."

38
doc = nlp(text)

for token in doc:

print(f"{token.text} --> {token.dep_} -->
{token.head.text}")
Output:

The --> det --> dog

dog --> nsubj --> chased
chased --> ROOT --> chased
the --> det --> cat
cat --> dobj --> chased
. --> punct --> chased
• nsubj (Nominal Subject): "dog" is the subject of "chased."

• dobj (Direct Object): "cat" is the object of "chased."

• det (Determiner): "the" is linked to nouns.

Key Points for Exams

• POS Tagging assigns parts of speech to words.

• NLTK and Spacy are commonly used for POS tagging.

• Probabilistic Parsing helps resolve ambiguities in sentence structure.

• PCFG (Probabilistic Context-Free Grammar) assigns probabilities to grammar rules.

• Constituency Parsing breaks sentences into phrases (NP, VP, etc.).

• Dependency Parsing identi es grammatical relationships between words.

Shallow, Dependency, and Constituency Parsing in

NLP
Parsing is the process of analyzing the structure of a sentence to understand its syntactic and
semantic meaning. It is broadly classi ed into three types:

1. Shallow Parsing (Chunking)

2. Dependency Parsing

3. Constituency Parsing

1. Shallow Parsing (Chunking)

fi
fi
39

What is Shallow Parsing?

Shallow Parsing, also known as Chunking, groups words into phrases (like noun phrases, verb
phrases) without fully analyzing the entire sentence structure. It does not form a complete parse tree
but identi es key phrases.

Example:
Sentence: "The quick brown fox jumps over the lazy dog."
Shallow Parsing Output:

• [The quick brown fox] (Noun Phrase - NP)

• [jumps] (Verb Phrase - VP)

• [over the lazy dog] (Prepositional Phrase - PP)

Shallow Parsing Using NLTK

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

sentence = "The quick brown fox jumps over the lazy dog"
words = nltk.word_tokenize(sentence) # Tokenization
pos_tags = nltk.pos_tag(words) # POS Tagging

# Define chunk grammar

grammar = "NP: {<DT>?<JJ>*<NN>}" # NP = Determiner (DT) +
Adjective (JJ) + Noun (NN)

chunk_parser = nltk.RegexpParser(grammar)
tree = chunk_parser.parse(pos_tags)

tree.pretty_print()
Output:

(S
(NP The/DT quick/JJ brown/JJ fox/NN)
jumps/VBZ
over/IN
(NP the/DT lazy/JJ dog/NN))
✅ Key Points:

• Faster than full parsing

fi
40
• Extracts key phrases (noun, verb, etc.)

• Useful for Named Entity Recognition (NER) and Information Extraction

2. Dependency Parsing
What is Dependency Parsing?

Dependency Parsing identi es relationships between words in a sentence. Each word (except the
root) depends on another word, forming a dependency tree.

Example:
Sentence: "The cat chases the mouse."
Dependency Relations:

• cat → subject of chases

• mouse → object of chases

• the → determiner (modi es "cat" and "mouse")

Dependency Parsing Using Spacy

import spacy
nlp = spacy.load("en_core_web_sm")

sentence = "The cat chases the mouse."

doc = nlp(sentence)

for token in doc:

print(f"{token.text} --> {token.dep_} -->
{token.head.text}")
Output:

The --> det --> cat

cat --> nsubj --> chases
chases --> ROOT --> chases
the --> det --> mouse
mouse --> dobj --> chases
. --> punct --> chases
✅ Key Points:

• Each word depends on a head word

• Represents grammatical structure

• Used in relation extraction, question answering, machine translation

fi
fi
41

3. Constituency Parsing
What is Constituency Parsing?

Constituency Parsing breaks sentences into constituents (phrases) based on a hierarchical

structure, forming a tree structure (Parse Tree).

Example:
Sentence: "The cat chases the mouse."
Parse Tree:

S
/ \
NP VP
/ / \
Det V NP
| | / \
The chases Det N
| |
the mouse
• S → Sentence

• NP (Noun Phrase) → The cat

• VP (Verb Phrase) → chases the mouse

• Det (Determiner) → The

Constituency Parsing Using NLTK

import nltk

grammar = nltk.CFG.fromstring("""
S -> NP VP
NP -> Det N | Det Adj N
VP -> V NP
Det -> 'The' | 'the'
Adj -> 'lazy'
N -> 'cat' | 'mouse'
V -> 'chases'
""")

parser = nltk.ChartParser(grammar)

sentence = ['The', 'cat', 'chases', 'the', 'mouse']

42
for tree in parser.parse(sentence):
tree.pretty_print()
Output:

S
/ \
NP VP
/ / \
Det V NP
| | / \
The chases Det N
| |
the mouse
✅ Key Points:

• Breaks sentences into phrases (NP, VP, etc.)

• Used in syntax analysis, grammar checking, sentence understanding

Comparison of Parsing Techniques

Feature Shallow Parsing Dependency Parsing Constituency Parsing
Output Phrases (NP, VP) Dependency Tree Parse Tree
Complexity Fast & Simple Medium High
Relation Extraction, Semantic Syntax Checking, AI
Use Cases Chunking, POS tagging
Analysis Chatbots
Example "The cat" (NP), "chases" cat --> nsubj -->
Full Sentence Parse Tree
Output (VP) chases

Exam-Oriented Key Points

1. Shallow Parsing
◦ Also called Chunking

◦ Groups words into phrases (NP, VP, etc.)

◦ Used in NER, POS tagging, Information Extraction

2. Dependency Parsing
◦ Identi es word dependencies

◦ Outputs ROOT word & dependencies

◦ Used in Relation Extraction, Machine Translation

fi
43
3. Constituency Parsing
◦ Breaks sentences into phrases (NP, VP, etc.)

◦ Forms Parse Tree (Hierarchical Structure)

◦ Used in Grammar Checking, AI Chatbots

44
UNIT-3

Text Corpus
✅ 1. De nition

A text corpus is a large and structured collection of texts used for natural language processing
(NLP) tasks such as text classi cation, machine translation, and sentiment analysis. It serves as the
foundational dataset for training machine learning models and analyzing language trends.

2. Types of Text Corpus

Different types of corpora exist based on the language, content, and purpose.

1. Monolingual Corpus

• Contains text in only one language.

• Used for training language models in that language.

• Example: Wikipedia Corpus (English), Hindi News Articles.

• Use Case: Chatbots, Sentiment Analysis, Spell Checking.

2. Multilingual Corpus

• Contains text in multiple languages.

• Used for cross-lingual research and translation models.

• Example: EuroParl Corpus, UN Documents in different languages.

• Use Case: Machine Translation (e.g., Google Translate).

3. Parallel Corpus

• A special type of multilingual corpus where the same text is available in multiple
languages, side by side.

• Used for training translation models.

• Example: TED Talks Transcripts, UN Parallel Corpus.

• Use Case: Automatic Translation Tools like DeepL, Google Translate.

4. Specialized Corpus (Domain-Speci c Corpus)

• Focuses on a speci c industry or topic (Medical, Legal, Finance, etc.).

• Used for industry-speci c NLP models.

fi
fi
fi
fi
fi
45
• Example: PubMed (Medical Corpus), Legal Corpus.

• Use Case: Medical chatbots, Legal Document Analysis, Financial Trend Prediction.

5. General Corpus

• Contains a wide variety of topics (news, books, blogs, etc.).

• Used for training general-purpose NLP models.

• Example: Google Books Corpus, Common Crawl (web pages).

• Use Case: AI Assistants like ChatGPT, Speech-to-Text Models.

3. Importance of a Text Corpus

✔ Trains AI models for text understanding and generation.

✔ Improves machine translation and speech recognition.
✔ Used in linguistic research to study language patterns.
✔ Helps in text classi cation, chatbots, and voice assistants.

4. Preprocessing a Text Corpus

Before using a text corpus, preprocessing is essential to clean and standardize text.

✔ Tokenization – Splitting text into words or sentences.

✔ Lowercasing – Convert all text to lowercase for consistency.
✔ Stopword Removal – Removing frequently occurring but unimportant words (e.g., "is", "the",
"and").
✔ Stemming & Lemmatization – Reducing words to their root or base form (e.g., "running" →
"run").
✔ Removing Punctuation & Special Characters – Eliminating symbols like @, #, $, !.
✔ Vectorization – Converting words into numerical values using techniques like Bag of Words
(BoW), TF-IDF, Word2Vec, GloVe.

5. Summary Table

Type of
Description Example Use Case
Corpus
Chatbots, Sentiment
Monolingual Text in one language Wikipedia (English)
Analysis
Cross-language NLP,
Multilingual Text in multiple languages UN Documents
Translation
Same text translated side-
Parallel TED Talks Corpus Machine Translation
by-side
Specialized Domain-speci c text PubMed (Medical) Medical & Legal NLP
fi
fi
46
Google Books, Common AI Assistants, Speech
General Covers multiple topics
Crawl Recognition

6. Key Takeaways

✔ A text corpus is a collection of texts used for NLP and machine learning.
✔ Types include Monolingual, Multilingual, Parallel, Specialized, and General corpora.
✔ Famous corpora include Brown Corpus, Google Books Ngram, and Common Crawl.
✔ Preprocessing is crucial for preparing text before using it in NLP models.

Bag of words

✅ 1. De nition

The Bag of Words (BoW) Model is a simple and widely used technique for representing text in
numerical form. It ignores grammar and word order but keeps track of word frequency in a
document.

It is called "Bag of Words" because it treats text as an unordered "bag" of words, only focusing on
how many times each word appears.

2. Working of Bag of Words Model

Step 1: Collect Text Data

• Example:
Document 1: "I love playing football."

• Document 2: "Football is a great sport."

•
Step 2: Tokenization (Convert Sentences into Words)

• Remove punctuation and split text into words:

["I", "love", "playing", "football"]

• ["Football", "is", "a", "great", "sport"]

•
Step 3: Create a Vocabulary

A vocabulary is a list of all unique words in the dataset:

["I", "love", "playing", "football", "is", "a", "great",

"sport"]
Step 4: Create Word Frequency Vectors
fi
47
Each document is converted into a vector where each column represents a word from the
vocabulary, and the values indicate how often the word appears.

Word I love playing football is a great sport

Doc 1 1 1 1 1 0 0 0 0

Doc 2 0 0 0 1 1 1 1 1

• 1 means the word is present in the document.

• 0 means the word is absent.

3. Advantages of Bag of Words

✔ Simple and easy to implement.

✔ Works well for small datasets.
✔ Good for text classi cation, spam ltering, and sentiment analysis.

4. Disadvantages of Bag of Words

❌ Ignores word meaning – "good" and "not good" are treated as separate words.
❌ Ignores word order – "Apple is red" and "Red is apple" have the same vector.
❌ Leads to large feature space – If the vocabulary is large, it creates very large vectors.

5. Applications of Bag of Words

✔ Text Classi cation – Spam detection, sentiment analysis.

✔ Topic Modeling – Finding the main topics in a document.
✔ Information Retrieval – Search engines rank documents based on keyword frequency.

6. Summary Table

Feature Description
De nition A simple way to represent text as word frequency vectors.
How It Works Creates a vocabulary, then counts word occurrences.

Pros Simple, effective for small datasets, good for text classi cation.
Cons Ignores meaning, ignores word order, large feature space.
Applications Spam ltering, sentiment analysis, search engines.

7. Key Takeaways
fi
fi
fi
fi
fi
fi
48
✔ BoW represents text as word frequency vectors, ignoring word order and meaning.
✔ It is widely used for text classi cation, spam ltering, and search engines.
✔ It struggles with large vocabulary sizes and doesn't capture word relationships.

Bag of N-Grams Model

✅ 1. De nition

The Bag of N-Grams Model is an extension of the Bag of Words (BoW) model that considers
word sequences (N-Grams) instead of individual words. This helps capture the context and order
of words to some extent, making it better than BoW.

2. What is an N-Gram?

An N-Gram is a sequence of N consecutive words from a given text.

Example Sentence: "I love playing

N-Gram Type
football"
Unigram
"I", "love", "playing", "football"
(N=1)
Bigram (N=2) "I love", "love playing", "playing football"
Trigram (N=3) "I love playing", "love playing football"

• Unigrams (N=1) work like Bag of Words (BoW).

• Bigrams (N=2) help capture word order and context.

• Trigrams (N=3) or higher further improve meaning representation.

3. Working of Bag of N-Grams Model

Step 1: Collect Text Data

fi
fi
fi
49
• Example:
Document 1: "I love playing football."

• Document 2: "Football is a great sport."

•
Step 2: Tokenization (Break text into N-Grams)

For bigrams (N=2), we get:

• Document 1 → ["I love", "love playing", "playing football"]

• Document 2 → ["Football is", "is a", "a great", "great sport"]

Step 3: Create a Vocabulary

["I love", "love playing", "playing football", "Football is",

"is a", "a great", "great sport"]
Step 4: Create N-Gram Frequency Vectors

N- I love playing Football is a great

Gram love playing football is a great sport
Doc 1 1 1 1 0 0 0 0
Doc 2 0 0 0 1 1 1 1

• 1 means the bigram is present in the document.

• 0 means the bigram is absent.

4. Advantages of Bag of N-Grams Model

✔ Captures context – Unlike BoW, it preserves some word order.

✔ Improves text classi cation – Helps in sentiment analysis, spam detection.
✔ Better than BoW for understanding meaning.

5. Disadvantages of Bag of N-Grams Model

❌ Higher dimensionality – More features than BoW, making it computationally expensive.

❌ Still limited context – Cannot understand sentence meaning fully.
❌ Sparse Data Issue – Many N-Grams may appear only once, leading to many zero values in
vectors.

6. Applications of Bag of N-Grams Model

✔ Sentiment Analysis – "not good" vs. "good" (BoW fails here).

✔ Spam Detection – Common spam word sequences are captured.
fi
50
✔ Speech Recognition – Predicting the next word in a sentence.
✔ Plagiarism Detection – Identi es similar phrases in documents.

7. Summary Table

Feature Description
De nition A model that represents text using word sequences (N-Grams).
How It Splits text into N-word sequences and creates frequency
Works vectors.
Pros Preserves some context, improves classi cation accuracy.
Cons High dimensionality, still lacks deep meaning understanding.
Applications Sentiment analysis, spam detection, speech recognition.

8. Key Takeaways

✔ N-Grams capture some word order, making it better than BoW.

✔ Bigram and Trigram models improve accuracy in text classi cation.
✔ More N-Grams mean higher complexity and sparse data issues.

TF-IDF (Term Frequency - Inverse Document

Frequency)
✅ 1. De nition

The TF-IDF Model is a statistical measure used to evaluate how important a word is in a
document relative to a collection of documents (corpus). Unlike Bag of Words (BoW) and N-
Grams, TF-IDF considers both frequency and signi cance, reducing the impact of commonly
used words.

2. Why TF-IDF?

In BoW and N-Grams, common words (e.g., "the", "is", "and") appear frequently, making them
seem important when they are not. TF-IDF solves this by:
✔ Giving high weight to important words (e.g., "football", "Python")
✔ Reducing weight of frequently occurring words (e.g., "the", "is", "and")

3. Components of TF-IDF
fi
fi
fi
fi
fi
fi
51

4. Working of TF-IDF Model

Example Dataset

Document 1: "I love playing football."

Document 2: "Football is a great sport."
Document 3: "I love watching football matches."
52

5. Advantages of TF-IDF Model

✔ Filters out common words – Reduces the impact of stop words (e.g., "is", "the").
✔ Improves document relevance – Highlights important words.
✔ Better than BoW & N-Grams – Considers both word frequency and importance.

6. Disadvantages of TF-IDF Model

❌ Does not capture meaning – Cannot understand synonyms (e.g., "happy" vs. "joyful").
❌ Ignores word order – "playing football" and "football playing" are treated the same.
❌ High dimensionality – Large corpora create very big matrices.

7. Applications of TF-IDF Model

✔ Search Engines (Google, Bing) – Ranks relevant documents.

✔ Spam Detection – Identi es spam messages.
✔ Keyword Extraction – Finds important words in text.
✔ Recommender Systems – Suggests articles based on keywords.

8. Summary Table

Feature Description
A statistical measure that nds important words in a
De nition
document.
How It
Calculates TF and IDF, then multiplies them.
Works
Pros Reduces stop words, improves relevance.
Cons Ignores meaning, large feature space.
Applications Search engines, spam detection, keyword extraction.

9. Key Takeaways

✔ TF-IDF reduces the weight of common words and highlights unique words.
✔ It is widely used in search engines, spam detection, and keyword extraction.
✔ Still has limitations, such as ignoring synonyms and word order.

Word2Vec Model
✅ 1. Introduction

The Word2Vec Model is a deep learning-based approach used for generating word embeddings,
representing words as dense numerical vectors in a continuous space. Unlike traditional models
fi
fi
fi
53
like Bag of Words (BoW) and TF-IDF, which treat words independently, Word2Vec captures
semantic relationships between words.

2. Why Use Word2Vec?

Older models like BoW and TF-IDF have signi cant limitations:
❌ Do not understand word meanings (e.g., "king" and "queen" are unrelated).
❌ Ignore relationships between words (e.g., "Paris" is related to "France").
❌ Create high-dimensional, sparse vectors that are computationally inef cient.

Word2Vec overcomes these issues by learning compact, meaningful word representations

where:
✔ Similar words have closer vector representations in the space.
✔ It preserves semantic relationships (e.g., "King - Man + Woman ≈ Queen").

3. How Does Word2Vec Work?

Word2Vec is trained using a neural network and operates using two primary architectures:

1⃣ Continuous Bag of Words (CBOW)

• Predicts the target word based on surrounding words.

• Example: Given "I __ playing football", the model predicts "love".

• Works well for frequent words and is computationally faster.

2⃣ Skip-Gram Model

• Predicts context words given a target word.

• Example: Given "love", the model predicts ["I", "playing", "football"].

• Works better for rare words and captures complex relationships.

4. Word Representation in Word2Vec

Word Vector Representation (Simpli ed)

King [0.2, 0.8, 0.5, 0.9, 0.1]
Queen [0.3, 0.7, 0.6, 0.8, 0.2]
Man [0.1, 0.9, 0.4, 0.7, 0.3]
Woman [0.2, 0.8, 0.5, 0.6, 0.4]
fi
fi
fi
54

💡 Key Property:
🔹 King - Man + Woman ≈ Queen (Captures gender relationship).
🔹 Paris - France + Italy ≈ Rome (Shows country-capital relationship).

5. Advantages of Word2Vec

✔ Captures word meanings and relationships effectively.

✔ Reduces high-dimensional vectors to lower, dense representations.
✔ Enhances performance in NLP tasks such as sentiment analysis and translation.

6. Disadvantages of Word2Vec

❌ Requires large datasets for effective learning.

❌ Ignores word order and syntax, focusing only on word relationships.
❌ Training can be computationally expensive for extensive corpora.

7. Applications of Word2Vec

✔ Search Engines – Enhances keyword matching and ranking.

✔ Chatbots & Virtual Assistants – Improves understanding of user queries.
✔ Machine Translation – Provides better word mappings for translations.
✔ Recommendation Systems – Suggests relevant content based on word similarity.

8. Quick Summary

Feature Description
A deep learning model that learns word meanings using dense
De nition
vectors.
How It
Uses CBOW and Skip-Gram to predict word relationships.
Works
Pros Captures word semantics, low-dimensional, and ef cient.
Requires large data, ignores syntax, and is computationally
Cons
expensive.
Applications Search engines, chatbots, machine translation.

9. Key Takeaways

✔ Word2Vec generates meaningful word embeddings by learning word relationships.

✔ CBOW predicts words from context, while Skip-Gram predicts context from words.
✔ It is widely used in NLP applications like chatbots, search engines, and translation.
fi
fi
55

FastText Model
✅ 1. Introduction
FastText is a word embedding model developed by Facebook's AI Research (FAIR). It improves
upon Word2Vec and GloVe by representing words as subword n-grams, allowing it to handle out-
of-vocabulary (OOV) words and capture morphological structures.

Unlike Word2Vec and GloVe, which treat words as atomic units, FastText breaks words into
smaller character-based n-grams, making it highly effective for in ected languages and
misspelled words.

2. Why Use FastText?

🚀 Limitations of Word2Vec & GloVe:

• ❌ Cannot handle out-of-vocabulary (OOV) words (new or rare words).

• ❌ Fails to capture internal word structure (e.g., pre xes, suf xes).

✅ FastText Advantages:

• Breaks words into character n-grams, making it better at handling unseen words.

• Works well for morphologically rich languages (e.g., German, Finnish, Hindi).

3. How FastText Works?

Step 1: Break Words into Subword Units (N-Grams)

Each word is decomposed into character-level n-grams (subwords).

🔹 Example: "Apple" (n=3, Trigrams)

• <Ap, App, ppl, ple, le>

🔹 Example: "Running"

• Run, unn, nni, nin, ing

These subword representations allow FastText to learn similarities between words with similar
structures (e.g., "running" and "runner").

Step 2: Compute Word Embeddings

fi
fi
fl
56

1⃣ Each word's embedding is computed by adding the embeddings of its subwords.

2⃣ This allows the model to generate vectors for unseen words by summing the n-grams of
similar known words.

💡 Example:
Even if the model has never seen "Playful," it can still derive its meaning from "Play" + "ful"
(which exist in its training data).

4. Comparison of Word Embedding Models

Handles OOV Captures Training Memory

Model
Words? Morphology? Speed Usage
Word2Ve
c ❌ No ❌ No 🚀 Fast ✅ Low

GloVe ❌ No ❌ No ⚡ Medium ⚠ High

FastText ✅ Yes ✅ Yes ⏳ Slower ⚠ High

✔ FastText is superior when dealing with new words, spelling errors, or rich languages.

5. Advantages of FastText

✔ Handles Out-of-Vocabulary (OOV) words dynamically.

✔ Understands word morphology (pre xes, suf xes, roots).
✔ Effective for spelling mistakes (e.g., "colour" vs. "color").
✔ Useful for non-English languages with complex word structures.
✔ Can classify words, sentences, and documents (used in NLP classi cation tasks).

6. Disadvantages of FastText

❌ Computationally expensive (higher memory usage).

❌ Training is slower than Word2Vec due to subword decomposition.
❌ Not always necessary for simple English texts where Word2Vec/GloVe suf ce.

7. Applications of FastText

✔ Spell Correction & Autocomplete – Recognizes spelling mistakes.

✔ Chatbots & NLP Assistants – Understands unseen words better.
✔ Multilingual Text Processing – Works well with complex languages.
✔ Sentiment Analysis – Improves accuracy in understanding word variations.
✔ Document Classi cation – Used in detecting spam, fake news, and topic categorization.

8. Quick Summary
fi
fi
fi
fi
fi
57

Feature Description
De nition A word embedding model that represents words using subword n-grams.
How It
Breaks words into subword units and computes embeddings.
Works
Handles OOV words, captures word structure, good for multilingual
Pros
NLP.
Cons High computational cost, slower than Word2Vec.
Applications Text classi cation, NLP assistants, spell correction, sentiment analysis.

9. Key Takeaways

✔ FastText solves the problem of OOV words by using subword n-grams.

✔ It is useful for morphologically complex languages.
✔ Slower and memory-intensive, but highly effective for NLP applications.

Building a Text Classi er

✅ 1. Introduction
A Text Classi er is a machine learning model that categorizes text into prede ned classes (e.g.,
spam detection, sentiment analysis, topic classi cation). It converts raw text into numerical features
and applies classi cation algorithms to predict the category.

🔹 Example Applications:
✔ Spam vs. Non-Spam Email Classi cation
✔ Sentiment Analysis (Positive/Negative/Neutral)
✔ News Categorization (Sports, Politics, Technology)

2. Steps to Build a Text Classi er

Step 1: Collect and Preprocess the Data

📌 Data Sources: Web Scraping, APIs, Databases, CSV Files

📌 Preprocessing Steps:
✔ Remove punctuation & special characters
✔ Convert to lowercase
✔ Tokenization (splitting text into words)
✔ Stopword Removal (removing words like “the”, “is”)
✔ Stemming/Lemmatization (reducing words to root form)

Step 2: Convert Text into Numerical Features

fi
fi
fi
fi
fi
fi
fi
fi
fi
58
Since ML models work with numbers, we must convert text into vectors using:

Technique Description
Bag of Words
Counts word frequency in text.
(BoW)
TF-IDF Weights words based on importance.
Word2Vec / Learns word relationships using
FastText embeddings.

Example using TF-IDF:

Text 1: "Machine learning is powerful."

Text 2: "Deep learning improves AI."
TF-IDF assigns weights to "Machine", "Deep", "learning", etc.

Step 3: Choose a Classi cation Algorithm

📌 Common Machine Learning Models for Text Classi cation:

✔ Naïve Bayes – Works well for spam ltering, sentiment analysis.
✔ Logistic Regression – Good for binary classi cation.
✔ SVM (Support Vector Machine) – Handles large feature spaces.
✔ Random Forest – Uses decision trees for better accuracy.
✔ Deep Learning (LSTMs, CNNs, Transformers) – Best for complex NLP tasks.

🔹 Example: Using Naïve Bayes for spam classi cation

Input: "Win a free iPhone now!"

Model predicts: SPAM

Step 4: Train & Evaluate the Model

🔹 Split Data into Train & Test Sets (e.g., 80% Train, 20% Test)
🔹 Train the Model using labeled data.
🔹 Evaluate Performance using metrics:

Metric Description
Accuracy % of correct predictions.

Precision How many predicted positives are actually positive?

Recall How many actual positives were correctly classi ed?

F1-Score Balance between Precision & Recall.

Step 5: Deploy & Optimize

fi
fi
fi
fi
fi
fi
59
✔ Deploy as a REST API or integrate with apps.
✔ Optimize using hyperparameter tuning (adjusting model parameters).
✔ Use real-time data to improve classi cation accuracy.

3. Advantages of Text Classi cation

✔ Automates text categorization tasks.

✔ Saves time in customer support & moderation.
✔ Scalable for handling large datasets.
✔ Works well with AI assistants & chatbots.

4. Disadvantages

❌ Sensitive to noise (spelling errors, slang).

❌ Data imbalance issues (some categories may have fewer samples).
❌ Computationally expensive for deep learning models.

5. Applications of Text Classi cation

✔ Spam Detection – Gmail lters spam emails.

✔ Sentiment Analysis – Analyzing customer reviews.
✔ Topic Categorization – Sorting news articles.
✔ Toxic Comment Detection – Social media moderation.

6. Summary

Step Description
1. Preprocessing Clean text, remove stopwords, tokenize.
2. Feature Extraction Convert text into numerical vectors (BoW, TF-IDF, Word2Vec).
3. Choose Model Use Naïve Bayes, SVM, or deep learning.
4. Train & Evaluate Use accuracy, precision, recall.
5. Deploy Deploy as API, optimize performance.

7. Key Takeaways

✔ Text classi cation automates categorizing textual data.

✔ TF-IDF, BoW, and embeddings help in feature extraction.
✔ Naïve Bayes, SVM, and deep learning models are used for classi cation.
✔ Evaluation metrics ensure accuracy & reliability.

59
fi
fi
fi
fi
fi
fi
60

Text Similarity and Document Similarity Measures

✅ 1. Introduction
Text similarity measures how closely related two pieces of text are. It is essential in search
engines, plagiarism detection, chatbots, and recommendation systems.
👉 Example: "Machine learning is great" vs. "Deep learning is amazing" → These sentences have
some similarity but are not identical.

2. Types of Text Similarity

Type Description Example

Lexical Compares word occurrences in both "Hello world" & "Hello there" → Similar due
Similarity texts. to "Hello".
Semantic Measures meaning-based similarity "I love dogs" & "I adore puppies" → Different
Similarity (word relationships). words but same meaning.

3. Text Similarity Methods

1⃣ Jaccard Similarity (Lexical-Based)

📌 Measures overlap between two sets of words.

2⃣ Cosine Similarity (Vector-Based)

📌 Measures angle between text vectors (range: 0 to 1).

3⃣ TF-IDF Similarity

📌 Weighs words based on importance in a document.

📌 Common in search engines & keyword extraction.

✅ Example:
✔ "Arti cial Intelligence is the future"
✔ "The future of AI is bright"
✔ TF-IDF assigns higher weights to important words like "Arti cial Intelligence".

🔹 Pros: Helps in search ranking.

🔹 Cons: Doesn't consider word meaning.

4⃣ Word Embeddings (Word2Vec, GloVe, FastText)

📌 Captures semantic relationships between words.

📌 Similar words have closer vector representations.
fi
fi
62

✅ Example:
✔ "King" - "Man" + "Woman" = "Queen"
✔ "Car" is closer to "Vehicle" than "Apple".

🔹 Pros: Handles synonyms & context.

🔹 Cons: Needs a large dataset to train.

4. Document Similarity Methods

1⃣ Cosine Similarity for Documents

✔ Compares document vectors using TF-IDF or Word2Vec.

✔ Used in news categorization & plagiarism detection.

✅ Example:
✔ Article 1: "Space exploration is exciting."
✔ Article 2: "NASA launches new space mission."
✔ Cosine Similarity = 0.75 (High Similarity)

2⃣ Latent Semantic Analysis (LSA)

📌 Identi es hidden topics in text.

📌 Uses Singular Value Decomposition (SVD) for dimensionality reduction.
✅ Example: Group similar documents based on themes like "Technology" or "Sports".

🔹 Pros: Reduces noise.

🔹 Cons: Computationally expensive.

5. Applications of Text Similarity

✔ Plagiarism Detection – Checks content similarity.

✔ Search Engines – Google ranks pages based on similarity.
✔ Chatbots – Detects similar queries.
✔ Document Clustering – Groups similar research papers.
✔ Recommender Systems – Suggests similar articles or movies.

6. Summary

Method Type Best For

Jaccard Similarity Lexical Short text, simple matching.
Cosine Similarity Vector-Based Long text, document similarity.
fi
63
Weighted Search engines, keyword
TF-IDF Similarity
Frequency ranking.
Word
Semantic Capturing meaning, chatbots.
Embeddings
LSA Topic Modeling Document clustering, NLP tasks.

7. Key Takeaways

✔ Text similarity helps in plagiarism detection, chatbots, & search engines.

✔ Jaccard & Cosine Similarity are fast & effective for lexical comparisons.
✔ TF-IDF is great for ranking important words.
✔ Word Embeddings capture semantic relationships.

Building a Text Classi er

✅ 1. Introduction
A Text Classi er automatically assigns labels/categories to text based on its content.
✔ Used in spam detection, sentiment analysis, topic classi cation, and chatbot intent
detection.

👉 Example:

• Spam Email Detection → "Congratulations! You've won a prize!" → Spam

• Sentiment Analysis → "This product is amazing!" → Positive Sentiment

2. Steps to Build a Text Classi er

Step 1⃣ : Data Collection

📌 Gather labeled text data (training dataset) with categories.

✅ Example:

Text Category
"I love this movie!" Positive
"This product is terrible." Negative
"Win a free iPhone now!" Spam

Step 2⃣ : Text Preprocessing

📌 Clean the text to remove noise and make it machine-readable.

fi
fi
fi
fi
64
✔ Lowercasing – Convert text to lowercase.
✔ Removing Punctuation – "Hello, World!" → "Hello World"
✔ Stopword Removal – Remove common words like "the", "is", "in".
✔ Tokenization – Split text into words ("I love NLP" → [I, love, NLP]).
✔ Lemmatization/Stemming – Convert words to their base form (running → run).

✅ Example (Before & After Preprocessing)

💬 Before: "I really love playing football!!!"
💬 After: "love play football"

Step 3⃣ : Feature Extraction (Convert Text into Numbers)

📌 Convert text into numerical representations for the model to process.

✔ Bag of Words (BoW) – Counts word occurrences.

✔ TF-IDF (Term Frequency-Inverse Document Frequency) – Weighs important words.
✔ Word Embeddings (Word2Vec, GloVe, FastText) – Captures word meaning.

✅ Example:

• BoW: "I love AI" → [1, 1, 1, 0, 0]

• TF-IDF: "I love AI" → [0.3, 0.5, 0.7, 0, 0]

• Word Embeddings: "I love AI" → [0.23, 0.45, -0.67, ...]

Step 4⃣ : Choose a Classi cation Model

📌 Use a Machine Learning algorithm to train the classi er.

Algorithm Best For Pros

Naïve Bayes Sentiment analysis, spam Fast & simple
detection
Binary classi cation (Spam/
Logistic Regression Good for small datasets
Not Spam)
Works well with high-
Support Vector Machine (SVM) Text classi cation
dimensional data
Random Forest Multi-category classi cation Reduces over tting
Deep Learning (LSTM, CNN,
Advanced NLP tasks Captures context & sequence
Transformers)

✅ Example:
✔ Naïve Bayes Classi er (for Spam Detection):

• "Win a free iPhone" → Spam

fi
fi
fi
fi
fi
fi
fi
65
• "Meeting at 5 PM" → Not Spam

Step 5⃣ : Model Training & Evaluation

📌 Train the model using a dataset and check accuracy.

✔ Split Data – 80% for training, 20% for testing.

✔ Performance Metrics:

• Accuracy – Correct predictions out of total predictions.

• Precision – True Positives / (True Positives + False Positives).

• Recall – How many actual positives were correctly predicted.

✅ Example Evaluation Results:

✔ Accuracy: 85%
✔ Precision: 90% (Fewer false positives)
✔ Recall: 80% (Some false negatives)

3. Applications of Text Classi cation

✔ Spam Filtering – Gmail detects spam emails.

✔ Sentiment Analysis – Reviews classi ed as Positive/Negative.
✔ News Categorization – Sports, Politics, Tech, etc.
✔ Chatbot Intent Detection – "Order a pizza" → Food Order Category.

4. Summary

Step Description
Data Collection Gather labeled text data.
Preprocessing Clean and prepare text for analysis.
Feature Extraction Convert text into numerical format.
Choose Naïve Bayes, SVM, Deep Learning,
Model Selection
etc.
Training &
Train model and check accuracy.
Evaluation

5. Key Takeaways

✔ Text classi cation helps in spam detection, sentiment analysis, and chatbots.
✔ Preprocessing improves accuracy by removing noise.
✔ TF-IDF & Word Embeddings help in better representation.
✔ Naïve Bayes, SVM, and Deep Learning are commonly used classi ers.
fi
fi
fi
fi
66

UNIT-4

Semantic Analysis: Word Sense Disambiguation

✅ 1. Introduction

📌 Semantic Analysis focuses on understanding the meaning of words and sentences in context.
📌 Word Sense Disambiguation (WSD) is the process of determining the correct meaning of a
word in a given context.

👉 Example:

• "I went to the bank to withdraw money." → Bank = Financial Institution

• "I sat on the river bank to relax." → Bank = River Edge

Here, the word "bank" has multiple meanings, and WSD helps determine the correct one.

2. Why is WSD Important?

✔ Improves NLP Applications – Used in chatbots, search engines, and translation systems.
✔ Reduces Ambiguity – Helps in accurate meaning extraction from text.
✔ Enhances Machine Learning Models – Provides contextual understanding for AI.

3. Approaches to Word Sense Disambiguation

1⃣ Knowledge-Based Approaches (Uses Dictionaries & Ontologies)

📌 Uses prede ned lexical databases like WordNet.

✔ Lesk Algorithm

• Compares dictionary de nitions (glosses) of words.

• Example:

◦ Bank (Financial): "A place where money is kept."

◦ Bank (River): "The land beside a river."

◦ If the sentence contains "money", it selects the rst meaning.

✔ Path-Based Methods

• Uses semantic relationships in WordNet.

fi
fi
fi
67
• Measures distance between words in a semantic graph.

2⃣ Supervised Learning Approaches (Uses Labeled Data)

📌 Requires training data with correct word meanings.

✔ Decision Trees, Naïve Bayes, SVM

• Example: Train a classi er with sentences containing "bank" and their correct meanings.

✔ Limitation: Needs large labeled datasets, which can be costly to create.

3⃣ Unsupervised Learning Approaches (Uses Context Clustering)

📌 Clusters words based on their usage in different sentences.

✔ Example:

• The word "bank" appears in nance-related texts and geography-related texts.

• The model learns that "bank" in nancial texts means nancial institution, while in
geography texts it means river edge.

✔ Latent Semantic Analysis (LSA), Neural Networks, Word Embeddings (Word2Vec, BERT)
are commonly used.

4. Applications of WSD

✔ Machine Translation – Improves accuracy by translating words correctly.

✔ Search Engines – Helps return relevant search results.
✔ Speech Recognition – Corrects homophones like "write" vs. "right".
✔ Chatbots & Virtual Assistants – Enhances conversational AI understanding.

5. Summary Table

Method Type Pros Cons

Knowledge- Simple, uses
Lesk Algorithm Limited accuracy
Based dictionaries
Knowledge-
Path-Based Methods Uses semantic structure Requires a lexical database
Based
Machine
Supervised Learning High accuracy Needs labeled data
Learning
Unsupervised Machine Computationally
Learns from raw text
Learning Learning expensive
fi
fi
fi
fi
68

6. Key Takeaways

✔ WSD helps AI understand words in the right context to reduce ambiguity.

✔ Lexical (dictionary-based), supervised, and unsupervised methods are used for WSD.
✔ Applications include Google Search, ChatGPT, Speech Assistants, and Translation Systems.

Named Entity Recognition (NER)

✅ 1. Introduction

📌 Named Entity Recognition (NER) is a subtask of Natural Language Processing (NLP) that
identi es and categorizes proper names in text.
📌 It extracts entities like Person, Organization, Location, Date, Time, etc.

👉 Example:

• "Elon Musk founded SpaceX in 2002, headquartered in California."

◦ Elon Musk → Person

◦ SpaceX → Organization

◦ 2002 → Date

◦ California → Location

2. Why is NER Important?

✔ Improves Information Extraction – Helps in news classi cation, chatbots, and search
engines.
✔ Enhances Data Organization – Useful for sorting and indexing large text corpora.
✔ Strengthens AI Applications – Helps machine learning models understand structured data.

3. Types of Named Entities

✔ Person (PER): Identi es individuals.

✔ Organization (ORG): Recognizes companies, institutions.
✔ Location (LOC): Detects countries, cities, landmarks.
✔ Date & Time (DATE, TIME): Extracts temporal information.
✔ Monetary Values (MONEY): Identi es prices, salaries.
✔ Percentages (PERCENT): Finds percentage-based data.
✔ Miscellaneous (MISC): Captures other entity types like product names, events.

4. Approaches to NER
fi
fi
fi
fi
69

1⃣ Rule-Based Methods

📌 Uses regular expressions and pattern-matching rules.

✔ Example:

• If a word starts with a capital letter and follows "Mr." or "Dr." → Person Name.
✔ Limitation: Struggles with new or unseen entities.

2⃣ Machine Learning-Based Methods

📌 Uses labeled training data to identify entity patterns.

✔ Common algorithms:

• Conditional Random Fields (CRF)

• Hidden Markov Models (HMM)

• Support Vector Machines (SVM)

✔ Limitation: Needs large, well-annotated datasets.

3⃣ Deep Learning-Based Methods

📌 Uses Neural Networks & Word Embeddings.

✔ Common models:

• BiLSTM + CRF

• Transformers (BERT, RoBERTa, GPT-based models)

✔ Advantage: Learns context and works well with unstructured data.
✔ Limitation: Computationally expensive.

5. Applications of NER

✔ Search Engines – Google uses NER to highlight key results.

✔ Healthcare NLP – Identi es diseases, drugs from medical texts.
✔ Financial Analysis – Extracts companies and nancial terms from reports.
✔ Chatbots & Virtual Assistants – Recognizes names, dates, and locations for personalized
responses.

6. Summary Table

Method Type Pros Cons

Rule-Based Pattern Matching Simple, quick Struggles with unseen words
fi
fi
70
Machine Statistical More
Needs labeled data
Learning Models adaptable
Requires high computing
Deep Learning Neural Networks Context-aware
power

7. Key Takeaways

✔NER extracts and categorizes names, places, dates, and organizations.

✔ Methods include Rule-Based, Machine Learning, and Deep Learning approaches.
✔ It enhances search engines, nance, healthcare, and AI assistants.

Topic Modeling
✅ 1. Introduction

📌 Topic Modeling is an unsupervised learning technique used to discover hidden themes in

large collections of text.
📌 It groups words that frequently appear together into topics without needing labeled data.

👉 Example:

• News Articles Dataset:

◦ Topic 1: ["election," "vote," "candidate," "government"] → Politics

◦ Topic 2: ["goal," "match," "team," "player"] → Sports

2. Why is Topic Modeling Important?

✔ Summarizes Large Text Data – Helps in document classi cation and search engines.
✔ Extracts Meaningful Insights – Identi es main themes in social media, news, and research
papers.
✔ Improves Text Analysis – Helps in sentiment analysis and customer feedback categorization.

3. Common Topic Modeling Techniques

1⃣ Latent Dirichlet Allocation (LDA)

📌 Most popular method that assumes each document contains multiple topics.
✔ Works on probability distribution of words across topics.
✔ Example:

• A movie review may contain "acting," "script," "director" (Topic 1: Film Industry) and
"cinematography," "camera," "effects" (Topic 2: Technical Aspects).
fi
fi
fi
71
✔ Limitation: Needs tuning for the correct number of topics.

2⃣ Latent Semantic Indexing (LSI)

📌 Uses Singular Value Decomposition (SVD) to nd patterns in word usage.

✔ Helps reduce noise in text data.
✔ Example:

• Words like "car," "vehicle," and "automobile" belong to the same topic even if they
aren’t used together frequently.

✔ Limitation: Doesn't handle polysemy (words with multiple meanings) well.

3⃣ Non-Negative Matrix Factorization (NMF)

📌 Decomposes a document-word matrix into topic-based matrices.

✔ Used in recommendation systems and document clustering.
✔ Example:

• Amazon reviews may contain topics related to "quality," "delivery," "price," and
"customer service."

✔ Limitation: Less interpretable than LDA.

4. Applications of Topic Modeling

✔ News Categorization – Groups articles into business, sports, entertainment, etc.

✔ Customer Feedback Analysis – Identi es major issues from reviews.
✔ Healthcare NLP – Extracts medical conditions and symptoms from patient records.
✔ Legal & Academic Research – Helps in summarizing long reports.

5. Summary Table

Metho
Type Pros Cons
d
Finds coherent
LDA Probabilistic Requires tuning
topics
Struggles with
LSI SVD-Based Handles synonyms
polysemy
Matrix
NMF Good for clustering Less interpretable
Factorization

6. Key Takeaways
fi
fi
72
✔ Topic Modeling nds hidden themes in large text datasets.
✔ LDA, LSI, and NMF are widely used techniques.
✔ Applications include search engines, news aggregation, and customer feedback analysis.

Latent Semantic Indexing (LSI)

✅ 1. Introduction

📌 Latent Semantic Indexing (LSI) is a mathematical technique used for dimensionality

reduction in text analysis.
📌 It captures relationships between words and documents by mapping them to a lower-
dimensional space.

👉 Example:

• Words "car," "vehicle," and "automobile" may not appear together but belong to the
same semantic topic.

• LSI groups them under a single concept (e.g., "transportation").

2. Why is LSI Important?

✔ Improves Search Accuracy – Finds relevant documents even if exact keywords aren’t used.
✔ Handles Synonyms Well – Recognizes related words even if they differ in spelling.
✔ Reduces Noise in Text Data – Helps lter out irrelevant terms and improve topic coherence.

3. How LSI Works?

📌 LSI is based on Singular Value Decomposition (SVD), which decomposes a document-term

matrix into three smaller matrices:

👉 Steps:
1⃣ Create a Term-Document Matrix (TDM) – Represents word occurrences in documents.
2⃣ Apply SVD – Breaks TDM into smaller matrices to capture important features.
3⃣ Reduce Dimensionality – Eliminates noise while preserving key topics.
4⃣ Retrieve Topics – Identi es relationships between words and documents.

4. Applications of LSI

✔ Search Engines – Improves search results by understanding word relationships.

✔ Document Clustering – Groups similar articles, emails, and reports together.
✔ Spam Filtering – Detects spam by analyzing latent topics in emails.
✔ Plagiarism Detection – Compares documents based on semantic similarity.
fi
fi
fi
73

5. LSI vs Other Topic Modeling Techniques

Metho
Type Pros Cons
d
Handles synonyms, reduces Struggles with
LSI SVD-Based
noise polysemy
LDA Probabilistic Assigns probability distributions Requires tuning
Matrix
NMF Works well with clustering Harder to interpret
Factorization

6. Key Takeaways

✔ LSI reduces text dimensions while preserving meaning.

✔ It improves search accuracy, document clustering, and spam ltering.
✔ Based on SVD, it captures hidden relationships between words.

Introduction to Lexicons and Sentiment Analysis

✅ 1. Introduction

📌 Lexicons are prede ned lists of words with associated meanings, often used in NLP tasks like
sentiment analysis, named entity recognition, and text classi cation.
📌 Sentiment Analysis is the process of determining the emotional tone behind a piece of text
(e.g., positive, negative, or neutral).

👉 Example:

• "The product is amazing!" → Positive 😊

• "I had a terrible experience." → Negative 😡

• "The service was okay." → Neutral 😐

2. What is a Lexicon?

✔ A Lexicon is a collection of words with metadata (e.g., meaning, polarity, frequency).

✔ In Sentiment Analysis, lexicons contain words labeled as positive, negative, or neutral.

👉 Example of a Sentiment Lexicon:

Sentiment
Word
Score
Excellen
+3 (Positive)
t
fi
fi
fi
74

Bad -2 (Negative)
Happy +2 (Positive)
Angry -3 (Negative)

3. Lexicon-Based Sentiment Analysis

📌 Uses prede ned word lists to determine sentiment.

👉 Steps:
1⃣ Tokenize the text.
2⃣ Match words with sentiment lexicons.
3⃣ Assign scores and compute overall sentiment.

✔ Advantages:

• Simple and interpretable

• No need for training data

• Works well for domain-speci c tasks

❌ Limitations:

• Ignores context and sarcasm (e.g., "This is just great!" could be sarcastic).

• Fails with unseen words or phrases.

4. Machine Learning-Based Sentiment Analysis

📌 Uses supervised learning to classify text into sentiment categories.

👉 Common ML Algorithms Used:

✔ Naïve Bayes – Probability-based approach.
✔ SVM (Support Vector Machine) – Finds the best boundary between classes.
✔ Deep Learning (LSTMs, Transformers) – Captures context better.

✔ Advantages:

• Understands complex patterns (e.g., sarcasm, context).

• Learns from data instead of relying on prede ned words.

❌ Limitations:

• Requires large labeled datasets.

• Computationally expensive.
fi
fi
fi
75

5. Applications of Sentiment Analysis

✔ Social Media Monitoring – Tracks customer opinions on Twitter, Facebook, etc.

✔ Product Reviews Analysis – Determines customer satisfaction.
✔ Stock Market Prediction – Analyzes news sentiment for nancial insights.
✔ Customer Support Automation – Detects unhappy customers for better service.

6. Summary Table

Approach Pros Cons

Lexicon- Ignores context,
Simple, no training needed
Based sarcasm
More accurate, context-
ML-Based Needs large datasets
aware

7. Key Takeaways

✔ Lexicons store word meanings, often used in sentiment analysis.

✔ Sentiment Analysis determines if a text is positive, negative, or neutral.
✔ Lexicon-based methods are simple but lack context, while ML models offer better accuracy.

Word Embeddings: An Overview

✅ 1. Introduction

📌 Word Embeddings are numerical representations of words in a vector space that capture
semantic meaning and relationships between words.
📌 Unlike traditional one-hot encoding, word embeddings provide a dense representation where
similar words are positioned closer together in the vector space.

👉 Example:

• "King" - "Man" + "Woman" ≈ "Queen"

• "Paris" is closer to "France" than to "India" in vector space.

2. Why Are Word Embeddings Important?

✔ Preserve Semantic Relationships – Words with similar meanings are closer in the embedding
space.
✔ Handle Synonyms and Context – "Happy" and "Joyful" have nearly identical vector
representations.
✔ Enhance NLP Tasks – Used in search engines, chatbots, and sentiment analysis.
fi
76

❌ Challenges:

• Requires large datasets for effective training.

• Can inherit biases present in training data.

3. Common Word Embedding Techniques

1⃣ Word2Vec

📌 Developed by Google, converts words into high-dimensional vectors.

✔ Uses CBOW (Continuous Bag of Words) and Skip-Gram models.
✔ Can capture relationships like "France" → "Paris" (capital-city relation).

2⃣ GloVe (Global Vectors for Word Representation)

📌 Developed by Stanford, based on word co-occurrence statistics.

✔ Ideal for capturing semantic word relationships.
✔ Example: Words frequently appearing together have closer vector representations.

3⃣ FastText

📌 Developed by Facebook, enhances Word2Vec by incorporating subword (morpheme)

information.
✔ Recognizes variations of words (e.g., run, running, runner have similar representations).
✔ Suitable for handling low-resource languages.

4. How Word Embeddings Work?

👉 Steps to Generate Word Embeddings:

1⃣ Tokenization – Splitting text into individual words.
2⃣ Building Vocabulary – Creating a list of unique words.
3⃣ Training the Model – Assigning numerical vectors based on word relationships.
4⃣ Applying Embeddings – Using them in tasks like text classi cation, translation, and
summarization.

5. Applications of Word Embeddings

✔ Machine Translation – Google Translate maps words across different languages.

✔ Chatbots & Virtual Assistants – Helps generate human-like responses.
✔ Search Engines – Improves query understanding and retrieval accuracy.
✔ Text Summarization – Extracts essential information from documents.
fi
77

6. Comparison of Word Embedding Models

Model Key Feature Advantages Limitations

Word2Ve Captures semantic
Context-based learning Requires large data
c meaning
Better at word Computationally
GloVe Co-occurrence matrix
relationships expensive
Uses subword
FastText Handles rare words well Slower training
information

7. Key Takeaways

✔ Word embeddings provide a meaningful way to represent words in numerical form.

✔ Word2Vec, GloVe, and FastText are commonly used models for word representation.
✔ These models are crucial in applications like search engines, chatbots, and NLP tasks.
78

UNIT-5
Speech Recognition
✅ 1. Introduction

📌 Speech Recognition (also known as Automatic Speech Recognition - ASR) is the process of
converting spoken language into text.
📌 It enables machines to understand human speech and is widely used in voice assistants,
transcription services, and accessibility tools.

👉 Example Applications:

• Voice Assistants – Siri, Google Assistant, Alexa

• Speech-to-Text Services – Google Docs Voice Typing, Otter.ai

• Call Centers – Automated customer support

2. How Speech Recognition Works?

👉 Step-by-Step Process:

1⃣ Audio Input – The system receives a speech signal via a microphone.

2⃣ Feature Extraction – Converts audio waves into digital features (e.g., Mel-Frequency Cepstral
Coef cients - MFCC).
3⃣ Acoustic Model – Maps speech features to phonemes (smallest units of sound).
4⃣ Language Model – Predicts the most probable sequence of words.
5⃣ Decoder – Converts phonemes into words, forming a meaningful sentence.

3. Key Technologies Used

✔ Hidden Markov Models (HMMs) – Traditional method for recognizing speech patterns.
✔ Deep Neural Networks (DNNs) – Modern AI-based approach for higher accuracy.
✔ Transformer Models (e.g., Wav2Vec, Whisper) – Advanced speech recognition using self-
supervised learning.

4. Challenges in Speech Recognition

❌ Accents & Dialects – Variations in pronunciation make recognition dif cult.

❌ Background Noise – Noisy environments reduce accuracy.
❌ Homophones – Words that sound similar but have different meanings (e.g., "two" vs. "to" vs.
fi
fi
79
"too").
❌ Code-Switching – Mixing of languages within speech (common in India, e.g., Hinglish).

5. Applications of Speech Recognition

✔ Voice Assistants – Apple Siri, Amazon Alexa, Google Assistant.

✔ Dictation & Transcription – Automatic conversion of speech to text.
✔ Healthcare – Voice-based medical documentation.
✔ Call Center Automation – AI-powered voice response systems.
✔ Accessibility – Helps individuals with disabilities (e.g., voice-controlled devices).

6. Popular Speech Recognition Systems

Compan
System Key Feature
y
AI-powered transcription with real-time
Google Speech-to-Text Google
processing.
IBM Watson Speech-to-
IBM Industry-level accuracy with domain adaptation.
Text
Amazon Transcribe Amazon Works with AWS cloud services.
DeepSpeech Mozilla Open-source and AI-powered.

7. Summary & Future Trends

✔Speech recognition has revolutionized human-computer interaction.

✔ AI-based approaches like transformers are improving accuracy.
✔ Future advancements include real-time multilingual recognition and emotion-aware speech
processing.

Machine Translation
✅ 1. Introduction

📌 Machine Translation (MT) is the process of automatically translating text from one language
to another using computers.
📌 It is widely used in global communication, localization, and multilingual NLP applications.

👉 Example Applications:

• Google Translate – Real-time language translation.

• Facebook AI Translation – Translates user posts and comments.

• Microsoft Translator – Used in business and education.

2. Types of Machine Translation

1⃣ Rule-Based Machine Translation (RBMT)

✔ Uses linguistic rules, grammar, and dictionaries to translate.
✔ Works well for structured texts (legal, medical).
❌ Limitation – Requires extensive rule databases and lacks exibility.

2⃣ Statistical Machine Translation (SMT)

✔ Uses probability models trained on bilingual texts.
✔ Example: Google Translate (before AI models).
❌ Limitation – Fails for complex grammar structures.

3⃣ Neural Machine Translation (NMT)

✔ Uses deep learning and neural networks for translation.
✔ Captures context, grammar, and meaning effectively.
✔ Example: Google Translate (modern version), OpenAI’s GPT.

3. How Machine Translation Works?

👉 Step-by-Step Process (NMT)

1⃣ Text Preprocessing – Tokenization, normalization, and sentence segmentation.

2⃣ Encoding – Converts words into vector representations (word embeddings).
3⃣ Translation Model – Uses Transformer-based models like BERT, GPT.
4⃣ Decoding – Generates translated text in the target language.
5⃣ Post-processing – Adjusts grammar and structure for natural output.

4. Challenges in Machine Translation

❌ Idioms & Phrases – "Break a leg" may translate literally instead of meaning "Good luck".
❌ Context Understanding – "Bank" ( nancial institution) vs. "Bank" (riverbank).
❌ Low-Resource Languages – Less training data for regional languages like Bhojpuri, Konkani.
❌ Grammar & Syntax Errors – Sentence structure variations between languages.

5. Applications of Machine Translation

✔ Global Communication – Translates emails, messages, and social media posts.

✔ E-commerce & Business – Localizes websites and customer support.
✔ Healthcare & Law – Helps in multilingual documentation.
✔ Education & Research – Translates academic papers and books.
fi
fl
81

6. Popular Machine Translation Systems

System Company Key Feature

Google Translate Google AI-powered translation across 100+ languages.
DeepL Translator DeepL High-quality translations with deep learning.
Microsoft
Microsoft Cloud-based translation for businesses.
Translator
Amazon Translate Amazon Neural translation for applications.

7. Future of Machine Translation

✔ AI-powered models will enhance accuracy and uency.

✔ Zero-shot translation – AI can translate between languages it has never seen before.
✔ Multilingual models – One model can handle multiple languages at once.
✔ Speech-to-Speech Translation – Real-time translation of spoken language.

Question Answering (Q&A)

✅ 1. Introduction

📌 Question Answering (Q&A) is an NLP task where a system automatically provides answers to
user queries.
📌 It is used in chatbots, virtual assistants, search engines, and customer support systems.

👉 Example Applications:

• Google Search – Provides direct answers to questions.

• Chatbots – AI-powered assistants like ChatGPT, Alexa, and Siri.

• Customer Support – Automated responses to FAQs.

2. Types of Question Answering Systems

1⃣ Open-Domain Q&A
✔ Answers questions using a large dataset or the internet.
✔ Example: Google Search answering “Who is the President of India?”.
✔ Uses retrieval-based techniques (search engine-based).

2⃣ Closed-Domain Q&A
✔ Answers questions from a speci c dataset or domain (e.g., medical, legal).
✔ Example: A Q&A system for medical diagnosis.
fi
fl
82

3⃣ Extractive Q&A
✔ Extracts exact phrases from a document to answer a question.
✔ Example: “What is the capital of France?” → Extracts "Paris" from a paragraph.

4⃣ Generative Q&A
✔ Generates answers using AI models like GPT and BERT.
✔ Example: “Explain Quantum Physics” → AI generates a detailed answer.

3. How Question Answering Works?

👉 Step-by-Step Process:

1⃣ Question Processing – Identi es the type of question (Who, What, Where, When, Why).
2⃣ Document Retrieval – Finds relevant documents using a search engine or database.
3⃣ Answer Extraction – AI extracts or generates the best answer.
4⃣ Answer Ranking – Ranks multiple possible answers based on relevance.

4. Technologies Used in Q&A

✔ BERT (Bidirectional Encoder Representations from Transformers) – Extracts precise

answers.
✔ GPT (Generative Pre-trained Transformer) – Generates human-like responses.
✔ TF-IDF & BM25 – Used in retrieval-based Q&A systems.
✔ Knowledge Graphs – Stores facts and relationships for structured Q&A.

5. Challenges in Question Answering

❌ Understanding Context – AI struggles with ambiguous questions.

❌ Multi-turn Conversations – Maintaining context in long conversations is dif cult.
❌ Misinformation – AI might generate or extract incorrect answers.
❌ Low-Resource Languages – Limited training data for regional languages.

6. Applications of Question Answering

✔ Search Engines – Google, Bing, and DuckDuckGo provide instant answers.

✔ Virtual Assistants – Siri, Google Assistant, and Alexa answer user queries.
✔ Healthcare & Legal – AI-powered Q&A for doctors and lawyers.
✔ E-commerce – Automated Q&A for customer queries.

7. Popular Q&A Systems

fi
fi
83
Google Search Extracts answers from indexed
Google
Q&A websites.
IBM Watson IBM AI-powered Q&A for businesses.
ChatGPT OpenAI Generates human-like responses.
Facebook
Facebook DrQA Extractive Q&A from Wikipedia.
AI

8. Future of Question Answering

✔ Multimodal Q&A – AI can answer text, image, and video-based questions.

✔ Personalized Q&A – AI adapts answers based on user preferences.
✔ Voice-Based Q&A – Advanced speech recognition for voice queries.
✔ Conversational Q&A – AI maintains context across multiple questions.

Summarization
✅ 1. Introduction

📌 Summarization is an NLP technique that condenses large texts into shorter, meaningful
summaries while preserving key information.
📌 Used in news articles, legal documents, research papers, and AI-powered assistants.

👉 Example Applications:

• Google News AI – Generates short news summaries.

• ChatGPT & Bard – Summarizes long texts into concise explanations.

• Legal & Medical AI – Extracts key points from case laws and patient records.

2. Types of Summarization

1⃣ Extractive Summarization
✔ Selects important sentences from the original text.
✔ Uses ranking algorithms like TF-IDF, BM25, TextRank.
✔ Example: Highlighting key sentences from an article.

2⃣ Abstractive Summarization
✔ Generates a new summary in natural language, rather than copying sentences.
✔ Uses deep learning models like BERT, GPT, T5.
✔ Example: "The economy is slowing down" instead of “The GDP growth rate is decreasing.”

3. How Summarization Works?

👉 Step-by-Step Process:
84

1⃣ Text Preprocessing – Tokenization, stopword removal, and stemming.

2⃣ Feature Extraction – Identi es key phrases, named entities, and important words.
3⃣ Ranking Sentences (Extractive Method) – Scores sentences based on importance.
4⃣ Generating Summary (Abstractive Method) – Uses AI models to rewrite content.
5⃣ Post-processing – Removes redundancy and re nes sentence structure.

4. Technologies Used in Summarization

✔ TF-IDF & TextRank – Extracts important sentences.

✔ BERTSUM & GPT-4 – Abstractive summarization.
✔ T5 (Text-To-Text Transfer Transformer) – Google's state-of-the-art summarization model.

5. Challenges in Summarization

❌ Understanding Context – AI might miss important nuances.

❌ Redundancy – Some summaries repeat key points unnecessarily.
❌ Fact Preservation – Abstractive models may generate misleading summaries.
❌ Handling Long Texts – Large documents require advanced compression techniques.

6. Applications of Summarization

✔ News & Journalism – AI-generated headlines and article briefs.

✔ Legal & Financial Reports – Condenses case laws and earnings reports.
✔ Healthcare & Research – Summarizes medical ndings and research papers.
✔ Content Creation – AI-powered summarization for blogs and social media.

7. Popular Summarization Tools

Tool Company Key Feature

SummarizeBot AI-powered Summarizes documents, PDFs, and news.
Hugging Face Open-
Provides AI models for summarization.
Transformers source
Generates high-quality abstractive
Google T5 Google
summaries.
GPT-based Summarization OpenAI Generates human-like summaries.

8. Future of Summarization

✔Real-time Summarization – AI can summarize live content (meetings, lectures).

✔ Multimodal Summarization – AI will summarize texts, videos, and audio.
fi
fi
fi
85
✔ Personalized Summarization – AI will generate summaries tailored to user preferences.
✔ Fact-Checked Summarization – AI will verify facts while summarizing.

Text Categorization
✅ 1. Introduction

📌 Text Categorization (also known as Text Classi cation) is the process of assigning prede ned
labels to text documents based on their content.
📌 It is widely used in spam detection, sentiment analysis, topic classi cation, and document
organization.

👉 Example Applications:

• Spam Filtering – Classi es emails as spam or non-spam.

• News Categorization – Tags articles as politics, sports, technology, etc.

• Sentiment Analysis – Categorizes reviews as positive, negative, or neutral.

2. Types of Text Categorization

1⃣ Rule-Based Classi cation

✔ Uses manually de ned rules (e.g., keywords, regular expressions).
✔ Example: If an email contains "win a prize," classify it as spam.
✔ ❌ Limitations: Requires frequent updates and struggles with complex patterns.

2⃣ Machine Learning-Based Classi cation

✔ Uses statistical models to learn from labeled text data.
✔ Algorithms: Naïve Bayes, SVM, Decision Trees, Random Forest, Neural Networks.
✔ Example: Classifying tweets as hate speech, offensive, or normal.

3⃣ Deep Learning-Based Classi cation

✔ Uses Neural Networks (CNNs, RNNs, LSTMs, Transformers) for better accuracy.
✔ Example: BERT and GPT models categorize complex text more effectively.

3. How Text Categorization Works?

👉 Step-by-Step Process:

1⃣ Text Preprocessing – Tokenization, stopword removal, stemming, lemmatization.

2⃣ Feature Extraction – Converts text into numerical vectors using TF-IDF, Word2Vec, BERT
embeddings.
3⃣ Model Training – A classi er learns patterns from labeled data.
fi
fi
fi
fi
fi
fi
fi
fi
fi
86

4⃣ Prediction & Classi cation – The model assigns labels to new text data.
5⃣ Evaluation – Uses accuracy, precision, recall, and F1-score to measure performance.

4. Algorithms Used in Text Categorization

✔ Naïve Bayes Classi er – Works well for spam detection.

✔ Support Vector Machines (SVM) – Effective for short text classi cation.
✔ Random Forest – Performs well with structured text data.
✔ LSTMs & Transformers (BERT, GPT-4) – Advanced classi cation models.

5. Challenges in Text Categorization

❌ Ambiguity in Language – Words can have multiple meanings.

❌ Handling Imbalanced Data – Some categories have more data than others.
❌ Multiclass & Multi-Label Classi cation – Some texts belong to multiple categories.
❌ Domain-Speci c Language – Dif cult to classify text with industry-speci c jargon.

6. Applications of Text Categorization

✔ Email Spam Filtering – Gmail, Outlook classify emails as spam or important.

✔ Sentiment Analysis – Analyzes movie reviews, product feedback.
✔ Customer Support Automation – Routes queries to the right department.
✔ Fake News Detection – Identi es misleading news articles.

7. Popular Text Categorization Tools

Tool Company Key Feature

NLTK & Scikit-Learn Open-source Provides Naïve Bayes & SVM models.
TensorFlow & Google & Supports deep learning models for
PyTorch Facebook classi cation.
FastText Facebook AI Ef cient text classi cation model.
Google AutoML NLP Google Auto-trains text categorization models.

8. Future of Text Categorization

✔ Self-Learning AI – AI will continuously improve by learning from new text.

✔ Multilingual Text Classi cation – AI will classify texts across multiple languages.
✔ Real-Time Categorization – Faster and more ef cient classi cation for chatbots, news, and
social media.
✔ Explainable AI in Classi cation – AI models will provide reasons for their classi cations.
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
87

Context Identi cation

✅ 1. Introduction

📌 Context Identi cation is the process of understanding the meaning of text based on its
surrounding words, phrases, and overall structure.
📌 It helps in chatbots, sentiment analysis, language translation, and intent detection.

👉 Example Applications:

• Voice Assistants (Alexa, Siri, Google Assistant) – Detects user intent from voice
commands.

• Chatbots – Understands user queries to provide accurate responses.

• Sentiment Analysis – Determines whether a statement is positive, negative, or neutral

based on surrounding words.

2. Importance of Context in NLP

✔ Word Meaning Disambiguation – The same word can have different meanings.
✔ Sentiment Understanding – Words can change meaning based on context (e.g., "not bad" is
positive).
✔ Entity Recognition – Identi es if a word refers to a person, place, or organization.

Example:

• "Apple is a big company." 🍏 → Company

• "I ate an apple." 🍎 → Fruit

3. Techniques for Context Identi cation

1⃣ Lexical Analysis
✔ Identi es parts of speech (POS), synonyms, antonyms, and named entities.
✔ Example: "run" can be a verb (run fast) or a noun (a long run).

2⃣ Dependency Parsing
✔ Analyzes grammatical relationships between words.
✔ Example: "The man saw the dog with a telescope."

• Did the man have a telescope?

• Did the dog have a telescope?

✔ Dependency parsing resolves such ambiguities.
fi
fi
fi
fi
fi
88

3⃣ Word Embeddings (Word2Vec, GloVe, BERT)

✔ Converts words into vector representations to capture meaning and context.
✔ Example: BERT understands that "bank" in "I deposited money in the bank" means a nancial
institution, not a riverbank.

4⃣ Attention Mechanism in Transformers

✔ Used in BERT, GPT, T5 to focus on important words in a sentence.
✔ Example: In "I watched a movie yesterday. It was amazing!", the model understands that "It"
refers to "movie".

4. Challenges in Context Identi cation

❌ Word Ambiguity – The same word can have different meanings.

❌ Idioms & Sarcasm – Phrases like “Yeah, right!” can be hard for AI to understand.
❌ Domain-Speci c Language – Context changes in technical, legal, or medical texts.
❌ Pronoun Resolution – AI must determine what "he," "she," or "it" refers to.

5. Applications of Context Identi cation

✔ Chatbots & Virtual Assistants – Understands user intent.

✔ Machine Translation – Improves accuracy in Google Translate, DeepL.
✔ Content Recommendation – Net ix, YouTube suggest relevant content based on context.
✔ Search Engines – Google uses context to re ne search results.
✔ Fake News Detection – Identi es misleading or false information.

6. Tools for Context Identi cation

Tool Company Key Feature

Open-
spaCy Fast NLP pipeline for context analysis.
source
Open-
NLTK Provides dependency parsing and POS tagging.
source
Deep learning model for context
BERT by Google Google
understanding.
GPT-4 by
OpenAI Advanced language understanding.
OpenAI

7. Future of Context Identi cation

✔ Real-time Context Analysis – AI will understand conversations dynamically.

✔ Better Multimodal Context Understanding – AI will interpret text, images, and videos
together.
✔ Enhanced Emotion & Sentiment Recognition – AI will grasp deeper emotional context.
✔ Context-Aware AI for Customer Support – AI chatbots will provide human-like responses.
fi
fi
fi
fi
fl
fi
fi
fi
fi
89

Dialog Systems
✅ 1. Introduction

📌 A Dialog System (or Conversational AI) is a system designed to interact with users in natural
language, either through text or speech.
📌 It is used in chatbots, virtual assistants, voice-based systems, and customer support bots.

👉 Example Applications:

• Alexa, Siri, Google Assistant – Understands and responds to voice commands.

• Customer Support Chatbots – Resolves queries automatically.

• Healthcare AI Assistants – Helps in patient diagnosis via conversation.

2. Types of Dialog Systems

1⃣ Rule-Based Dialog Systems

✔ Uses prede ned IF-ELSE rules for responses.
✔ Example:

• User: "What is your name?"

• Bot: "I am a chatbot."

✔ ❌ Limitation: Cannot handle unexpected queries.

2⃣ Retrieval-Based Dialog Systems

✔ Selects prede ned responses based on input matching.
✔ Uses TF-IDF, Word2Vec, or BERT for better understanding.
✔ Example: Customer service chatbots that answer FAQs.

3⃣ Generative Dialog Systems

✔ Uses Deep Learning (RNNs, LSTMs, Transformers) to generate responses.
✔ More natural and exible than retrieval-based systems.
✔ Example: ChatGPT, Google Bard, Microsoft Copilot.

3. Architecture of Dialog Systems

🛠 Components of a Dialog System:

1⃣ Speech/Text Input – Takes user input (voice or text).

2⃣ Natural Language Understanding (NLU) – Identi es intent, extracts entities.
3⃣ Dialog Manager – Maintains conversation history and context.
fi
fi
fl
fi
90

4⃣ Natural Language Generation (NLG) – Generates human-like responses.

5⃣ Speech/Text Output – Returns a response.

4. Implementing a Simple Dialog System in Python

🔹 Using a Rule-Based Approach

def chatbot_response(user_input):
responses = {
"hello": "Hi! How can I help you?",
"how are you": "I'm just a bot, but I'm doing
great!",
"bye": "Goodbye! Have a nice day!",
}

return responses.get(user_input.lower(), "Sorry, I don't

understand.")

# Example conversation
while True:
user_input = input("You: ")
if user_input.lower() == "exit":
break
print("Bot:", chatbot_response(user_input))
🔹 Limitation: Cannot handle complex queries.

🔹 Using a Retrieval-Based Approach (NLTK & TF-IDF)

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity

responses = ["Hi there!", "I'm a chatbot.", "I can help

you.", "Goodbye!"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(responses)

def chatbot_response(user_input):
user_vec = vectorizer.transform([user_input])
similarity = cosine_similarity(user_vec, X)
best_match_idx = similarity.argmax()
return responses[best_match_idx]

# Example
91
print("Bot:", chatbot_response("Who are you?"))
🔹 Advantage: Finds the closest match based on TF-IDF similarity.

🔹 Using a Generative Model (Transformer-Based Chatbot - GPT-like Model)

from transformers import pipeline

chatbot = pipeline("conversational", model="facebook/

blenderbot-400M-distill")

while True:
user_input = input("You: ")
if user_input.lower() == "exit":
break
response = chatbot(user_input)
print("Bot:", response['generated_text'])
🔹 Advantage: Can generate human-like responses dynamically.
🔹 Requirement: Needs Hugging Face Transformers Library (pip install
transformers).

5. Challenges in Dialog Systems

❌ Understanding Context – AI must maintain conversation history.

❌ Handling Ambiguity – Words/phrases may have multiple meanings.
❌ Emotional Understanding – Hard for AI to detect sarcasm/emotions.
❌ Multilingual Support – AI must understand different languages.

6. Applications of Dialog Systems

✔ Customer Service Chatbots – Automates query resolution.

✔ Healthcare Assistants – Provides health advice based on symptoms.
✔ AI Tutors – Assists in online learning.
✔ Smart Home Assistants – Controls home devices using voice commands.
✔ Voice-Activated Devices – Helps visually impaired users interact with technology.

7. Future of Dialog Systems

✔ More Human-Like Conversations – AI will improve in emotion detection.

✔ Better Context Retention – Advanced memory models for better context understanding.
✔ Multimodal Conversational AI – AI will combine voice, text, and gestures.
✔ Personalized AI Assistants – AI will learn user preferences over time.
92

Introduction to Famous Deep Learning-Based NLP

Models
✅ 1. Introduction

📌 Deep Learning has revolutionized Natural Language Processing (NLP) by enabling models to
understand and generate human-like text.
📌 Advanced NLP models like BERT, GPT-4, and T5 are widely used in chatbots, machine
translation, question answering, and summarization.

👉 Key Capabilities of Deep Learning NLP Models:

✔ Contextual understanding of language.
✔ Generating human-like text responses.
✔ Summarizing long documents.
✔ Answering complex questions.

2. Famous Deep Learning-Based NLP Models

🔹 BERT (Bidirectional Encoder Representations from Transformers)

📌 Developed by Google AI (2018).

📌 Uses bidirectional context learning (understands both left & right context).
📌 Great for Question Answering (QA), Text Classi cation, Sentiment Analysis, etc.

Example: Using BERT for Text Classi cation

from transformers import BertTokenizer,

BertForSequenceClassification
import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-
uncased")
model = BertForSequenceClassification.from_pretrained("bert-
base-uncased")

text = "The product is amazing and I love it!"

tokens = tokenizer(text, return_tensors="pt", padding=True,
truncation=True)
output = model(**tokens)
fi
fi
93

print(output.logits) # Raw prediction scores

🔹 Advantage: Understands context better than previous NLP models.
🔹 Limitation: Requires large computational power for training.

🔹 GPT-4 (Generative Pre-trained Transformer 4)

📌 Developed by OpenAI (2023).

📌 A generative model that can write essays, code, answer questions, and generate creative text.
📌 Uses self-attention to predict the next word in a sentence.

Example: Using GPT-4 for Text Generation

from transformers import pipeline

gpt_pipeline = pipeline("text-generation", model="gpt-4")

response = gpt_pipeline("Once upon a time in AI history,",
max_length=50)

print(response[0]['generated_text'])
🔹 Advantage: Generates high-quality, human-like responses.
🔹 Limitation: Expensive and requires large datasets.

🔹 T5 (Text-to-Text Transfer Transformer)

📌 Developed by Google AI (2019).

📌 Converts every NLP task into a text generation problem (e.g., translation, summarization).

Example: Using T5 for Summarization

from transformers import T5Tokenizer,

T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-
small")

text = "Artificial Intelligence is transforming the world

with new innovations."
input_ids = tokenizer("summarize: " + text,
return_tensors="pt").input_ids
output = model.generate(input_ids)
94

print(tokenizer.decode(output[0], skip_special_tokens=True))
🔹 Advantage: Works well for summarization, translation, and question answering.
🔹 Limitation: Needs ne-tuning for speci c tasks.

3. Comparison of Famous NLP Models

Mode Developed
Strengths Best For
l By
BER Contextual Sentiment Analysis, Question
Google AI
T Understanding Answering
GPT-
OpenAI Text Generation Chatbots, Story Writing, Q&A
4
T5 Google AI Text-to-Text Transfer Summarization, Translation

4. Applications of Deep Learning NLP Models

✔ Virtual Assistants – Alexa, Google Assistant, Siri.

✔ Chatbots – Customer support AI chatbots.
✔ Machine Translation – Google Translate, DeepL.
✔ Summarization Tools – Automatic summarization of news/articles.
✔ Code Generation – AI-powered programming assistants like GitHub Copilot.

5. Future of Deep Learning-Based NLP Models

✔More Ef cient Models – AI with lower compute requirements.

✔ Better Context Retention – AI understanding long conversations better.
✔ Multilingual AI – Support for multiple languages & dialects.
✔ Ethical AI – Bias-free and safe AI systems.

Deep Learning-Based NLP Models: BERT, GPT-4,

etc.
1. Introduction
Natural Language Processing (NLP) has advanced signi cantly with the introduction of
transformer-based models. These models, such as BERT, GPT-4, and T5, use deep learning
techniques to understand and generate human-like text.

✔ Transformer Architecture: Replaces traditional RNNs and CNNs, enabling parallel processing
and better contextual understanding.
✔ Applications: Chatbots, Question Answering, Sentiment Analysis, Summarization, Machine
Translation, etc.
fi
fi
fi
fi
95

2. BERT (Bidirectional Encoder Representations from

Transformers)
📌 Developed by: Google AI (2018)
📌 Architecture: Transformer-based bidirectional model (reads text from both left and right).
📌 Key Feature: Uses self-attention mechanism to capture word dependencies in context.

🔹 How BERT Works

BERT is pre-trained on large text corpora and ne-tuned for speci c NLP tasks. It follows:

1. Pre-training Phase (Self-Supervised Learning)

◦ Masked Language Model (MLM): Random words in a sentence are masked, and
BERT predicts them.

◦ Next Sentence Prediction (NSP): BERT determines whether two sentences follow
each other in a text.

2. Fine-Tuning Phase

◦ Pre-trained BERT is ne-tuned on speci c tasks like sentiment analysis, Q&A, or

translation.

🔹 Advantages of BERT

✔ Understands context bidirectionally (better meaning extraction).

✔ Handles polysemy (same word, different meanings) effectively.
✔ Improves many NLP tasks like Named Entity Recognition (NER) and Q&A.

🔹 Limitations of BERT

❌ High computational cost (requires GPUs/TPUs for training).

❌ Slow inference due to its deep architecture.
❌ Not designed for text generation, mainly for understanding.

3. GPT-4 (Generative Pre-trained Transformer 4)

📌 Developed by: OpenAI (2023)
📌 Architecture: Transformer-based unidirectional model (processes text from left to right).
📌 Key Feature: Autoregressive text generation – predicts the next word based on previous
words.

🔹 How GPT-4 Works

fi
fi
fi
fi
96
1. Pre-training Phase

◦ Trained on massive text datasets using unsupervised learning.

◦ Learns grammar, facts, reasoning, and common sense.

2. Fine-tuning Phase

◦ Adjusted with human feedback (RLHF - Reinforcement Learning with Human

Feedback) to improve responses.

3. Text Generation

◦ Uses probability distribution to generate human-like text for various applications.

🔹 Advantages of GPT-4

✔ Best in class text generation (coherent, detailed responses).

✔ Handles multi-turn conversations well.
✔ Can process images along with text (multimodal capabilities).
✔ Improved factual accuracy compared to GPT-3.5.

🔹 Limitations of GPT-4

❌ Still prone to hallucinations (generating incorrect facts).

❌ Expensive to train and deploy (requires high computational power).
❌ Limited reasoning in complex problems.

🔹 Example Code: Using GPT-4 with OpenAI API

import openai

openai.api_key = "YOUR_API_KEY"

response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": "Explain quantum
computing"}]
)

print(response['choices'][0]['message']['content'])

4. Comparison: BERT vs. GPT-4

Feature BERT (Google) GPT-4 (OpenAI)
Directionality Bidirectional Unidirectional
97

Main Purpose Text Understanding Text Generation

Pre-training
MLM, NSP Autoregressive Learning
Tasks
Fine-tuning Task-Speci c RLHF-based tuning
Q&A, Sentiment Chatbots, Content
Use Cases
Analysis Generation

5. Other Deep Learning NLP Models

🔹 T5 (Text-to-Text Transfer Transformer)

📌 Developed by: Google

📌 Key Feature: Treats all NLP tasks as text-to-text problems.
✔ Example: Summarization – "Summarize: The quick brown fox jumps over the lazy dog."

6. Conclusion & Future Trends

✔ Hybrid Models – Combining BERT for understanding and GPT for generation.
✔ Multimodal AI – Models processing text, images, and videos (e.g., GPT-4 Vision).
✔ Ef cient AI – Reducing size and improving inference speed for deployment.

🚀 BERT and GPT-4 are transforming NLP! Let me know if you need further clari cations
or implementation details. 😊

Indian Language Case Studies

✅ 1. Introduction

📌 India has 22 of cial languages and over 1,600 dialects, making NLP for Indian languages
highly complex.
📌 Many deep learning-based NLP models are developed speci cally to understand, translate,
and process Indian languages.
📌 Challenges include low resource availability, complex grammar, and script diversity.

2. Challenges in Indian Language NLP

✔ Script Diversity – Hindi (Devanagari), Tamil (Brahmic), Urdu (Perso-Arabic), etc.

✔ Low-Resource Languages – Limited datasets for regional languages like Manipuri, Konkani.
✔ Code-Mixing – Hindi-English or Tamil-English mixed language ("Hinglish", "Tanglish").
✔ Phonetic Spelling Variations – Different spellings for the same word.
✔ Morphological Complexity – Words change form based on tense, gender, and plurality.
fi
fi
fi
fi
fi
98

3. Indian NLP Initiatives and Models

🔹 AI4Bharat

📌 Developed Indian NLP tools like IndicBERT and Samanantar (largest parallel dataset for
translation).
📌 Focuses on machine translation, speech recognition, and sentiment analysis.

Example: Using IndicBERT for Text Processing

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indic-
bert")
model = AutoModel.from_pretrained("ai4bharat/indic-bert")

text = "भारत एक सुंदर श ।"

tokens = tokenizer(text, return_tensors="pt")
output = model(**tokens)

print(output.last_hidden_state.shape)
✔ Use Case: Sentiment analysis and NER for Indian languages.

🔹 Google’s MuRIL (Multilingual Representations for Indian Languages)

📌 A BERT-based model that understands Indian languages and code-mixed text.

📌 Pre-trained on 17 Indian languages and transliterated data.

✔ Use Case: Detecting offensive language in Hinglish tweets, improving search engines, etc.

🔹 Microsoft’s IndicTrans

📌 A Neural Machine Translation (NMT) system for translating Indian languages.

📌 Supports multiple scripts and maintains context better than Google Translate.

✔ Use Case: Translation services for government documents, news, and legal texts.

4. Applications of Indian NLP Models

दे
है
99
✔ Automatic Speech Recognition (ASR) – Google Assistant, Alexa in Indian languages.
✔ Machine Translation – AI-powered tools translating Hindi-English, Tamil-Telugu.
✔ Chatbots & Virtual Assistants – Indian banking & e-commerce platforms using Hindi
chatbots.
✔ Sentiment Analysis – Understanding social media opinions in regional languages.

5. Future of Indian NLP

✔ Better Dataset Availability – More labeled Indian language datasets.

✔ Ef cient Multilingual AI – NLP models supporting real-time translation & conversation.
✔ Improved Speech Recognition – AI understanding Indian accents & dialects.
✔ Expansion in Education & Healthcare – AI-powered language learning & medical
consultation.
fi

Tafl Last Min Notes
No ratings yet
Tafl Last Min Notes
19 pages
Automata
No ratings yet
Automata
16 pages
Theory of Computation Basics
No ratings yet
Theory of Computation Basics
47 pages
Toc 2
No ratings yet
Toc 2
1 page
Q Bank With Syllabus
No ratings yet
Q Bank With Syllabus
11 pages
Toc 1
No ratings yet
Toc 1
32 pages
Finite Automata Answers
No ratings yet
Finite Automata Answers
33 pages
Formal Language & Automata Theory
No ratings yet
Formal Language & Automata Theory
32 pages
Theory of Computation
No ratings yet
Theory of Computation
95 pages
FLAT Module-I
No ratings yet
FLAT Module-I
123 pages
CD Digital Notes
No ratings yet
CD Digital Notes
126 pages
Lexical Analysis: Language and Automata Theory
No ratings yet
Lexical Analysis: Language and Automata Theory
43 pages
CD Digital Notes Cse-Aiml
No ratings yet
CD Digital Notes Cse-Aiml
186 pages
Theoretical Computer Science Previous Year Question Paper
No ratings yet
Theoretical Computer Science Previous Year Question Paper
69 pages
Unit 1 - Finite Automata
No ratings yet
Unit 1 - Finite Automata
18 pages
Regular Expression & Autometa
No ratings yet
Regular Expression & Autometa
62 pages
2.chapter3 - Regular Expressions and Automata
No ratings yet
2.chapter3 - Regular Expressions and Automata
28 pages
Rahul Kumar Shaw
No ratings yet
Rahul Kumar Shaw
10 pages
Cs2303 Theory of Computation 2marks
100% (1)
Cs2303 Theory of Computation 2marks
20 pages
Chapter1.2 Finite Automata
No ratings yet
Chapter1.2 Finite Automata
28 pages
Unit-1 FiniteAutomata
No ratings yet
Unit-1 FiniteAutomata
89 pages
TCS Module 1-Part 1-Intro-RS
No ratings yet
TCS Module 1-Part 1-Intro-RS
39 pages
Finite Automata Basics
No ratings yet
Finite Automata Basics
146 pages
Untitled 4
No ratings yet
Untitled 4
5 pages
AT Theory
No ratings yet
AT Theory
15 pages
3 Fa
No ratings yet
3 Fa
64 pages
Unit-1 ATCD
No ratings yet
Unit-1 ATCD
51 pages
Lecture 3-Finite Autometa
No ratings yet
Lecture 3-Finite Autometa
84 pages
Untitled Document
No ratings yet
Untitled Document
29 pages
3-Lexical Analysis Part2
No ratings yet
3-Lexical Analysis Part2
39 pages
Theory of Computation
No ratings yet
Theory of Computation
22 pages
Theeory OF Computation Ex Question
No ratings yet
Theeory OF Computation Ex Question
37 pages
Study Guide For Final
No ratings yet
Study Guide For Final
1 page
Finals Review
No ratings yet
Finals Review
42 pages
Unit 1-Theory of Computation
No ratings yet
Unit 1-Theory of Computation
83 pages
Automata Theory v0.8
No ratings yet
Automata Theory v0.8
50 pages
Theory of Computation Complete Notes 50051588 2025 05-15-22 25
No ratings yet
Theory of Computation Complete Notes 50051588 2025 05-15-22 25
20 pages
D Fanfare Gex
No ratings yet
D Fanfare Gex
35 pages
Toc Thund
No ratings yet
Toc Thund
2 pages
Course
No ratings yet
Course
52 pages
Unit-2 Introduction To Finite Automata PDF
No ratings yet
Unit-2 Introduction To Finite Automata PDF
68 pages
MITWPU - Unit 1-Theory of Computation
No ratings yet
MITWPU - Unit 1-Theory of Computation
104 pages
MITWPU - Unit 1-Theory of Computation-Merged
No ratings yet
MITWPU - Unit 1-Theory of Computation-Merged
299 pages
Lecture # 2 - Grammar & Language
No ratings yet
Lecture # 2 - Grammar & Language
19 pages
PART-A Toc
No ratings yet
PART-A Toc
15 pages
Module 1
No ratings yet
Module 1
20 pages
Assignment No 1 Toa
No ratings yet
Assignment No 1 Toa
16 pages
Ch-2 DFA and NFA
No ratings yet
Ch-2 DFA and NFA
27 pages
Automata (Unit - 1)
100% (1)
Automata (Unit - 1)
12 pages
TOT - Finite - Automata (1) - Read-Only 00
No ratings yet
TOT - Finite - Automata (1) - Read-Only 00
24 pages
Regular Anguage
No ratings yet
Regular Anguage
38 pages
Regular Expression, DFA and NFA: Prepared By: Prof. J. S. Dhobi Prof. M. D. Mehta
No ratings yet
Regular Expression, DFA and NFA: Prepared By: Prof. J. S. Dhobi Prof. M. D. Mehta
82 pages
13-Application - Complete Prediction-02-02-2024
No ratings yet
13-Application - Complete Prediction-02-02-2024
49 pages
Flat Unit 1
No ratings yet
Flat Unit 1
36 pages
ACT Chapter 1
No ratings yet
ACT Chapter 1
31 pages
IP Notes Unit 2,3,4,5
No ratings yet
IP Notes Unit 2,3,4,5
73 pages
MWD Unit 4,5
No ratings yet
MWD Unit 4,5
55 pages
Python Module Creation
No ratings yet
Python Module Creation
2 pages
Sde Sheet
No ratings yet
Sde Sheet
3 pages
Ecus
No ratings yet
Ecus
10 pages
Evermotion Archmodels 81 PDF
No ratings yet
Evermotion Archmodels 81 PDF
2 pages
Power Electronics for Engineers
No ratings yet
Power Electronics for Engineers
45 pages
LAB05 SCOR - Configure Cisco Firepower NGFW Discovery and IPS Policy
No ratings yet
LAB05 SCOR - Configure Cisco Firepower NGFW Discovery and IPS Policy
31 pages
AUTOSAR CP SRS SPALGeneral
No ratings yet
AUTOSAR CP SRS SPALGeneral
23 pages
Applsci 12 08252
No ratings yet
Applsci 12 08252
20 pages
Presentazione Di KUKA Roboter Italia - Kuka
No ratings yet
Presentazione Di KUKA Roboter Italia - Kuka
24 pages
OS Environmental Science Level 6 - 024936
No ratings yet
OS Environmental Science Level 6 - 024936
117 pages
Telecom Equipment Certification
No ratings yet
Telecom Equipment Certification
2 pages
Creopedia - EvoCreo Wikia - Fandom
No ratings yet
Creopedia - EvoCreo Wikia - Fandom
19 pages
PC Assembly & Disassembly Guide
No ratings yet
PC Assembly & Disassembly Guide
27 pages
Marshall Manual
No ratings yet
Marshall Manual
9 pages
OOP Lab: Kilos & Tuition Calculations
No ratings yet
OOP Lab: Kilos & Tuition Calculations
17 pages
Voice Response System
0% (1)
Voice Response System
74 pages
LG K8 (2017) - Schematic Diagarm PDF
No ratings yet
LG K8 (2017) - Schematic Diagarm PDF
141 pages
SDLC
100% (3)
SDLC
85 pages
ITU T A5 TD New G.1028.2
No ratings yet
ITU T A5 TD New G.1028.2
7 pages
Bypassing Kernel ASLR Target: Windows 10 (Remote Bypass) : Stéfan Le Berre - Heurs
No ratings yet
Bypassing Kernel ASLR Target: Windows 10 (Remote Bypass) : Stéfan Le Berre - Heurs
13 pages
Incomings Courses+in+english PDF
No ratings yet
Incomings Courses+in+english PDF
9 pages
Thesis Topics in Cloud Computing
100% (3)
Thesis Topics in Cloud Computing
8 pages
Chapter 11
No ratings yet
Chapter 11
7 pages
7 Fresh and Simple Ways To Test Cross-Browser Compatibility - FreelanceFolder
No ratings yet
7 Fresh and Simple Ways To Test Cross-Browser Compatibility - FreelanceFolder
45 pages
1.1 Introduction To Windows Server
No ratings yet
1.1 Introduction To Windows Server
42 pages
02SOP-Outlook Android
No ratings yet
02SOP-Outlook Android
8 pages
NoteGPT AI PPT Maker 1728839592167
No ratings yet
NoteGPT AI PPT Maker 1728839592167
10 pages
(Ebook PDF) Calculus For AP: A Complete Course PDF Download
100% (6)
(Ebook PDF) Calculus For AP: A Complete Course PDF Download
50 pages
Applied Machine Learning Course Guide
No ratings yet
Applied Machine Learning Course Guide
5 pages
Transformation: General Key Points
No ratings yet
Transformation: General Key Points
11 pages
Days of Innocence and Wonder Lucy Treloar Official Test Bank
No ratings yet
Days of Innocence and Wonder Lucy Treloar Official Test Bank
406 pages
TOPOLT 72 Crack Serial
No ratings yet
TOPOLT 72 Crack Serial
1 page