Basics of Text Processing - Tokenization and Stemming
Basics of Text Processing - Tokenization and Stemming
Introduction: Natural Language Processing, Why NLP is hard? Programming languages Vs Natural
Languages, Are natural languages regular? Finite automata for NLP, Stages of NLP, Challenges and
Issues(Open Problems) in NLP
Basics of text processing: Tokenization, Stemming, Lemmatization, Part of Speech Tagging
Morphological Analysis: What is Morphology? Types of Morphemes, Inflectional morphology &
Derivational morphology, Morphological parsing with Finite State Transducers (FST)
Syntactic Analysis: Syntactic Representations of Natural Language, Parsing Algorithms, Probabilistic
context-free grammars, and Statistical parsing
Semantic Analysis: Lexical Semantic, Relations among lexemes & their senses –Homonymy,
Polysemy, Synonymy, Hyponymy, WordNet, Word Sense Disambiguation (WSD), Dictionary
based approach, Latent Semantic Analysis
explain in detail Introduction: Natural Language Processing, Why NLP is hard? Programming
languages Vs Natural Languages,
Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) and linguistics that focuses
on enabling computers to understand, interpret, and generate human language. NLP is used in
applications such as speech recognition, machine translation, sentiment analysis, chatbots, and text
summarization. It bridges the gap between human communication and computer understanding.
NLP involves multiple disciplines, including computer science, linguistics, and machine learning, to
develop algorithms that process and analyze large amounts of natural language data. The key goals of
NLP include:
Unlike structured programming languages, natural languages (such as English, Hindi, or Marathi) are
highly ambiguous, context-dependent, and constantly evolving. Several factors make NLP a challenging
problem:
1. Ambiguity
Lexical Ambiguity: A word can have multiple meanings (e.g., "bank" can refer to a riverbank
or a financial institution).
Syntactic Ambiguity: A sentence can be interpreted in multiple ways (e.g., "I saw the man
with a telescope" – did I use the telescope, or did the man have a telescope?).
Semantic Ambiguity: Words with multiple senses create confusion in meaning (e.g., "bark"
can mean the sound of a dog or the outer layer of a tree).
2. Context Dependence
Words and sentences derive meaning from context. For example, "He ran to the bank" can
only be understood correctly if the context (financial institution vs. riverbank) is known.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 1/47
3. Morphological Complexity
Many languages have complex word formations with prefixes, suffixes, and inflections (e.g.,
"run," "running," "ran").
Some languages, like agglutinative languages (e.g., Turkish), form long words by adding
multiple morphemes, making processing difficult.
4. Variability in Sentence Structure
Different languages have different grammar rules (e.g., Subject-Verb-Object in English vs.
Subject-Object-Verb in Japanese).
Even within the same language, a sentence can be structured in multiple ways without
changing its meaning.
5. Idioms, Metaphors, and Sarcasm
Humans often use figurative language, which is hard for machines to interpret literally. For
example, "kick the bucket" does not mean physically kicking a bucket but refers to someone
dying.
6. Lack of Sufficient Training Data
Some languages and dialects have limited digital resources for training NLP models.
Low-resource languages face difficulties in building accurate models compared to English.
7. Real-Time Processing Challenges
NLP tasks like speech-to-text, chatbot responses, and real-time translation require quick
processing and understanding of the input text.
No ambiguity; every statement has a fixed High ambiguity; words and sentences can
Ambiguity
meaning. have multiple meanings.
Changes slowly and in a controlled Constantly evolving with new words and
Evolution
manner. meanings.
Even a small syntax error can cause Humans can understand sentences even with
Error Tolerance
failure. grammar mistakes.
Conclusion
NLP is a challenging yet essential field in AI that deals with understanding human language. The
complexity arises due to ambiguity, context dependence, and the evolving nature of languages. Unlike
programming languages, which are structured and follow strict rules, natural languages are flexible,
ambiguous, and context-sensitive, making NLP a difficult problem to solve.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 2/47
explain in detail Are natural languages regular? Finite automata for NLP
To answer this, we need to understand regular languages and how they relate to natural languages.
Regular languages are a class of languages that can be represented using regular expressions and can
be recognized by finite automata. These languages follow strict patterns and have limited complexity.
Regular languages are not powerful enough to capture complex structures like nested dependencies
or long-distance relationships between words, which are common in natural languages.
No, natural languages are not regular because they are far more complex than regular languages.
There are multiple reasons for this:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 3/47
Example:
"The dog barks" (correct) vs. "The dog bark" (incorrect).
Regular languages cannot enforce such rules.
Thus, natural languages are not regular because they exhibit recursion, long-distance dependencies,
and complex syntactic rules that cannot be represented by finite-state automata.
Although natural languages are not regular, finite automata (FA) are still useful for some basic NLP
tasks like tokenization, stemming, and simple pattern matching.
1. Deterministic Finite Automata (DFA) – Only one possible transition per input symbol.
2. Non-Deterministic Finite Automata (NFA) – Multiple possible transitions per input symbol.
1. Tokenization
Breaking text into words or meaningful units.
Example: Splitting `"Hello, world!"` into `["Hello", ",", "world", "!"]`.
A DFA can be designed to recognize word boundaries (spaces, punctuation, etc.).
2. Stemming
Reducing words to their root form.
Example: "running" → "run", "played" → "play".
A finite automaton can recognize suffixes and remove them.
3. Lexical Analysis in Parsing
Identifying keywords in programming languages or structured text.
Example: Checking if a string is a valid variable name.
4. Spelling Correction and Simple Grammar Checking
Detecting common misspellings based on predefined rules.
Example: Identifying "hte" as a misspelling of "the".
5. Named Entity Recognition (NER) – Limited Use
Identifying names, dates, and places using predefined patterns.
Example: Recognizing "2024-02-25" as a date.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 4/47
Cannot handle recursion or long-distance dependencies.
Limited memory – Cannot remember information from distant parts of a sentence.
Not suitable for deep syntactic or semantic analysis.
Conclusion
Natural languages are not regular because they contain recursion, long-distance dependencies,
and contextual variations.
Finite automata are useful for basic NLP tasks like tokenization, stemming, and lexical analysis,
but they are not powerful enough for complex linguistic structures like parsing entire sentences
or understanding meaning in depth.
More advanced models like Context-Free Grammars (CFGs) and Neural Networks are needed for
deeper language processing.
Objective: Break text into meaningful words or tokens and analyze their structure.
Processes Involved:
Tokenization: Splitting a sentence into words, phrases, or subwords.
Example: "I love NLP!" → `["I", "love", "NLP", "!"]`
Stemming: Reducing words to their base form by cutting off prefixes/suffixes.
Example: "running" → "run", "played" → "play"
Lemmatization: Converting words to their dictionary form using linguistic rules.
Example: "am", "is", "are" → "be"
Part of Speech (POS) Tagging: Assigning grammatical labels (noun, verb, adjective) to words.
Example: "The cat sleeps." → `["The/DET", "cat/NOUN", "sleeps/VERB"]`
mathematica
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 5/47
(NP (DET The) (N dog))
(VP (V chased) (NP (DET the) (N cat))))
Objective: Convert speech into text (Speech-to-Text) or text into speech (Text-to-Speech).
Processes Involved:
Speech Recognition: Identifying spoken words.
Prosody Analysis: Understanding tone, stress, and rhythm.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 6/47
Challenges and Issues (Open Problems) in NLP
Despite advancements, NLP still faces several challenges due to the complexity of human languages.
1. Ambiguity in Language
Issue: NLP models are often trained on English and struggle with low-resource languages.
Example:
Translating idioms: "Break a leg" (English) → "Buena suerte" (Spanish, meaning "Good luck").
Solution Attempts:
Multilingual models like mBERT and Google's M4 models.
Issue: Many NLP models require vast amounts of labeled data, which is expensive and time-
consuming to create.
Bias Problem: Models trained on biased data can reflect societal biases.
Example:
AI-generated job applications may show gender bias.
Solution Attempts:
Zero-shot learning, synthetic data generation, fairness-focused AI models.
Issue: People often mix languages (code-mixing) and use informal/slang text.
Example:
"Mujhe yeh movie bahut acchi lagi!" (Hindi-English mix)
Solution Attempts:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 7/47
Developing code-mixed NLP models, using transfer learning for informal language.
Issue: Traditional NLP models struggle with long texts where meaning depends on distant words.
Example:
"The book that I bought last year and kept in my drawer is missing." (What is missing? The
book.)
Solution Attempts:
Transformer models (like BERT, GPT) with attention mechanisms.
Issue: Speech recognition, chatbots, and real-time translation require low-latency responses.
Example:
AI assistants like Alexa and Siri must respond instantly.
Solution Attempts:
Edge computing, model optimization, and federated learning.
Conclusion
NLP has evolved significantly, but challenges like ambiguity, context understanding, multilingual
support, and bias remain. Advanced techniques like deep learning, transformers (BERT, GPT), and real-
world knowledge integration are helping improve NLP, but many open problems still require research
and innovation.
1. Tokenization
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 8/47
What is Tokenization?
Tokenization is the process of breaking down text into smaller units, called tokens. These tokens can be
words, sentences, or subwords, depending on the level of tokenization.
Types of Tokenization
1. Word Tokenization
Splitting text into individual words.
Example:
vbnet
2. Sentence Tokenization
Splitting text into sentences.
Example:
vbnet
3. Subword Tokenization
Splitting words into meaningful sub-units.
Useful for handling rare words or unknown words in NLP.
Example:
vbnet
Word: "unhappiness"
Subword Tokens: ["un", "happiness"]
4. Character Tokenization
Splitting words into individual characters.
Used in neural network models like LSTMs and Transformers.
Example:
mathematica
Word: "NLP"
Character Tokens: ["N", "L", "P"]
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 9/47
Challenges in Tokenization
2. Stemming
What is Stemming?
Stemming is the process of reducing words to their root form by removing prefixes and suffixes.
Example of Stemming
mathematica
arduino
"connection" → "connect"
"arguing" → "argu"
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 10/47
2. Lancaster Stemmer (More Aggressive than Porter Stemmer)
Removes more characters, leading to shorter stems.
Example:
arduino
"running" → "run"
"argument" → "argu"
arduino
"happily" → "happi"
"studies" → "studi"
4. Regex-Based Stemmer
Uses regular expressions to remove suffixes.
Example:
arduino
"worked" → "work"
"playing" → "play"
Challenges in Stemming
Purpose Break text into smaller units (tokens). Reduce words to their root form.
Output Words, sentences, or subwords. Root words, often not real words.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 11/47
Feature Tokenization Stemming
`"I am running fast."` → `["I", "am", `"running"` → `"run"`,
Example
"running", "fast", "."]` `"arguing"` → `"argu"`
More complex (Applying rules to
Complexity Simple (Splitting words/sentences).
remove suffixes).
Accuracy High (Less chance of errors). Lower (Can over-stem or under-stem).
Conclusion
Tokenization and stemming are crucial text preprocessing steps in NLP. Tokenization splits text into
meaningful units, while stemming simplifies words to their root forms. Both processes help in reducing
text complexity and improving the performance of NLP applications like search engines, chatbots, and
machine translation.
1. Lemmatization
What is Lemmatization?
Lemmatization is the process of reducing a word to its base or dictionary form (lemma) while
considering the word’s context and meaning. Unlike stemming, which removes suffixes blindly,
lemmatization ensures that the transformed word is an actual meaningful word.
Example of Lemmatization
Word Lemma
running run
better good
studies study
wolves wolf
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 12/47
Unlike stemming, lemmatization ensures that `"better"` is lemmatized to `"good"`, while stemming
would return `"better"` as `"better"`, failing to capture the actual base form.
Challenges in Lemmatization
Lemmatization in Python
python
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("running", pos="v")) # Output: run
print(lemmatizer.lemmatize("better", pos=wordnet.ADJ)) # Output: good
POS tagging is the process of assigning parts of speech (noun, verb, adjective, etc.) to words in a
sentence. It helps in understanding the syntactic structure of a text.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 13/47
Why is POS Tagging Important?
Sentence: `"The quick brown fox jumps over the lazy dog."`
brown JJ (Adjective)
fox NN (Noun)
jumps VBZ (Verb - third person singular)
over IN (Preposition)
the DT (Determiner)
lazy JJ (Adjective)
dog NN (Noun)
python
import nltk
from nltk import word_tokenize, pos_tag
sentence = "The quick brown fox jumps over the lazy dog."
words = word_tokenize(sentence)
pos_tags = pos_tag(words)
print(pos_tags)
# Output: [('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over
Using spaCy:
python
import spacy
nlp = spacy.load("en_core_web_sm")
sentence = "The quick brown fox jumps over the lazy dog."
doc = nlp(sentence)
Conclusion
Lemmatization helps convert words to their base forms, improving text normalization.
POS tagging helps in understanding word roles in a sentence, which is critical for lemmatization,
machine translation, and text analysis.
Both techniques play an essential role in text preprocessing for NLP applications such as
chatbots, search engines, and voice assistants.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 15/47
Morphology is a crucial aspect of Natural Language Processing (NLP) that focuses on understanding
the structure and formation of words. It helps in tasks such as lemmatization, spell-checking, text-
to-speech conversion, and machine translation.
1. What is Morphology?
Definition
Morphology is the branch of linguistics that studies the internal structure of words and how they are
formed from smaller units called morphemes.
Helps in word segmentation for languages where spaces do not separate words (e.g., Chinese,
Japanese).
Improves lemmatization by identifying the base form of words.
Aids in spell-checking, grammar correction, and language translation.
Assists in text synthesis and voice recognition.
A morpheme is the smallest unit of meaning in a language. It cannot be broken down further without
losing its meaning.
3. Types of Morphemes
Morphemes can be classified into two major types:
A. Free Morphemes
Examples:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 16/47
Free Morpheme Meaning
book A readable object
play Engage in an activity
B. Bound Morphemes
Examples:
-s Plural cats
-ing Continuous tense playing
-ly Adverb form quickly
A. Inflectional Morphemes
Example:
📌 Inflectional morphemes only add grammatical meaning but do not create a new word.
B. Derivational Morphemes
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 17/47
Change the meaning of a word or convert it into a different grammatical category.
Can turn a verb into a noun, adjective into an adverb, etc.
Example:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 18/47
Spell Checking: Identifies incorrect inflections or derivations.
Speech Recognition: Recognizes different word forms.
Machine Translation: Translates words while maintaining correct morphology.
Search Engines: Retrieves relevant documents based on morphological variations of a word.
Text Summarization: Reduces content by understanding word structures.
Conclusion
Morphology is essential in understanding word formation and is widely used in NLP applications like
search engines, chatbots, and speech recognition systems.
Free morphemes can exist alone, while bound morphemes need attachment.
Inflectional morphemes modify words grammatically but don’t create new words.
Derivational morphemes create new words by changing the word's category.
Understanding morphology helps improve text processing, language modeling, and AI-driven NLP
applications! 🚀
1. Inflectional Morphology
What is Inflectional Morphology?
Inflectional morphology deals with modifying a word's form to express grammatical features such as
tense, number, gender, case, mood, and comparison, without changing the word's meaning or
category.
📌 Inflection does not create a new word; it just changes the word's grammatical role.
Examples of Inflectional Morphology
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 19/47
Inflectional Morpheme Function Example
-ed Past tense played, walked
Examples in Sentences
Applications in NLP
Part-of-Speech Tagging (POS Tagging): Identifies the correct grammatical category (e.g.,
"running" as a verb vs. noun).
Machine Translation: Adjusts word forms according to language grammar rules.
Speech Recognition: Recognizes different word inflections to understand context.
Text-to-Speech Systems: Reads out words with correct grammatical inflections.
2. Derivational Morphology
What is Derivational Morphology?
Derivational morphology deals with the creation of new words by adding prefixes and suffixes. These
affixes often change the meaning or grammatical category of a word.
📌 Derivation creates a new word with a different meaning or category.
Examples of Derivational Morphology
Examples in Sentences
Applications in NLP
Text Normalization: Converts words to their base forms (e.g., "happiness" → "happy").
Semantic Analysis: Understands word meanings in different contexts.
Information Retrieval: Improves search engines by recognizing derived words.
Machine Translation: Helps in correct word formation in different languages.
Number of morphemes in
8 Large
English
Example dog → dogs (plural) happy → happiness (adjective → noun)
Conclusion
Inflectional morphology modifies words for grammatical correctness without changing their
meaning.
Derivational morphology creates new words and can change grammatical categories.
Both are crucial in NLP applications like machine translation, text analysis, and speech
recognition.
Understanding inflectional and derivational morphology helps improve language models, search
engines, and AI-driven NLP applications! 🚀
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 21/47
Morphological parsing with Finite State Transducers (FST)
For example:
Morphological parsing is important in Natural Language Processing (NLP) for text analysis, machine
translation, and speech recognition.
📌 In Morphological Parsing, an FST maps surface forms (words) to their root form and
morphological structure.
Root: "cat"
Suffix: "-s" (plural marker)
Input Output
c c
a a
t t
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 22/47
Input Output
s +PL (plural marker)
Root: "run"
Suffix: "-ing" (progressive marker)
Input Output
r r
u u
n n
n ε (null, doubling rule)
i i
n n
g +ING (progressive marker)
8. Conclusion
Morphological parsing is essential for understanding word structures in NLP.
Finite State Transducers (FSTs) are powerful tools for mapping words to their root forms and
affixes.
FST-based morphological analysis is used in lemmatization, POS tagging, spell checking, and
machine translation.
While FST is efficient, it struggles with irregular words and ambiguity, requiring additional
processing techniques.
🔹 Overall, FST is a fundamental technique in NLP for breaking down words into meaningful
components! 🚀
A parse tree represents the syntactic structure of a sentence based on phrase structure grammar. It
breaks a sentence into constituents (phrases and sub-phrases).
📌 Example Parse Tree for "The cat sleeps."
bash
S
/ \
NP VP
/ \ |
DT N V
| | |
The cat sleeps
S = Sentence
NP = Noun Phrase ("The cat")
VP = Verb Phrase ("sleeps")
DT = Determiner ("The")
N = Noun ("cat")
V = Verb ("sleeps")
Use Case:
A dependency tree represents grammatical relations between words. Here, each word depends on
another word, forming a directed graph.
📌 Example Dependency Tree for "The cat sleeps."
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 25/47
bash
sleeps
/ \
cat The
Use Case:
css
S → NP VP
NP → DT N
VP → V
1. Introduction
Natural Language Processing (NLP) often requires syntactic analysis, which involves understanding the
structure of a sentence. Traditional Context-Free Grammars (CFGs) define sentence structures using
rules, but they do not handle ambiguity effectively. To improve parsing, we use Probabilistic Context-
Free Grammars (PCFGs) and Statistical Parsing, which incorporate probabilities to determine the
most likely parse tree.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 27/47
2. Context-Free Grammars (CFG) Recap
A Context-Free Grammar (CFG) is a set of production rules used to generate valid sentences. It consists
of:
Example CFG
mathematica
S → NP VP
NP → DT N
VP → V
DT → "The"
N → "cat"
V → "sleeps"
bash
S
/ \
NP VP
/ \ |
DT N V
| | |
The cat sleeps
Limitations of CFG
A PCFG is an extension of CFG where each production rule is assigned a probability. These probabilities
help determine the most likely parse tree for a given sentence.
PCFG Components
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 28/47
Probabilities sum to 1 for rules with the same non-terminal on the left-hand side
Example PCFG
css
S → NP VP [1.0]
NP → DT N [0.7]
NP → N [0.3]
VP → V [0.6]
VP → V NP [0.4]
DT → "The" [1.0]
N → "cat" [0.5]
N → "dog" [0.5]
V → "sleeps" [0.6]
V → "eats" [0.4]
For the sentence "The cat sleeps", we can calculate probabilities for different parse trees:
1️⃣ Parse Tree 1:
mathematica
S (1.0)
/ \
NP (0.7) VP (0.6)
/ \ |
DT (1.0) N (0.5) V (0.6)
| | |
"The" "cat" "sleeps"
Advantages of PCFG
Statistical Parsing uses machine learning and probability to select the best parse tree for a sentence
based on training data. It learns from large datasets instead of manually defining grammar rules.
1️⃣ Supervised Parsing – Uses labeled parse trees from treebanks (e.g., Penn Treebank)
2️⃣ Unsupervised Parsing – Learns syntax from raw text (no labeled trees)
Methods of Statistical Parsing
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 29/47
A. Probabilistic CYK Parsing
Uses PCFG and the CYK algorithm to find the most likely parse tree
Bottom-up approach (builds trees from words → sentences)
Works best for Chomsky Normal Form (CNF) grammars
Count(A → BC)
P (A → BC) =
Total Count(A)
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 30/47
✅ Speech Recognition – Improves understanding of spoken sentences
✅ Machine Translation – Enhances translation quality
✅ Question Answering Systems – Helps chatbots understand queries
✅ Grammar Checking Tools – Detects grammatical errors
✅ Information Extraction – Identifies relationships between words
8. Conclusion
PCFG extends CFG by assigning probabilities to production rules
Statistical Parsing uses machine learning to select the best parse tree
These methods are crucial in speech recognition, chatbots, and machine translation
Despite challenges, they help improve NLP accuracy and efficiency 🚀
Lexical Semantics is a branch of linguistic semantics that studies the meaning of words, their
relationships, and how they contribute to the meaning of sentences. It involves:
Lexeme: "run"
Inflected forms: "runs", "running", "ran"
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 31/47
2️⃣ Sense and Reference
Sense: The internal meaning of a word in a specific context
Example: "bank" (financial institution) vs. "bank" (riverbank)
Reference: The actual object or concept the word refers to
Example: "The Eiffel Tower" refers to a specific monument in Paris.
Words that have the same spelling or pronunciation but different meanings.
Example:
B. Polysemy
"Head"
Body part ("He hit his head.")
Leader ("She is the head of the company.")
Front part of an object ("Head of the queue.")
C. Synonymy
D. Antonymy
explain in detail Relations among lexemes & their senses –Homonymy, Polysemy, Synonymy,
Hyponymy
1. Homonymy
2. Polysemy
3. Synonymy
4. Hyponymy
1️⃣ Homonymy
Definition
Homonymy occurs when two words have the same spelling or pronunciation but different meanings
and origins.
Types of Homonyms
1. Homophones: Words that sound the same but have different meanings and spellings.
Example:
"Flour" (used in baking) vs. "Flower" (a plant part)
"Write" (to compose text) vs. "Right" (correct)
2. Homographs: Words that have the same spelling but different pronunciations and meanings.
Example:
"Lead" (/lɛd/, a metal) vs. "Lead" (/liːd/, to guide)
"Tear" (/tɪər/, to rip) vs. "Tear" (/tɛər/, a drop from the eye)
3. Perfect Homonyms: Words that have both identical spelling and pronunciation but different
meanings.
Example:
"Bat" (a flying mammal) vs. "Bat" (used in cricket)
NLP Challenge
2️⃣ Polysemy
Definition
Polysemy occurs when a single word has multiple related meanings. Unlike homonyms, polysemous
words share a common origin.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 33/47
Examples of Polysemy
1. Head
Body part: "She hurt her head."
Leader: "He is the head of the department."
Front/top part: "Meet me at the head of the queue."
2. Bank
Financial institution: "I deposited money in the bank."
Riverbank: "They sat by the river bank."
3. Mouth
Body part: "She opened her mouth."
Entrance: "The mouth of the cave."
NLP Challenge
3️⃣ Synonymy
Definition
Synonymy refers to words that have similar or nearly identical meanings in a specific context.
Examples of Synonyms
Types of Synonyms
1. Absolute Synonyms: Words that mean exactly the same in all contexts (rare in natural languages).
Example: "Furnace" vs. "Boiler" (in some contexts)
2. Near Synonyms: Words that are similar but have slight differences.
Example: "Slim" vs. "Skinny" (Both mean thin, but "skinny" can have a negative connotation.)
3. Contextual Synonyms: Words that are synonyms only in certain contexts.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 34/47
Example: "Buy" and "Purchase" are synonyms in formal writing, but "purchase" is less
common in casual speech.
NLP Challenge
Synonyms affect text retrieval: A search for "big house" should also find "large house".
Thesaurus-based models (like WordNet) help group synonyms for NLP applications.
Word embeddings (like Word2Vec, BERT) understand word similarity based on context.
4️⃣ Hyponymy
Definition
Hyponymy is a hierarchical relationship where one word (hyponym) has a more specific meaning
than another (hypernym).
Examples of Hyponymy
5. Applications in NLP
1. Word Sense Disambiguation (WSD)
Helps determine correct word meaning in context
Example: "Apple" (fruit) vs. "Apple" (company)
2. Search Engines & Information Retrieval
Uses synonymy & hyponymy for better query expansion
Example: Searching "buy car" should also retrieve "purchase vehicle"
3. Machine Translation
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 35/47
Polysemy and homonymy must be resolved to translate correctly
Example: "Bank" (financial) → "Banco" in Spanish, "Bank" (river) → "Orilla"
4. Text Summarization & Sentiment Analysis
Understanding word relations improves text meaning extraction
6. Conclusion
Homonymy, Polysemy, Synonymy, and Hyponymy are fundamental word relationships in Lexical
Semantics.
NLP challenges include ambiguity, contextual meaning, and synonym expansion.
Machine Learning & WordNet help address these challenges in real-world applications.
These concepts enhance search engines, machine translation, sentiment analysis, and
chatbots.
By incorporating Lexical Semantics, NLP systems can better understand natural language, improve
accuracy, and enhance human-like interactions .🚀
WordNet, Word Sense Disambiguation (WSD), Dictionary based approach, Latent Semantic Analysis
explain in detail
1️⃣ WordNet
What is WordNet?
WordNet is a large lexical database of English that organizes words into groups of synonyms called
synsets. It was developed at Princeton University by George A. Miller.
Structure of WordNet
Nouns
Verbs
Adjectives
Adverbs
Each word is linked to related words through semantic relationships such as:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 36/47
2. Antonymy – Words with opposite meanings (e.g., big and small).
3. Hyponymy-Hypernymy – A hierarchical relation where one word is a more specific type of another
(e.g., dog is a hyponym of animal).
4. Meronymy-Holonymy – Part-whole relationship (e.g., wheel is a meronym of car).
5. Troponymy – A specific manner of performing an action (e.g., whisper is a troponym of speak).
✔ Word Sense Disambiguation (WSD) – Helps determine the correct meaning of a word based on
context.
✔ Information Retrieval – Enhances search engines by finding synonyms and related words.
✔ Machine Translation – Improves accuracy by using structured word relationships.
✔ Chatbots & Conversational AI – Helps understand user queries and generate meaningful
responses.
Word Sense Disambiguation (WSD) is the process of determining which sense (meaning) of a word is
being used in a given context.
Example of WSD
Approaches to WSD
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 37/47
Compares word definitions with the surrounding words in a sentence.
Example: "Bass" in "He plays the bass guitar" → WordNet definition helps choose "musical
instrument" over "fish."
2. Supervised Machine Learning Approach
Uses labeled datasets where each word is tagged with its correct sense.
Algorithms like Naïve Bayes, Decision Trees, and Neural Networks learn from examples.
3. Unsupervised Approach (Clustering & Embeddings)
Uses large text corpora (unlabeled data) and clusters words based on context similarity.
4. Knowledge-Based Approach (Using WordNet)
Finds semantic relationships between words using WordNet.
Applications of WSD
The Dictionary-Based Approach in NLP uses dictionaries, lexical databases (like WordNet), and
predefined resources to understand word meanings.
How It Works
The Lesk Algorithm compares dictionary definitions of words with the words in the surrounding
context.
Example:
Consider the sentence "He sat by the river bank."
Latent Semantic Analysis (LSA) is a mathematical technique used in NLP to find relationships between
words and documents by analyzing large text corpora.
Example of LSA
LSA will group words related to football together and separate programming-related words.
Advantages of LSA
✔ Information Retrieval & Search Engines – Helps find relevant documents based on meaning.
✔ Text Classification – Groups similar documents.
✔ Recommender Systems – Suggests articles based on topic similarity.
Summary of Concepts
Concept Definition Example Use Case
Lexical database organizing words into Helps with synonym detection, query
WordNet
synsets (similar meanings) expansion in search engines
Disambiguating multiple meanings of a Understanding "bank" as a financial
WSD
word based on context institution vs. a riverbank
Dictionary-Based Uses predefined dictionaries (WordNet, Lesk Algorithm selects the best word
Approach Oxford Dictionary) to understand words meaning using dictionary definitions
Latent Semantic Finds hidden relationships between words Improves search engines by
Analysis (LSA) using statistical techniques understanding semantic similarity
Conclusion
WordNet provides structured relationships between words.
WSD helps determine the correct word meaning in context.
Dictionary-Based Approaches rely on external knowledge sources like WordNet.
LSA finds deeper semantic relationships using mathematical techniques.
🚀 Together, these techniques enhance NLP applications like search engines, machine translation,
and chatbots by improving semantic understanding.
Tokenization in NLP
What is Tokenization?
Tokenization is the process of splitting text into smaller units, called tokens. These tokens can be
words, subwords, or characters, depending on the type of tokenization used. It is the first step in text
processing for NLP tasks such as sentiment analysis, machine translation, and text summarization.
Example of Tokenization:
Input sentence:
✅ "Natural Language Processing is amazing!"
After tokenization:
✅ [‘Natural’, ‘Language’, ‘Processing’, ‘is’, ‘amazing’, ‘!’]
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 40/47
Types of Tokenization
1️⃣ Word Tokenization
Also called word segmentation.
Splits text into words based on spaces or punctuation.
Common in languages like English, but difficult in languages without spaces (e.g., Chinese,
Japanese).
✅ Example:
Input:
"I love machine learning."
Output (tokens):
[‘I’, ‘love’, ‘machine’, ‘learning’, ‘.’]
💡 Challenges:
Handling compound words (e.g., "New York" should be one token).
Dealing with contractions (e.g., "don't" → ["do", "not"]).
✅ Example:
Input: "NLP"
Character tokens: [‘N’, ‘L’, ‘P’]
💡 Challenges:
Results in longer sequences → slower training.
Loses word-level meaning.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 41/47
4️⃣ Sentence Tokenization (Sentence Segmentation)
Splits paragraphs into individual sentences.
Uses punctuation marks (like `.`, `!`, `?`) to detect sentence boundaries.
Helpful in document processing and summarization.
✅ Example:
Input:
"I love NLP. It is very interesting!"
Output:
[‘I love NLP.’, ‘It is very interesting!’]
💡 Challenges:
Abbreviations (e.g., "Dr.", "U.S.") can be mistaken for sentence boundaries.
Complex sentence structures may confuse basic rules.
Conclusion
Tokenization is a fundamental step in NLP, helping break text into meaningful parts. The choice of
tokenization method depends on the application:
🚀 Modern NLP frameworks like NLTK, spaCy, and Hugging Face Transformers provide efficient
tokenization tools!
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 42/47
Stem and Affix Classes of Morphemes
In morphology, words are made up of morphemes, which are the smallest units of meaning.
Morphemes can be classified into two broad categories:
1. Stem
2. Affix
1️⃣ Stem
A stem is the core part of a word that holds its basic meaning. It is the main morpheme to which affixes
(prefixes or suffixes) can be added.
✅ Example:
Word: unhappiness
Stem: happy
Affixes: un- (prefix) and -ness (suffix)
💡 Types of Stems:
Free stem: Can stand alone as a word (e.g., "play" in "playing").
Bound stem: Cannot stand alone and needs an affix (e.g., "ceive" in "receive").
2️⃣ Affix
An affix is a morpheme that attaches to a stem to modify its meaning. Affixes cannot stand alone as
words.
💡 Types of Affixes:
a) Prefix
✅ Examples:
re-play → "re" (again) + "play" (stem) = replay (play again)
un-happy → "un" (not) + "happy" = unhappy (not happy)
b) Suffix
✅ Examples:
kindness → "kind" (stem) + "-ness" (suffix) = kindness (state of being kind)
teacher → "teach" (stem) + "-er" (suffix) = teacher (one who teaches)
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 43/47
c) Infix (Rare in English)
✅ Example (Tagalog):
sulat (write) → sumulat (writing)
✅ Example (German):
ge-liebt (loved)
"lieb" (love) is the stem, and ge- and -t are the circumfix.
Conclusion
The stem is the core meaning of the word.
Affixes (prefixes, suffixes, infixes, circumfixes) modify the meaning or function of the word.
Understanding morphology helps in NLP, language learning, and linguistics! 🚀
What are different techniques for the semantic analysis for the Statement.
✅ Example:
Synonymy: "big" and "large" mean the same.
Hyponymy: "dog" is a hyponym of "animal".
✅ Example:
"I went to the bank to withdraw money."
Here, bank means a financial institution.
"The boat was near the bank of the river."
Here, bank refers to the side of a river.
🔹 Techniques Used:
Lesk Algorithm (Dictionary-based)
Machine Learning (Supervised Models)
✅ Example:
"Elon Musk founded Tesla in 2003."
Elon Musk → Person
Tesla → Organization
2003 → Date
✅ Example:
"John (Agent) ate (Action) an apple (Object)."
✅ Example:
The words "doctor", "nurse", "hospital" have similar meanings and belong to the same topic of
"medical field."
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 45/47
🔹 Applications: Document Classification, Search Engines, Plagiarism Detection.
6️⃣ Distributional Semantics (Word Embeddings: Word2Vec, GloVe, BERT)
Words are represented as vectors based on their context in large text corpora.
Helps measure semantic similarity between words.
✅ Example:
King - Man + Woman = Queen (Word2Vec analogy)
✅ Example:
"Show me flights from Mumbai to Delhi on Monday."
Converted into structured form:
json
{
"intent": "search_flight",
"origin": "Mumbai",
"destination": "Delhi",
"date": "Monday"
}
🚀 Modern NLP models like BERT, GPT, and T5 use a combination of these techniques for better
semantic understanding!
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 46/47
What is Natural Language Processing (NLP)?
Natural Language Processing (NLP) is a branch of artificial intelligence (AI) that enables computers to
understand, interpret, and generate human language. It bridges the gap between human
communication and machine understanding, allowing computers to process and analyze large amounts
of natural language data.
🔹 Example Applications of NLP:
Chatbots (e.g., Siri, Alexa, Google Assistant)
Machine Translation (Google Translate)
Sentiment Analysis (analyzing customer reviews)
Speech Recognition (converting speech to text)
Spam Filtering (detecting spam emails)
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 47/47