KEMBAR78
Basics of Text Processing - Tokenization and Stemming | PDF | Morphology (Linguistics) | Word
0% found this document useful (0 votes)
13 views47 pages

Basics of Text Processing - Tokenization and Stemming

NLP Notes

Uploaded by

dhirajyadav32134
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views47 pages

Basics of Text Processing - Tokenization and Stemming

NLP Notes

Uploaded by

dhirajyadav32134
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Unit I Introduction to Natural Language Processing 07 Hours

Introduction: Natural Language Processing, Why NLP is hard? Programming languages Vs Natural
Languages, Are natural languages regular? Finite automata for NLP, Stages of NLP, Challenges and
Issues(Open Problems) in NLP
Basics of text processing: Tokenization, Stemming, Lemmatization, Part of Speech Tagging
Morphological Analysis: What is Morphology? Types of Morphemes, Inflectional morphology &
Derivational morphology, Morphological parsing with Finite State Transducers (FST)
Syntactic Analysis: Syntactic Representations of Natural Language, Parsing Algorithms, Probabilistic
context-free grammars, and Statistical parsing
Semantic Analysis: Lexical Semantic, Relations among lexemes & their senses –Homonymy,
Polysemy, Synonymy, Hyponymy, WordNet, Word Sense Disambiguation (WSD), Dictionary
based approach, Latent Semantic Analysis

explain in detail Introduction: Natural Language Processing, Why NLP is hard? Programming
languages Vs Natural Languages,

Introduction to Natural Language Processing (NLP)

Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) and linguistics that focuses
on enabling computers to understand, interpret, and generate human language. NLP is used in
applications such as speech recognition, machine translation, sentiment analysis, chatbots, and text
summarization. It bridges the gap between human communication and computer understanding.

NLP involves multiple disciplines, including computer science, linguistics, and machine learning, to
develop algorithms that process and analyze large amounts of natural language data. The key goals of
NLP include:

Understanding text and speech data.


Extracting useful information from unstructured language.
Automating communication between humans and machines.

Why is NLP Hard?

Unlike structured programming languages, natural languages (such as English, Hindi, or Marathi) are
highly ambiguous, context-dependent, and constantly evolving. Several factors make NLP a challenging
problem:

1. Ambiguity
Lexical Ambiguity: A word can have multiple meanings (e.g., "bank" can refer to a riverbank
or a financial institution).
Syntactic Ambiguity: A sentence can be interpreted in multiple ways (e.g., "I saw the man
with a telescope" – did I use the telescope, or did the man have a telescope?).
Semantic Ambiguity: Words with multiple senses create confusion in meaning (e.g., "bark"
can mean the sound of a dog or the outer layer of a tree).
2. Context Dependence
Words and sentences derive meaning from context. For example, "He ran to the bank" can
only be understood correctly if the context (financial institution vs. riverbank) is known.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 1/47
3. Morphological Complexity
Many languages have complex word formations with prefixes, suffixes, and inflections (e.g.,
"run," "running," "ran").
Some languages, like agglutinative languages (e.g., Turkish), form long words by adding
multiple morphemes, making processing difficult.
4. Variability in Sentence Structure
Different languages have different grammar rules (e.g., Subject-Verb-Object in English vs.
Subject-Object-Verb in Japanese).
Even within the same language, a sentence can be structured in multiple ways without
changing its meaning.
5. Idioms, Metaphors, and Sarcasm
Humans often use figurative language, which is hard for machines to interpret literally. For
example, "kick the bucket" does not mean physically kicking a bucket but refers to someone
dying.
6. Lack of Sufficient Training Data
Some languages and dialects have limited digital resources for training NLP models.
Low-resource languages face difficulties in building accurate models compared to English.
7. Real-Time Processing Challenges
NLP tasks like speech-to-text, chatbot responses, and real-time translation require quick
processing and understanding of the input text.

Programming Languages vs. Natural Languages

Feature Programming Languages Natural Languages

Highly structured, follows strict grammar


Structure Unstructured, follows flexible grammar rules.
(syntax).

No ambiguity; every statement has a fixed High ambiguity; words and sentences can
Ambiguity
meaning. have multiple meanings.

Context Minimal; each statement is independent Highly context-dependent; meaning depends


Dependence and executes in order. on surrounding words and context.

Changes slowly and in a controlled Constantly evolving with new words and
Evolution
manner. meanings.

Understood the same way by all Understood differently by people based on


Interpretation
compilers/interpreters. culture, dialect, and context.

Even a small syntax error can cause Humans can understand sentences even with
Error Tolerance
failure. grammar mistakes.

Conclusion

NLP is a challenging yet essential field in AI that deals with understanding human language. The
complexity arises due to ambiguity, context dependence, and the evolving nature of languages. Unlike
programming languages, which are structured and follow strict rules, natural languages are flexible,
ambiguous, and context-sensitive, making NLP a difficult problem to solve.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 2/47
explain in detail Are natural languages regular? Finite automata for NLP

Are Natural Languages Regular?

To answer this, we need to understand regular languages and how they relate to natural languages.

1. What are Regular Languages?

Regular languages are a class of languages that can be represented using regular expressions and can
be recognized by finite automata. These languages follow strict patterns and have limited complexity.

Examples of Regular Languages:


Strings containing only "a" and "b" (e.g., "abba", "bab").
Strings that follow a simple pattern, such as `a*b+` (i.e., one or more "b" after zero or more
"a").

Regular languages are not powerful enough to capture complex structures like nested dependencies
or long-distance relationships between words, which are common in natural languages.

2. Are Natural Languages Regular?

No, natural languages are not regular because they are far more complex than regular languages.
There are multiple reasons for this:

1. Nested Structures (Recursion)


Many natural language sentences have recursive structures, meaning that phrases can be
embedded inside each other.
Example:
"The cat [that the dog [that the man owns] chased] is sleeping."
This sentence has nested dependencies (one phrase inside another), which cannot be
captured by regular expressions.
2. Long-Distance Dependencies
In natural languages, words in a sentence may have dependencies on words that are far
apart.
Example:
"If the boy who is playing in the park wins, he will get a prize."
The subject "boy" is related to the verb "wins," even though there are many words
between them. Regular languages and finite automata cannot handle such
relationships.
3. Context Sensitivity
The meaning of words and sentences depends on context.
Example:
"Time flies like an arrow."
This could mean "Time moves quickly like an arrow" or "Measure the speed of flies like
you would an arrow."
Regular languages cannot handle such contextual variations.
4. Agreement Constraints
In natural languages, subjects and verbs must agree in number and tense.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 3/47
Example:
"The dog barks" (correct) vs. "The dog bark" (incorrect).
Regular languages cannot enforce such rules.

Thus, natural languages are not regular because they exhibit recursion, long-distance dependencies,
and complex syntactic rules that cannot be represented by finite-state automata.

Finite Automata for NLP

Although natural languages are not regular, finite automata (FA) are still useful for some basic NLP
tasks like tokenization, stemming, and simple pattern matching.

1. What is Finite Automata?

A finite automaton (FA) is a computational model that consists of:

A finite set of states


A starting state
A set of accepting states
A transition function that determines state changes based on input symbols

There are two types of FA:

1. Deterministic Finite Automata (DFA) – Only one possible transition per input symbol.
2. Non-Deterministic Finite Automata (NFA) – Multiple possible transitions per input symbol.

2. How Finite Automata is Used in NLP?

Finite automata are useful for:

1. Tokenization
Breaking text into words or meaningful units.
Example: Splitting `"Hello, world!"` into `["Hello", ",", "world", "!"]`.
A DFA can be designed to recognize word boundaries (spaces, punctuation, etc.).
2. Stemming
Reducing words to their root form.
Example: "running" → "run", "played" → "play".
A finite automaton can recognize suffixes and remove them.
3. Lexical Analysis in Parsing
Identifying keywords in programming languages or structured text.
Example: Checking if a string is a valid variable name.
4. Spelling Correction and Simple Grammar Checking
Detecting common misspellings based on predefined rules.
Example: Identifying "hte" as a misspelling of "the".
5. Named Entity Recognition (NER) – Limited Use
Identifying names, dates, and places using predefined patterns.
Example: Recognizing "2024-02-25" as a date.

3. Limitations of Finite Automata in NLP

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 4/47
Cannot handle recursion or long-distance dependencies.
Limited memory – Cannot remember information from distant parts of a sentence.
Not suitable for deep syntactic or semantic analysis.

Conclusion

Natural languages are not regular because they contain recursion, long-distance dependencies,
and contextual variations.
Finite automata are useful for basic NLP tasks like tokenization, stemming, and lexical analysis,
but they are not powerful enough for complex linguistic structures like parsing entire sentences
or understanding meaning in depth.
More advanced models like Context-Free Grammars (CFGs) and Neural Networks are needed for
deeper language processing.

Stages of NLP, Challenges and Issues(Open Problems) in NLP explain in detail

Stages of NLP (Natural Language Processing)


NLP involves multiple steps to process and understand human language. These steps transform raw
text into meaningful insights that machines can work with. The major stages of NLP are:

1. Lexical Analysis (Tokenization and Morphology Processing)

Objective: Break text into meaningful words or tokens and analyze their structure.
Processes Involved:
Tokenization: Splitting a sentence into words, phrases, or subwords.
Example: "I love NLP!" → `["I", "love", "NLP", "!"]`
Stemming: Reducing words to their base form by cutting off prefixes/suffixes.
Example: "running" → "run", "played" → "play"
Lemmatization: Converting words to their dictionary form using linguistic rules.
Example: "am", "is", "are" → "be"
Part of Speech (POS) Tagging: Assigning grammatical labels (noun, verb, adjective) to words.
Example: "The cat sleeps." → `["The/DET", "cat/NOUN", "sleeps/VERB"]`

2. Syntactic Analysis (Parsing & Grammar Checking)

Objective: Analyze the grammatical structure of a sentence.


Processes Involved:
Parsing: Identifying sentence structure based on grammar rules.
Example:

mathematica

Sentence: "The dog chased the cat."


Parse Tree:
(S

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 5/47
(NP (DET The) (N dog))
(VP (V chased) (NP (DET the) (N cat))))

Dependency Parsing: Identifying relationships between words.


Example: "The big cat sat on the mat."
"cat" (subject) → "sat" (verb)
"on" (preposition) → "mat" (object)

3. Semantic Analysis (Understanding Meaning)

Objective: Extract the meaning of words, phrases, and sentences.


Processes Involved:
Named Entity Recognition (NER): Identifying names, locations, dates, etc.
Example: "Elon Musk founded Tesla in 2003." → `[Person: Elon Musk]`, `[Company:
Tesla]`, `[Year: 2003]`
Word Sense Disambiguation (WSD): Determining the correct meaning of words in context.
Example: "He went to the bank." (Financial institution or riverbank?)
Lexical Semantics: Analyzing word relationships like synonyms, antonyms, hypernyms
(categories).
Example: "car" → "vehicle" (hypernym), "car" → "automobile" (synonym)

4. Pragmatic Analysis (Context Understanding & Discourse Analysis)

Objective: Understand the meaning beyond words (context, sarcasm, intent).


Processes Involved:
Coreference Resolution: Finding which words refer to the same entity.
Example: "John went to the store. He bought milk."
"He" → "John"
Sentiment Analysis: Determining the emotion behind text.
Example: "This movie is amazing!" → Positive Sentiment
Discourse Analysis: Understanding how sentences relate to each other.
Example: "She failed the exam. However, she did not give up."

5. Speech Processing (Optional, for Speech-Based NLP Systems)

Objective: Convert speech into text (Speech-to-Text) or text into speech (Text-to-Speech).
Processes Involved:
Speech Recognition: Identifying spoken words.
Prosody Analysis: Understanding tone, stress, and rhythm.

6. Machine Learning and Deep Learning for NLP

Objective: Train AI models to learn and improve language understanding.


Approaches Used:
Traditional NLP: Rule-based approaches, statistical methods.
Machine Learning: Logistic regression, decision trees, SVMs.
Deep Learning: Transformer models like BERT, GPT, LSTMs for text processing.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 6/47
Challenges and Issues (Open Problems) in NLP
Despite advancements, NLP still faces several challenges due to the complexity of human languages.

1. Ambiguity in Language

Lexical Ambiguity: A word can have multiple meanings.


Example: "He saw the bat." (Animal or sports equipment?)
Syntactic Ambiguity: A sentence can be parsed in different ways.
Example: "I saw the man with a telescope." (Who has the telescope?)
Semantic Ambiguity: Different interpretations of the same phrase.
Example: "Visiting relatives can be annoying." (Are the relatives visiting, or is visiting them
annoying?)

2. Understanding Context and Common Sense

Issue: Machines struggle to understand context and real-world knowledge.


Example:
"He put the laptop in the bag and then dropped it."
"It" refers to the bag or the laptop?
Solution Attempts:
Knowledge graphs, pretrained models like GPT, BERT, and LLMs with real-world data.

3. Handling Multiple Languages and Dialects

Issue: NLP models are often trained on English and struggle with low-resource languages.
Example:
Translating idioms: "Break a leg" (English) → "Buena suerte" (Spanish, meaning "Good luck").
Solution Attempts:
Multilingual models like mBERT and Google's M4 models.

4. Data Scarcity and Bias in NLP Models

Issue: Many NLP models require vast amounts of labeled data, which is expensive and time-
consuming to create.
Bias Problem: Models trained on biased data can reflect societal biases.
Example:
AI-generated job applications may show gender bias.
Solution Attempts:
Zero-shot learning, synthetic data generation, fairness-focused AI models.

5. Code-Mixing and Informal Language Usage

Issue: People often mix languages (code-mixing) and use informal/slang text.
Example:
"Mujhe yeh movie bahut acchi lagi!" (Hindi-English mix)
Solution Attempts:

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 7/47
Developing code-mixed NLP models, using transfer learning for informal language.

6. Long-Range Dependencies in Text

Issue: Traditional NLP models struggle with long texts where meaning depends on distant words.
Example:
"The book that I bought last year and kept in my drawer is missing." (What is missing? The
book.)
Solution Attempts:
Transformer models (like BERT, GPT) with attention mechanisms.

7. Sarcasm and Humor Detection

Issue: NLP models fail to detect sarcasm and humor correctly.


Example:
"Oh great, another Monday!" (Is it positive or negative?)
Solution Attempts:
Sentiment analysis with contextual learning and social media datasets.

8. Real-Time NLP Processing Challenges

Issue: Speech recognition, chatbots, and real-time translation require low-latency responses.
Example:
AI assistants like Alexa and Siri must respond instantly.
Solution Attempts:
Edge computing, model optimization, and federated learning.

Conclusion
NLP has evolved significantly, but challenges like ambiguity, context understanding, multilingual
support, and bias remain. Advanced techniques like deep learning, transformers (BERT, GPT), and real-
world knowledge integration are helping improve NLP, but many open problems still require research
and innovation.

explain in detail Basics of text processing: Tokenization, Stemming

Basics of Text Processing: Tokenization and Stemming


Text processing is a fundamental step in Natural Language Processing (NLP). It involves transforming
raw text into a structured format that computers can understand and process efficiently. Two key
processes in text processing are Tokenization and Stemming.

1. Tokenization
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 8/47
What is Tokenization?

Tokenization is the process of breaking down text into smaller units, called tokens. These tokens can be
words, sentences, or subwords, depending on the level of tokenization.

Why is Tokenization Important?

It helps in understanding the structure of text.


It is the first step in NLP tasks such as text analysis, sentiment analysis, and machine translation.
Reduces complexity by converting unstructured text into a structured form.

Types of Tokenization

1. Word Tokenization
Splitting text into individual words.
Example:

vbnet

Sentence: "I love NLP!"


Word Tokens: ["I", "love", "NLP", "!"]

2. Sentence Tokenization
Splitting text into sentences.
Example:

vbnet

Paragraph: "I love NLP. It's very interesting."


Sentence Tokens: ["I love NLP.", "It's very interesting."]

3. Subword Tokenization
Splitting words into meaningful sub-units.
Useful for handling rare words or unknown words in NLP.
Example:

vbnet

Word: "unhappiness"
Subword Tokens: ["un", "happiness"]

4. Character Tokenization
Splitting words into individual characters.
Used in neural network models like LSTMs and Transformers.
Example:

mathematica

Word: "NLP"
Character Tokens: ["N", "L", "P"]

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 9/47
Challenges in Tokenization

Handling Punctuation: Should punctuation marks be considered as separate tokens?


Example: `"Hello, world!"` → Should `"Hello,"` be one token or split into `"Hello"` and
`","`?
Multi-word Expressions: Some words should not be split.
Example: `"New York"` should remain a single entity.
Ambiguity in Language:
`"Don't"` → Should it be split into `["Do", "n't"]` or `["Don't"]`?
Language-Specific Rules: Different languages have different rules for tokenization.
Example: In Chinese, there are no spaces between words, making tokenization complex.

Libraries for Tokenization

NLTK (Natural Language Toolkit)


spaCy
Transformers (Hugging Face)
WordPiece (used in BERT)
Byte Pair Encoding (BPE)

2. Stemming
What is Stemming?

Stemming is the process of reducing words to their root form by removing prefixes and suffixes.

Why is Stemming Important?

It reduces word variations and simplifies text processing.


Helps in reducing vocabulary size, making NLP models more efficient.
Improves search results by grouping similar words.

Example of Stemming

mathematica

Original Words: "running", "runs", "ran"


Stemmed Word: "run"

Types of Stemming Algorithms

1. Porter Stemmer (Most Common)


Uses a set of predefined rules to remove suffixes.
Example:

arduino

"connection" → "connect"
"arguing" → "argu"

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 10/47
2. Lancaster Stemmer (More Aggressive than Porter Stemmer)
Removes more characters, leading to shorter stems.
Example:

arduino

"running" → "run"
"argument" → "argu"

3. Snowball Stemmer (Improved Porter Stemmer)


Supports multiple languages.
More efficient and accurate than Porter Stemmer.
Example:

arduino

"happily" → "happi"
"studies" → "studi"

4. Regex-Based Stemmer
Uses regular expressions to remove suffixes.
Example:

arduino

"worked" → "work"
"playing" → "play"

Challenges in Stemming

Over-Stemming: Cutting too much of a word, losing meaning.


Example: `"university"` → `"univers"` (Incorrect)
Under-Stemming: Not reducing words to a common root.
Example: `"organization"` and `"organizing"` should have the same stem but don't.
Loss of Meaning: Stems may not always be real words.
Example: `"arguing"` → `"argu"` (Not a valid word)

Libraries for Stemming

NLTK (Natural Language Toolkit)


spaCy
SnowballStemmer in Python

Comparison of Tokenization and Stemming


Feature Tokenization Stemming

Purpose Break text into smaller units (tokens). Reduce words to their root form.
Output Words, sentences, or subwords. Root words, often not real words.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 11/47
Feature Tokenization Stemming
`"I am running fast."` → `["I", "am", `"running"` → `"run"`,
Example
"running", "fast", "."]` `"arguing"` → `"argu"`
More complex (Applying rules to
Complexity Simple (Splitting words/sentences).
remove suffixes).
Accuracy High (Less chance of errors). Lower (Can over-stem or under-stem).

Conclusion
Tokenization and stemming are crucial text preprocessing steps in NLP. Tokenization splits text into
meaningful units, while stemming simplifies words to their root forms. Both processes help in reducing
text complexity and improving the performance of NLP applications like search engines, chatbots, and
machine translation.

explain in detail Lemmatization, Part of Speech Tagging

Lemmatization and Part of Speech (POS) Tagging in NLP


In Natural Language Processing (NLP), lemmatization and POS tagging play crucial roles in
understanding the structure and meaning of text.

1. Lemmatization
What is Lemmatization?

Lemmatization is the process of reducing a word to its base or dictionary form (lemma) while
considering the word’s context and meaning. Unlike stemming, which removes suffixes blindly,
lemmatization ensures that the transformed word is an actual meaningful word.

Why is Lemmatization Important?

It improves text normalization by converting words to their base forms.


It ensures that words with different forms are treated as the same word.
It is widely used in search engines, chatbots, and text summarization.

Example of Lemmatization

Word Lemma
running run
better good

studies study
wolves wolf

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 12/47
Unlike stemming, lemmatization ensures that `"better"` is lemmatized to `"good"`, while stemming
would return `"better"` as `"better"`, failing to capture the actual base form.

Lemmatization vs. Stemming

Feature Lemmatization Stemming


Meaning-Based Yes No
Uses Dictionary Yes No

Produces Real Words Yes Not Always


Example ("Studies") study studi
Example ("Better") good better

Types of Lemmatization Algorithms

1. WordNet Lemmatizer (NLTK-based)


Uses the WordNet lexical database to find the correct base form of a word.
Example: `"running"` → `"run"`, `"better"` → `"good"`.
2. spaCy Lemmatizer
More advanced and faster than WordNet.
Supports multiple languages.
3. TextBlob Lemmatizer
Simplifies the process by automatically identifying lemmas.

Challenges in Lemmatization

Requires POS tagging to determine the correct lemma.


Slower than stemming since it consults dictionaries.
Ambiguity in word meaning can lead to incorrect lemmas.

Lemmatization in Python

Using NLTK’s WordNet Lemmatizer:

python

from nltk.stem import WordNetLemmatizer


from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("running", pos="v")) # Output: run
print(lemmatizer.lemmatize("better", pos=wordnet.ADJ)) # Output: good

2. Part of Speech (POS) Tagging


What is POS Tagging?

POS tagging is the process of assigning parts of speech (noun, verb, adjective, etc.) to words in a
sentence. It helps in understanding the syntactic structure of a text.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 13/47
Why is POS Tagging Important?

Helps in lemmatization by determining the correct base form of a word.


Useful in text analysis, speech recognition, and machine translation.
Plays a key role in syntactic parsing and named entity recognition (NER).

Example of POS Tagging

Sentence: `"The quick brown fox jumps over the lazy dog."`

Word POS Tag


The DT (Determiner)
quick JJ (Adjective)

brown JJ (Adjective)
fox NN (Noun)
jumps VBZ (Verb - third person singular)

over IN (Preposition)
the DT (Determiner)
lazy JJ (Adjective)

dog NN (Noun)

POS Tags in NLP

POS Tag Meaning Example


NN Noun cat, book, house
VB Verb run, eat, jump

JJ Adjective beautiful, quick, lazy


RB Adverb quickly, silently
DT Determiner the, a, an
PRP Pronoun he, she, they

IN Preposition in, on, over

Types of POS Tagging Approaches

1. Rule-Based POS Tagging


Uses manually defined grammar rules.
Example: `"The boy runs"` → `"The/DT boy/NN runs/VB"`
2. Statistical POS Tagging (HMM-Based)
Uses probabilistic models like Hidden Markov Models (HMM) to predict POS tags.
More accurate than rule-based methods.
3. Neural Network-Based POS Tagging
Uses deep learning models such as Recurrent Neural Networks (RNNs) and Transformers.
Achieves state-of-the-art performance.

POS Tagging in Python


Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 14/47
Using NLTK:

python

import nltk
from nltk import word_tokenize, pos_tag

sentence = "The quick brown fox jumps over the lazy dog."
words = word_tokenize(sentence)
pos_tags = pos_tag(words)

print(pos_tags)
# Output: [('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over

Using spaCy:

python

import spacy
nlp = spacy.load("en_core_web_sm")

sentence = "The quick brown fox jumps over the lazy dog."
doc = nlp(sentence)

for token in doc:


print(token.text, token.pos_)

Comparison of Lemmatization and POS Tagging


Feature Lemmatization POS Tagging
Purpose Reduce words to their base form Assigns part of speech to words
Example "running" → "run" "running" → "VB (Verb)"

Requires Context? Yes Yes


Used in? Text normalization, Search Engines Syntactic Parsing, Named Entity Recognition
Complexity High (Requires a dictionary) Medium (Can use rules or ML)

Conclusion
Lemmatization helps convert words to their base forms, improving text normalization.
POS tagging helps in understanding word roles in a sentence, which is critical for lemmatization,
machine translation, and text analysis.
Both techniques play an essential role in text preprocessing for NLP applications such as
chatbots, search engines, and voice assistants.

expalin in detail Morphological Analysis: What is Morphology? Types of Morphemes

Morphological Analysis in NLP

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 15/47
Morphology is a crucial aspect of Natural Language Processing (NLP) that focuses on understanding
the structure and formation of words. It helps in tasks such as lemmatization, spell-checking, text-
to-speech conversion, and machine translation.

1. What is Morphology?
Definition

Morphology is the branch of linguistics that studies the internal structure of words and how they are
formed from smaller units called morphemes.

Why is Morphology Important in NLP?

Helps in word segmentation for languages where spaces do not separate words (e.g., Chinese,
Japanese).
Improves lemmatization by identifying the base form of words.
Aids in spell-checking, grammar correction, and language translation.
Assists in text synthesis and voice recognition.

2. Morphemes: The Building Blocks of Words


What is a Morpheme?

A morpheme is the smallest unit of meaning in a language. It cannot be broken down further without
losing its meaning.

Example of Morphemes in a Word

Consider the word "unhappiness":

`"un-"` → Prefix (meaning "not")


`"happy"` → Root morpheme (main meaning of the word)
`"-ness"` → Suffix (converts an adjective into a noun)

Thus, `"unhappiness"` consists of three morphemes.

3. Types of Morphemes
Morphemes can be classified into two major types:

A. Free Morphemes

Can stand alone as a complete word.


Carry meaning without needing to attach to other morphemes.

Examples:

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 16/47
Free Morpheme Meaning
book A readable object
play Engage in an activity

happy Feeling joy

B. Bound Morphemes

Cannot stand alone and must be attached to a free morpheme.


Modify the meaning of the free morpheme.
Used in prefixes, suffixes, inflections, and derivations.

Examples:

Bound Morpheme Meaning Example


un- Not unhappy

-s Plural cats
-ing Continuous tense playing
-ly Adverb form quickly

4. Subcategories of Bound Morphemes


Bound morphemes are divided into inflectional morphemes and derivational morphemes.

A. Inflectional Morphemes

Modify a word’s tense, number, aspect, mood, or possession.


Do not change the word’s grammatical category (noun stays a noun, verb stays a verb).

Examples of Inflectional Morphemes

Inflectional Morpheme Function Example


-s / -es Plural books, watches
-'s Possessive John's

-ed Past tense played


-ing Present participle running
-er Comparative bigger
-est Superlative strongest

Example:

`"cat"` → `"cats"` (plural)


`"walk"` → `"walked"` (past tense)

📌 Inflectional morphemes only add grammatical meaning but do not create a new word.
B. Derivational Morphemes

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 17/47
Change the meaning of a word or convert it into a different grammatical category.
Can turn a verb into a noun, adjective into an adverb, etc.

Examples of Derivational Morphemes

Derivational Morpheme Function Example


un- Negation unhappy

re- Repetition redo


-ly Adverb formation quickly
-ness Noun formation happiness

-ment Noun formation development


-able Adjective formation readable

Example:

`"happy"` → `"happiness"` (adjective → noun)


`"develop"` → `"development"` (verb → noun)
`"act"` → `"action"` (verb → noun)

📌 Derivational morphemes create new words with different meanings.


5. Comparison: Inflectional vs. Derivational Morphemes
Feature Inflectional Morpheme Derivational Morpheme
Function Adds grammatical meaning Changes word meaning or category
Creates a new word? No Yes
Changes grammatical category? No Yes (e.g., verb → noun)

Example cat → cats (plural) happy → happiness (adjective → noun)

6. Morphological Analysis in NLP


Morphological analysis is the process of analyzing the structure of words and breaking them down
into morphemes.

Techniques Used for Morphological Analysis

1. Morphological Segmentation – Splitting a word into morphemes.


→ `"un-" + "happy" + "-ly"`
`"unhappily"`
2. Stemming and Lemmatization – Reducing words to their root or base form.
→ `"run"`
`"running"`
→ `"study"`
`"studies"`
3. Morphological Parsing – Analyzing the components of a word and their grammatical role.

7. Application of Morphology in NLP

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 18/47
Spell Checking: Identifies incorrect inflections or derivations.
Speech Recognition: Recognizes different word forms.
Machine Translation: Translates words while maintaining correct morphology.
Search Engines: Retrieves relevant documents based on morphological variations of a word.
Text Summarization: Reduces content by understanding word structures.

Conclusion
Morphology is essential in understanding word formation and is widely used in NLP applications like
search engines, chatbots, and speech recognition systems.

Free morphemes can exist alone, while bound morphemes need attachment.
Inflectional morphemes modify words grammatically but don’t create new words.
Derivational morphemes create new words by changing the word's category.

Understanding morphology helps improve text processing, language modeling, and AI-driven NLP
applications! 🚀

Inflectional morphology & Derivational morphology explain in detail

Inflectional Morphology & Derivational Morphology in NLP


Morphology is the study of word formation and structure. It is a critical part of Natural Language
Processing (NLP) because it helps computers understand and process human languages.

Morphology is broadly divided into:

1. Inflectional Morphology – Focuses on grammatical changes in words.


2. Derivational Morphology – Focuses on forming new words by adding affixes.

1. Inflectional Morphology
What is Inflectional Morphology?

Inflectional morphology deals with modifying a word's form to express grammatical features such as
tense, number, gender, case, mood, and comparison, without changing the word's meaning or
category.

📌 Inflection does not create a new word; it just changes the word's grammatical role.
Examples of Inflectional Morphology

Inflectional Morpheme Function Example


-s / -es Plural cats, buses
-'s Possessive John's book

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 19/47
Inflectional Morpheme Function Example
-ed Past tense played, walked

-ing Present participle running, swimming


-er Comparative bigger, faster
-est Superlative strongest, tallest

Examples in Sentences

1. She plays the piano. (Singular: "play" → "plays")


2. They walked to school. (Past tense: "walk" → "walked")
3. This house is bigger than mine. (Comparative: "big" → "bigger")

📌 Important Rules of Inflectional Morphology:


Inflectional morphemes do not change the word’s core meaning.
Inflectional morphemes do not change the word’s grammatical category.
English has only eight inflectional morphemes.

Applications in NLP

Part-of-Speech Tagging (POS Tagging): Identifies the correct grammatical category (e.g.,
"running" as a verb vs. noun).
Machine Translation: Adjusts word forms according to language grammar rules.
Speech Recognition: Recognizes different word inflections to understand context.
Text-to-Speech Systems: Reads out words with correct grammatical inflections.

2. Derivational Morphology
What is Derivational Morphology?

Derivational morphology deals with the creation of new words by adding prefixes and suffixes. These
affixes often change the meaning or grammatical category of a word.
📌 Derivation creates a new word with a different meaning or category.
Examples of Derivational Morphology

Derivational Morpheme Function Example


un- Negation unhappy, unclear

re- Repetition redo, rewrite


-ly Adverb formation quickly, beautifully
-ness Noun formation happiness, darkness
-ment Noun formation development, agreement

-able Adjective formation readable, drinkable

Examples in Sentences

1. She is unhappy with her job. ("happy" → "unhappy")


Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 20/47
2. His kindness is appreciated. ("kind" → "kindness")
3. He developed a new theory. ("develop" → "development")

📌 Important Rules of Derivational Morphology:


Derivational morphemes often change the word’s grammatical category.
They create new words with different meanings.
The number of derivational morphemes is larger than inflectional morphemes in English.

Applications in NLP

Text Normalization: Converts words to their base forms (e.g., "happiness" → "happy").
Semantic Analysis: Understands word meanings in different contexts.
Information Retrieval: Improves search engines by recognizing derived words.
Machine Translation: Helps in correct word formation in different languages.

3. Differences Between Inflectional and Derivational Morphology


Feature Inflectional Morphology Derivational Morphology

Modifies a word’s grammatical Creates new words with different


Function
role meanings
Changes meaning? No Yes
Changes grammatical category? No Yes (e.g., verb → noun)

Number of morphemes in
8 Large
English
Example dog → dogs (plural) happy → happiness (adjective → noun)

4. How Inflectional and Derivational Morphology Help NLP?


1. Word Segmentation: Breaks down words into morphemes for better language understanding.
2. Named Entity Recognition (NER): Identifies entities like names, locations, and dates.
3. Text Summarization: Recognizes the root forms of words to extract important content.
4. Sentiment Analysis: Understands positive/negative expressions by analyzing word variations.

Conclusion
Inflectional morphology modifies words for grammatical correctness without changing their
meaning.
Derivational morphology creates new words and can change grammatical categories.
Both are crucial in NLP applications like machine translation, text analysis, and speech
recognition.

Understanding inflectional and derivational morphology helps improve language models, search
engines, and AI-driven NLP applications! 🚀

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 21/47
Morphological parsing with Finite State Transducers (FST)

Morphological Parsing with Finite State Transducers (FST)

1. What is Morphological Parsing?


Morphological parsing is the process of analyzing the internal structure of words by breaking them
down into morphemes (smallest meaningful units). It helps in understanding how words are formed
and how they can be modified.

For example:

"unhappiness" → "un-" (prefix) + "happy" (root) + "-ness" (suffix)


"running" → "run" (root) + "-ing" (suffix)

Morphological parsing is important in Natural Language Processing (NLP) for text analysis, machine
translation, and speech recognition.

2. What is a Finite State Transducer (FST)?


A Finite State Transducer (FST) is an extension of a Finite State Automaton (FSA). While an FSA only
accepts or rejects strings, an FST maps input strings to output strings.

📌 In Morphological Parsing, an FST maps surface forms (words) to their root form and
morphological structure.

How FST Works in NLP?

An FST consists of states and transitions.


Each transition is labeled with two symbols: an input and an output.
It processes words step-by-step, identifying prefixes, roots, and suffixes.
FST helps in both analysis (breaking down words) and generation (forming words).

3. How FST is Used in Morphological Parsing?


Example 1: Parsing the word "cats"

We want to analyze "cats" into:

Root: "cat"
Suffix: "-s" (plural marker)

An FST performs the following transformation:

Input Output
c c
a a
t t

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 22/47
Input Output
s +PL (plural marker)

So, "cats" → "cat +PL" (Parsed output)

Example 2: Parsing the word "running"

We want to analyze "running" into:

Root: "run"
Suffix: "-ing" (progressive marker)

An FST processes the word:

Input Output
r r
u u

n n
n ε (null, doubling rule)
i i
n n
g +ING (progressive marker)

So, "running" → "run +ING" (Parsed output)

4. Structure of an FST for Morphological Parsing


A typical Finite State Transducer for morphological parsing consists of:

1. Initial state – Where processing starts.


2. Transitions – Moves between states based on input characters.
3. Output labels – Converts surface words to their root form with grammatical markers.
4. Final state – Marks the end of word processing.

📌 Example FST Diagram for "cats"


css

(q0) --c/c--> (q1) --a/a--> (q2) --t/t--> (q3) --s/+PL--> (q4)

This FST takes "cats" as input and returns "cat+PL".

5. Applications of FST in NLP


🔹 Lemmatization – Converts words to their base form (e.g., "running" → "run").
🔹 Part-of-Speech Tagging (POS Tagging) – Identifies word categories using morphological
information.
🔹 Spell Checking – Recognizes word variations and corrects misspellings.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 23/47
🔹 Machine Translation – Helps in translating words accurately by identifying their root and suffixes.
🔹 Text-to-Speech (TTS) Systems – Converts text into speech by recognizing morphological structures.
6. Advantages of FST in NLP
✅ Efficient – Works quickly for large text corpora.
✅ Compact Representation – Stores multiple word forms in a single automaton.
✅ Reversible – Can be used for both analysis (parsing) and generation (word formation).
✅ Handles Complex Word Structures – Useful for languages with rich morphology like Hindi, German,
and Arabic.

7. Limitations of FST in NLP


❌ Cannot Handle Irregular Words Well – FST struggles with words like "went" (past of "go").
❌ Requires Large Lexicons – Needs a comprehensive dictionary of morphemes.
❌ Ambiguity in Parsing – Some words have multiple possible analyses (e.g., "running" can be a verb or
a noun).

8. Conclusion
Morphological parsing is essential for understanding word structures in NLP.
Finite State Transducers (FSTs) are powerful tools for mapping words to their root forms and
affixes.
FST-based morphological analysis is used in lemmatization, POS tagging, spell checking, and
machine translation.
While FST is efficient, it struggles with irregular words and ambiguity, requiring additional
processing techniques.

🔹 Overall, FST is a fundamental technique in NLP for breaking down words into meaningful
components! 🚀

Syntactic Analysis: Syntactic Representations of Natural Language, Parsing Algorithms explain in


detail

Syntactic Analysis in NLP

1. What is Syntactic Analysis?


Syntactic analysis, also known as parsing, is the process of analyzing the grammatical structure of a
sentence. It determines how words in a sentence relate to each other, following rules of syntax
(sentence structure).
📌 Example:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 24/47
Sentence: "The cat sleeps on the mat."
Syntactic structure:
Subject: "The cat"
Verb: "sleeps"
Prepositional phrase: "on the mat"

Why is Syntactic Analysis Important?

✅ Helps in machine translation


✅ Used in chatbots and voice assistants
✅ Improves information retrieval
✅ Aids in grammar checking and sentence correction
2. Syntactic Representations of Natural Language
To understand syntax, NLP represents sentences using formal structures. Some key representations
are:

A. Parse Tree (Constituency Parsing)

A parse tree represents the syntactic structure of a sentence based on phrase structure grammar. It
breaks a sentence into constituents (phrases and sub-phrases).
📌 Example Parse Tree for "The cat sleeps."
bash

S
/ \
NP VP
/ \ |
DT N V
| | |
The cat sleeps

S = Sentence
NP = Noun Phrase ("The cat")
VP = Verb Phrase ("sleeps")
DT = Determiner ("The")
N = Noun ("cat")
V = Verb ("sleeps")

Use Case:

Grammar checking in applications like Grammarly

B. Dependency Tree (Dependency Parsing)

A dependency tree represents grammatical relations between words. Here, each word depends on
another word, forming a directed graph.
📌 Example Dependency Tree for "The cat sleeps."
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 25/47
bash

sleeps
/ \
cat The

"sleeps" (root verb) depends on nothing


"cat" depends on "sleeps"
"The" depends on "cat"

Use Case:

Question answering systems like Google Assistant, Siri

3. Parsing Algorithms in NLP


Parsing algorithms analyze sentences and generate syntactic structures. The two main types of
parsing are:

A. Constituency Parsing (Phrase-Structure Parsing)

Uses context-free grammar (CFG) rules


Builds a parse tree
Example Rule:

css

S → NP VP
NP → DT N
VP → V

Parses sentences top-down (breaking a sentence into smaller parts)

1️⃣ Top-Down Parsing


✅ Starts with sentence (S) → breaks into smaller parts using grammar rules
✅ Example Algorithm: Recursive Descent Parsing
✅ Drawback: Can fail for left-recursive grammar rules
2️⃣ Bottom-Up Parsing
✅ Starts with words → combines them into phrases, forming a sentence (S)
✅ Example Algorithm: Shift-Reduce Parsing
✅ More efficient than top-down parsing
B. Dependency Parsing

Determines dependencies between words


Uses dependency grammar rules
Example Algorithm:
Eisner’s Algorithm (Efficient for dependency trees)
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 26/47
MaltParser (Machine learning-based parser)

4. Challenges in Parsing NLP


❌ Ambiguity:
"I saw a man with a telescope."
Did I see a man using a telescope?
Or did I see a man who had a telescope?

❌ Complex Sentences: Long sentences increase computational complexity.


❌ Different Languages: Syntax varies across languages.
5. Applications of Syntactic Analysis
🔹 Chatbots & Voice Assistants – Understanding sentence structure in conversations
🔹 Grammar Checking – Identifying incorrect syntax
🔹 Search Engines – Improving query understanding
🔹 Machine Translation – Translating sentences accurately
6. Conclusion
Syntactic analysis is crucial in NLP to understand sentence structure.
It can be represented using parse trees (Constituency Parsing) or dependency trees (Dependency
Parsing).
Parsing algorithms like Recursive Descent Parsing, Shift-Reduce Parsing, and Dependency
Parsing help analyze sentences.
Despite challenges like ambiguity and complexity, parsing is essential for AI applications,
chatbots, and translation systems. 🚀

explain in detail Probabilistic context-free grammars, and Statistical parsing

Probabilistic Context-Free Grammars (PCFG) and Statistical


Parsing in NLP

1. Introduction
Natural Language Processing (NLP) often requires syntactic analysis, which involves understanding the
structure of a sentence. Traditional Context-Free Grammars (CFGs) define sentence structures using
rules, but they do not handle ambiguity effectively. To improve parsing, we use Probabilistic Context-
Free Grammars (PCFGs) and Statistical Parsing, which incorporate probabilities to determine the
most likely parse tree.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 27/47
2. Context-Free Grammars (CFG) Recap
A Context-Free Grammar (CFG) is a set of production rules used to generate valid sentences. It consists
of:

Terminals (T): Words in the language (e.g., "cat", "sleeps")


Non-terminals (N): Syntactic categories (e.g., S → Sentence, NP → Noun Phrase)
Production rules (P): Define how symbols are replaced
Start symbol (S): The root of the grammar

Example CFG

mathematica

S → NP VP
NP → DT N
VP → V
DT → "The"
N → "cat"
V → "sleeps"

Sentence: "The cat sleeps."


Parse tree:

bash

S
/ \
NP VP
/ \ |
DT N V
| | |
The cat sleeps

Limitations of CFG

Cannot handle ambiguity (multiple valid parse trees)


Lacks ability to rank parse trees by likelihood
Cannot capture real-world sentence probabilities

To address these, we use Probabilistic Context-Free Grammars (PCFGs).

3. Probabilistic Context-Free Grammars (PCFG)


What is PCFG?

A PCFG is an extension of CFG where each production rule is assigned a probability. These probabilities
help determine the most likely parse tree for a given sentence.

PCFG Components

Same CFG rules (terminals, non-terminals, start symbol)


Probability distribution for each rule

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 28/47
Probabilities sum to 1 for rules with the same non-terminal on the left-hand side

Example PCFG

css

S → NP VP [1.0]
NP → DT N [0.7]
NP → N [0.3]
VP → V [0.6]
VP → V NP [0.4]
DT → "The" [1.0]
N → "cat" [0.5]
N → "dog" [0.5]
V → "sleeps" [0.6]
V → "eats" [0.4]

Sentence Parsing with PCFG

For the sentence "The cat sleeps", we can calculate probabilities for different parse trees:
1️⃣ Parse Tree 1:
mathematica

S (1.0)
/ \
NP (0.7) VP (0.6)
/ \ |
DT (1.0) N (0.5) V (0.6)
| | |
"The" "cat" "sleeps"

Probability = 1.0 × 0.7 × 1.0 × 0.5 × 0.6 = 0.21


If there were multiple valid parse trees, PCFG would select the one with the highest probability.

Advantages of PCFG

✅ Resolves ambiguity by ranking parse trees


✅ Helps in speech recognition and machine translation
✅ Provides a statistical way to model language
4. Statistical Parsing
What is Statistical Parsing?

Statistical Parsing uses machine learning and probability to select the best parse tree for a sentence
based on training data. It learns from large datasets instead of manually defining grammar rules.

Types of Statistical Parsing

1️⃣ Supervised Parsing – Uses labeled parse trees from treebanks (e.g., Penn Treebank)
2️⃣ Unsupervised Parsing – Learns syntax from raw text (no labeled trees)
Methods of Statistical Parsing
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 29/47
A. Probabilistic CYK Parsing

Uses PCFG and the CYK algorithm to find the most likely parse tree
Bottom-up approach (builds trees from words → sentences)
Works best for Chomsky Normal Form (CNF) grammars

B. Maximum Likelihood Estimation (MLE)

Counts how often each rule appears in training data


Calculates probability as:

Count(A → BC)
P (A → BC) =
Total Count(A)

C. Machine Learning-Based Parsers

1. Collins Parser (based on lexicalized PCFG)


2. Stanford Parser (uses deep learning and probabilistic models)
3. Berkeley Parser (unsupervised statistical parser)

5. Example: PCFG vs. Statistical Parsing


Sentence: "The old man the boat."

(1) Rule-Based CFG Parsing

CFG produces multiple parse trees (ambiguity)


Cannot determine which tree is correct

(2) PCFG Parsing

Assigns probabilities to parse trees


Selects most probable tree

(3) Statistical Parsing

Trained on large datasets (real-world usage)


Predicts correct parse tree based on context

6. Challenges in PCFG and Statistical Parsing


❌ Sparse Data Problem – Some sentence structures may not appear in training data
❌ Computational Complexity – PCFG parsing requires high computation power
❌ Long-Range Dependencies – Hard to capture relationships in long sentences
❌ Structural Ambiguity – Sentences like "Time flies like an arrow" can have multiple interpretations
7. Applications of PCFG and Statistical Parsing

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 30/47
✅ Speech Recognition – Improves understanding of spoken sentences
✅ Machine Translation – Enhances translation quality
✅ Question Answering Systems – Helps chatbots understand queries
✅ Grammar Checking Tools – Detects grammatical errors
✅ Information Extraction – Identifies relationships between words
8. Conclusion
PCFG extends CFG by assigning probabilities to production rules
Statistical Parsing uses machine learning to select the best parse tree
These methods are crucial in speech recognition, chatbots, and machine translation
Despite challenges, they help improve NLP accuracy and efficiency 🚀

Semantic Analysis: Lexical Semantic explain in detail

Semantic Analysis: Lexical Semantics in NLP

1. Introduction to Semantic Analysis


Semantic analysis in Natural Language Processing (NLP) deals with understanding the meaning of
words, phrases, and sentences. It focuses on extracting contextual meaning rather than just
identifying words and their structure.

What is Lexical Semantics?

Lexical Semantics is a branch of linguistic semantics that studies the meaning of words, their
relationships, and how they contribute to the meaning of sentences. It involves:

Understanding word meanings


Identifying relationships between words
Handling ambiguity in language
Word sense disambiguation (WSD) – determining the correct meaning of a word in context

2. Key Concepts in Lexical Semantics


1️⃣ Lexeme
A lexeme is the basic unit of meaning in a language. It includes a base form and its variations
(inflections).
Example:

Lexeme: "run"
Inflected forms: "runs", "running", "ran"

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 31/47
2️⃣ Sense and Reference
Sense: The internal meaning of a word in a specific context
Example: "bank" (financial institution) vs. "bank" (riverbank)
Reference: The actual object or concept the word refers to
Example: "The Eiffel Tower" refers to a specific monument in Paris.

3️⃣ Word Relations in Lexical Semantics


A. Homonymy

Words that have the same spelling or pronunciation but different meanings.
Example:

"Bat" (flying mammal) vs. "Bat" (used in cricket)

B. Polysemy

A single word with multiple related meanings.


Example:

"Head"
Body part ("He hit his head.")
Leader ("She is the head of the company.")
Front part of an object ("Head of the queue.")

C. Synonymy

Words that have similar meanings.


Example:

"Big" and "Large"


"Fast" and "Quick"

D. Antonymy

Words that have opposite meanings.


Types of Antonyms:

Gradable antonyms: Have different degrees (e.g., "hot" vs. "cold")


Complementary antonyms: Either one or the other (e.g., "dead" vs. "alive")
**Relational anton

explain in detail Relations among lexemes & their senses –Homonymy, Polysemy, Synonymy,
Hyponymy

Relations Among Lexemes & Their Senses in Lexical Semantics


Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 32/47
In Lexical Semantics, words (lexemes) are not isolated; they have relationships that define their
meanings and usage in language. Understanding these relationships helps in word sense
disambiguation (WSD), machine translation, and information retrieval in NLP.
The key semantic relationships among lexemes are:

1. Homonymy
2. Polysemy
3. Synonymy
4. Hyponymy

1️⃣ Homonymy
Definition

Homonymy occurs when two words have the same spelling or pronunciation but different meanings
and origins.

Types of Homonyms

1. Homophones: Words that sound the same but have different meanings and spellings.
Example:
"Flour" (used in baking) vs. "Flower" (a plant part)
"Write" (to compose text) vs. "Right" (correct)
2. Homographs: Words that have the same spelling but different pronunciations and meanings.
Example:
"Lead" (/lɛd/, a metal) vs. "Lead" (/liːd/, to guide)
"Tear" (/tɪər/, to rip) vs. "Tear" (/tɛər/, a drop from the eye)
3. Perfect Homonyms: Words that have both identical spelling and pronunciation but different
meanings.
Example:
"Bat" (a flying mammal) vs. "Bat" (used in cricket)

NLP Challenge

Homonyms create ambiguity in language processing. Example:

"He went to the bank." (Does it mean a financial institution or a riverbank?)


NLP disambiguation techniques like Word Sense Disambiguation (WSD) help determine the
correct meaning.

2️⃣ Polysemy
Definition

Polysemy occurs when a single word has multiple related meanings. Unlike homonyms, polysemous
words share a common origin.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 33/47
Examples of Polysemy

1. Head
Body part: "She hurt her head."
Leader: "He is the head of the department."
Front/top part: "Meet me at the head of the queue."
2. Bank
Financial institution: "I deposited money in the bank."
Riverbank: "They sat by the river bank."
3. Mouth
Body part: "She opened her mouth."
Entrance: "The mouth of the cave."

NLP Challenge

Lexical ambiguity: "head" can refer to body parts, leadership, or position.


Context is crucial in resolving ambiguity. Example:
"The head of the company spoke." (Head = leader)
"She washed her head." (Head = body part)
Word Sense Disambiguation (WSD) and semantic role labeling help resolve such ambiguity.

3️⃣ Synonymy
Definition

Synonymy refers to words that have similar or nearly identical meanings in a specific context.

Examples of Synonyms

1. Big vs. Large


"That is a big house."
"That is a large house."
2. Buy vs. Purchase
"I will buy a new car."
"I will purchase a new car."
3. Begin vs. Start
"Let’s begin the meeting."
"Let’s start the meeting."

Types of Synonyms

1. Absolute Synonyms: Words that mean exactly the same in all contexts (rare in natural languages).
Example: "Furnace" vs. "Boiler" (in some contexts)
2. Near Synonyms: Words that are similar but have slight differences.
Example: "Slim" vs. "Skinny" (Both mean thin, but "skinny" can have a negative connotation.)
3. Contextual Synonyms: Words that are synonyms only in certain contexts.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 34/47
Example: "Buy" and "Purchase" are synonyms in formal writing, but "purchase" is less
common in casual speech.

NLP Challenge

Synonyms affect text retrieval: A search for "big house" should also find "large house".
Thesaurus-based models (like WordNet) help group synonyms for NLP applications.
Word embeddings (like Word2Vec, BERT) understand word similarity based on context.

4️⃣ Hyponymy
Definition

Hyponymy is a hierarchical relationship where one word (hyponym) has a more specific meaning
than another (hypernym).

Examples of Hyponymy

1. Animal (Hypernym) → Dog (Hyponym) → Golden Retriever (More Specific Hyponym)


2. Fruit (Hypernym) → Apple (Hyponym) → Granny Smith (More Specific Hyponym)
3. Vehicle (Hypernym) → Car (Hyponym) → Sedan (More Specific Hyponym)

NLP Use Case

WordNet is a lexical database that organizes words into hypernym-hyponym relationships.


Query Expansion: A search for "fruit" can retrieve "apple" and "banana".
Text Classification: Identifying "dog" means identifying it under "animal" class.

Comparison of Word Relations


Relation Definition Example
Homonymy Same spelling/pronunciation, different meanings (unrelated) "Bat" (animal) vs. "Bat" (cricket)
Polysemy Same spelling/pronunciation, multiple related meanings "Head" (body part, leader)
Synonymy Different words, similar meanings "Big" vs. "Large"
Hyponymy One word is a more specific type of another "Dog" (hyponym) of "Animal"

5. Applications in NLP
1. Word Sense Disambiguation (WSD)
Helps determine correct word meaning in context
Example: "Apple" (fruit) vs. "Apple" (company)
2. Search Engines & Information Retrieval
Uses synonymy & hyponymy for better query expansion
Example: Searching "buy car" should also retrieve "purchase vehicle"
3. Machine Translation

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 35/47
Polysemy and homonymy must be resolved to translate correctly
Example: "Bank" (financial) → "Banco" in Spanish, "Bank" (river) → "Orilla"
4. Text Summarization & Sentiment Analysis
Understanding word relations improves text meaning extraction

6. Conclusion
Homonymy, Polysemy, Synonymy, and Hyponymy are fundamental word relationships in Lexical
Semantics.
NLP challenges include ambiguity, contextual meaning, and synonym expansion.
Machine Learning & WordNet help address these challenges in real-world applications.
These concepts enhance search engines, machine translation, sentiment analysis, and
chatbots.

By incorporating Lexical Semantics, NLP systems can better understand natural language, improve
accuracy, and enhance human-like interactions .🚀

WordNet, Word Sense Disambiguation (WSD), Dictionary based approach, Latent Semantic Analysis
explain in detail

WordNet, Word Sense Disambiguation (WSD), Dictionary-Based


Approach, and Latent Semantic Analysis (LSA) in NLP
These four concepts play a crucial role in Natural Language Processing (NLP) by helping machines
understand word meanings, relationships, and context.

1️⃣ WordNet
What is WordNet?

WordNet is a large lexical database of English that organizes words into groups of synonyms called
synsets. It was developed at Princeton University by George A. Miller.

Structure of WordNet

WordNet categorizes words into four types:

Nouns
Verbs
Adjectives
Adverbs

Each word is linked to related words through semantic relationships such as:

1. Synonymy – Words with similar meanings (e.g., happy and joyful).

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 36/47
2. Antonymy – Words with opposite meanings (e.g., big and small).
3. Hyponymy-Hypernymy – A hierarchical relation where one word is a more specific type of another
(e.g., dog is a hyponym of animal).
4. Meronymy-Holonymy – Part-whole relationship (e.g., wheel is a meronym of car).
5. Troponymy – A specific manner of performing an action (e.g., whisper is a troponym of speak).

Example from WordNet

For the word "car", WordNet provides:

Synonyms: Automobile, Motorcar


Hypernym (Generalization): Vehicle
Hyponyms (Types of Car): Sedan, SUV, Convertible
Meronyms (Parts of Car): Engine, Wheel, Seat

Applications of WordNet in NLP

✔ Word Sense Disambiguation (WSD) – Helps determine the correct meaning of a word based on
context.
✔ Information Retrieval – Enhances search engines by finding synonyms and related words.
✔ Machine Translation – Improves accuracy by using structured word relationships.
✔ Chatbots & Conversational AI – Helps understand user queries and generate meaningful
responses.

2️⃣ Word Sense Disambiguation (WSD)


What is WSD?

Word Sense Disambiguation (WSD) is the process of determining which sense (meaning) of a word is
being used in a given context.

Why is WSD Challenging?

Many words have multiple meanings (polysemy).


Words can be homonyms (same spelling, different meanings).
The correct meaning depends on sentence structure and surrounding words.

Example of WSD

Consider the word "bank":

1. "He deposited money in the bank." (Bank = financial institution)


2. "He sat by the river bank." (Bank = land beside a river)

An NLP model needs contextual information to select the correct meaning.

Approaches to WSD

1. Dictionary-Based Approach (Lesk Algorithm)


Uses dictionaries like WordNet.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 37/47
Compares word definitions with the surrounding words in a sentence.
Example: "Bass" in "He plays the bass guitar" → WordNet definition helps choose "musical
instrument" over "fish."
2. Supervised Machine Learning Approach
Uses labeled datasets where each word is tagged with its correct sense.
Algorithms like Naïve Bayes, Decision Trees, and Neural Networks learn from examples.
3. Unsupervised Approach (Clustering & Embeddings)
Uses large text corpora (unlabeled data) and clusters words based on context similarity.
4. Knowledge-Based Approach (Using WordNet)
Finds semantic relationships between words using WordNet.

Applications of WSD

✔ Machine Translation – Ensures correct word translation based on context.


✔ Search Engines – Improves search accuracy by resolving ambiguous queries.
✔ Text Summarization – Selects the most relevant meaning to generate summaries.
3️⃣ Dictionary-Based Approach
What is the Dictionary-Based Approach?

The Dictionary-Based Approach in NLP uses dictionaries, lexical databases (like WordNet), and
predefined resources to understand word meanings.

How It Works

1. The system looks up a word in a dictionary.


2. It retrieves all possible meanings and definitions.
3. It compares the meanings with the surrounding words to find the best match.
4. The best-matching sense is selected.

Lesk Algorithm (A Popular Dictionary-Based WSD Algorithm)

The Lesk Algorithm compares dictionary definitions of words with the words in the surrounding
context.
Example:
Consider the sentence "He sat by the river bank."

WordNet Definitions for "Bank":


1. "A financial institution where money is deposited."
2. "Land beside a river."
The algorithm finds "river" in the second definition → correct sense is "land beside a river".

Advantages & Disadvantages

✔ Simple and easy to implement.


✔ Requires no labeled training data.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 38/47
✖ Depends on dictionary quality.
✖ Cannot handle new words or slang efficiently.
Applications

✔ WSD – Helps select the correct word meaning.


✔ Question Answering – Finds accurate answers by understanding word sense.
✔ Text Mining & Summarization – Extracts meaningful information.
4️⃣ Latent Semantic Analysis (LSA)
What is LSA?

Latent Semantic Analysis (LSA) is a mathematical technique used in NLP to find relationships between
words and documents by analyzing large text corpora.

How LSA Works

1. Builds a Term-Document Matrix (TDM)


Rows = Words
Columns = Documents
Values = Word frequencies in documents
2. Applies Singular Value Decomposition (SVD)
Reduces dimensionality by identifying latent (hidden) relationships between words and
topics.
3. Finds Semantic Similarity
Words that appear in similar contexts get similar numerical representations.

Example of LSA

Consider three documents:

1. "I love playing football."


2. "Football is a great sport."
3. "Python is a popular programming language."

LSA will group words related to football together and separate programming-related words.

Advantages of LSA

✔ Finds synonyms & conceptually related words


✔ Reduces noise in text data
✔ Improves search engine accuracy
Disadvantages

✖ Ignores word order and syntax


✖ Computationally expensive for large datasets
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 39/47
Applications of LSA

✔ Information Retrieval & Search Engines – Helps find relevant documents based on meaning.
✔ Text Classification – Groups similar documents.
✔ Recommender Systems – Suggests articles based on topic similarity.
Summary of Concepts
Concept Definition Example Use Case
Lexical database organizing words into Helps with synonym detection, query
WordNet
synsets (similar meanings) expansion in search engines
Disambiguating multiple meanings of a Understanding "bank" as a financial
WSD
word based on context institution vs. a riverbank

Dictionary-Based Uses predefined dictionaries (WordNet, Lesk Algorithm selects the best word
Approach Oxford Dictionary) to understand words meaning using dictionary definitions
Latent Semantic Finds hidden relationships between words Improves search engines by
Analysis (LSA) using statistical techniques understanding semantic similarity

Conclusion
WordNet provides structured relationships between words.
WSD helps determine the correct word meaning in context.
Dictionary-Based Approaches rely on external knowledge sources like WordNet.
LSA finds deeper semantic relationships using mathematical techniques.

🚀 Together, these techniques enhance NLP applications like search engines, machine translation,
and chatbots by improving semantic understanding.

Explain Tokenization with its different types

Tokenization in NLP

What is Tokenization?
Tokenization is the process of splitting text into smaller units, called tokens. These tokens can be
words, subwords, or characters, depending on the type of tokenization used. It is the first step in text
processing for NLP tasks such as sentiment analysis, machine translation, and text summarization.

Example of Tokenization:

Input sentence:
✅ "Natural Language Processing is amazing!"

After tokenization:
✅ [‘Natural’, ‘Language’, ‘Processing’, ‘is’, ‘amazing’, ‘!’]
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 40/47
Types of Tokenization
1️⃣ Word Tokenization
Also called word segmentation.
Splits text into words based on spaces or punctuation.
Common in languages like English, but difficult in languages without spaces (e.g., Chinese,
Japanese).

✅ Example:
Input:
"I love machine learning."
Output (tokens):
[‘I’, ‘love’, ‘machine’, ‘learning’, ‘.’]
💡 Challenges:
Handling compound words (e.g., "New York" should be one token).
Dealing with contractions (e.g., "don't" → ["do", "not"]).

2️⃣ Subword Tokenization


Breaks words into smaller meaningful units.
Helps handle rare words by splitting them into frequent subwords.
Used in transformers like BERT, GPT.

✅ Example (Byte Pair Encoding - BPE):


Word: "unhappiness"
Subword tokens: [‘un’, ‘happiness’]

💡 Common Subword Tokenization Methods:


Byte Pair Encoding (BPE) – Used in GPT models.
WordPiece Tokenization – Used in BERT.
Unigram Language Model – Used in SentencePiece.

3️⃣ Character Tokenization


Splits text character by character.
Useful for languages with complex words (e.g., Chinese).
Helps handle misspellings and typos.

✅ Example:
Input: "NLP"
Character tokens: [‘N’, ‘L’, ‘P’]
💡 Challenges:
Results in longer sequences → slower training.
Loses word-level meaning.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 41/47
4️⃣ Sentence Tokenization (Sentence Segmentation)
Splits paragraphs into individual sentences.
Uses punctuation marks (like `.`, `!`, `?`) to detect sentence boundaries.
Helpful in document processing and summarization.

✅ Example:
Input:
"I love NLP. It is very interesting!"
Output:
[‘I love NLP.’, ‘It is very interesting!’]
💡 Challenges:
Abbreviations (e.g., "Dr.", "U.S.") can be mistaken for sentence boundaries.
Complex sentence structures may confuse basic rules.

Which Tokenization Method to Use?


Tokenization Type When to Use? Challenges
Struggles with compound words and
Word Tokenization Simple applications like text search
contractions
Subword
Deep learning models (BERT, GPT) Requires training a vocabulary
Tokenization
Character Handling typos, new words, and rare
Produces long sequences
Tokenization languages
Sentence Hard to handle abbreviations and
Summarization, translation
Tokenization punctuation

Conclusion
Tokenization is a fundamental step in NLP, helping break text into meaningful parts. The choice of
tokenization method depends on the application:

For basic NLP tasks → Word tokenization.


For deep learning models → Subword tokenization.
For document processing → Sentence tokenization.
For error handling and rare words → Character tokenization.

🚀 Modern NLP frameworks like NLTK, spaCy, and Hugging Face Transformers provide efficient
tokenization tools!

Explain stem and affix classes of morphemes with


Example.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 42/47
Stem and Affix Classes of Morphemes

In morphology, words are made up of morphemes, which are the smallest units of meaning.
Morphemes can be classified into two broad categories:

1. Stem
2. Affix

1️⃣ Stem
A stem is the core part of a word that holds its basic meaning. It is the main morpheme to which affixes
(prefixes or suffixes) can be added.
✅ Example:
Word: unhappiness
Stem: happy
Affixes: un- (prefix) and -ness (suffix)

💡 Types of Stems:
Free stem: Can stand alone as a word (e.g., "play" in "playing").
Bound stem: Cannot stand alone and needs an affix (e.g., "ceive" in "receive").

2️⃣ Affix
An affix is a morpheme that attaches to a stem to modify its meaning. Affixes cannot stand alone as
words.
💡 Types of Affixes:
a) Prefix

Added before the stem.


Changes the meaning of the word.

✅ Examples:
re-play → "re" (again) + "play" (stem) = replay (play again)
un-happy → "un" (not) + "happy" = unhappy (not happy)

b) Suffix

Added after the stem.


Can change the tense, form, or meaning of a word.

✅ Examples:
kindness → "kind" (stem) + "-ness" (suffix) = kindness (state of being kind)
teacher → "teach" (stem) + "-er" (suffix) = teacher (one who teaches)

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 43/47
c) Infix (Rare in English)

Inserted inside a word.


Common in some other languages (e.g., Tagalog).

✅ Example (Tagalog):
sulat (write) → sumulat (writing)

d) Circumfix (Not common in English)

Added both before and after a stem.


Common in German and some other languages.

✅ Example (German):
ge-liebt (loved)
"lieb" (love) is the stem, and ge- and -t are the circumfix.

Conclusion
The stem is the core meaning of the word.
Affixes (prefixes, suffixes, infixes, circumfixes) modify the meaning or function of the word.
Understanding morphology helps in NLP, language learning, and linguistics! 🚀

What are different techniques for the semantic analysis for the Statement.

Techniques for Semantic Analysis of a Statement


Semantic analysis in NLP involves extracting meaning from text by understanding the relationships
between words, phrases, and sentences. Various techniques are used to analyze the meaning of
statements, including:

1️⃣ Lexical Semantics Analysis


Focuses on word meanings and their relationships.
Uses dictionaries, thesauruses (like WordNet) to find synonyms, antonyms, hypernyms,
hyponyms, etc.

✅ Example:
Synonymy: "big" and "large" mean the same.
Hyponymy: "dog" is a hyponym of "animal".

🔹 Applications: Word Sense Disambiguation (WSD), Information Retrieval.


Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 44/47
2️⃣ Word Sense Disambiguation (WSD)
Determines which sense of a word is used in a given context.
Uses supervised ML, dictionary-based, or knowledge-based approaches.

✅ Example:
"I went to the bank to withdraw money."
Here, bank means a financial institution.
"The boat was near the bank of the river."
Here, bank refers to the side of a river.

🔹 Techniques Used:
Lesk Algorithm (Dictionary-based)
Machine Learning (Supervised Models)

3️⃣ Named Entity Recognition (NER)


Identifies names of people, organizations, locations, dates, etc. in text.
Uses rule-based approaches, ML, or deep learning (BERT, spaCy).

✅ Example:
"Elon Musk founded Tesla in 2003."
Elon Musk → Person
Tesla → Organization
2003 → Date

🔹 Applications: Chatbots, Search Engines, Information Extraction.


4️⃣ Semantic Role Labeling (SRL) / Thematic Role Extraction
Identifies the roles of words in a sentence (who did what to whom?).
Breaks a sentence into Agent (doer), Action (verb), Object (receiver of action), etc.

✅ Example:
"John (Agent) ate (Action) an apple (Object)."

🔹 Applications: Question Answering, Text Summarization.


5️⃣ Latent Semantic Analysis (LSA)
Uses mathematical techniques (Singular Value Decomposition - SVD) to find hidden
relationships between words.
Helps in topic modeling and finding similar words in a corpus.

✅ Example:
The words "doctor", "nurse", "hospital" have similar meanings and belong to the same topic of
"medical field."

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 45/47
🔹 Applications: Document Classification, Search Engines, Plagiarism Detection.
6️⃣ Distributional Semantics (Word Embeddings: Word2Vec, GloVe, BERT)
Words are represented as vectors based on their context in large text corpora.
Helps measure semantic similarity between words.

✅ Example:
King - Man + Woman = Queen (Word2Vec analogy)

🔹 Applications: Machine Translation, Chatbots, Sentiment Analysis.


7️⃣ Semantic Parsing (Deep Learning-based NLP)
Converts natural language into a structured format (like SQL queries, logic forms).
Uses Neural Networks (BERT, GPT, T5, etc.).

✅ Example:
"Show me flights from Mumbai to Delhi on Monday."
Converted into structured form:

json

{
"intent": "search_flight",
"origin": "Mumbai",
"destination": "Delhi",
"date": "Monday"
}

🔹 Applications: Voice Assistants (Alexa, Siri), Chatbots.


Conclusion
Different semantic analysis techniques help understand the meaning of statements in various
ways. The choice of method depends on the application:

Simple tasks → Lexical Semantics, NER.


Word meaning identification → WSD, SRL.
Deep meaning extraction → LSA, Word Embeddings, Semantic Parsing.

🚀 Modern NLP models like BERT, GPT, and T5 use a combination of these techniques for better
semantic understanding!

What is Natural Language Processing(NLP)? Discuss various stages


involved in NLP process with suitable example.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 46/47
What is Natural Language Processing (NLP)?
Natural Language Processing (NLP) is a branch of artificial intelligence (AI) that enables computers to
understand, interpret, and generate human language. It bridges the gap between human
communication and machine understanding, allowing computers to process and analyze large amounts
of natural language data.
🔹 Example Applications of NLP:
Chatbots (e.g., Siri, Alexa, Google Assistant)
Machine Translation (Google Translate)
Sentiment Analysis (analyzing customer reviews)
Speech Recognition (converting speech to text)
Spam Filtering (detecting spam emails)

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 47/47

You might also like