KEMBAR78
SemVII NaturalLanguageProcessing | PDF | Morphology (Linguistics) | Parsing
0% found this document useful (0 votes)
23 views32 pages

SemVII NaturalLanguageProcessing

The document outlines key concepts in Natural Language Processing (NLP), including the stages of NLP such as lexical, syntactic, semantic, discourse, and pragmatic analysis. It discusses challenges like ambiguity, irony, and context in language understanding, and explains applications of NLP like machine translation and sentiment analysis. Additionally, it covers morphological analysis using Finite State Automata (FSA) and the differences between inflectional and derivational morphology.

Uploaded by

vu1s2223013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views32 pages

SemVII NaturalLanguageProcessing

The document outlines key concepts in Natural Language Processing (NLP), including the stages of NLP such as lexical, syntactic, semantic, discourse, and pragmatic analysis. It discusses challenges like ambiguity, irony, and context in language understanding, and explains applications of NLP like machine translation and sentiment analysis. Additionally, it covers morphological analysis using Finite State Automata (FSA) and the differences between inflectional and derivational morphology.

Uploaded by

vu1s2223013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

NLP Imp.

Questions for Sem End

Mod. 1
Q.1 Explain the various stages of Natural Language processing [VIMP]
There are 5 stages in Natural Language Processing :-
Lexical Analysis: First stage in NLP, also known as Morphological analysis. At this stage the
structure of the words is identified and analyzed. Involves breaking down the text into tokens,
which are the smallest units of meaning, such as words or phrases.
Example: “I love cats” is broken down into tokens like [“I”, “love”, “cats”]
Syntactic Analysis: Involves analyzing the syntax or structure of the sentence according to
grammatical rules. Checks the sentence for proper syntax and creates a parse tree that
represents the grammatical structure.
Example: Creates a parse tree for “I love cats” where “I” is the Subject/Pronoun, “love” as the
Verb and “cats” as the Object/Noun
Semantic Analysis: Focuses on understanding the meaning of the text by interpreting the
meanings of individual words and how they combine to form meaningful phrases or sentences.
Example: For “I love cats”, semantic analysis would determine that “love” expresses a positive
emotion and the speaker has a positive feeling towards the animal cat.
Disclosure Integration: Involves understanding the context and the relationship between
sentences to derive the meaning from the entire text, rather than just individual sentences.
Example: For a text with two sentences “I love cats. They are so cute.”, discourse integration
would link the “They” to “cats” and understand that the second sentence is providing additional
information about the “cats” in the first sentence.
Pragmatic Analysis: Deals with understanding the text in context, the text is re-interpreted on
what it truly means. Considering the speaker’s intentions which requires real world knowledge.
Example: “John saw Mary in a garden with a cat”, here we can’t say whether John is with the cat
or Mary with the cat. Or “Could you pass me the salt ?” Pragmatic analysis would recognize that
this isn’t just a question but it's also a polite request for someone to pass the salt.

Q.2 Discuss the challenges in various stages of natural language processing [VIMP]
Contextual words and phrases and homonyms: The same words and phrases can have
diverse meanings according to the context of the sentence and many words have the exact
same pronunciation but completely different meanings.
Example: “I went to the bank” it’s unclear whether the person went to a financial institution or a
riverbank without further context.
Synonyms: Different words can have similar meanings but they are not always interchangeable
in every context. This can cause contextual misunderstanding as we use the different words to
express the same idea.
Example: The words “big”, “large” and “huge” are synonyms, but we use “big mistake” instead of
a “large mistake”.
Irony and Sarcasm: Irony and sarcasm often convey meanings that are the opposite of the
literal words used, making it difficult for machines to interpret correctly. Detecting irony and
sarcasm requires an understanding of tone, context, and sometimes cultural or situational
knowledge, which can be tough for NLP systems to identify.​
Example: "Oh great, another traffic jam!" is likely sarcastic, meaning the speaker is actually
unhappy, not pleased.
Ambiguity: Occurs when a sentence or phrase can be interpreted in more than one way. These
phrases potentially have two or more possible interpretations.
Example: “The chicken is ready to eat” can mean that the chicken itself is about to eat or that
the chicken is cooked and ready for someone to eat it.
Errors in text or speech: Human communication often contains errors like typos, grammatical
mistakes, or mispronunciations.
Example: "I can’t beleive it!" contains a typo ("beleive" instead of "believe")

Q.3 Explain the ambiguities associated at each level with examples for NLP
Natural language is very ambiguous, ambiguities occur when a language construct can have
more than one interpretation. Ambiguities mean not having well defined solutions. Any sentence
in a language having a large-enough grammar can have another interpretation.
Lexical Ambiguity: A single word can have multiple meanings depending on the context.
When words have multiple assertions then it is known as lexical ambiguity.
Example: “I saw a bat in the garden”
Ambiguity: Does the bat refer to a flying mammal or a cricket bat
Syntactic Ambiguity: A sentence can have multiple possible structures, leading to different
interpretations. Syntactic ambiguity means sentences are parsed in multiple syntactic forms or A
sentence can be parsed in different ways.
Example: “I saw the man with the telescope”
Ambiguity: Did the man have the telescope, or was the telescope used to see the man?
Semantic Ambiguity: Semantic ambiguity is related to the sentence interpretation. The
meaning of a sentence is unclear due to the relationships between words or phrases.
Example: “Visiting relatives can be annoying”
Ambiguity: Does it mean the act of visiting relatives is annoying, or that relatives who visit are
annoying?
Anaphoric Ambiguity: This kind of ambiguity arises due to the use of anaphora entities in
discourse.
Example: “The horse ran up the hill. It was very steep. It was very tired”
Ambiguity: Here the anaphoric reference of “it” can cause ambiguity.
Metonymy Ambiguity: It is the most difficult ambiguity, it deals with phrases in which the literal
meaning is different from the figurative assertion. A word or phrase is used to represent
something closely related, which can lead to confusion.
Example: “The company is screaming for new management”
Ambiguity: Here it really doesn’t mean that the company is literally screaming

Q.4 Explain the applications of NLP


Machine translation: Machine translation involves automatically converting text from one
language to another. NLP techniques are used to understand the syntax, semantics, and
context of the source language to accurately translate it into the target language.
Example: Google Translate is a popular example where you can input text in English and get it
translated into Spanish, French, or any other language.
Speech recognition: Speech recognition converts spoken language into text. NLP is used to
process the spoken words, recognize the structure of the language, and convert it into text while
handling various accents and speech patterns.
Example: Virtual assistants like Siri, Alexa, and Google Assistant
Information retrieval: Information retrieval is the process of finding relevant information from a
large dataset or database based on a user's query. NLP helps in understanding the user’s
query, matching it with relevant documents, and ranking the results based on their relevance.
Example: Search Engines like Google
Question answering: Question answering systems provide precise answers to user queries
based on a knowledge base or dataset. NLP is used to understand the question, analyze the
available data, and generate the correct answer.
Example: IBM’s Watson
Text Summarization: Text summarization automatically generates a concise and coherent
summary of a longer text document. NLP algorithms analyze the text to identify key points,
sentences, or phrases that encapsulate the main ideas.
Example: MIcrosoft Word auto-summarize
Sentiment Analysis: Sentiment analysis determines the emotional tone behind a piece of text,
categorizing it as positive, negative, or neutral. NLP techniques are used to analyze the text and
detect sentiment by identifying opinion words, context, and overall sentiment-bearing phrases.
Example: Companies use sentiment analysis to gauge public opinion on social media posts,
customer reviews, or feedback.

Mod. 2
Q.5 Explain Porter Stemming algorithm in Detail [VVIMP]
Stemming is a method in text processing that eliminates prefixes and suffixes from words,
transforming them into their fundamental or root form, to streamline and standardize words to
enhance the effectiveness of NLP tasks.
The Porter Stemming Algorithm is one of the most widely used stemming algorithms in
Natural Language Processing proposed in 1980. It reduces words to their base or root form
(known as the stem) by systematically removing suffixes. This helps in text normalization,
especially for search and information retrieval tasks.
Steps in Portal Stemming algo:
Step 1: Classify Characters: The algorithm begins by identifying and classifying characters in
the word as vowels or consonants. To understand the structure of the word and decide where to
apply the stemming rules.
Vowels are: a,e,i,o,u
Consonants: Any letter that’s not a vowel, except “y” in some cases
Step 2: Compare to rules: Words are then matched against a set of predefined rules to decide
which part of the word (usually suffixes) should be removed or transformed.
Region definitions:
R1: The part of the word that begins after the first non-vowel, following a vowel.
R2: The part of R1 that begins after the first non-vowel following a vowel.
These regions decide which suffixes to remove based on their location in the word.
Step 3: Remove common suffixes and pluralization: The algorithm removes or modifies
suffixes and handles plural forms systematically.
-​ Past tenses: “ed”
Examples: “jumped” → “jump”
-​ Adjective and Adverb suffixes: “ness”,”ful”,”ous”, etc.
Examples: “happiness” → “happi”, “successful” → “success”, “dangerous” → “danger”
-​ Plural forms: “s”,”es”
Examples: “cats” → “cat” and “foxes” → “fox”

Advantages of Porter Stemming algo:


-​ Widely used due to its simplicity and efficiency, it is widely adopted in text mining and
search engines.
-​ Specifically designed for English, ensuring better accuracy for English text.

Application: The main applications of porter stemmer include,


-​ Data mining and information retrieval
-​ Search Tasks
-​ Text Normalization

Q.6 Explain the role of FSA in morphological analysis ?


OR Explain FSA for nouns and verbs. Also, design a Finite State Automata (FSA) for the
words of English numbers 1-99 [VIMP]
Finite State Automata (FSA) play a crucial role in morphological analysis, which is the study of
how words are formed from smaller units called morphemes (e.g., roots, prefixes, suffixes).
Morphological analysis is essential in Natural Language Processing (NLP) to understand and
process words in their various forms.

In morphological analysis, FSA is used to:


1.​ Recognize Valid Word forms: FSA checks if a word follows the morphological rules of
a language. For example, “walked” (root + suffix) is valid, but “walkeded” is not.
2.​ Decompose words into Morphemes: FSA can break down a word into its root and
affixes (prefixes, infixes and suffixes). For example, “cats” → “cat”, the suffix “s” is
removed.
3.​ Handle Inflections and Deviations: ​
Inflection, modifications to indicate tense, number, etc. For example, “run” → “running”
Derivation, modifications to create a new word. For example, “law” → “lawyer”.
FSA encodes these rules to analyze word forms.
4.​ Support Morphological Parsing: Parsing involves mapping the structures of a word.
For example, FSA can parse “unsuccessful” into prefix “un”, root “success” and suffix
“ful”.
Example: For a word like “unsuccessfully” where,
Prefix “un”, Root “success” and Suffix “ly”


FSA for Nouns:
Nouns in English may take singular or plural forms
Pluralization rules:
-​ Regular: Add -s or -es
(e.g, cat→cats and fox→foxes)
-​ Irregular: Change internal structure
(e.g., mouse→mice and wolf→wolves)

FSA for Verbs:


Verbs in English can be conjugated based on tense, number and person
Rules:
-​ Past: Base form + -ed (e.g, jump→jumped)
-​ Present: Base form +s (e.g, jump→jumps)
-​ Present Participle: Base form + ing (e.g., jump→jumping)

FSA for numbers from 1-99 in English

Q.7 Illustrate the concept of tokenization and stemming in NLP


●​ Tokenization
●​ Stemming

●​ Lemmatization

Q.8 Explain inflectional and derivational morphology with an example.


Inflectional Morphology:

●​ Purpose: Inflectional morphology modifies a word to express different grammatical


categories such as tense, number, gender, case, etc.
●​ Meaning: The base meaning of the word does not change; only its grammatical function
does.
●​ Examples:
○​ Tense: play → played (past tense)
○​ Number: cat → cats (plural)
○​ Case: she → her (subjective to objective case)
●​ Features:
○​ Inflectional morphemes do not change the word class (e.g., noun remains a
noun).
○​ The number of inflectional morphemes in a language is usually limited.
○​ Inflections are typically added as suffixes in English (e.g., -s, -ed, -ing).

Derivational Morphology:

●​ Purpose: Derivational morphology creates a new word by adding a morpheme that


changes the meaning or the word class of the original word.
●​ Meaning: The meaning of the base word often changes, sometimes significantly.
●​ Examples:
○​ Word Class Change: happy (adjective) → happiness (noun)
○​ Meaning Change: friend → unfriendly (adding "un-" changes the meaning to its
opposite)
●​ Features:
○​ Derivational morphemes often change the word class (e.g., verb to noun,
adjective to adverb).
○​ The number of derivational morphemes is more varied than inflectional
morphemes.
○​ They can be either prefixes (e.g., un- in unhappy) or suffixes (e.g., -ness in
happiness).

Inflectional Morphology Derivational Morphology

It is a morphological process that


It is concerned with the way
adapts existing words so that they
Definition morphemes are connected to
function effectively in sentences without
existing lexical forms as affixes.
changing the POS of base morpheme.

The usage of inflectional morphology is


Regularity Not as much regular in usage
really regular in our language

Can only be suffix or infix, can never be


Use Can be both prefix or suffix
prefix

Change in Never changes the grammatical Can change the grammatical


POS category or POS category or POS

Cat + s = Cats Danger + ous = Dangerous


Example
NOUN NOUN NOUN ADJECTIVE
Q.9 Represent output of morphological analysis for Regular verb, irregular verb, singular
noun, plural noun. Also explain the role of FST in Morphological Parsing with an
example.
Morphological analysis involves breaking a word into its constituent morphemes (root, affixes)
and identifying their grammatical roles. Output of Morphological analysis for,
Regular Verb: Example: jumped
-​ Root: jump
-​ Tense: past
-​ Output: {Root:jump, Tense:past}
Irregular Verb: Example: ran
-​ Root: run
-​ Tense: past
-​ Output: {Root:run, Tense:past}
Singular Noun: Example: cat
-​ Root: cat
-​ Number: singular
-​ Output: {Root:cat, Number:singular}
Plural Noun: Example: cats
-​ Root: cat
-​ Number: plural
-​ Output: {Root:cat, Number:plural}

Role of FST in Morphological Parsing: An FST is an extension of a finite-state automaton


(FSA) that maps input strings to output strings. It is particularly useful for morphological parsing,
where the FST takes a surface form (e.g., walked) as input and outputs its morphological
structure.
Functions of FST:
1.​ Mapping Input to Output: Converts surface forms into linguistic expressions
E.g., Mapping of a plural noun,​
cats → {Root:cat, Number:plural}
2.​ Handling Morphological rules: Encodes rules like suffix addition, stem changes and
irregular forms
E.g., handles irregular mappings like,
run → ran
see → saw
3.​ Bidirectional Processing: Works for both analysis (breaking down words) and
generation (constructing words)
Example for FST in Morphological analysis: Consider the words “jumped” (verb) and “cats”
(noun)
For, Input: “jumped”
-​ Step 1: FST identifies the suffix -ed
-​ Step 2: Removes the suffix -ed to extract the Root “jump”
-​ Step 3: Maps -ed to past tense
Output: {Root:jump, Tense:past}
For, Input: “cats”
-​ Step 1: FST identifies the suffix -s
-​ Step 2: Removes the suffix -s to extract the Root “cat”
-​ Step 3: Maps -s to number plural
Output: {Root:cat, Number:plural}

Q.10 Explain how the n-gram model is used in spelling correction.


The n-gram model is a probabilistic language model that predicts the likelihood of a sequence
of words or characters appearing together based on their historical frequency. In the context of
spelling correction, n-grams help identify and rank the most probable corrections for a
misspelled word by evaluating the likelihood of the corrected word fitting within its context.
Error types in Spelling Correction:
Substitution: A misspelled word arises when a single character is replaced with another, the
n-gram model considers the context of the preceding and following words to suggest
corrections.
For e.g., In the sentence, “The speljing of the word is wrong”, the bigram probabilities for "The
spelling" and "The speljing" are compared. If P(“The spelling”) > P(“The speljing”), the model
suggests correcting speljing to spelling.
Addition: A character is mistakenly added to a word. For e.g., “The spellingx of the word is
wrong”, the bigram probabilities are checked which leads to, P(“The spelling”) > P(“The
spellingx”). The model suggests removing the extra “x” at the end of the word to get the most
probable result.
Deletion: A character is omitted from a word. For e.g., “The splling of the word is wrong”, the
bigram probabilities are checked which leads to, P(“The spelling”) > P(“The splling”). The model
suggests adding an “e” to the word to get the most probable result.
Transposition: Two adjacent characters are swapped. For e.g., “The spleling of the word is
wrong”, the bigram probabilities are checked which leads to, P(“The spelling”) > P(“The
splelingx”). The model suggests swapping “l” with “e” in the word to get the most probable result.
Q.11 Explain the perplexity of any language model.
Perplexity is a key metric used to evaluate the performance of a language model. It measures
how well a language model predicts a sample of text. Essentially, perplexity evaluates the
"uncertainty" of a model when it tries to predict the next word or sequence of words in a
language.

where, w1,w2…wn is the sequence of N words in the dataset


and P(wn) is the probability of nth word in the sequence predicted by the model

Lower perplexity means the model is better at predicting the text


Higher perplexity means that the model struggles to predict the text, implying poor performance
Perfect Model: If the language model predicts every word in the sequence with certainty
(P(wn)=1), the perplexity is 1.
Random Guessing: If the model assigns equal probability to all the words in the vocabulary
(P(wn)=1/|V|), the perplexity is equal to the vocabulary size.

Importance of Perplexity:
Model Performance: Lower perplexity indicates a better language model, as it means the
model's predictions are closer to the actual sequence of words.
Comparative Metric: Perplexity is used to compare different language models on the same
dataset. A lower perplexity across multiple datasets shows that a model generalizes better.
Evaluation of Probabilities: Perplexity directly evaluates the probability distribution generated
by the model, reflecting its ability to model language.

Applications of Perplexity:
N-gram Models: Perplexity is often used to evaluate n-gram models, with lower perplexity
indicating better performance.
Neural Language Models: Modern models like RNNs, LSTMs, and Transformers are evaluated
using perplexity to measure their fluency in generating text.
Fine-tuning: When fine-tuning models on a specific domain, perplexity helps track improvement
over iterations.

Q.12 Explain Good Turing Discounting.


In tasks like language modeling or text analysis, many possible word combinations might not
appear in the training data. A naive model would assign such unseen events a probability of 0,
which is problematic for predicting new data.
Good Turing Discount solves this by,
1.​ Reducing probabilities of observed events slightly (discounting)
2.​ Allocating this “saved probability mass” to unseen events
Instead of using the observed frequency of an event directly, Good-Turing Discounting replaces
it with an adjusted count, called the Good-Turing estimate.

Steps for Good Turing Discounting:


Step 1: Count Frequencies: Count how many events have occurred exactly r times in the
dataset and compute Nr for each of them.
Step 2: Compute Adjusted Counts: For each frequency r, calculate the adjusted counts with
the formula r* = (r+1) x (Nr+1 / Nr)
Step 3: Reassign Probabilities: Calculate probabilities for observed and unseen events using
their adjusted counts.
Advantages:
-​ Handles unseen events effectively by reallocating probabilities
-​ Improves model generalization on sparse datasets
Example: Consider N1=3, N2=2 and N3=1, calc Adjusted counts for each

Mod. 3
Q.13 What is POS tagging ? Explain various approaches to perform POS tagging and
discuss various challenges faced by POS tagging. [VVIMP]
OR What is the rule-based and stochastic part of speech taggers ?
Challenges in POS tagging:
-​ Ambiguity: The main problem with POS tagging is ambiguity. In English, many
common words have multiple meanings and hence multiple POS. The job of a POS
tagger is to resolve this ambiguity accurately based on the context of use.
For e.g., the word “shot” in the sentence “He shot a bird” is a Verb whereas in the
sentence “A shot of vodka” is a Noun
-​ Downstream Propagation: If a POS tagger gives poor accuracy, then this has an
adverse effect on other tasks that follow. This is called downstream propagation. To
improve accuracy, POS tagging is combined with other processing techniques, like
dependency parsing.
-​ Unigram Tagging: In a statistical approach, one can count the tag frequencies of words
in a tagged corpus and then assign the most probable tag. This is Unigram Tagging, a
much better approach would be Bigram tagging where the tag frequency is counted
while keeping in mind the preceding tag.
-​ Out-Of Vocabulary words: Unknown words, such as new terms, abbreviations or slang
are difficult to tag.

Q.14 Explain hidden Markov model for POS-based tagging


OR What are the limitations of the Hidden Markov Model ?
OR Apply Hidden Markov Model and do POS tagging for the given statements
[Numericals] [VIMP]
The Hidden Markov Model (HMM) is a statistical model that is widely used for Part-of-Speech
(POS) tagging in Natural Language Processing (NLP). It is based on the Markov process and
assumes a sequence of states (POS tags) and observations (words in the text).
Key components of Hidden Markov Model (HMM):
1.​ Initial Probabilities: The probability of the first tag in the sentence, being T1.
Example: P(NOUN)
2.​ One or more Hidden States: The possible POS tags (e.g., noun, verb, adjective)
Example: T = {NOUN, VERB, ADJ}
3.​ Transition Probabilities: The probability of a tag Ti occurring given the previous tag
Ti-1.
Example: P(VERB|NOUN) is the probability of a verb following a noun.
4.​ Observations: The words in a given sentence or text
Example: W = {“The”,”cat”,”runs”}
5.​ Emission Probabilities: The probability of a word W being generated given a tag T. It is
a sequence of observations of likelihoods.
Example: P(“dog”|NOUN)
Example for HMM: “The cat runs”
Step 1: Observations: “The”, “cat”, “runs”
Step 2: Possible Tags: {DET, NOUN, VERB}
Step 3: Calculate Transition probabilities P(T|W): Calculating Transition probability P(T|W)
for each sequence of tags
Step 4: Output the most probable sequence: e.g.,{DET, NOUN, VERB}

Issues/Limitations in HMM:
-​ Independence Assumptions: Assumes that the probability of a word depends only on
its tag and the probability of a tag depends only on the previous tag. Ignores long-range
dependencies or global sentence context.
-​ Out-Of-Vocabulary OOV Words: HMM struggles with unknown words since it relies on
predefined probabilities. Cannot handle OOV words.
-​ Scalability Issues: As the length of the sentence increases, the computation grows
exponentially without optimizations like the Viterbi algorithm.

Q.15 Demonstrate the concept of Conditional Random Field in NLP


OR Explain how Conditional Random Field (CRF) is used for sequence labeling [VIMP]
Conditional Random Fields (CRFs) are a type of probabilistic model used for structured
prediction tasks, particularly in sequence labeling problems. In Natural Language Processing
(NLP), they are widely used for tasks such as Part-of-Speech (POS) tagging, Named Entity
Recognition (NER), and Pattern Recognition.
Unlike Classifiers that do not consider neighboring labels, a CRF takes context into account and
allows for dependencies between neighboring labels. CRF considers the entire sequence when
predicting labels, not just local dependencies. To achieve this, the predictions are modeled as a
graphical model. And they represent the presence of dependencies between the predictions.

CRFs are discriminative models that directly model the conditional probability P(Y|X), where:
-​ X = (x1,x2…xn) is the sequence of the observations (words in a sentence)
-​ Y = (y1,y2…yn) is the sequence of labels (POS tags, NER labels)
Working of a CRF:
Step 1: Input and Output Representation: Input a sequence of observations, X = (x1,x2…xn)
and output should be a sequence of labels Y = (y1,y2…yn) e.g., POS tags or Entity labels in
NER.
Step 2: Feature Functions: CRFs use feature functions f(X,Y) to capture the relationship
between input observations and output labels. These features can be, ​
State features: Dependent on current observation xi and current label yi
Transition features: Dependent on current label yi and the previous label yi-1
Step 3: Conditional Probability: Computing conditional probability of a label sequence Y given
the input sequence X is done using P(Y|X)
Step 4: Training: Model learns the weights (lambda k) for the features by maximizing the
conditional likelihood
Step 5: Prediction: The task is to find the sequence Y that maximizes P(Y|X), often solved
using the Viterbi algorithm.
Example:
Input Sequence: X = {“Sanil”, “Jadhav”,”was”,”born”,”in,”Oregon””}
Output Sequence: Y = {B-PER, I-PER,O,O,O,B-LOC} where,
-​ B-PER: Beginning of a person’s name
-​ I-PER: PER inside a person's name
-​ O: Outside any named entity or Others
-​ B-LOC: Beginning of a location name

Q.16 .Explain Parser in NLP also explain or compare top-down and bottom-up approach
of parsing with examples
A parser in Natural Language Processing (NLP) analyzes a sentence's grammatical structure to
identify its syntactic relations. Parsing generates a parse tree or syntax tree, representing the
hierarchical structure of a sentence based on grammar rules.
Types of Parsing Approaches:
1.​ Top-Down Parsing: Begins with the start symbol and applies grammar rules to derive
the input sentence. Matches rules to input tokens recursively, ensuring each token
corresponds to a grammar rule.
2.​ Bottom-Up Parsing: Starts with the input tokens and applies grammar rules in reverse
to build up the parse tree towards the start symbol. Consider all possible combinations of
rules.

Top-Down Parsing Bottom-Up Parsing

Parsing strategy that first looks at the highest Parsing strategy that first looks at the lowest
level of the parse tree and works down the level of the parse tree and works up the parse
parse tree by using the rules of grammar. tree by using rules of grammar.

This parsing technique uses left most This parsing technique uses right most
derivation derivation

Attempts to find the left-most derivations for Attempts to reduce the input string to the start
an input string symbol of the a grammar

The main leftmost decision is to select what The main decision is to select when to use a
production rule to use in order to construct production rule to reduce the string to get the
the string. starting symbol.

Used for well-define grammar, so it is less It is better at managing ambiguous grammars


efficient in managing ambiguous grammars

May explore invalid branches Only focuses on valid input combinations


since we already have the Start symbol
Example: Recursive Descent parser. Example: Shift Reduce parser.

Q.17 Explain the use of Probabilistic Context-Free Grammar (PCFG) in NLP with an
example.
Probabilistic Context Free Grammar or PCFG is an extension of Context Free Grammar
where each production is assigned a probability. This probability reflects how likely a particular
rule is to be applied, enabling PCFGs to handle ambiguities in NLP.

A probabilistic Context-Free Grammar is defined by a quintuple:


G = (M,T,R,S,P) where,
-​ M is the set of non-terminals
-​ T is the set of terminals
-​ R is the set of Production rules
-​ S is the start symbol
-​ P is the set of probabilities on production rules

Each rule A → B has a probability P(A→B), such that,


∑P(A → B) = 1

Roles of PCFG in NLP,


1.​ Disambiguation: PCFG solves ambiguities in parsing by assigning higher probabilities
to more likely interpretations
2.​ Sentence Parsing: PCFG helps construct parse trees for a sentence by considering the
likelihood of different syntactic structures
3.​ Language Modeling: Used to predict the likelihood of order of sequence of words

Example of PCFG: “the cat sat”


Grammar rules,
S → NP VP (P=0.9)​ ​ NP → Det N (P=0.8)​ ​ VP → V NP (P=0.7)
S → VP (P=0.1)​ ​ NP → N (P=0.2)​ ​ VP → V (P=0.3)

Det → “the” (P=0.1)


N → “cat” (P=0.5)
N → “mat” (P=0.5)
V → “sat” (P=0.1)

Now,
Step 1: Start with S → NP VP (P=0.9)
Step 2: Expand NP → Det N (P=0.8)
Step 3: Assign terminals Det → “the” (P=1.0) and N→”cat”( P=0.5)
Step 4: Expand VP → V (P=0.3) and assign V → “sat” (P=1.0)

Probability Calculation = Multiplication of all P()

Q.18 For a given grammar using CYK or CKY algorithm parse the statement “The man
read this book”.
The CYK algorithm (Cocke-Younger-Kasami) is a bottom-up parsing method used to check if
a given string can be generated by a Context-Free Grammar (CFG) in Chomsky Normal Form
(CNF). It also produces the possible parse trees.

Example: “The man read this book”


Step 1: Grammar in CNF
S → NP VP
NP → Det N
VP → V NP
Det → “the” | ”this”
N → “man” | “book”
V → “read”

Step 2: Input String


“Themanreadthisbook” is tokenzied as,
w1=”the”, w2=”man, w3=”read”, w4=”this”, w5=”book”

Step 3: CYK Table

the man read this book

the Det

man N

read V

this Det
book N

Step 4: Substrings of length 2


Substring Non-Terminals Reason

“the man” NP Det + N = NP

“man read” — No rules

“read this” VP V + NP = VP

“this book” NP Det + N = NP

Step 5: Substrings of length 3


Substring Non-Terminals Reason

“the man read” — No rules

“read this book” VP V + NP = VP

Step 6: Full sentence


Substring Non-Terminals Reason

“the man read this book” S NP + VP = S

Q.19 Explain maximum entropy model for POS tagging.


In many systems, there is a time or state dependency. These systems evolve in time through a
sequence of states and the current state is influenced by past states.
For example, there is a high chance of rain today if it had already rained yesterday. Other
examples include stock prices, human speech or words in a sentence.
We may have a sequence of words but not corresponding POS tags, in such a case, we model
the tags as states and use the observed words to predict the most probable sequence of tags.
This is the job of the Maximum Entropy Model.
Steps in MaxEnt Model,
-​ Training: Extract features from a labeled corpus. Train the model using an optimization
technique to find weights λi for each feature.
-​ Tagging (Inference): For each word in a sentence, calculate P(t | w,c) for all possible
tags t. Assign the tag with the highest probability.
t* = arg max P(t | w,c)

Q.20 Compute the emission and transition probabilities for a bigram HMM. Also decode
the following sentence using the Viterbi algorithm.
XXX

Mod. 4
Q.21 What do you mean by Word Sense Disambiguation (WSD) ? Discuss
dictionary-based approach for WSD. [VIMP]
OR Explain the Lesk algorithm for WSD
OR Explain dictionary-based approach for WSD with a suitable example.
The main problem of NLP systems is to identify words properly and to determine their specific
usage of a word in a particular sentence. WSD solves this ambiguity when it arises while
determining the meaning of the same word, when it is used in different contexts.
Word Sense Disambiguation (WSD) is the task in Natural Language Processing (NLP) of
identifying the correct meaning of a word in context when the word has multiple possible senses
(meanings). It is essential for understanding natural language, machine translation, information
retrieval, and question answering.

Example: The word “bank” can mean


-​ A financial institution (He deposited money in the bank).
-​ The edge of a river (They sat on the river bank).
Here, WSD’s task is to determine whether "bank" refers to a financial institution or a riverbank
based on the surrounding words.

Dictionary-Based Approach for WSD: The dictionary-based approach relies on external


lexical resources, such as dictionaries, thesauruses, or WordNet, to identify the sense of a word.
It uses definitions and semantic relations from these resources to disambiguate meanings.
Example: Determine the sense of “bank” in the sentence, “The boat was tied to the river bank”
Step 1: Senses from Wordnet:
-​ Sense 1: Bank (financial institution) = A financial establishment where money is kept
-​ Sense 2: Bank (riverbank) = the land alongside a river or a lake
Step 2: Context Words: “boat”, “tied”, “river”
Step 3: Overlap with Definitions:
-​ Sense 1 (financial institution): No overlap with any of the dictionary definitions
-​ Sense 2 (riverbank): Dictionary definition overlapping with “river”
Step 4: Conclusion: The correct sense is riverbank based on the context.

Applications of WSD:
-​ Machine Translation: Ensuring the correct sense is translated.
-​ Information Retrieval: Improving search engine relevance by distinguishing senses.
-​ Text Summarization: Extracting accurate summaries by resolving ambiguities.

Q.22 Explain with suitable examples following relationships between word meanings.
Synonymy, Antonymy, Homonymy, Hypernymy & Hyponymy and Polysemy [VIMP]
1.​ Synonymy: Synonymy refers to words that have similar or identical meanings. It is a
relation between two lexical items having different forms but expressing the same or a
close meaning.​
Example:
-​ Big and Large: He lives in a big house vs. He lives in a large house
-​ Happy and Joyful: She feels happy vs. She feels joyful
2.​ Antonymy: Antonymy refers to words with opposite meanings. There are three types of
antonyms:
-​ Gradable: Represent a scale or spectrum​
Example: This coffee is cold vs This coffee is hot
-​ Complementary: Mutually exclusive​
Example: The switch is on vs The switch is off
-​ Relational: Depend on the context
Example: She lent me her book vs I borrowed her book
3.​ Homonymy: Homonymy occurs when words share the same spelling or pronunciation
but have entirely different meanings.​
Example: I can see the sea.
4.​ Hypernymy and Hyponymy: Hypernymy represents a general category, while
hyponymy refers to specific instances within that category.​
Hypernym is the General term and Hyponym is a Specific term.​
Example:
-​ Hypernym: Animal
-​ Hyponyms: Dog, Cat, Elephant
5.​ Polysemy: Polysemy refers to a single word having multiple related meanings.​
Example: Run
-​ He runs fast. (physical activity)
-​ The program is running. (operation)
-​ She runs a company. (manages)

Q.23 Describe semantic analysis in NLP


Semantic analysis is the process of finding the meaning from text. It focuses on understanding
the meaning of words, phrases, sentences, and entire texts. The main aim of semantic analysis
is to draw exact meaning or dictionary meaning from the text, to check the text for
meaningfulness.
The key objective of Semantic Analysis in NLP is to recognize the meaning of a word or
sentence based on its use in the given context.
Example:
The word “bank” could mean:
-​ Financial institution
-​ The side of a river
Applications of Semantic Analysis:
1.​ Sentiment Analysis: Determines the sentiment (positive, negative, neutral) of a text.​
Example: “The movie was awesome!” (Positive Sentiment)​
“The movie was so trash” (Negative Sentiment)
2.​ Machine Translation: Translates text from one language to another while preserving
meaning.
3.​ Question Answering Systems: Understands and responds to user queries based on
context and intent.
4.​ Text Summarization: Extracts the most relevant information from a document and
shortens the text contents in the document.

Q.24 Demonstrate the lexical semantic analysis using an example.


Lexical Semantics is a part of Semantic analysis. It studies the meaning of individual words and
their relationships, such as synonyms, antonyms, and word senses. The goal is to process and
interpret the meaning of words within a given context.

Example of Lexical Semantic Analysis:


Input: “The bank is on the river bank.”

Step 1: Word Identification: Identifying all the tokens in the sentence, “The”, “bank”, “is”, “on”,
“the”, “river”, “bank.”
Step 2: Word Sense Disambiguation: The word “bank” appears twice. Its meaning depends
on the context of the sentence:
-​ “bank” (financial institution) for the first occurrence
-​ “Bank” (side of a river) for the second occurrence
Step 3: Synonymy and Antonymy: Synonym for the second "bank" could be "shore." Antonym
for the first "bank" could be "debt" (in a financial context).
Step 4: Hypernymy and Hyponymy:
-​ Hypernym of "bank" (financial): "organization."
-​ Hypernym of "bank" (river): "geographical feature."
-​ Hyponym of "bank" (financial): "savings bank," "credit union."
-​ Hyponym of "bank" (river): "sandy bank," "rocky bank."
Step 5: Semantic Similarity: Comparing the meanings of the word “bank” in both contexts, we
can say that “bank” (financial institution) is semantically distant or not similar to “bank” (river)

Q.25 Explain Yarowsky bootstrapping approach of semi-supervised learning.


The Yarowsky bootstrapping approach, proposed by David Yarowsky in 1995, is a
semi-supervised learning technique primarily used for tasks like Word Sense Disambiguation
(WSD) and other natural language processing problems where labeled data is scarce. It
leverages a small amount of labeled data (seed data) and a large amount of unlabeled data to
iteratively improve model accuracy.
By iteratively labeling unlabeled data using these principles and refining the model, Yarowsky’s
approach bootstraps itself to achieve high accuracy.

Steps in Yarowsky Bootstrapping:


Step 1: Initialization: Start with a small set of manually labeled examples for different classes,
this is also called Seed data. Along with this Seed data, also gather a large corpus of unlabeled
data.
Step 2: Training the model: Train a simple classifier on the labeled (Seed data) data. Use
features like collocations, surrounding/neighboring words and word frequency to make
predictions.
Step 3: Labeling Unlabeled Data: Use the classifier to predict labels for the unlabeled data.
Select the most confident predictions based on the classifier’s output, which are usually
high-probability predictions.
Step 4: Augmenting the Training Set: Add the confidently labeled examples to the training
set, treating them as ground truth.
Step 5: Iterative Refinement: Repeat the process of training, labeling and augmenting until
convergence or a performance threshold is reached.

Example: WSD for “Bass”


Task: Disambiguate the word “bass” into two senses using Yarowsky’s Bootstrapping
-​ Sense 1: Fish (e.g “The Largemouth Bass swims in the lake”)
-​ Sense 2: Instrument (e.g “He played the bass guitar”)

Seed Data or Labeled Data:


-​ Fish: [“swims”,”lake”,”catch”]
-​ Instrument : [“guitar”,”speakers”,”played”]

Unlabeled Data: Sentences containing the word “bass” without labels

Steps:
-​ Train a classifier on the seed data using collocations like “swims” or “guitar”
-​ Apply the classifier to unlabeled sentences like,​
1. “He caught a smallmouth bass in the river” → Label: Fish (confident)
2. “The bass player is amazing” → Label: Instrument (confident)
-​ Add these confidently labeled examples to the training set
-​ Retrain the classifier with this new augmented dataset
-​ Repeat until the performance threshold is met

Mod. 5
Q.26 Explain anaphora Resolution using Hobbs and Centering algorithm [VIMP]
OR Explain Hobbs algorithm for pronoun resolution
OR Explain Anaphora Resolution using Hobbs and Centering algorithm.
Hobbs' Algorithm is a method used in computational linguistics to resolve pronoun references,
specifically anaphora (where a pronoun refers back to a previously mentioned entity). This
algorithm attempts to identify the noun phrase (NP) that a pronoun refers to by searching
through the syntactic structure of a sentence. It is a syntactic, rule-based approach.

Steps of Hobbs' Algorithm:

1.​ Parse the sentence: Begin with a syntactic parse tree of the sentence that contains the
pronoun.
2.​ Start at the pronoun: Identify the pronoun for which we need to find the antecedent.
3.​ Climb up the tree: Starting from the pronoun, move up the tree to the nearest NP or S
(sentence) node.
4.​ Search the left branches: After reaching the NP or S node, search its left siblings (if
any) for an NP that could be an antecedent.
5.​ Climb to the next level: If no antecedent is found at the current level, move up the tree
to the next NP or S node and repeat the process, checking the left siblings.
6.​ Search previous sentences (if needed): If no antecedent is found in the current
sentence, the algorithm can move to previous sentences and search there.

Example: Consider the following two sentences:

1.​ John saw a dog. He liked it.

We want to determine what "He" refers to.

Step-by-Step Application:

1.​ Parse the sentence: The sentence is parsed syntactically:​


"John saw a dog" is one sentence (S1), and "He liked it" is the second sentence (S2).
2.​ Start at the pronoun "He": The pronoun “He” is located in the second sentence (S2).
3.​ Climb up the tree to the nearest NP or S: From "He", we move up to the sentence
(S2) node.
4.​ Search the left siblings: There are no left siblings of "He" in this sentence.
5.​ Climb up the tree to the next level (previous sentence): Since there is no match
within S2, we move to S1.
6.​ Search the previous sentence (S1): In S1, the NP "John" is found, and "He" can be
resolved as referring to "John."

Result: Hobbs' Algorithm resolves the pronoun "He" to refer to "John."

The Centering Algorithm is a framework used in computational linguistics and discourse


analysis to model the flow of attention and coherence within a discourse. It helps determine
how pronouns and noun phrases are interpreted in a sequence of sentences and aims to
predict which entities (objects or people) will remain the focus of the conversation.

The algorithm was proposed as part of Centering Theory, which focuses on the role of
discourse entities (called "centers") in maintaining coherence between sentences.

Key Concepts in Centering Theory

1.​ Forward-looking centers (Cf): These are potential entities (noun phrases) in a
sentence that could be referred to in subsequent sentences. They are ranked by
salience (importance), usually based on syntactic roles like subject, object, etc.
2.​ Backward-looking center (Cb): This is the most salient entity in the current sentence
that refers back to an entity in the previous sentence. It is the entity that links the current
sentence to the previous one, creating coherence.
3.​ Preferred center (Cp): The highest-ranked forward-looking center (the most salient
entity in the current sentence) that is expected to become the backward-looking center in
the next sentence.
4.​ Transition Types: Transitions describe the relationship between sentences based on
whether the backward-looking center and preferred center stay the same across
sentences. There are four types:
○​ Continue: The current backward-looking center (Cb) matches the previous
backward-looking center, and the preferred center remains the same.
○​ Retain: The backward-looking center stays the same, but the preferred center
changes.
○​ Smooth Shift: The backward-looking center changes, and the new sentence's
preferred center becomes the new backward-looking center.
○​ Rough Shift: The backward-looking center and preferred center both change,
leading to a potential disruption in coherence.

Steps of the Centering Algorithm

1.​ Identify Cf: Extract all the possible entities (noun phrases) from the current sentence
that could be referred to in future sentences.
2.​ Determine Cb: Identify the backward-looking center, i.e., the most salient entity in the
current sentence that refers back to a previous entity.
3.​ Rank the centers: Rank the forward-looking centers (Cf) based on their syntactic roles
and salience (subject > object > others).
4.​ Compute the transition type: Based on how the current Cb and Cp relate to the
previous sentence's Cb and Cp, determine the transition type (Continue, Retain, Smooth
Shift, or Rough Shift).

Example: Consider the following discourse:

Sentence 1: John went to the park.​


Sentence 2: He saw a dog.​
Sentence 3: It was barking loudly.

Step-by-Step Application

1.​ Sentence 1:
○​ Cf: {John}
○​ Cb: None (there's no previous sentence, so no backward-looking center)
○​ Cp: John (he is the subject)
2.​ Sentence 2:
○​ Cf: {John, a dog} (both John and the dog are potential forward-looking centers)
○​ Cb: John (refers back to "John" in the previous sentence)
○​ Cp: John (since "He" refers to John, John remains the most salient entity)
3.​ Transition Type: Continue (John remains the Cb and Cp).
4.​ Sentence 3:
○​ Cf: {the dog}
○​ Cb: the dog (refers back to "a dog" in the previous sentence)
○​ Cp: the dog (the most salient entity in this sentence)
5.​ Transition Type: Smooth Shift (Cb changes from "John" to "the dog," and the Cp is
now the dog).

Explanation:

●​ From Sentence 1 to Sentence 2, the backward-looking center (Cb) is "John," and he


remains the focus in both sentences, leading to a Continue transition.
●​ In Sentence 3, the backward-looking center shifts from "John" to "the dog," creating a
Smooth Shift in attention.

Q.27 Explain the three types of referents that complicate the reference resolution
problem.
A natural language expression used to perform reference is called a referring expression and
the entity that is referred to is called the referent.
The three types of referents that complicate the reference solution problem are,
1.​ Inferrables: In certain instances, a referring expression refers to an entity that has been
implied rather than one that has been expressed in the text. Inferrables are referents not
explicitly mentioned in the text but can be inferred based on world knowledge or context.
Example: “John bought a car. The engine is very powerful”
“The engine” refers to the engine of the car, which is not explicitly mentioned but inferred
from the mention of the car in the previous statement.
2.​ Discontinuous Sets: Discontinuous sets occur when the reference refers to a group of
entities that are not explicitly grouped together in the text. The resolution system must
combine scattered mentions across the discourse to resolve the reference to the
appropriate set.
Example: “Sanil met Rugved in the park. Later, Rohit joined them, and all three went to a
cafe.”​
Here, “them” refers to Sanil and Rugved whereas “All three” refers to Sanil,Rugved and
Rohit.
3.​ Generics: Generics are referents that refer to a general class or category rather than a
specific instance. The resolution system must determine whether the reference is
generic or specific, which can be ambiguous depending on the context.
Example: “Dogs are loyal animals” Here, “Dogs” refers to the general category of dogs
(Generic).
“The dog barked loudly.” Here, “The dog” refers to a specific animal (Not Generic)

Q.28 What is reference resolution? & Explain Discourse reference resolution in detail
Reference resolution in NLP is the process of identifying what a word, phrase, or pronoun in a
sentence refers to. It involves linking mentions (e.g., pronouns like he, she, it, or noun phrases)
to their corresponding entities or concepts within the text.
Example: “John saw a dog. He liked it.”
Here, the “He” refers to John and “it” refers to a dog.
Discourse Reference resolution focuses on resolving references that span across multiple
sentences or utterances in a discourse. Unlike resolving references within a single sentence,
discourse-level resolution requires understanding the context of multiple sentences and
maintaining a mental representation of the entities discussed.
Types of References in Discourse:
-​ Anaphora: When a pronoun or noun phrase refers back to an earlier entity.
Example: “John saw a dog. He liked it.”
Here, the “He” refers to John and “it” refers to a dog.
-​ Cataphora: When a pronoun or noun phrase refers to an entity mentioned later in the
text.
Example: “Before he spoke, John cleared his throat.” Here, “he” refers to John which is
mentioned in the next sentence.
-​ Exophora: When the reference is to something outside the text (requires external
knowledge).
Example: “Look at that!” Here, “that” refers to something visible in the environment, not
in the text.
Applications of Discourse reference resolution:
-​ Chatbots: Understanding pronouns and references in user queries.
-​ Information Extraction: Linking mentions to extract coherent facts.
-​ Summarization: Maintaining entity continuity in summaries
Example of Discourse Reference Resolution: “Sanil visited the museum. He found it
fascinating. The ancient weapons were extraordinary.”
Mentions: Sanil, the museum, He, it, The ancient weapons
Entity Links:
-​ He → Sanil
-​ It → The museum
-​ The ancient weapons → inferred to be part of the museum

Q.29 Illustrate the reference phenomena for solving the pronoun problem.
The pronoun problem refers to identifying the entity (referred to as the referent) that a pronoun
(such as he, she, it, they) refers to in a given text. This process is a key part of reference
resolution in NLP. To solve this problem, we rely on understanding reference phenomena,
which are linguistic and contextual cues that help us link pronouns to their antecedents.
1.​ Coreference: Coreference occurs when a pronoun refers to a noun phrase (antecedent)
within the same sentence or nearby sentences. Use syntactic proximity (the closest
noun) and semantic compatibility (gender, number agreement) to resolve the
pronouns.
Example: “John saw a dog. He liked it.”
Here, the “He” refers to John and “it” refers to a dog.
2.​ Anaphora: E.g. already done, Solution: Look for antecedents in the preceding text
using sentence structures and dependencies.
3.​ Cataphora: E.g. already done, Solution: Post-process the sentence to identify nouns
following the pronoun for potential antecedents.
4.​ Exophora: E.g. already done, Solution: Requires contextual understanding and often
involves integrating real-world knowledge or extra inputs like images.
By understanding and applying reference phenomena, NLP systems can improve the resolution
of pronouns and enhance tasks such as chatbots, machine translation, and text summarization.
Example of Reference Phenomena solving a Pronoun Problem:
“Pihu saw a boy. He was crying”
-​ Identify Pronouns: “He”
-​ Search for Antecedents: Nouns in the context are, “Pihu” and “a boy”
-​ Filter by Compatibility: Match gender, number and role in the sentence. “He” matches
with “a boy” (singular, male)
-​ Resolve: Assign “He” = “a boy”

Q.30 What are the five types of referring expressions ? Explain with the help of an
example.
1.​ Indefinite Noun Phrases: These refer to entities that are being introduced for the first
time in a discourse and are not previously known to the listener/reader. They are often
accompanied by indefinite articles like a or an.
Example: “A man entered the room” Here, “A man” introduces an entity for the first time.
2.​ Definite Noun Phrases: These refer to entities that are already known or can be
uniquely identified within the discourse or context. They are accompanied by the definite
article the.
Example: “The man sat down.” Here, “The man” refers to a specific, identifiable man,
often one introduced earlier or assumed to be known.
3.​ Pronouns: Pronouns are referring expressions that replace nouns and usually refer to
entities mentioned earlier in the discourse. They include he, she, it, they, etc.
Example: “John saw a dog. He liked it.”
Here, the “He” refers to John and “it” refers to a dog.
4.​ Demonstratives: Demonstratives specify entities relative to the speaker’s perspective
and are often used with gestures in spoken language. They include this, that, these,
those.
Example: “I want that book, not this one.” Here, “that book” refers to a specific book,
likely pointed to or previously mentioned.
5.​ Names: Names are proper nouns used to refer to specific entities without ambiguity
within the given context.
Example: “Sanil called Freddy yesterday.” Here, “Sanil” and “Freddy” are specific
individuals, and their names uniquely identify them in the discourse.

Mod. 6
Q.32 Demonstrate the working of Machine translation systems
Machine translation (MT) refers to the automatic conversion of text from one language to
another using computational algorithms. It plays a crucial role in breaking language barriers,
making information accessible globally.
Write about the Approaches in Machine Translation (next answer) →
Q.33 Explain Machine translation approaches used in NLP [VIMP]
OR What is rule based machine translation ?
OR Explain the statistical approach for machine translation.
1.​ Rule-Based MT: The earliest commercial machine translation systems were rule-based
machine translation systems also called as RBMTs, which are based on linguistic
principles or rules that permits words to be placed in many contexts and to have various
meanings.
These rules are created by programmers and human language experts who have put a
lot of work into understanding and mapping the rules between two languages. This
allows users to update and improve the translation, although these handcrafted rules
can be challenging to maintain for large-scale translations.
Workflow Example: “I am eating an apple”
Lexical mapping: Match words to their equivalents in the target language
I → मैं , “am eating” →खा रहा हूं , “an apple” →सेब
Grammar Transformation: Adjust word order based on target language’s grammar.
Final Output = “मैं सेब खा रहा हूं”

2.​ Statistical MT: SMT or Statistical Machine Translation bases translations on statistical
models, the parameters of which are generated from analysis of large bilingual text
corpora. Utilizes probabilities to determine the most likely translation.
The collection of a sizable and organized group of writings written in two different
languages is referred to as “bilingual text corpus”. To create the statistical models,
supervised and unsupervised machine learning techniques are employed.
Workflow Example: “The cat sat on the mat”
Analyze Parallel Corpora:
English: “The cat sat on the mat”
Hindi: “बिल्ली कालीन पर बैठ गयी.”
Compute probabilities for word mapping:
Cat → बिल्ली (high probability), Mat → कालीन (high probability)
Reconstruct the sentence using high-probability phrases:
Final Output = “बिल्ली कालीन पर बैठ गयी.”

3.​ Neural MT: Based on deep learning and artificial neural networks.
Uses an encoder-decoder architecture where, and Encoder processes the source
language sentence and encodes it into a fixed-size vector. Decoder then decodes the
vector to generate the target language sentence.
Workflow Example: “How are you ?”
Encoder converts the sentence into a sequence of embeddings. Decoder generates the
output word-by-word using these embeddings.
Final Output = “आप कैसे हैं?”

4.​ Hybrid MT: Hybrid MT combines two or more approaches, typically Rule-Based MT
(RBMT) and Statistical MT (SMT) or Neural MT (NMT). Utilizes the linguistic knowledge
of RBMT to handle grammar and data-driven insights of SMT for fluent translations.
Q.34 Explain the different steps in text processing for Information Retrieval [VIMP]
Text processing is a crucial phase in information retrieval that transforms raw text into a
structured format for efficient indexing and retrieval. Steps in Text Pre-Processing:
1.​ Tokenization: Split the text into individual units (tokens) such as words, phrases, or
symbols. Break sentences into words by identifying delimiters (e.g., spaces,
punctuation).​
Input: “This is a sentence”
Output: [this, is, a, sentence]
2.​ Normalization: Standardize text to ensure consistency. Normalization converts all text
to lowercase, removing special characters and punctuation and Stop word removal.​
Input: “My name is Sanil”
Output: Converting “Sanil” to “sanil”, removal of Stop Words “my” and “is”, ​
Final Output = “name sanil”
3.​ Stemming and Lemmatization: Reduce words to their base or root form for
consistency.
-​ Stemming: It is a natural language processing technique used to lower inflection
in words to their root forms, it aids in the preprocessing of text, words and
documents.​
running, runs → run, leaves → leav (word MAY NOT be recognized by dictionary)
-​ Lemmatization: Responsible for grouping different inflected forms of words into
their root form having the same meaning. Seeks to get rid of inflectional suffixes
and prefixes for the purpose of bringing out the word’s dictionary form.
better → good, leaves → leaf (word IS ALWAYS recognized by dictionary)
4.​ N-grams Creation: Generate sequences of n tokens to capture context and multi-word
phrases. N-gram could be of types,
-​ Unigram: Single Words, Example: “Leave me alone” → [leave, me, alone]
-​ Bigram: Pairs of consecutive words, ​
Example: “Leave me alone please” → [leave me, me alone, alone please]
-​ Trigram: Triplets of words,​
Example: ”Leave me alone please” → [leave me alone, me alone please]

Q.35 Explain the Information Retrieval system


OR Explain Information Retrieval versus Information extraction systems [VIMP]
An Information Retrieval System is a framework that locates, retrieves, and presents relevant
information from large collections of data in response to user queries. It plays a critical role in
applications like search engines, document retrieval, and recommendation systems.
Steps in IR:
Step 1: Document Processing: Text from documents is tokenized and indexed for efficient
retrieval. Stop words are removed, and stemming or lemmatization is applied to normalize
words
Step 2: Query Processing: The user's query undergoes preprocessing similar to the
documents. The processed query is compared to the indexed terms.
Step 3: Matching and Ranking: Documents are scored for relevance using techniques like
probabilistic models. Results are ranked and returned in descending order of the confidence
score or relevance.
Step 4: Presentation: The retrieved documents or snippets are displayed to the user for
selection.

{G-FOTU-E}
Aspect Information Retrieval Information Extraction

To find relevant documents or data To extract specific information or


Goal
sources for a query. facts from text.

Focuses on matching queries to Focuses on understanding and


Focus
documents. extracting meaning.

Entire documents, ranked by


Output Return facts out of documents
relevance.

Techniques Search algorithms, relevance Natural Language Processing


used scoring, keyword matching. (NLP), Named Entity Recognition.

Finding research papers or web Extracting dates, amounts, or names


Use Cases
pages on a topic. from financial reports.

Google search results for "climate


Extracting the names of all
change impacts." A search engine
individuals mentioned in a news
Example returns a list of web pages that
article, along with their roles and the
contain information about "climate
events they are associated with.
change".

Q.36 Explain Question Answering System (QAS) in detail.


A Question-Answering (QA) system in Natural Language Processing (NLP) is an AI system
designed to automatically answer questions posed by users in natural language. These systems
retrieve relevant information from a dataset or knowledge base and provide a concise and
accurate answer, often with the help of NLP techniques to understand the question's intent and
context.

Steps in QA System:

Step 1: Question Processing​


The system analyzes the question to identify its type (e.g., factual, descriptive, yes/no) and the
expected answer format (e.g., name, date, location).​
NLP techniques such as Named Entity Recognition (NER), part-of-speech tagging, and
dependency parsing help in understanding the structure of the question.
Step 2: Information Retrieval​
The system searches a knowledge base or corpus to retrieve relevant documents or passages
that may contain the answer. This step often involves semantic search or keyword-based
search methods.

Step 3: Answer Extraction​


Once relevant documents or passages are identified, the system extracts the most probable
answer using techniques such as text span detection, machine reading comprehension, and
deep learning models (e.g., BERT, GPT).​
The system may look for the exact phrase or a contextually accurate sentence that best
answers the question.

Step 4: Answer Ranking​


If multiple possible answers are found, the system ranks them based on their relevance and
confidence score. Higher-ranked answers are more likely to be accurate.

Types of QA Systems:

1.​ Fact-Based QA Systems: Designed to answer specific factual questions with definite
answers, such as "What is the capital of France?" (Answer: Paris).
2.​ Open-Domain QA Systems: Capable of answering a broad range of questions across
various domains, often used in search engines or virtual assistants.
3.​ Closed-Domain QA Systems: Focused on answering questions from a specific field,
such as medicine or law, where in-depth knowledge of that field is required.
4.​ Generative QA Systems: These systems generate answers based on understanding
the context, like summarizing complex information or explaining a topic.
5.​ Extractive QA Systems: These systems extract exact answers from a given text or
dataset, such as finding specific words, phrases, or numbers.

Example Workflow:

●​ Question: "Who won the Nobel Prize in Physics in 2023?"


●​ Question Processing: The system identifies this as a factual question about a person
(entity type: Nobel Prize winner).
●​ Information Retrieval: The system searches relevant sources, such as news articles or
encyclopedic databases, for mentions of Nobel Prize winners in 2023.
●​ Answer Extraction: It detects the correct answer from the retrieved sources (e.g., "John
Doe") and presents it to the user.
●​ Answer Presentation: The system responds: "John Doe won the Nobel Prize in
Physics in 2023."

You might also like