SemVII NaturalLanguageProcessing
SemVII NaturalLanguageProcessing
Mod. 1
Q.1 Explain the various stages of Natural Language processing [VIMP]
There are 5 stages in Natural Language Processing :-
Lexical Analysis: First stage in NLP, also known as Morphological analysis. At this stage the
structure of the words is identified and analyzed. Involves breaking down the text into tokens,
which are the smallest units of meaning, such as words or phrases.
Example: “I love cats” is broken down into tokens like [“I”, “love”, “cats”]
Syntactic Analysis: Involves analyzing the syntax or structure of the sentence according to
grammatical rules. Checks the sentence for proper syntax and creates a parse tree that
represents the grammatical structure.
Example: Creates a parse tree for “I love cats” where “I” is the Subject/Pronoun, “love” as the
Verb and “cats” as the Object/Noun
Semantic Analysis: Focuses on understanding the meaning of the text by interpreting the
meanings of individual words and how they combine to form meaningful phrases or sentences.
Example: For “I love cats”, semantic analysis would determine that “love” expresses a positive
emotion and the speaker has a positive feeling towards the animal cat.
Disclosure Integration: Involves understanding the context and the relationship between
sentences to derive the meaning from the entire text, rather than just individual sentences.
Example: For a text with two sentences “I love cats. They are so cute.”, discourse integration
would link the “They” to “cats” and understand that the second sentence is providing additional
information about the “cats” in the first sentence.
Pragmatic Analysis: Deals with understanding the text in context, the text is re-interpreted on
what it truly means. Considering the speaker’s intentions which requires real world knowledge.
Example: “John saw Mary in a garden with a cat”, here we can’t say whether John is with the cat
or Mary with the cat. Or “Could you pass me the salt ?” Pragmatic analysis would recognize that
this isn’t just a question but it's also a polite request for someone to pass the salt.
Q.2 Discuss the challenges in various stages of natural language processing [VIMP]
Contextual words and phrases and homonyms: The same words and phrases can have
diverse meanings according to the context of the sentence and many words have the exact
same pronunciation but completely different meanings.
Example: “I went to the bank” it’s unclear whether the person went to a financial institution or a
riverbank without further context.
Synonyms: Different words can have similar meanings but they are not always interchangeable
in every context. This can cause contextual misunderstanding as we use the different words to
express the same idea.
Example: The words “big”, “large” and “huge” are synonyms, but we use “big mistake” instead of
a “large mistake”.
Irony and Sarcasm: Irony and sarcasm often convey meanings that are the opposite of the
literal words used, making it difficult for machines to interpret correctly. Detecting irony and
sarcasm requires an understanding of tone, context, and sometimes cultural or situational
knowledge, which can be tough for NLP systems to identify.
Example: "Oh great, another traffic jam!" is likely sarcastic, meaning the speaker is actually
unhappy, not pleased.
Ambiguity: Occurs when a sentence or phrase can be interpreted in more than one way. These
phrases potentially have two or more possible interpretations.
Example: “The chicken is ready to eat” can mean that the chicken itself is about to eat or that
the chicken is cooked and ready for someone to eat it.
Errors in text or speech: Human communication often contains errors like typos, grammatical
mistakes, or mispronunciations.
Example: "I can’t beleive it!" contains a typo ("beleive" instead of "believe")
Q.3 Explain the ambiguities associated at each level with examples for NLP
Natural language is very ambiguous, ambiguities occur when a language construct can have
more than one interpretation. Ambiguities mean not having well defined solutions. Any sentence
in a language having a large-enough grammar can have another interpretation.
Lexical Ambiguity: A single word can have multiple meanings depending on the context.
When words have multiple assertions then it is known as lexical ambiguity.
Example: “I saw a bat in the garden”
Ambiguity: Does the bat refer to a flying mammal or a cricket bat
Syntactic Ambiguity: A sentence can have multiple possible structures, leading to different
interpretations. Syntactic ambiguity means sentences are parsed in multiple syntactic forms or A
sentence can be parsed in different ways.
Example: “I saw the man with the telescope”
Ambiguity: Did the man have the telescope, or was the telescope used to see the man?
Semantic Ambiguity: Semantic ambiguity is related to the sentence interpretation. The
meaning of a sentence is unclear due to the relationships between words or phrases.
Example: “Visiting relatives can be annoying”
Ambiguity: Does it mean the act of visiting relatives is annoying, or that relatives who visit are
annoying?
Anaphoric Ambiguity: This kind of ambiguity arises due to the use of anaphora entities in
discourse.
Example: “The horse ran up the hill. It was very steep. It was very tired”
Ambiguity: Here the anaphoric reference of “it” can cause ambiguity.
Metonymy Ambiguity: It is the most difficult ambiguity, it deals with phrases in which the literal
meaning is different from the figurative assertion. A word or phrase is used to represent
something closely related, which can lead to confusion.
Example: “The company is screaming for new management”
Ambiguity: Here it really doesn’t mean that the company is literally screaming
Mod. 2
Q.5 Explain Porter Stemming algorithm in Detail [VVIMP]
Stemming is a method in text processing that eliminates prefixes and suffixes from words,
transforming them into their fundamental or root form, to streamline and standardize words to
enhance the effectiveness of NLP tasks.
The Porter Stemming Algorithm is one of the most widely used stemming algorithms in
Natural Language Processing proposed in 1980. It reduces words to their base or root form
(known as the stem) by systematically removing suffixes. This helps in text normalization,
especially for search and information retrieval tasks.
Steps in Portal Stemming algo:
Step 1: Classify Characters: The algorithm begins by identifying and classifying characters in
the word as vowels or consonants. To understand the structure of the word and decide where to
apply the stemming rules.
Vowels are: a,e,i,o,u
Consonants: Any letter that’s not a vowel, except “y” in some cases
Step 2: Compare to rules: Words are then matched against a set of predefined rules to decide
which part of the word (usually suffixes) should be removed or transformed.
Region definitions:
R1: The part of the word that begins after the first non-vowel, following a vowel.
R2: The part of R1 that begins after the first non-vowel following a vowel.
These regions decide which suffixes to remove based on their location in the word.
Step 3: Remove common suffixes and pluralization: The algorithm removes or modifies
suffixes and handles plural forms systematically.
- Past tenses: “ed”
Examples: “jumped” → “jump”
- Adjective and Adverb suffixes: “ness”,”ful”,”ous”, etc.
Examples: “happiness” → “happi”, “successful” → “success”, “dangerous” → “danger”
- Plural forms: “s”,”es”
Examples: “cats” → “cat” and “foxes” → “fox”
FSA for Nouns:
Nouns in English may take singular or plural forms
Pluralization rules:
- Regular: Add -s or -es
(e.g, cat→cats and fox→foxes)
- Irregular: Change internal structure
(e.g., mouse→mice and wolf→wolves)
● Lemmatization
Derivational Morphology:
Importance of Perplexity:
Model Performance: Lower perplexity indicates a better language model, as it means the
model's predictions are closer to the actual sequence of words.
Comparative Metric: Perplexity is used to compare different language models on the same
dataset. A lower perplexity across multiple datasets shows that a model generalizes better.
Evaluation of Probabilities: Perplexity directly evaluates the probability distribution generated
by the model, reflecting its ability to model language.
Applications of Perplexity:
N-gram Models: Perplexity is often used to evaluate n-gram models, with lower perplexity
indicating better performance.
Neural Language Models: Modern models like RNNs, LSTMs, and Transformers are evaluated
using perplexity to measure their fluency in generating text.
Fine-tuning: When fine-tuning models on a specific domain, perplexity helps track improvement
over iterations.
Mod. 3
Q.13 What is POS tagging ? Explain various approaches to perform POS tagging and
discuss various challenges faced by POS tagging. [VVIMP]
OR What is the rule-based and stochastic part of speech taggers ?
Challenges in POS tagging:
- Ambiguity: The main problem with POS tagging is ambiguity. In English, many
common words have multiple meanings and hence multiple POS. The job of a POS
tagger is to resolve this ambiguity accurately based on the context of use.
For e.g., the word “shot” in the sentence “He shot a bird” is a Verb whereas in the
sentence “A shot of vodka” is a Noun
- Downstream Propagation: If a POS tagger gives poor accuracy, then this has an
adverse effect on other tasks that follow. This is called downstream propagation. To
improve accuracy, POS tagging is combined with other processing techniques, like
dependency parsing.
- Unigram Tagging: In a statistical approach, one can count the tag frequencies of words
in a tagged corpus and then assign the most probable tag. This is Unigram Tagging, a
much better approach would be Bigram tagging where the tag frequency is counted
while keeping in mind the preceding tag.
- Out-Of Vocabulary words: Unknown words, such as new terms, abbreviations or slang
are difficult to tag.
Issues/Limitations in HMM:
- Independence Assumptions: Assumes that the probability of a word depends only on
its tag and the probability of a tag depends only on the previous tag. Ignores long-range
dependencies or global sentence context.
- Out-Of-Vocabulary OOV Words: HMM struggles with unknown words since it relies on
predefined probabilities. Cannot handle OOV words.
- Scalability Issues: As the length of the sentence increases, the computation grows
exponentially without optimizations like the Viterbi algorithm.
CRFs are discriminative models that directly model the conditional probability P(Y|X), where:
- X = (x1,x2…xn) is the sequence of the observations (words in a sentence)
- Y = (y1,y2…yn) is the sequence of labels (POS tags, NER labels)
Working of a CRF:
Step 1: Input and Output Representation: Input a sequence of observations, X = (x1,x2…xn)
and output should be a sequence of labels Y = (y1,y2…yn) e.g., POS tags or Entity labels in
NER.
Step 2: Feature Functions: CRFs use feature functions f(X,Y) to capture the relationship
between input observations and output labels. These features can be,
State features: Dependent on current observation xi and current label yi
Transition features: Dependent on current label yi and the previous label yi-1
Step 3: Conditional Probability: Computing conditional probability of a label sequence Y given
the input sequence X is done using P(Y|X)
Step 4: Training: Model learns the weights (lambda k) for the features by maximizing the
conditional likelihood
Step 5: Prediction: The task is to find the sequence Y that maximizes P(Y|X), often solved
using the Viterbi algorithm.
Example:
Input Sequence: X = {“Sanil”, “Jadhav”,”was”,”born”,”in,”Oregon””}
Output Sequence: Y = {B-PER, I-PER,O,O,O,B-LOC} where,
- B-PER: Beginning of a person’s name
- I-PER: PER inside a person's name
- O: Outside any named entity or Others
- B-LOC: Beginning of a location name
Q.16 .Explain Parser in NLP also explain or compare top-down and bottom-up approach
of parsing with examples
A parser in Natural Language Processing (NLP) analyzes a sentence's grammatical structure to
identify its syntactic relations. Parsing generates a parse tree or syntax tree, representing the
hierarchical structure of a sentence based on grammar rules.
Types of Parsing Approaches:
1. Top-Down Parsing: Begins with the start symbol and applies grammar rules to derive
the input sentence. Matches rules to input tokens recursively, ensuring each token
corresponds to a grammar rule.
2. Bottom-Up Parsing: Starts with the input tokens and applies grammar rules in reverse
to build up the parse tree towards the start symbol. Consider all possible combinations of
rules.
Parsing strategy that first looks at the highest Parsing strategy that first looks at the lowest
level of the parse tree and works down the level of the parse tree and works up the parse
parse tree by using the rules of grammar. tree by using rules of grammar.
This parsing technique uses left most This parsing technique uses right most
derivation derivation
Attempts to find the left-most derivations for Attempts to reduce the input string to the start
an input string symbol of the a grammar
The main leftmost decision is to select what The main decision is to select when to use a
production rule to use in order to construct production rule to reduce the string to get the
the string. starting symbol.
Q.17 Explain the use of Probabilistic Context-Free Grammar (PCFG) in NLP with an
example.
Probabilistic Context Free Grammar or PCFG is an extension of Context Free Grammar
where each production is assigned a probability. This probability reflects how likely a particular
rule is to be applied, enabling PCFGs to handle ambiguities in NLP.
Now,
Step 1: Start with S → NP VP (P=0.9)
Step 2: Expand NP → Det N (P=0.8)
Step 3: Assign terminals Det → “the” (P=1.0) and N→”cat”( P=0.5)
Step 4: Expand VP → V (P=0.3) and assign V → “sat” (P=1.0)
Q.18 For a given grammar using CYK or CKY algorithm parse the statement “The man
read this book”.
The CYK algorithm (Cocke-Younger-Kasami) is a bottom-up parsing method used to check if
a given string can be generated by a Context-Free Grammar (CFG) in Chomsky Normal Form
(CNF). It also produces the possible parse trees.
the Det
man N
read V
this Det
book N
“read this” VP V + NP = VP
Q.20 Compute the emission and transition probabilities for a bigram HMM. Also decode
the following sentence using the Viterbi algorithm.
XXX
Mod. 4
Q.21 What do you mean by Word Sense Disambiguation (WSD) ? Discuss
dictionary-based approach for WSD. [VIMP]
OR Explain the Lesk algorithm for WSD
OR Explain dictionary-based approach for WSD with a suitable example.
The main problem of NLP systems is to identify words properly and to determine their specific
usage of a word in a particular sentence. WSD solves this ambiguity when it arises while
determining the meaning of the same word, when it is used in different contexts.
Word Sense Disambiguation (WSD) is the task in Natural Language Processing (NLP) of
identifying the correct meaning of a word in context when the word has multiple possible senses
(meanings). It is essential for understanding natural language, machine translation, information
retrieval, and question answering.
Applications of WSD:
- Machine Translation: Ensuring the correct sense is translated.
- Information Retrieval: Improving search engine relevance by distinguishing senses.
- Text Summarization: Extracting accurate summaries by resolving ambiguities.
Q.22 Explain with suitable examples following relationships between word meanings.
Synonymy, Antonymy, Homonymy, Hypernymy & Hyponymy and Polysemy [VIMP]
1. Synonymy: Synonymy refers to words that have similar or identical meanings. It is a
relation between two lexical items having different forms but expressing the same or a
close meaning.
Example:
- Big and Large: He lives in a big house vs. He lives in a large house
- Happy and Joyful: She feels happy vs. She feels joyful
2. Antonymy: Antonymy refers to words with opposite meanings. There are three types of
antonyms:
- Gradable: Represent a scale or spectrum
Example: This coffee is cold vs This coffee is hot
- Complementary: Mutually exclusive
Example: The switch is on vs The switch is off
- Relational: Depend on the context
Example: She lent me her book vs I borrowed her book
3. Homonymy: Homonymy occurs when words share the same spelling or pronunciation
but have entirely different meanings.
Example: I can see the sea.
4. Hypernymy and Hyponymy: Hypernymy represents a general category, while
hyponymy refers to specific instances within that category.
Hypernym is the General term and Hyponym is a Specific term.
Example:
- Hypernym: Animal
- Hyponyms: Dog, Cat, Elephant
5. Polysemy: Polysemy refers to a single word having multiple related meanings.
Example: Run
- He runs fast. (physical activity)
- The program is running. (operation)
- She runs a company. (manages)
Step 1: Word Identification: Identifying all the tokens in the sentence, “The”, “bank”, “is”, “on”,
“the”, “river”, “bank.”
Step 2: Word Sense Disambiguation: The word “bank” appears twice. Its meaning depends
on the context of the sentence:
- “bank” (financial institution) for the first occurrence
- “Bank” (side of a river) for the second occurrence
Step 3: Synonymy and Antonymy: Synonym for the second "bank" could be "shore." Antonym
for the first "bank" could be "debt" (in a financial context).
Step 4: Hypernymy and Hyponymy:
- Hypernym of "bank" (financial): "organization."
- Hypernym of "bank" (river): "geographical feature."
- Hyponym of "bank" (financial): "savings bank," "credit union."
- Hyponym of "bank" (river): "sandy bank," "rocky bank."
Step 5: Semantic Similarity: Comparing the meanings of the word “bank” in both contexts, we
can say that “bank” (financial institution) is semantically distant or not similar to “bank” (river)
Steps:
- Train a classifier on the seed data using collocations like “swims” or “guitar”
- Apply the classifier to unlabeled sentences like,
1. “He caught a smallmouth bass in the river” → Label: Fish (confident)
2. “The bass player is amazing” → Label: Instrument (confident)
- Add these confidently labeled examples to the training set
- Retrain the classifier with this new augmented dataset
- Repeat until the performance threshold is met
Mod. 5
Q.26 Explain anaphora Resolution using Hobbs and Centering algorithm [VIMP]
OR Explain Hobbs algorithm for pronoun resolution
OR Explain Anaphora Resolution using Hobbs and Centering algorithm.
Hobbs' Algorithm is a method used in computational linguistics to resolve pronoun references,
specifically anaphora (where a pronoun refers back to a previously mentioned entity). This
algorithm attempts to identify the noun phrase (NP) that a pronoun refers to by searching
through the syntactic structure of a sentence. It is a syntactic, rule-based approach.
1. Parse the sentence: Begin with a syntactic parse tree of the sentence that contains the
pronoun.
2. Start at the pronoun: Identify the pronoun for which we need to find the antecedent.
3. Climb up the tree: Starting from the pronoun, move up the tree to the nearest NP or S
(sentence) node.
4. Search the left branches: After reaching the NP or S node, search its left siblings (if
any) for an NP that could be an antecedent.
5. Climb to the next level: If no antecedent is found at the current level, move up the tree
to the next NP or S node and repeat the process, checking the left siblings.
6. Search previous sentences (if needed): If no antecedent is found in the current
sentence, the algorithm can move to previous sentences and search there.
Step-by-Step Application:
The algorithm was proposed as part of Centering Theory, which focuses on the role of
discourse entities (called "centers") in maintaining coherence between sentences.
1. Forward-looking centers (Cf): These are potential entities (noun phrases) in a
sentence that could be referred to in subsequent sentences. They are ranked by
salience (importance), usually based on syntactic roles like subject, object, etc.
2. Backward-looking center (Cb): This is the most salient entity in the current sentence
that refers back to an entity in the previous sentence. It is the entity that links the current
sentence to the previous one, creating coherence.
3. Preferred center (Cp): The highest-ranked forward-looking center (the most salient
entity in the current sentence) that is expected to become the backward-looking center in
the next sentence.
4. Transition Types: Transitions describe the relationship between sentences based on
whether the backward-looking center and preferred center stay the same across
sentences. There are four types:
○ Continue: The current backward-looking center (Cb) matches the previous
backward-looking center, and the preferred center remains the same.
○ Retain: The backward-looking center stays the same, but the preferred center
changes.
○ Smooth Shift: The backward-looking center changes, and the new sentence's
preferred center becomes the new backward-looking center.
○ Rough Shift: The backward-looking center and preferred center both change,
leading to a potential disruption in coherence.
1. Identify Cf: Extract all the possible entities (noun phrases) from the current sentence
that could be referred to in future sentences.
2. Determine Cb: Identify the backward-looking center, i.e., the most salient entity in the
current sentence that refers back to a previous entity.
3. Rank the centers: Rank the forward-looking centers (Cf) based on their syntactic roles
and salience (subject > object > others).
4. Compute the transition type: Based on how the current Cb and Cp relate to the
previous sentence's Cb and Cp, determine the transition type (Continue, Retain, Smooth
Shift, or Rough Shift).
Step-by-Step Application
1. Sentence 1:
○ Cf: {John}
○ Cb: None (there's no previous sentence, so no backward-looking center)
○ Cp: John (he is the subject)
2. Sentence 2:
○ Cf: {John, a dog} (both John and the dog are potential forward-looking centers)
○ Cb: John (refers back to "John" in the previous sentence)
○ Cp: John (since "He" refers to John, John remains the most salient entity)
3. Transition Type: Continue (John remains the Cb and Cp).
4. Sentence 3:
○ Cf: {the dog}
○ Cb: the dog (refers back to "a dog" in the previous sentence)
○ Cp: the dog (the most salient entity in this sentence)
5. Transition Type: Smooth Shift (Cb changes from "John" to "the dog," and the Cp is
now the dog).
Explanation:
Q.27 Explain the three types of referents that complicate the reference resolution
problem.
A natural language expression used to perform reference is called a referring expression and
the entity that is referred to is called the referent.
The three types of referents that complicate the reference solution problem are,
1. Inferrables: In certain instances, a referring expression refers to an entity that has been
implied rather than one that has been expressed in the text. Inferrables are referents not
explicitly mentioned in the text but can be inferred based on world knowledge or context.
Example: “John bought a car. The engine is very powerful”
“The engine” refers to the engine of the car, which is not explicitly mentioned but inferred
from the mention of the car in the previous statement.
2. Discontinuous Sets: Discontinuous sets occur when the reference refers to a group of
entities that are not explicitly grouped together in the text. The resolution system must
combine scattered mentions across the discourse to resolve the reference to the
appropriate set.
Example: “Sanil met Rugved in the park. Later, Rohit joined them, and all three went to a
cafe.”
Here, “them” refers to Sanil and Rugved whereas “All three” refers to Sanil,Rugved and
Rohit.
3. Generics: Generics are referents that refer to a general class or category rather than a
specific instance. The resolution system must determine whether the reference is
generic or specific, which can be ambiguous depending on the context.
Example: “Dogs are loyal animals” Here, “Dogs” refers to the general category of dogs
(Generic).
“The dog barked loudly.” Here, “The dog” refers to a specific animal (Not Generic)
Q.28 What is reference resolution? & Explain Discourse reference resolution in detail
Reference resolution in NLP is the process of identifying what a word, phrase, or pronoun in a
sentence refers to. It involves linking mentions (e.g., pronouns like he, she, it, or noun phrases)
to their corresponding entities or concepts within the text.
Example: “John saw a dog. He liked it.”
Here, the “He” refers to John and “it” refers to a dog.
Discourse Reference resolution focuses on resolving references that span across multiple
sentences or utterances in a discourse. Unlike resolving references within a single sentence,
discourse-level resolution requires understanding the context of multiple sentences and
maintaining a mental representation of the entities discussed.
Types of References in Discourse:
- Anaphora: When a pronoun or noun phrase refers back to an earlier entity.
Example: “John saw a dog. He liked it.”
Here, the “He” refers to John and “it” refers to a dog.
- Cataphora: When a pronoun or noun phrase refers to an entity mentioned later in the
text.
Example: “Before he spoke, John cleared his throat.” Here, “he” refers to John which is
mentioned in the next sentence.
- Exophora: When the reference is to something outside the text (requires external
knowledge).
Example: “Look at that!” Here, “that” refers to something visible in the environment, not
in the text.
Applications of Discourse reference resolution:
- Chatbots: Understanding pronouns and references in user queries.
- Information Extraction: Linking mentions to extract coherent facts.
- Summarization: Maintaining entity continuity in summaries
Example of Discourse Reference Resolution: “Sanil visited the museum. He found it
fascinating. The ancient weapons were extraordinary.”
Mentions: Sanil, the museum, He, it, The ancient weapons
Entity Links:
- He → Sanil
- It → The museum
- The ancient weapons → inferred to be part of the museum
Q.29 Illustrate the reference phenomena for solving the pronoun problem.
The pronoun problem refers to identifying the entity (referred to as the referent) that a pronoun
(such as he, she, it, they) refers to in a given text. This process is a key part of reference
resolution in NLP. To solve this problem, we rely on understanding reference phenomena,
which are linguistic and contextual cues that help us link pronouns to their antecedents.
1. Coreference: Coreference occurs when a pronoun refers to a noun phrase (antecedent)
within the same sentence or nearby sentences. Use syntactic proximity (the closest
noun) and semantic compatibility (gender, number agreement) to resolve the
pronouns.
Example: “John saw a dog. He liked it.”
Here, the “He” refers to John and “it” refers to a dog.
2. Anaphora: E.g. already done, Solution: Look for antecedents in the preceding text
using sentence structures and dependencies.
3. Cataphora: E.g. already done, Solution: Post-process the sentence to identify nouns
following the pronoun for potential antecedents.
4. Exophora: E.g. already done, Solution: Requires contextual understanding and often
involves integrating real-world knowledge or extra inputs like images.
By understanding and applying reference phenomena, NLP systems can improve the resolution
of pronouns and enhance tasks such as chatbots, machine translation, and text summarization.
Example of Reference Phenomena solving a Pronoun Problem:
“Pihu saw a boy. He was crying”
- Identify Pronouns: “He”
- Search for Antecedents: Nouns in the context are, “Pihu” and “a boy”
- Filter by Compatibility: Match gender, number and role in the sentence. “He” matches
with “a boy” (singular, male)
- Resolve: Assign “He” = “a boy”
Q.30 What are the five types of referring expressions ? Explain with the help of an
example.
1. Indefinite Noun Phrases: These refer to entities that are being introduced for the first
time in a discourse and are not previously known to the listener/reader. They are often
accompanied by indefinite articles like a or an.
Example: “A man entered the room” Here, “A man” introduces an entity for the first time.
2. Definite Noun Phrases: These refer to entities that are already known or can be
uniquely identified within the discourse or context. They are accompanied by the definite
article the.
Example: “The man sat down.” Here, “The man” refers to a specific, identifiable man,
often one introduced earlier or assumed to be known.
3. Pronouns: Pronouns are referring expressions that replace nouns and usually refer to
entities mentioned earlier in the discourse. They include he, she, it, they, etc.
Example: “John saw a dog. He liked it.”
Here, the “He” refers to John and “it” refers to a dog.
4. Demonstratives: Demonstratives specify entities relative to the speaker’s perspective
and are often used with gestures in spoken language. They include this, that, these,
those.
Example: “I want that book, not this one.” Here, “that book” refers to a specific book,
likely pointed to or previously mentioned.
5. Names: Names are proper nouns used to refer to specific entities without ambiguity
within the given context.
Example: “Sanil called Freddy yesterday.” Here, “Sanil” and “Freddy” are specific
individuals, and their names uniquely identify them in the discourse.
Mod. 6
Q.32 Demonstrate the working of Machine translation systems
Machine translation (MT) refers to the automatic conversion of text from one language to
another using computational algorithms. It plays a crucial role in breaking language barriers,
making information accessible globally.
Write about the Approaches in Machine Translation (next answer) →
Q.33 Explain Machine translation approaches used in NLP [VIMP]
OR What is rule based machine translation ?
OR Explain the statistical approach for machine translation.
1. Rule-Based MT: The earliest commercial machine translation systems were rule-based
machine translation systems also called as RBMTs, which are based on linguistic
principles or rules that permits words to be placed in many contexts and to have various
meanings.
These rules are created by programmers and human language experts who have put a
lot of work into understanding and mapping the rules between two languages. This
allows users to update and improve the translation, although these handcrafted rules
can be challenging to maintain for large-scale translations.
Workflow Example: “I am eating an apple”
Lexical mapping: Match words to their equivalents in the target language
I → मैं , “am eating” →खा रहा हूं , “an apple” →सेब
Grammar Transformation: Adjust word order based on target language’s grammar.
Final Output = “मैं सेब खा रहा हूं”
2. Statistical MT: SMT or Statistical Machine Translation bases translations on statistical
models, the parameters of which are generated from analysis of large bilingual text
corpora. Utilizes probabilities to determine the most likely translation.
The collection of a sizable and organized group of writings written in two different
languages is referred to as “bilingual text corpus”. To create the statistical models,
supervised and unsupervised machine learning techniques are employed.
Workflow Example: “The cat sat on the mat”
Analyze Parallel Corpora:
English: “The cat sat on the mat”
Hindi: “बिल्ली कालीन पर बैठ गयी.”
Compute probabilities for word mapping:
Cat → बिल्ली (high probability), Mat → कालीन (high probability)
Reconstruct the sentence using high-probability phrases:
Final Output = “बिल्ली कालीन पर बैठ गयी.”
3. Neural MT: Based on deep learning and artificial neural networks.
Uses an encoder-decoder architecture where, and Encoder processes the source
language sentence and encodes it into a fixed-size vector. Decoder then decodes the
vector to generate the target language sentence.
Workflow Example: “How are you ?”
Encoder converts the sentence into a sequence of embeddings. Decoder generates the
output word-by-word using these embeddings.
Final Output = “आप कैसे हैं?”
4. Hybrid MT: Hybrid MT combines two or more approaches, typically Rule-Based MT
(RBMT) and Statistical MT (SMT) or Neural MT (NMT). Utilizes the linguistic knowledge
of RBMT to handle grammar and data-driven insights of SMT for fluent translations.
Q.34 Explain the different steps in text processing for Information Retrieval [VIMP]
Text processing is a crucial phase in information retrieval that transforms raw text into a
structured format for efficient indexing and retrieval. Steps in Text Pre-Processing:
1. Tokenization: Split the text into individual units (tokens) such as words, phrases, or
symbols. Break sentences into words by identifying delimiters (e.g., spaces,
punctuation).
Input: “This is a sentence”
Output: [this, is, a, sentence]
2. Normalization: Standardize text to ensure consistency. Normalization converts all text
to lowercase, removing special characters and punctuation and Stop word removal.
Input: “My name is Sanil”
Output: Converting “Sanil” to “sanil”, removal of Stop Words “my” and “is”,
Final Output = “name sanil”
3. Stemming and Lemmatization: Reduce words to their base or root form for
consistency.
- Stemming: It is a natural language processing technique used to lower inflection
in words to their root forms, it aids in the preprocessing of text, words and
documents.
running, runs → run, leaves → leav (word MAY NOT be recognized by dictionary)
- Lemmatization: Responsible for grouping different inflected forms of words into
their root form having the same meaning. Seeks to get rid of inflectional suffixes
and prefixes for the purpose of bringing out the word’s dictionary form.
better → good, leaves → leaf (word IS ALWAYS recognized by dictionary)
4. N-grams Creation: Generate sequences of n tokens to capture context and multi-word
phrases. N-gram could be of types,
- Unigram: Single Words, Example: “Leave me alone” → [leave, me, alone]
- Bigram: Pairs of consecutive words,
Example: “Leave me alone please” → [leave me, me alone, alone please]
- Trigram: Triplets of words,
Example: ”Leave me alone please” → [leave me alone, me alone please]
{G-FOTU-E}
Aspect Information Retrieval Information Extraction
Steps in QA System:
Types of QA Systems:
1. Fact-Based QA Systems: Designed to answer specific factual questions with definite
answers, such as "What is the capital of France?" (Answer: Paris).
2. Open-Domain QA Systems: Capable of answering a broad range of questions across
various domains, often used in search engines or virtual assistants.
3. Closed-Domain QA Systems: Focused on answering questions from a specific field,
such as medicine or law, where in-depth knowledge of that field is required.
4. Generative QA Systems: These systems generate answers based on understanding
the context, like summarizing complex information or explaining a topic.
5. Extractive QA Systems: These systems extract exact answers from a given text or
dataset, such as finding specific words, phrases, or numbers.
Example Workflow: