NLP Important Question and Answers Module Wise
NLP Important Question and Answers Module Wise
modeling
1. Define NLP. Discuss its real-world applications
Natural Language Processing (NLP) is a subfield of Artificial Intelligence (AI) and Linguistics
concerned with the interaction between computers and human (natural) languages. It involves
designing algorithms and systems that allow computers to process, understand, interpret, and
generate human language in a meaningful way.
Applications of NLP
Natural Language Processing (NLP) has a wide range of applications that aim to bridge the gap
between human language and computational systems. One of the major applications of NLP is
Machine Translation (MT), which involves automatically converting text or speech from one
language to another. MT systems analyze the source language for syntax and semantics and
generate equivalent content in the target language. Examples include Google Translate and
Microsoft Translator. The challenge in MT lies in handling grammar, idioms, context, and word
order, especially for Indian languages, which have a free word order.
Speech Recognition is another significant application where spoken language is converted into
text. This is used in systems like voice assistants (e.g., Google Assistant, Siri) and dictation tools.
It involves acoustic modeling, language modeling, and phonetic transcription. Speech recognition
must account for accents, background noise, and spontaneous speech.
Speech Synthesis, also known as Text-to-Speech (TTS), is the reverse process, where written
text is converted into spoken output. TTS systems are used in applications for visually impaired
users, public announcement systems, and interactive voice response (IVR) systems. These
systems require natural-sounding voice output, correct intonation, and pronunciation.
Natural Language Interfaces to Databases (NLIDB) allow users to interact with databases
using natural language queries instead of structured query languages like SQL. For example, a
user can ask “What is the balance in my savings account?” and the system translates it into a
database query. This application requires robust parsing, semantic interpretation, and domain
understanding.
Information Retrieval (IR) deals with finding relevant documents or data in response to a user
query. Search engines like Google, Bing, and academic databases are practical implementations
of IR. NLP techniques help in query expansion, stemming, and ranking results by relevance.
Information Extraction (IE) refers to the automatic identification of structured information such
as names, dates, locations, and relationships from unstructured text. IE is useful in fields like
journalism, business intelligence, and biomedical research. Named Entity Recognition (NER) and
Relation Extraction are key components of IE.
Question Answering (QA) systems provide direct answers to user questions instead of listing
documents. For example, a QA system can answer “Who is the President of India?” by retrieving
the exact answer from a knowledge base or corpus. These systems require deep linguistic
analysis, context understanding, and often integrate IR and IE.
Text Summarization involves automatically generating a condensed version of a given text while
preserving its key information. Summarization can be extractive (selecting key sentences) or
abstractive (generating new sentences). It is useful in generating news digests, executive
summaries, and academic reviews. Summarization systems must preserve coherence,
grammaticality, and meaning.
Another major challenge is identifying semantics, especially in the presence of idioms and
metaphors. Idioms such as "kick the bucket" or "spill the beans" have meanings that cannot be
derived from the literal meaning of the words. Similarly, metaphors like "time is a thief" require
deep contextual and cultural understanding, which machines struggle to grasp. These figurative
expressions pose a serious problem for semantic analysis since they don't follow regular linguistic
patterns.
Quantifier scoping is another subtle issue, dealing with how quantifiers (like “all,” “some,”
“none”) affect the meaning of sentences. For example, the sentence “Every student read a book”
can mean either that all students read the same book or that each student read a different one.
Disambiguating such sentences requires complex logical reasoning and context awareness.
Ambiguity is one of the most persistent challenges in NLP. At the word level, there are two main
types: part-of-speech ambiguity and semantic ambiguity. In part-of-speech ambiguity, a word
like “book” can be a noun (“a book”) or a verb (“to book a ticket”), and the correct tag must be
determined based on context. This ties into the task of Part-of-Speech (POS) tagging, where
the system must assign correct grammatical labels to each word in a sentence, often using
probabilistic models like Hidden Markov Models or neural networks.
In terms of semantic ambiguity, many words have multiple meanings—a problem known as
polysemy. For instance, the word “bat” can refer to a flying mammal or a piece of sports
equipment. Resolving this is the goal of Word Sense Disambiguation (WSD), which attempts to
determine the most appropriate meaning of a word in a given context. WSD is particularly difficult
in resource-poor languages or when the context is vague.
Another type of complexity arises from structural ambiguity, where a sentence can be parsed in
more than one grammatical way. For example, in “I saw the man with a telescope,” it is unclear
whether the telescope was used by the speaker or the man. Structural ambiguity can lead to
multiple interpretations and is a major hurdle in syntactic and semantic parsing.
3. What is the role of grammar in NLP? How is it different from Language?
One of the main challenges in defining the structure of natural language is its dynamic nature and
the presence of numerous exceptions that are difficult to capture formally. Over time, several
grammatical frameworks have been proposed to address these challenges. Prominent among
them are transformational grammar (Chomsky, 1957), lexical functional grammar (Kaplan and
Bresnan, 1982), government and binding theory (Chomsky, 1981), generalized phrase structure
grammar, dependency grammar, Paninian grammar, and tree-adjoining grammar (Joshi, 1985).
While some of these grammars focus on the derivational aspects of sentence formation (e.g.,
phrase structure grammar), others emphasize relational properties (e.g., dependency grammar,
lexical functional grammar, Paninian grammar, and link grammar).
The most significant contribution in this area has been made by Noam Chomsky, who proposed a
formal hierarchy of grammars based on their expressive power. These grammars employ phrase
structure or rewrite rules to generate well-formed sentences in a language. The general
framework proposed by Chomsky is referred to as generative grammar, which consists of a finite
set of rules capable of generating all and only the grammatical sentences of a language.
Chomsky also introduced transformational grammar, asserting that natural languages cannot be
adequately represented using phrase structure rules alone. In his work Syntactic Structures
(1957), he proposed that each sentence has two levels of representation: the deep structure,
which captures the sentence's core meaning, and the surface structure, which represents the
actual form of the sentence. The transformation from deep to surface structure is accomplished
through transformational rules.
For example, for the sentence “The police will catch the snatcher,” the phrase structure rules
generate the following parse tree:
This tree represents the syntactic structure of the sentence as derived from phrase structure
rules.
Transformational rules are applied to the output of the phrase structure grammar and are used to
modify sentence structures. These rules may have multiple symbols on the left-hand side and
enable transformations such as changing an active sentence into a passive one. For example,
Chomsky provided a rule for converting active to passive constructions:
NP₁ + Aux + V + NP₂ → NP₂ + Aux + be + en + V + by + NP₁.
This rule inserts the strings “be” and “en” and rearranges sentence constituents to reflect a
passive construction. Transformational rules can be either obligatory, ensuring grammatical
agreement (such as subject-verb agreement), or optional, allowing for structural variations while
preserving meaning.
The third component, morphophonemic rules, connects the sentence representation to a string of
phonemes. For instance, in the transformation of the sentence “The police will catch the
snatcher,” the passive transformation results in “The snatcher will be caught by the police.” A
morphophonemic rule then modifies “catch + en” to its correct past participle form “caught.”
However, phrase structure rules often struggle to account for more complex linguistic phenomena
such as embedded noun phrases containing adjectives, modifiers, or relative clauses. These
phenomena give rise to what are known as long-distance dependencies, where related
elements like a verb and its object may be separated by arbitrary amounts of intervening text.
Such dependencies are not easily handled at the surface structure level. A specific case of
long-distance dependency is wh-movement, where interrogative words like “what” or “who” are
moved to the front of a sentence, creating non-local syntactic relationships. These limitations
highlight the need for more advanced grammatical frameworks like tree-adjoining grammars
(TAGs), which can effectively model such syntactic phenomena due to their capacity to represent
recursion and long-distance dependencies more naturally than standard phrase structure rules.
N-gram model
The goal of statistical language models is to estimate the probability (likelihood) of a sentence.
This is achieved by decomposing sentence probability into a product of conditional probabilities
using the chain rule as follows:
In order to calculate the sentence probability, we need to calculate the probability of a word, given
the sequence of words preceding it. An n-gram model simplifies this task by approximating the
probability of a word given all the previous words by the conditional probability given previous n-1
words only
Thus, an n-gram model calculates P(wi∣hi)P(w_i | h_i)P(wi∣hi) by modeling language as a
Markov model of order (n−1), i.e., it looks at only the previous (n−1) words.
● A model that considers only the previous one word is called a bigram model (n = 2).
● A model that considers the previous two words is called a trigram model (n = 3).
Using bigram and trigram models, the probability of a sentence w1,w2,...,wnw_1, w_2, ...,
w_nw1,w2,...,wncan be estimated as:
A special word(pseudo word) <s> is introduced to mark the beginning of the sentence in bi-gram
estimation. The probability of the first word in a sentence is conditioned on <s>. Similarly, in
tri-gram estimation, we introduce two pseudo-words <s1> and <s2>.
Estimation of probabilities is done by training the n-gram model on the training corpus. We
estimate n-gram parameters using the maximum likelihood estimation (MLE) technique, i.e, using
relative frequencies. We count a particular n-gram in the training corpus and divide it by the sum
of all n-grams that share the same prefix.
The sum of all n-grams that share first n-1 words is equal to the common prefix
. So, we rewrite the previous expression as:
The model parameters obtained using these estimates maximize the probability of the training set
T given the model M, i.e., P(T|M). However, the frequency with which a word occurs in a text may
differ from its frequency in the training set. Therefore, the model only provides the most likely
solution based on the training data.
Several improvements have been proposed for the standard n-gram model. Before discussing
these enhancements, let us illustrate the underlying ideas with the help of an example.
The n-gram model suffers from data sparseness. n-grams not seen in the training data are
assigned zero probability, even in large corpora. This is due to the assumption that a word’s
probability depends only on the preceding word(s), which is often not true. Natural language
contains long-distance dependencies that this model cannot capture.
To handle data sparseness, various smoothing techniques have been developed, such as
add-one smoothing. As Jurafsky and Martin (2000) state:
The term smoothing reflects how these techniques adjust probabilities toward more uniform
distributions.
Add-one Smoothing
Add-One Smoothing is a simple technique used to handle the data sparseness problem in
n-gram language models by avoiding zero probabilities for unseen n-grams.
In an n-gram model, if a particular n-gram (like a word or word pair) does not occur in the
training data, it is assigned a probability of zero, which can negatively affect the overall probability
of a sentence. Add-One Smoothing helps by assigning a small non-zero probability to these
unseen events.
List all words and count how many times each word appears:
Step 2: Calculate Unigram Probabilities (Using MLE)
Paninian Framework
Paninian Grammar is a highly influential grammatical framework, based on the ancient Sanskrit
grammarian Panini. It provides a rule-based structure for analyzing sentence formation using
deep linguistic features such as vibhakti (case suffixes) and karaka (semantic roles). Unlike
many Western grammars which focus on syntax, Paninian grammar emphasizes the relationship
between semantics and syntax, making it well-suited for Indian languages with free word order.
● It contains inflected words with suffixes (like tense, case markers, gender, number,
etc.).
● The sentence is in linear word form but doesn’t reveal deeper structure.
Example (Hindi):
राम ने फल खाया (Rām ne phal khāyā)
At surface level:
राम + ने | फल | खाया
Here, "ने" is a vibhakti marker.
● These vibhaktis provide syntactic cues about the role of each noun in the sentence.
● Different vibhaktis (e.g., ने, को, से, का) indicate nominative, accusative, instrumental,
genitive etc.
Example:
"को" → Dative/Accusative
"से" → Instrumental
● Karaka relations assign semantic roles to nouns, like agent, object, instrument,
source, etc.
Karaka roles are assigned based on the verb and semantic dependency, not fixed
word order.
Example:
राम ने फल खाया
● Captures the final meaning of the sentence by combining all the karaka roles.
Example meaning:
"Rām ate the fruit."
→ {Agent: Ram, Action: eat, Object: fruit}
Karaka Theory is a fundamental concept in Paninian grammar that explains the semantic
roles of nouns in relation to the verb in a sentence. It helps identify who is doing what to whom,
with what, for whom, where, and from where, etc.
Unlike English grammar which focuses on Subject/Object, karaka theory goes deeper into
semantic functions.
is done on
3. Karana Instrument of the action Instrumental (से) चाकू से काटा Cut with a
knife
Features:
● They are language-independent roles; similar roles exist in many world languages.
फल Karma Object
मोहन को Sampradana Recipient
However, many issues remain unresolved, especially in cases of shared Karaka relations.
Another difficulty arises when the mapping between the Vibhakti (case markers and
postpositions) and the semantic relation (with respect to the verb) is not one-to-one. Two different
Vibhaktis can represent the same relation, or the same Vibhakti can represent different relations
in different contexts. The strategy to disambiguate the various senses of words or word groupings
remains a challenging issue.
As the system of rules differs across languages, the framework requires adaptation to handle
various applications in different languages. Only some general features of the PG framework
have been described here.
Features:
Advantages:
● Useful for tasks that demand transparency and rule-explainability (e.g., legal or
medical domains).
Disadvantages:
● Poor scalability to diverse or informal language (like social media text).
The statistical approach emerged with the rise of computational power and access to
large language datasets (corpora). These systems learn patterns and relationships from
data using machine learning and probabilistic models, such as Hidden Markov Models
(HMMs), Naive Bayes, or deep learning architectures.
Features:
Advantages:
Disadvantages:
Regular expressions, or regexes for short, are a pattern-matching standard for string parsing
and replacement. They are a powerful way to find and replace strings that follow a defined format.
For example, regular expressions can be used to parse dates, URLs, email addresses, log
files, configuration files, command line switches, or programming scripts. They are useful
tools for the design of language compilers and have been used in NLP for tokenization,
describing lexicons, morphological analysis, etc.
We have all used simplified forms of regular expressions, such as the file search patterns used by
MS-DOS, e.g., dir*.txt.
The use of regular expressions in computer science was made popular by a Unix-based editor,
'ed'. Perl was the first language that provided integrated support for regular expressions. It used a
slash around each regular expression; we will follow the same notation in this book. However,
slashes are not part of the regular expression itself.
Regular expressions were originally studied as part of the theory of computation. They were first
introduced by Kleene (1956). A regular expression is an algebraic formula whose value is a
pattern consisting of a set of strings, called the language of the expression. The simplest kind
of regular expression contains a single symbol.
For example, the expression /a/ denotes the set containing the string 'a'. A regular expression
may specify a sequence of characters also. For example, the expression /supernova/ denotes
the set that contains the string "supernova" and nothing else.
In a search application, the first instance of each match to a regular expression is underlined in
Table.
Character Classes
Characters are grouped by putting them between square brackets [ ]. Any character in the class
will match one character in the input. For example, the pattern /[abcd]/ will match a, b, c, or d.
This is called disjunction of characters.
Regular expressions can also specify what a character cannot be, using a caret (^) at the
beginning of the brackets.
● This interpretation is true only when the caret is the first character inside brackets.
Case Sensitivity:
● Regex is case-sensitive.
● /s/ matches lowercase s but not uppercase S.
Anchors:
○ strawberry
○ blackberry
○ sugarberry
RE Description
\n Newline character
\t Tab character
\d Digit (0-9)
\D Non-digit
\W Non-alphanumeric character
\s Whitespace
\S Non-whitespace
Real-world use:
● A transition function (δ) that maps a state and input symbol to the next state
● Deterministic Finite Automaton (DFA) – Only one transition is allowed for a given input
from a state.
● Non-Deterministic Finite Automaton (NFA) – Multiple transitions can occur from the
same state on the same input.
FSAs are used to accept regular languages. They process strings symbol by symbol and
determine whether the input belongs to the defined language by checking if the final state reached
is an accepting state.
Morphology is the study of the internal structure of words. Morphological parsing refers to
breaking a word into its meaningful units, called morphemes (e.g., root + suffix).
Example:
Morphological analysis can be effectively modeled using a special form of finite automaton called
a Finite-State Transducer (FST).
● Q = States
● q₀ = Initial state
● F = Final states
An FST reads an input string and produces an output string, mapping surface forms to lexical
representations or vice versa.
FSTs are widely used for two-level morphology, where they help in:
1. Analyzing surface forms into valid morphemes (e.g., "dogs" → "dog+N+PL")
For example:
● Input: lesser
● Output: less+ADJ+COMP
The FST maps the surface string to the morphological structure while also handling spelling
variations (e.g., dropping of ‘e’ in hope + ing → hoping).
This FST has transitions labeled with symbol pairs (input:output), such as:
● h:c
● o:o
● t:t
This shows that an FST simultaneously reads input and writes output as it moves through
states.
● Efficiency: FSAs and FSTs are computationally efficient and well-suited for real-time
applications.
● Bidirectionality: A single FST can perform both analysis (surface → lexical) and
generation (lexical → surface).
Morphological Parsing
Morphological parsing is the process of analyzing a word to identify its morphemes, the smallest
meaning-bearing units of language. This includes recognizing the stem (or root) of the word and
identifying any affixes (prefixes, suffixes, infixes, circumfixes) that modify its meaning or
grammatical function. The goal is to map a word’s surface form (as it appears in text) to its
canonical form (or lemma) along with its morphological features (e.g., part of speech, number,
tense, gender).
Morphological parsing is essential for various Natural Language Processing (NLP) tasks such as
machine translation, information retrieval (IR), text-to-speech systems, and question answering.
Stemming is a rule-based process that removes known affixes from words to reduce them to a
common root form, known as a stem. The result may not always be a valid word in the language.
Goal:
To collapse morphological variants of a word into a common base form to aid in text normalization.
Method:
Stemmers typically apply a set of rewrite rules or heuristics. Common stemming algorithms
include:
● Porter Stemmer (1980): A widely used, multi-stage rule-based stemmer that uses suffix
stripping and rule-based transformations.
Limitations:
● Overstemming: Reducing different words to the same stem (e.g., universe and university
→ univers).
● Understemming: Failing to reduce related words to the same stem (e.g., relational and
relation may result in different stems).
● Produces stems that may not be valid words (e.g., univers, organiz).
Applications:
● Information retrieval (e.g., matching play, playing, and played in a search query).
● Text classification and clustering.
Lemmatization
Definition:
Goal:
To convert different inflected forms of a word into a linguistically correct base form.
Method:
● Morphological analysis
● Lexicons/dictionaries
● Part-of-speech tagging
Advantages:
Limitations:
Applications:
● Machine translation.
● Sentiment analysis.
● Knowledge extraction.
● Stemming is preferred in IR systems and search engines, where exact meaning is less
important than lexical similarity.
● Lemmatization is used in linguistically intensive tasks, where accurate word forms and
meanings are necessary (e.g., machine translation, syntactic parsing).
In more advanced systems, finite-state transducers (FSTs) and two-level morphology (e.g.,
Koskenniemi's model) are used to perform both analysis (lemmatization) and generation tasks
with formal precision.
4. What is spelling error detection and correction? Explain common techniques
1. Substitution: Replacing one letter with another (e.g., cat → bat).
5. Reversal errors: A specific case of transposition where letters are reversed.
● OCR (Optical Character Recognition) and similar devices introduce errors such as:
○ Substitution
○ Space deletion/insertion
● Speech recognition systems process phoneme strings and attempt to match them to
known words. These errors are often phonetic in nature, leading to non-trivial
distortions of words.
1. Non-word errors: The incorrect word does not exist in the language (e.g., freind instead of
friend).
2. Real-word errors: The incorrect word is a valid word, but incorrect in the given context
(e.g., their instead of there).
○ Useful for real-word errors (e.g., correcting there to their based on sentence
meaning).
○ Generates a phonetic or structural key for a word and matches against similarly
keyed dictionary entries.
○ Example: Soundex
3. N-gram Based Techniques:
○ Use machine learning models (e.g., RNNs, Transformers) trained on large corpora
to detect and correct errors.
5. Define POS tagging. Compare rule based, statistical based and hybrid approaches
Part-of-Speech Tagging
Part-of-Speech tagging is the process of assigning an appropriate grammatical category (such as
noun, verb, adjective, etc.) to each word in a given sentence. It is a fundamental task in Natural
Language Processing (NLP), which plays a crucial role in syntactic parsing, information extraction,
machine translation, and other language processing tasks.
POS tagging helps in resolving syntactic ambiguity and understanding the grammatical structure
of a sentence. Since many words in English and other natural languages can serve multiple
grammatical roles depending on the context, POS tagging is necessary to identify the correct
category for each word.
There are several approaches to POS tagging, which are broadly categorized as: (i) Rule-based
POS tagging, (ii) Stochastic POS tagging, and (iii) Hybrid POS tagging.
The rule-based taggers make use of rules that consider the tags of neighboring words and the
morphological structure of the word. For example, a rule might state that if a word is preceded by
a determiner and is a noun or verb, it should be tagged as a noun. Another rule might say that if a
word ends in "-ly", it is likely an adverb.
The effectiveness of this approach depends heavily on the quality and comprehensiveness of the
hand-written rules. Although rule-based taggers can be accurate for specific domains, they are
difficult to scale and maintain, especially for languages with rich morphology or free word order.
Stochastic or statistical POS tagging makes use of probabilistic models to determine the most
likely tag for a word based on its occurrence in a tagged corpus. These taggers are trained on
annotated corpora where each word has already been tagged with its correct part of speech.
In the simplest form, a unigram tagger assigns the most frequent tag to a word, based on the
maximum likelihood estimate computed from the training data:
where f(w,t) is the frequency of word w being tagged as t, and f(w) is the total frequency of the
word w in the corpus. This approach, however, does not take into account the context in which the
word appears.
To incorporate context, bigram and trigram models are used. In a bigram model, the tag assigned
to a word depends on the tag of the previous word. The probability of a sequence of tags is given
by:
The probability of the word sequence given the tag sequence is:
Thus, the best tag sequence is the one that maximizes the product:
This is known as the Hidden Markov Model (HMM) approach to POS tagging. Since the actual
tag sequence is hidden and only the word sequence is observed, the Viterbi algorithm is used to
compute the most likely tag sequence.
Bayesian inference is also used in stochastic tagging. Based on Bayes’ theorem, the posterior
probability of a tag given a word is:
Since P(w) is constant for all tags, we can choose the tag that maximizes P(w∣t)⋅P(t).
Statistical taggers can be trained automatically from large annotated corpora and tend to
generalize better than rule-based systems, especially in handling noisy or ambiguous data.
Hybrid approaches combine rule-based and statistical methods to take advantage of the strengths
of both. One of the most popular hybrid methods is Transformation-Based Learning (TBL),
introduced by Eric Brill, commonly referred to as Brill’s Tagger.
In this approach, an initial tagging is done using a simple method, such as assigning the most
frequent tag to each word. Then, a series of transformation rules are applied to improve the
tagging. These rules are automatically learned from the training data by comparing the initial
tagging to the correct tagging and identifying patterns where the tag should be changed.
Each transformation is of the form: "Change tag A to tag B when condition C is met". For example,
a rule might say: "Change the tag from VB to NN when the word is preceded by a determiner".
The transformation rules are applied iteratively to correct errors in the tagging, and each rule is
chosen based on how many errors it corrects in the training data. This approach is robust,
interpretable, and works well across different domains.
6. Write the CFG rules for a sentence and parse it using top-down parsing.
Parsing is the process of analyzing a string of symbols (typically a sentence) according to the
rules of a formal grammar. In Natural Language Processing (NLP), parsing determines the
syntactic structure of a sentence by identifying its grammatical constituents (like noun phrases,
verb phrases, etc.). It checks whether the sentence follows the grammatical rules defined by a
grammar, often a Context-Free Grammar (CFG). The result of parsing is typically a parse tree or
syntax tree, which shows how a sentence is hierarchically structured. Parsing helps in
disambiguating sentences with multiple meanings. It is essential for understanding, translation,
and information extraction. There are two main types: syntactic parsing, which focuses on
structure, and semantic parsing, which focuses on meaning. Parsing algorithms include
top-down, bottom-up, and chart parsing. Efficient parsing is crucial for developing
grammar-aware NLP applications.
Top-down Parsing
As the name suggests, top-down parsing starts its search from the root node S and works
downwards towards the leaves. The underlying assumption here is that the input can be derived
from the designated start symbol, S, of the grammar. The next step is to find all sub-trees which
can start with S. To generate the sub-trees of the second-level search, we expand the root node
using all the grammar rules with S on their left hand side. Likewise, each non-terminal symbol in
the resulting sub-trees is expanded next using the grammar rules having a matching non-terminal
symbol on their left hand side. The right hand side of the grammar rules provide the nodes to be
generated, which are then expanded recursively. As the expansion continues, the tree grows
downward and eventually reaches a state where the bottom of the tree consists only of
part-of-speech categories. At this point, all trees whose leaves do not match words in the input
sentence are rejected, leaving only trees that represent successful parses. A successful parse
corresponds to a tree which matches exactly with the words in the input sentence.
Sample grammar
● S → NP VP
● S → VP
● NP → Det Nominal
● NP → NP PP
● Nominal → Noun
● Nominal → Nominal Noun
● VP → Verb
● VP → Verb NP
● VP → Verb NP PP
● PP → Preposition NP
● Det → this | that | a | the
● Noun → book | flight | meal | money
● Verb → book | include | prefer
● Pronoun → I | he | she | me | you
● Preposition → from | to | on | near | through
A top-down search begins with the start symbol of the grammar. Thus, the first level (ply) search
tree consists of a single node labelled S. The grammar in Table 4.2 has two rules with S on their
left hand side. These rules are used to expand the tree, which gives us two partial trees at the
second level search, as shown in Figure 4.4. The third level is generated by expanding the
non-terminal at the bottom of the search tree in the previous ply. Due to space constraints, only
the expansion corresponding to the left-most non-terminals has been shown in the figure. The
subsequent steps in the parse are left, as an exercise, to the readers. The correct parse tree
shown in Figure 4.4 is obtained by expanding the fifth parse tree of the third level.
7. What is bottom-up parsing? How does it differ from top-down parsing?
Parsing is the process of analyzing a string of symbols (typically a sentence) according to the
rules of a formal grammar. In Natural Language Processing (NLP), parsing determines the
syntactic structure of a sentence by identifying its grammatical constituents (like noun phrases,
verb phrases, etc.). It checks whether the sentence follows the grammatical rules defined by a
grammar, often a Context-Free Grammar (CFG). The result of parsing is typically a parse tree or
syntax tree, which shows how a sentence is hierarchically structured. Parsing helps in
disambiguating sentences with multiple meanings. It is essential for understanding, translation,
and information extraction. There are two main types: syntactic parsing, which focuses on
structure, and semantic parsing, which focuses on meaning. Parsing algorithms include
top-down, bottom-up, and chart parsing. Efficient parsing is crucial for developing
grammar-aware NLP applications.
Bottom-up Parsing
A bottom-up parser starts with the words in the input sentence and attempts to construct a parse
tree in an upward direction towards the root. At each step, the parser looks for rules in the
grammar where the right hand side matches some of the portions in the parse tree constructed so
far, and reduces it using the left hand side of the production. The parse is considered successful if
the parser reduces the tree to the start symbol of the grammar. Figure 4.5 shows some steps
carried out by the bottom-up parser for sentence Paint the door.
Each of these parsing strategies has its advantages and disadvantages. As the top-down search
starts generating trees with the start symbol of the grammar, it never wastes time exploring a tree
leading to a different root. However, it wastes considerable time exploring S trees that eventually
result in words that are inconsistent with the input. This is because a top-down parser generates
trees before seeing the input. On the other hand, a bottom-up parser never explores a tree that
does not match the input. However, it wastes time generating trees that have no chance of leading
to an S-rooted tree. The left branch of the search space in Figure 4.5 that explores a sub-tree
assuming paint as a noun, is an example of wasted effort. We now present a basic search
strategy that uses the top-down method to generate trees and augments it with bottom-up
constraints to filter bad parses.
Parsing is the process of analyzing a string of symbols (typically a sentence) according to the
rules of a formal grammar. In Natural Language Processing (NLP), parsing determines the
syntactic structure of a sentence by identifying its grammatical constituents (like noun phrases,
verb phrases, etc.). It checks whether the sentence follows the grammatical rules defined by a
grammar, often a Context-Free Grammar (CFG). The result of parsing is typically a parse tree or
syntax tree, which shows how a sentence is hierarchically structured. Parsing helps in
disambiguating sentences with multiple meanings. It is essential for understanding, translation,
and information extraction. There are two main types: syntactic parsing, which focuses on
structure, and semantic parsing, which focuses on meaning. Parsing algorithms include
top-down, bottom-up, and chart parsing. Efficient parsing is crucial for developing
grammar-aware NLP applications.
A → BC
A → w, where w is a word.
The algorithm first builds parse trees of length one by considering all rules which could produce
words in the sentence being parsed. Then, it finds the most probable parse for all the constituents
of length two. The parse of shorter constituents constructed in earlier iterations can now be used
in constructing the parse of longer constituents.
A* ⇒ wᵢ
1. A → B C is a rule in grammar
2. B* ⇒ wᵢₖ
3. C* ⇒ wₖ₊₁
For a sub-string wᵢ of length j starting at i, the algorithm considers all possible ways of breaking it
into two parts wᵢₖ and wₖ₊₁ . Finally, since A ⇒ wᵢ , we have to verify that S* ⇒ w₁ₙ, i.e., the
start symbol of the grammar derives w₁ₙ.
CYK ALGORITHM
Let w = w1 w2 w3 ... wn
and w0 = w, wn+1 = ∅
// Initialization step
for i := 1 to n do
for all rules A → wi do
chart[i, i] := [A]
// Recursive step
for j := 2 to n do
for i := 1 to n-j+1 do
begin
chart[i, j] := ∅
for k := i to j-1 do
chart[i, j] := chart[i, j] ∪ { A | A → BC is a production and
B ∈ chart[i, k] and C ∈ chart[k+1, j] }
end
Such classification tasks are typically framed as supervised machine learning problems, where we
are given a set of labeled examples (training data) and must build a model that generalizes to unseen
examples. These tasks rely on representing text in a numerical feature space and applying statistical
models to predict classes.
One of the most commonly used classifiers in NLP is the Naive Bayes classifier, a probabilistic
model that applies Bayes’ theorem under strong independence assumptions. Despite its simplicity, it
is robust and surprisingly effective in many domains including spam filtering, sentiment analysis,
and language identification. Naive Bayes is categorized as a generative model, because it models
the joint distribution of inputs and classes to "generate" data points, in contrast with discriminative
models that directly estimate the class boundary.
Before applying any classifier, text must be converted into a format suitable for machine learning
algorithms. In Naive Bayes, we represent a document as a bag of words (BoW), which treats the
document as an unordered collection of words, discarding grammar and word order. This assumption
simplifies modeling, reducing a complex structured input into a feature vector.
The text classification pipeline using Naive Bayes typically consists of the following steps:
1. Data Collection
Gather labeled training data where each document is already assigned a class (e.g., spam or not
spam, positive or negative).
2. Text Preprocessing
Convert raw text into a clean format for analysis:
● Accuracy
This approach is typically used in supervised learning tasks like spam detection, sentiment analysis,
or topic categorization, where the classifier is trained on documents that are already labeled.
This is purely a generative model where the sentence is generated from the class model, and the
focus is on calculating likelihood rather than making a classification.
Such language modeling is useful in applications like document ranking, spell correction, machine
translation, or even language identification, where we need to evaluate which class (or language) is
most likely to produce the given sentence.
● Example: “The plot was predictable” (negative), but “Predictable in a good way”
(positive).
This ambiguity makes it hard for a classifier to generalize without deep contextual
understanding.
● Example: “Great! Another bug in the software.” is sarcastic, but appears positive to a
naive classifier.
● Sarcasm detection typically requires pragmatic cues, which are hard to model in
traditional supervised settings.
3. Domain Dependence
Supervised sentiment models are domain-specific. A model trained on movie reviews may
perform poorly on product reviews or political opinions.
● Sentiment expressions vary across domains. For instance, the word “unpredictable” is
positive in movie reviews but negative in car reviews.
● Domain adaptation is difficult due to vocabulary shifts and changing sentiment cues.
4. Data Sparsity and Imbalanced Classes
Many datasets have an uneven distribution of sentiment classes. Neutral or majority classes
dominate, while minority classes (e.g., strong negatives) are underrepresented.
● Also, certain sentiments (like anger or sarcasm) may have few examples, making it
difficult for the model to learn meaningful patterns.
For instance, negation words like “not”, “never” change the polarity, but only if correctly linked
to the word they modify.
Example: “I do not like this movie” → model must connect not with like.
8. How is accuracy of a Naive Bayes sentiment classifier optimized
Text Preprocessing
Proper preprocessing ensures that the input to the classifier is clean, normalized, and informative.
● Lowercasing: Convert all text to lowercase to avoid treating “Good” and “good” as different
words.
● Stopword Removal: Common words (e.g., the, is, in) that carry little sentiment can be
removed.
● Stemming/Lemmatization: Reduces words to their base form (e.g., loved → love) so that
variants are treated as the same feature.
● Bag-of-Words (BoW): The standard approach, but improvements can be made by:
● Binarization: For Naive Bayes, converting word counts to 0/1 (presence/absence) often
improves accuracy. This is called Binary Multinomial Naive Bayes.
● N-grams: Using bigrams or trigrams captures short phrases like "not good", which are
important in sentiment detection.
Handling Negation
Naive Bayes fails to capture negation unless explicitly handled. A common trick is:
● Negation Tagging: Add a “NOT_” prefix to every word following a negation word (e.g., not,
never, didn't) until a punctuation mark is found.
Smoothing Techniques
To prevent zero probabilities when a test word wasn't seen in training, Laplace Smoothing (Add-1
smoothing) is applied:
● Words like "not", "never", "no" are stopwords but crucial for detecting negative sentiment.
Feature Selection
Remove irrelevant or misleading features using:
● Chi-square test, Mutual Information, or Information Gain to keep only the most
sentiment-informative words.
● These help identify words with strong positive or negative polarity, even in small training sets.
Data Balancing
In imbalanced datasets (e.g., more positive reviews than negative), the model becomes biased.
Accuracy is improved by:
● Use a held-out validation set or cross-validation for reliable error estimation and model
selection.
Module 4:
1. Explain the architecture and design features of Information retrieval
1. Indexing
Indexing is the process of organizing data to enable rapid search and retrieval. In IR, an inverted
index is commonly used. This structure maps each term in the document collection to a list of
documents (or document IDs) where that term occurs. It typically includes additional information like
term frequency, position, and weight (e.g., TF-IDF score).
Efficient indexing allows the system to avoid scanning all documents for every query, dramatically
reducing search time and computational cost. Index construction involves tokenizing documents,
normalizing text, and storing index entries in a sorted and optimized structure, often with compression
techniques to reduce storage requirements.
Stop words are extremely common words that appear in almost every document, such as "the", "is",
"at", "which", "on", and "and". These words usually add little value to understanding the main content
or differentiating between documents.
Removing stop words reduces the size of the index, speeds up the search process, and minimizes
noise in results. However, careful handling is required because some stop words may be semantically
important depending on the domain (e.g., "to be or not to be" in literature, or "in" in legal texts). Most
IR systems use a predefined stop word list, though it can be customized based on corpus analysis.
3. Stemming
Stemming is a form of linguistic normalization used to reduce related words to a common base or root
form. For example:
Stemming improves recall in IR systems by ensuring that different inflected or derived forms of a
word are matched to the same root term in the index. This is particularly important in languages with
rich morphology.
Common stemming algorithms include:
Stemming is different from lemmatization, which uses vocabulary and grammar rules to derive the
base form.
4. Zipf’s Law
Zipf’s Law is a statistical principle that describes the frequency distribution of words in natural
language corpora. It states that the frequency f of any word is inversely proportional to its rank r:
f ∝ 1/r
This means that the most frequent word occurs roughly twice as often as the second most frequent
word, three times as often as the third, and so on.
For example, in English corpora, words like "the", "of", "and", and "to" dominate the frequency list.
Meanwhile, the majority of words occur rarely (called the "long tail").
In IR, Zipf’s Law justifies:
Understanding this law helps in designing efficient indexing and retrieval strategies that focus on the
more informative, lower-frequency words.
Introduction to IR Models
Information Retrieval (IR) models are mathematical frameworks or algorithms designed to retrieve
relevant documents from a large corpus in response to a user’s information need, often expressed as
a query. These models serve as the backbone of search engines, recommendation systems, and
other knowledge discovery platforms. They function by transforming both the documents and queries
into a formal representation and then applying a matching or ranking function to determine relevance.
Based on foundational principles, IR models are broadly classified into Classical and Non-Classical
models. Each category employs distinct strategies for matching queries with relevant content.
● Document and Query Representation: Each document is encoded as a binary vector where
each term is either present (1) or absent (0). Queries are Boolean expressions using operators
like AND, OR, and NOT.
● Operators:
○ OR: Union of sets — retrieves documents containing at least one of the terms.
● Example:
Query: AI AND Robotics
Documents:
D1: "AI and Robotics in Industry" → Retrieved
D2: "AI in Healthcare" → Not Retrieved
● Advantages:
● Limitations:
● Similarity Measurement:
○ Cosine Similarity: Measures the cosine of the angle between document and query
vectors. A higher cosine value indicates higher relevance.
○ Jaccard Coefficient: Used when binary vectors are applied; computes the intersection
over union of term sets.
● Solved Example:
○ Documents:
D1: “AI is shaping the future of technology”
D2: “History of AI and computing”
○ Cosine Similarity:
D1: 0.578 → Ranked 1
D2: 0.000 → Ranked 2
● Advantages:
● Key Assumptions:
● Advantages:
○ Statistically grounded.
● Limitations:
● Key Concept:
Introduces uncertainty measures to evaluate how strongly a document term supports or
contradicts a query. This is derived from van Rijsbergen’s logical inference framework.
● Example:
Sentence: “Atul is serving a dish”
Term: “dish”
If "dish" logically supports the truth of the sentence, then it contributes positively to retrieval.
● Advantages:
● Key Elements:
○ Infons are associated with polarity: 1 (true), 0 (false).
● Working Mechanism:
● Applications:
● Key Features:
● Applications:
○ Conversational agents.
3. Discuss the cluster model, Fuzzy model and LSTM model in IR
This hypothesis suggests that related documents are likely to be retrieved together. Therefore, by
grouping related documents into clusters, the search time can be significantly reduced. Rather than
matching a query with every document in the collection, it is first matched with representative
vectors of each cluster. Only the documents from clusters whose representative is similar to the
query vector are then examined in detail.
Clustering can also be applied to terms, rather than documents. In such cases, terms are grouped
into co-occurrence classes, which are useful in dimensionality reduction and thesaurus
construction.
LSTMs are widely used in natural language processing (NLP) tasks, particularly in Machine
Translation, speech recognition, and information retrieval, where understanding the context
across sequences is essential.
2. Structure of LSTM
An LSTM network consists of a chain of repeating modules, where each module contains four key
components:
Each gate is a neural network layer that controls how information flows through the network.
● Resistant to vanishing gradients: The gating mechanisms preserve gradients across many
time steps.
● The encoder LSTM reads the source language sentence and encodes it into a context vector.
● The decoder LSTM uses this context to generate the target language sentence, one word at a
time.
This architecture works well in sequence-to-sequence tasks and has been a standard approach
before the introduction of Transformer models.
● Contextual document ranking: Ranking results based on the semantic relationship between
query and document.
6. Limitations of LSTM
● Training complexity: LSTM networks are computationally intensive.
● Long training times: Due to their sequential nature, LSTMs cannot be easily parallelized.
1. Relevance of Results
One of the core challenges in IR is identifying which documents are genuinely relevant to a user’s
query. Relevance is subjective and context-dependent, making it difficult to model computationally.
● Example: The query “jaguar” may refer to a car brand, an animal, or a sports team.
● Impact: Systems may retrieve either too few or too many irrelevant results.
2. Vocabulary Mismatch
Often, users and documents refer to the same concept using different words.
● Example: A user searching for “automobile” may not retrieve documents containing only the
word “car.”
● Solutions:
○ Query Expansion: Adds synonyms and related terms using tools like WordNet or
embeddings.
● Need: IR systems must infer the correct sense from limited query context.
Structure of WordNet
WordNet organizes words into sets of synonyms known as synsets, each representing one distinct
concept or sense. These synsets are interlinked by various lexical and semantic relations, which
allows WordNet to model complex linguistic relationships.
● A word may appear in multiple synsets, depending on how many senses it has.
Example:
The word “read” can appear in different synsets depending on its meaning:
● Nouns
● Verbs
● Adjectives
● Adverbs
Separate databases exist for each category, though some relations may cross between categories
(e.g., derivationally related forms).
4. Example of a WordNet Entry: “read”
WordNet lists 1 noun sense and 11 verb senses for “read.”
Each verb sense includes a synset, definition, and example usage:
Example:
In “The python bit him,” WordNet helps determine if “python” means a snake or a
programming language.
Example:
A document mentioning "poodle" and "bulldog" can be categorized under "dog" →
"animal".
Example:
Query for “car” can be expanded to include “automobile” and “vehicle.”
● Relevance Improvement: Matching documents that use different but semantically related
terms.
4. Text Summarization
● WordNet is used to form lexical chains to identify salient words or concepts.
6. What are the Framenet and Stemmers? Explain their usage in NLP tasks.
1. FrameNet
1.1 Introduction
FrameNet is a lexical database of English based on the theory of frame semantics, developed by
Charles J. Fillmore. It annotates English sentences with semantic frames that represent common
situations or events and the roles (participants) involved in them.
● Core Roles:
● Helps understand who did what to whom, when, where, and how.
b) Question Answering
● Frame semantics helps identify the relevant parts of a sentence to extract accurate answers.
Example:
Q: “Who sent the packet to Khushbu?”
A: “Khushbu received a packet from the examination cell.”
Using the TRANSFER frame, roles like SENDER and RECIPIENT are matched.
c) Machine Translation
d) Text Summarization
e) Information Retrieval
● Retrieves documents by matching frames and roles, not just surface words.
2. Stemmers
2.1 Introduction
Stemming is the process of reducing inflected or derived words to their base or root form, known as
the stem. It is a fundamental preprocessing step in NLP and IR to standardize word forms, improving
matching and reducing vocabulary size.
Example:
“connect”, “connected”, “connection” → “connect”
Stemming differs from lemmatization as it doesn’t ensure the output is a valid dictionary word—it just
reduces the word to a common base form.
● Snowball stemmer supports: English, French, Spanish, Portuguese, German, Hungarian, etc.
b) Indian Languages
● Hindi and Bengali stemming research by Ramanathan & Rao (2003) and Majumder et al.
(2007).
b) Text Categorization
c) Text Summarization
d) Query Expansion
Different types of POS taggers are employed based on the underlying algorithmic approach. These
taggers vary in complexity, accuracy, and their reliance on linguistic or statistical resources.
1. Rule-Based POS Taggers
Rule-based POS taggers rely on a set of handcrafted grammatical rules and lexicons to determine
the POS tags of words.
Working:
● They use a dictionary that lists words and their possible tags.
Example:
If a word follows a determiner and is not a verb, it is likely a noun (e.g., “the cat”).
Advantages:
● Linguistically interpretable.
Disadvantages:
● Difficult to maintain and scale.
Types:
a) Unigram Tagger
● Assigns the most frequent tag for each word based on training data.
Advantages:
● Adaptable to real-world data.
Disadvantages:
● Require labeled training data.
Working:
● Initially assign tags using a simple method (e.g., unigram tagging).
Key Features:
● Rules are automatically learned from training data.
● Rules are ordered and applied only if specific patterns are matched.
Advantages:
● Interpretable like rule-based systems.
Disadvantages:
● Slower than pure statistical taggers.
Common Approaches:
a) Decision Trees
Predict tags by learning rules from feature splits based on context.
Advantages:
● High accuracy on diverse and large datasets.
Disadvantages:
● Require significant computation and data.
● Less interpretable.
5. Hybrid Taggers
Hybrid taggers combine two or more techniques to balance accuracy, speed, and robustness.
Examples:
● Combining a rule-based system with a statistical backoff.
Use Case:
Especially useful in low-resource languages or domain-specific corpora where combining
resources improves coverage and accuracy.
8. What are research corpora? Discuss their types and use in NLP experiments
Research Corpora
1. Introduction to Research Corpora
A research corpus (plural: corpora) is a large and structured set of textual or speech data,
systematically collected and used for linguistic research and Natural Language Processing (NLP)
tasks. It acts as the empirical foundation for designing, training, testing, and evaluating NLP models.
Corpora are essential in enabling data-driven approaches in computational linguistics. They offer
real-world linguistic evidence for studying patterns, building models, and conducting experiments in
areas such as part-of-speech tagging, syntactic parsing, named entity recognition, sentiment analysis,
and machine translation.
● Annotation: Includes linguistic labels such as POS tags, parse trees, or semantic roles.
Examples:
● Wikipedia dump
● News articles
● Web-crawled corpora
● Includes annotations like part-of-speech (POS) tags, lemmas, and morphological features.
Example:
● Brown Corpus
Examples:
● Include annotations for named entities, semantic roles, word senses, etc.
Examples:
d) Discourse-Annotated Corpora
Examples:
● OntoNotes
● Used for automatic speech recognition (ASR), text-to-speech (TTS), and speaker
identification.
Examples:
● TIMIT
● LibriSpeech
● Switchboard Corpus
Examples:
● Europarl Corpus
● OpenSubtitles
● UN Corpus
Examples:
● MedLine abstracts
● Domain mismatch: Training on one domain (e.g., news) may not generalize to another (e.g.,
tweets).
1. Define Machine Translation. List and explain its types and applications
MT systems aim to analyze the structure and meaning of the source language and generate an
equivalent expression in the target language. This process involves understanding grammar,
vocabulary, context, and even cultural nuances.
RBMT is highly dependent on the quality and coverage of its rule sets and dictionaries. It provides
more interpretable results but requires extensive human effort to create and maintain the linguistic
resources for each language pair.
SMT systems require large amounts of parallel data to train effectively. While they can generalize well
from data, they often struggle with grammar and context, leading to translations that may sound
unnatural.
This approach is based on the concept of translation by analogy. Its performance depends on the
quantity and quality of stored examples. EBMT systems work well in domains with repetitive or
formulaic language.
NMT systems produce translations that are generally more fluent and contextually appropriate than
previous methods. They require substantial computational resources and large datasets but represent
the current state-of-the-art in machine translation.
Hybrid systems may use rules for handling certain linguistic constructs while relying on statistical or
neural components for general translation. They are often employed in specialized domains where
accuracy and control are crucial.
One of the most prominent applications is in cross-lingual communication, where MT systems are used
in real-time messaging and voice translation tools to bridge language barriers. These systems are
widely adopted in social media platforms, customer service chatbots, and mobile translation apps.
In the domain of content localization, machine translation helps adapt websites, software interfaces,
and multimedia content to different languages and cultures. This is particularly useful for multinational
companies seeking to reach international audiences.
Another key area of application is in customer support, where automated translation enables
companies to provide assistance in multiple languages without requiring a large multilingual staff. This
enhances efficiency and improves user satisfaction.
Machine translation is also used in e-commerce to translate product descriptions, user reviews, and
transaction messages, thereby improving the shopping experience for users around the world.
In the field of education and research, MT tools help students and scholars access academic materials
published in foreign languages, broadening the scope of knowledge and collaboration.
The media and journalism industries benefit from machine translation by distributing news content
across linguistic regions rapidly and efficiently. Similarly, the travel and tourism industry relies on
translation tools to help tourists navigate foreign environments.
When translating between languages, especially those that are typologically different (e.g., English and
Hindi, English and Japanese), divergences must be handled carefully to maintain meaning, fluency,
and grammatical correctness.
● Example:
Here, "missed" doesn't translate directly; Hindi uses a construction meaning "the train left."
2. Syntactic Divergence
This involves differences in sentence structure or word order between languages.
● Example:
○ Japanese: 私はリンゴを食べます。
(Literal: I apples eat.)
Japanese uses Subject–Object–Verb (SOV) order, unlike English's Subject–Verb–Object (SVO) order.
3. Morphological Divergence
Occurs when the languages differ in how they express grammatical information (e.g., tense,
number, case) through word forms.
● Example:
○ Chinese: 他们跑了。
(Pinyin: Tāmen pǎo le.)
Chinese does not inflect verbs for tense like English does; it uses aspect markers (like 了) instead.
4. Categorical Divergence
Happens when the same idea is expressed using different grammatical categories in the two
languages.
● Example:
Here, "hungry" is an adjective in English, while the equivalent concept is expressed as a noun in
French.
5. Structural Divergence
Arises when an extra phrase or clause is needed in the target language to preserve the original
meaning.
● Example:
Spanish requires a prepositional phrase “en la” that has no direct English counterpart in this context.
6. Idiomatic Divergence
Involves idioms or expressions that cannot be translated literally without losing meaning.
● Example:
This architecture handles variable-length sequences and allows the system to model complex
relationships between words, making it suitable for capturing contextual and semantic information
necessary for accurate translation.
2.1 Encoder
The encoder is responsible for processing the input sentence (in the source language). It reads the
entire sentence word by word and converts it into a fixed-length context vector (also called a thought
vector or hidden representation).
● Typically implemented using Recurrent Neural Networks (RNNs), Long Short-Term Memory
(LSTM), or Gated Recurrent Units (GRUs).
● In modern systems, Transformer encoders are used for parallel processing and better context
handling.
Function:
The encoder learns to represent the entire input sentence in a compressed vector form that captures
the overall meaning.
2.2 Decoder
The decoder takes the context vector produced by the encoder and generates the translated
sentence (in the target language) word by word.
● Like the encoder, the decoder is usually an RNN, LSTM, GRU, or Transformer model.
● At each time step, the decoder predicts the next word in the sequence, using:
Function:
The decoder learns to generate fluent and grammatically correct output based on the encoded source
sentence.
3. Working Mechanism
The basic steps are as follows:
● The encoder processes each token and outputs a context vector summarizing the
entire sentence.
● The decoder uses this vector to generate the first word of the target sentence.
● This predicted word is then fed back into the decoder to predict the next word.
● This process continues until the decoder outputs an end-of-sentence token (<EOS>).
● Lack of Focus: The model has no mechanism to focus on relevant parts of the input while
generating each word.
● Widely used in Transformer models and modern NMT systems like Google Translate
and DeepL.
The Transformer has become the standard architecture for modern NMT systems.
4. Discuss the challenges and techniques in low-resource machine translation section
These low-resource settings are common among indigenous, regional, and less commonly taught
languages, making this an important area for linguistic diversity and digital inclusion.
3.3 Back-Translation
Involves training a model to translate from the target language to the source language, then using it
to generate synthetic parallel data from monolingual target-language text. This synthetic data
augments the training set of the forward translation model.
● Word reordering
● Sentence paraphrasing
These methods artificially increase the volume and diversity of training data.
5. How is machine translation evaluated? Discuss BLEU and METEOR scores
To address this, both automatic and human evaluation methods are used.
Advantages:
Common automatic metrics include BLEU, METEOR, TER, and ChrF. Among these, BLEU and
METEOR are the most widely used.
● Produces a score between 0 and 1, where 1 means a perfect match with the reference.
3.5 Limitations
● Ignores synonymy and paraphrasing
4.5 Limitations
● More computationally intensive than BLEU
Bias in MT can affect the fairness, accuracy, and trustworthiness of translations, while ethical
issues raise questions about accountability, transparency, and inclusivity. Addressing these
challenges is critical to ensuring that MT systems serve all users equitably and responsibly.
● Example:
○ Religious or political terms may be translated in ways that reflect one ideology or
worldview.
● Example:
○ High-quality translations exist for English-French, but poor or non-existent support for
languages like Quechua or Xhosa.
● Example:
4. Mitigation Strategies
To address bias and ethical concerns, developers and researchers can implement the following
strategies:
● Enable user feedback mechanisms to report and correct biased or harmful translations.
● Increase support for low-resource languages through community collaboration and open
data initiatives.
● Apply fairness audits regularly to assess the performance of systems across different
languages and groups.
In the analysis phase, the source sentence is parsed according to its grammatical structure. The
transfer phase applies rules to convert the syntactic and lexical elements of the source language into
the target language. Finally, the generation phase reconstructs the sentence in the grammatical format
of the target language.
RBMT offers high interpretability and control over translation output. However, it requires intensive
manual effort to develop language-specific rules and resources. It also struggles with ambiguity,
idiomatic expressions, and scalability to new languages.
The typical SMT system includes phrase-based models that translate segments of words rather than
individual words. Word alignments, language models, and phrase tables are used to construct
translations based on statistical likelihood.
This approach significantly reduces the need for linguistic expertise and is easier to scale. However,
SMT tends to produce translations that are grammatically awkward or contextually inaccurate,
especially for complex sentences. It also struggles with capturing long-range dependencies and
maintaining sentence coherence.
Modern NMT systems use attention mechanisms and transformer models to improve context handling
and translation quality. These models are capable of learning semantic relationships and producing
fluent, human-like translations.
NMT offers significant improvements over previous approaches in terms of fluency, accuracy, and the
ability to model complex language patterns. However, it requires large amounts of training data and
computational power. Additionally, NMT systems are often less interpretable and may generate
incorrect or biased translations if not properly trained.
4. Conclusion
Rule-Based Machine Translation emphasizes linguistic rules and provides control but lacks scalability
and adaptability. Statistical Machine Translation introduces automation through data but often
sacrifices fluency and semantic depth. Neural Machine Translation, by leveraging deep learning,
achieves high levels of translation quality and context awareness, though at the cost of requiring more
data and computational resources.
Each approach represents a key stage in the evolution of machine translation technology and
contributes to the continuous advancement of language processing systems.
8. What are some low resource techniques used to improve MT in low-resource conditions?
2. Transfer Learning
Transfer learning involves training an MT model on a high-resource language pair and then
fine-tuning it on a related low-resource pair. This is especially effective when the languages share
linguistic characteristics such as syntax or vocabulary. Knowledge gained from high-resource data
helps improve performance in the low-resource setting.
4. Back-Translation
Back-translation is a widely used data augmentation technique. It involves the following steps:
● A reverse translation model is trained to translate from the target to the source language.
● Monolingual data in the target language is translated back into the source language.
● The resulting synthetic parallel data is combined with real parallel data to train the forward
translation model.
Though still a developing area, unsupervised MT has shown promising results for certain language
pairs.
7. Data Augmentation
Data augmentation techniques artificially increase the amount and diversity of training data. Common
methods include: