KEMBAR78
NLP Important Question and Answers Module Wise | PDF | Parsing | Syntax
0% found this document useful (0 votes)
115 views101 pages

NLP Important Question and Answers Module Wise

Natural Language Processing (NLP) is a subfield of AI focused on the interaction between computers and human languages, enabling tasks like translation and sentiment analysis. Key applications include machine translation, speech recognition, and information retrieval, while challenges involve ambiguity, semantics, and grammar representation. The document also discusses the role of grammar in NLP, various statistical language models, and the Paninian framework for sentence analysis.

Uploaded by

Shrejal Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views101 pages

NLP Important Question and Answers Module Wise

Natural Language Processing (NLP) is a subfield of AI focused on the interaction between computers and human languages, enabling tasks like translation and sentiment analysis. Key applications include machine translation, speech recognition, and information retrieval, while challenges involve ambiguity, semantics, and grammar representation. The document also discusses the role of grammar in NLP, various statistical language models, and the Paninian framework for sentence analysis.

Uploaded by

Shrejal Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

Module 1: Introduction & Language

modeling
1.​ Define NLP. Discuss its real-world applications

Natural Language Processing:

Natural Language Processing (NLP) is a subfield of Artificial Intelligence (AI) and Linguistics
concerned with the interaction between computers and human (natural) languages. It involves
designing algorithms and systems that allow computers to process, understand, interpret, and
generate human language in a meaningful way.

NLP focuses on enabling machines to comprehend and respond to language as it is naturally


spoken or written by humans. By doing so, it bridges the gap between human communication and
computer understanding, making it possible for computers to perform tasks such as language
translation, sentiment analysis, and text summarization using natural language input.

Applications of NLP
Natural Language Processing (NLP) has a wide range of applications that aim to bridge the gap
between human language and computational systems. One of the major applications of NLP is
Machine Translation (MT), which involves automatically converting text or speech from one
language to another. MT systems analyze the source language for syntax and semantics and
generate equivalent content in the target language. Examples include Google Translate and
Microsoft Translator. The challenge in MT lies in handling grammar, idioms, context, and word
order, especially for Indian languages, which have a free word order.

Speech Recognition is another significant application where spoken language is converted into
text. This is used in systems like voice assistants (e.g., Google Assistant, Siri) and dictation tools.
It involves acoustic modeling, language modeling, and phonetic transcription. Speech recognition
must account for accents, background noise, and spontaneous speech.

Speech Synthesis, also known as Text-to-Speech (TTS), is the reverse process, where written
text is converted into spoken output. TTS systems are used in applications for visually impaired
users, public announcement systems, and interactive voice response (IVR) systems. These
systems require natural-sounding voice output, correct intonation, and pronunciation.
Natural Language Interfaces to Databases (NLIDB) allow users to interact with databases
using natural language queries instead of structured query languages like SQL. For example, a
user can ask “What is the balance in my savings account?” and the system translates it into a
database query. This application requires robust parsing, semantic interpretation, and domain
understanding.

Information Retrieval (IR) deals with finding relevant documents or data in response to a user
query. Search engines like Google, Bing, and academic databases are practical implementations
of IR. NLP techniques help in query expansion, stemming, and ranking results by relevance.

Information Extraction (IE) refers to the automatic identification of structured information such
as names, dates, locations, and relationships from unstructured text. IE is useful in fields like
journalism, business intelligence, and biomedical research. Named Entity Recognition (NER) and
Relation Extraction are key components of IE.

Question Answering (QA) systems provide direct answers to user questions instead of listing
documents. For example, a QA system can answer “Who is the President of India?” by retrieving
the exact answer from a knowledge base or corpus. These systems require deep linguistic
analysis, context understanding, and often integrate IR and IE.

Text Summarization involves automatically generating a condensed version of a given text while
preserving its key information. Summarization can be extractive (selecting key sentences) or
abstractive (generating new sentences). It is useful in generating news digests, executive
summaries, and academic reviews. Summarization systems must preserve coherence,
grammaticality, and meaning.

2.​ List and Explain the challenges of NLP

Challenges in Natural Language Processing


Natural Language Processing (NLP) deals with the inherently complex and ambiguous nature of
human language. One of the key challenges is representation and interpretation, which refers
to how machines can represent the structure and meaning of language in a formal way that
computers can manipulate. Unlike numbers or code, natural language involves abstract concepts,
emotions, and context, making it difficult to represent using fixed logical forms or algorithms.
Interpretation becomes even harder when the same sentence can carry different meanings
depending on the speaker’s intent, cultural background, or tone.

Another major challenge is identifying semantics, especially in the presence of idioms and
metaphors. Idioms such as "kick the bucket" or "spill the beans" have meanings that cannot be
derived from the literal meaning of the words. Similarly, metaphors like "time is a thief" require
deep contextual and cultural understanding, which machines struggle to grasp. These figurative
expressions pose a serious problem for semantic analysis since they don't follow regular linguistic
patterns.

Quantifier scoping is another subtle issue, dealing with how quantifiers (like “all,” “some,”
“none”) affect the meaning of sentences. For example, the sentence “Every student read a book”
can mean either that all students read the same book or that each student read a different one.
Disambiguating such sentences requires complex logical reasoning and context awareness.

Ambiguity is one of the most persistent challenges in NLP. At the word level, there are two main
types: part-of-speech ambiguity and semantic ambiguity. In part-of-speech ambiguity, a word
like “book” can be a noun (“a book”) or a verb (“to book a ticket”), and the correct tag must be
determined based on context. This ties into the task of Part-of-Speech (POS) tagging, where
the system must assign correct grammatical labels to each word in a sentence, often using
probabilistic models like Hidden Markov Models or neural networks.

In terms of semantic ambiguity, many words have multiple meanings—a problem known as
polysemy. For instance, the word “bat” can refer to a flying mammal or a piece of sports
equipment. Resolving this is the goal of Word Sense Disambiguation (WSD), which attempts to
determine the most appropriate meaning of a word in a given context. WSD is particularly difficult
in resource-poor languages or when the context is vague.

Another type of complexity arises from structural ambiguity, where a sentence can be parsed in
more than one grammatical way. For example, in “I saw the man with a telescope,” it is unclear
whether the telescope was used by the speaker or the man. Structural ambiguity can lead to
multiple interpretations and is a major hurdle in syntactic and semantic parsing.

3.​ What is the role of grammar in NLP? How is it different from Language?

Language and Grammar


Automatic processing of natural language requires that the rules and exceptions of a language be
explicitly described so that a computer can understand and manipulate them. Grammar plays a
central role in this specification, as it provides a set of formal rules that allow both parsing and
generation of sentences. Grammar, thus, defines the structure of language at the linguistic level,
rather than at the level of world knowledge. However, due to the influence of world knowledge on
both the selection of words (lexical items) and the conventions of structuring them, the boundary
between syntax and semantics often becomes blurred. Nonetheless, maintaining a separation
between the two is considered beneficial for ease of grammar writing and language processing.

One of the main challenges in defining the structure of natural language is its dynamic nature and
the presence of numerous exceptions that are difficult to capture formally. Over time, several
grammatical frameworks have been proposed to address these challenges. Prominent among
them are transformational grammar (Chomsky, 1957), lexical functional grammar (Kaplan and
Bresnan, 1982), government and binding theory (Chomsky, 1981), generalized phrase structure
grammar, dependency grammar, Paninian grammar, and tree-adjoining grammar (Joshi, 1985).
While some of these grammars focus on the derivational aspects of sentence formation (e.g.,
phrase structure grammar), others emphasize relational properties (e.g., dependency grammar,
lexical functional grammar, Paninian grammar, and link grammar).

The most significant contribution in this area has been made by Noam Chomsky, who proposed a
formal hierarchy of grammars based on their expressive power. These grammars employ phrase
structure or rewrite rules to generate well-formed sentences in a language. The general
framework proposed by Chomsky is referred to as generative grammar, which consists of a finite
set of rules capable of generating all and only the grammatical sentences of a language.
Chomsky also introduced transformational grammar, asserting that natural languages cannot be
adequately represented using phrase structure rules alone. In his work Syntactic Structures
(1957), he proposed that each sentence has two levels of representation: the deep structure,
which captures the sentence's core meaning, and the surface structure, which represents the
actual form of the sentence. The transformation from deep to surface structure is accomplished
through transformational rules.

Transformational grammar comprises three main components: phrase structure grammar,


transformational rules, and morphophonemic rules. Phrase structure grammar provides the base
structure of a sentence using rules such as:​
S → NP + VP,​
NP → Det + Noun,​
VP → V + NP,​
V → Aux + Verb, etc.​
Here, S represents a sentence, NP a noun phrase, VP a verb phrase, Det a determiner, and so
on. The lexicon includes determiners like “the,” “a,” and “an,” verbs like “catch,” “eat,” “write,”
nouns such as “police,” “snatcher,” and auxiliaries like “is,” “will,” and “can.” Sentences generated
using these rules are said to be grammatical, and the structure they are assigned is known as
constituent or phrase structure.

For example, for the sentence “The police will catch the snatcher,” the phrase structure rules
generate the following parse tree:

This tree represents the syntactic structure of the sentence as derived from phrase structure
rules.

Transformational rules are applied to the output of the phrase structure grammar and are used to
modify sentence structures. These rules may have multiple symbols on the left-hand side and
enable transformations such as changing an active sentence into a passive one. For example,
Chomsky provided a rule for converting active to passive constructions:​
NP₁ + Aux + V + NP₂ → NP₂ + Aux + be + en + V + by + NP₁.​
This rule inserts the strings “be” and “en” and rearranges sentence constituents to reflect a
passive construction. Transformational rules can be either obligatory, ensuring grammatical
agreement (such as subject-verb agreement), or optional, allowing for structural variations while
preserving meaning.

The third component, morphophonemic rules, connects the sentence representation to a string of
phonemes. For instance, in the transformation of the sentence “The police will catch the
snatcher,” the passive transformation results in “The snatcher will be caught by the police.” A
morphophonemic rule then modifies “catch + en” to its correct past participle form “caught.”

However, phrase structure rules often struggle to account for more complex linguistic phenomena
such as embedded noun phrases containing adjectives, modifiers, or relative clauses. These
phenomena give rise to what are known as long-distance dependencies, where related
elements like a verb and its object may be separated by arbitrary amounts of intervening text.
Such dependencies are not easily handled at the surface structure level. A specific case of
long-distance dependency is wh-movement, where interrogative words like “what” or “who” are
moved to the front of a sentence, creating non-local syntactic relationships. These limitations
highlight the need for more advanced grammatical frameworks like tree-adjoining grammars
(TAGs), which can effectively model such syntactic phenomena due to their capacity to represent
recursion and long-distance dependencies more naturally than standard phrase structure rules.

4.​ Explain the architecture of a statistical language model

Statistical Language Model


A Statistical language model is a probability distribution P(S) over all possible word sequences. A
number of statistical language models have been proposed in literature. The dominant approach
in statistical language modelling is the n-gram model.

N-gram model
The goal of statistical language models is to estimate the probability (likelihood) of a sentence.
This is achieved by decomposing sentence probability into a product of conditional probabilities
using the chain rule as follows:

In order to calculate the sentence probability, we need to calculate the probability of a word, given
the sequence of words preceding it. An n-gram model simplifies this task by approximating the
probability of a word given all the previous words by the conditional probability given previous n-1
words only
Thus, an n-gram model calculates P(wi∣hi)P(w_i | h_i)P(wi​∣hi​) by modeling language as a
Markov model of order (n−1), i.e., it looks at only the previous (n−1) words.

●​ A model that considers only the previous one word is called a bigram model (n = 2).​

●​ A model that considers the previous two words is called a trigram model (n = 3).​

Using bigram and trigram models, the probability of a sentence w1,w2,...,wnw_1, w_2, ...,
w_nw1​,w2​,...,wn​can be estimated as:

A special word(pseudo word) <s> is introduced to mark the beginning of the sentence in bi-gram
estimation. The probability of the first word in a sentence is conditioned on <s>. Similarly, in
tri-gram estimation, we introduce two pseudo-words <s1> and <s2>.
Estimation of probabilities is done by training the n-gram model on the training corpus. We
estimate n-gram parameters using the maximum likelihood estimation (MLE) technique, i.e, using
relative frequencies. We count a particular n-gram in the training corpus and divide it by the sum
of all n-grams that share the same prefix.

The sum of all n-grams that share first n-1 words is equal to the common prefix
. So, we rewrite the previous expression as:

The model parameters obtained using these estimates maximize the probability of the training set
T given the model M, i.e., P(T|M). However, the frequency with which a word occurs in a text may
differ from its frequency in the training set. Therefore, the model only provides the most likely
solution based on the training data.

Several improvements have been proposed for the standard n-gram model. Before discussing
these enhancements, let us illustrate the underlying ideas with the help of an example.
The n-gram model suffers from data sparseness. n-grams not seen in the training data are
assigned zero probability, even in large corpora. This is due to the assumption that a word’s
probability depends only on the preceding word(s), which is often not true. Natural language
contains long-distance dependencies that this model cannot capture.

To handle data sparseness, various smoothing techniques have been developed, such as
add-one smoothing. As Jurafsky and Martin (2000) state:

"Smoothing refers to re-evaluating zero or low-probability n-grams and assigning


them non-zero values."

The term smoothing reflects how these techniques adjust probabilities toward more uniform
distributions.
Add-one Smoothing
Add-One Smoothing is a simple technique used to handle the data sparseness problem in
n-gram language models by avoiding zero probabilities for unseen n-grams.

In an n-gram model, if a particular n-gram (like a word or word pair) does not occur in the
training data, it is assigned a probability of zero, which can negatively affect the overall probability
of a sentence. Add-One Smoothing helps by assigning a small non-zero probability to these
unseen events.

5.​ Derive and explain Unigram and bigram models

Derivation and Explanation of Unigram and Bigram Models


Example: Unigram Language Model

Sample Corpus (training data):

Let’s say we have the following 3 sentences in our training corpus:

1.​ the cat sat​

2.​ the cat meowed​

3.​ the dog barked

Step 1: Build Vocabulary and Count Frequencies

List all words and count how many times each word appears:
Step 2: Calculate Unigram Probabilities (Using MLE)

Use the formula:


Example:
6. What is the Paninian Framework

Paninian Framework
Paninian Grammar is a highly influential grammatical framework, based on the ancient Sanskrit
grammarian Panini. It provides a rule-based structure for analyzing sentence formation using
deep linguistic features such as vibhakti (case suffixes) and karaka (semantic roles). Unlike
many Western grammars which focus on syntax, Paninian grammar emphasizes the relationship
between semantics and syntax, making it well-suited for Indian languages with free word order.

Architecture of Paninian Grammar

Paninian grammar processes a sentence through four levels of representation:

1. Surface Level (Shabda or Morphological Level)

●​ This is the raw form of the sentence as it appears in speech or


writing.​

●​ It contains inflected words with suffixes (like tense, case markers, gender, number,
etc.).​

●​ The sentence is in linear word form but doesn’t reveal deeper structure.

Example (Hindi):​
राम ने फल खाया (Rām ne phal khāyā)​
At surface level:​
राम + ने | फल | खाया​
Here, "ने" is a vibhakti marker.

2. Vibhakti Level (Case-Level or Morphosyntactic Level)

●​ Focuses on vibhaktis (postpositions or case suffixes).​

●​ These vibhaktis provide syntactic cues about the role of each noun in the sentence.​

●​ Different vibhaktis (e.g., ने, को, से, का) indicate nominative, accusative, instrumental,
genitive etc.

Example:

"ने" → Ergative marker (agent in perfective aspect)

"को" → Dative/Accusative

"से" → Instrumental

3. Karaka Level (Semantic Role Level)

●​ Most critical part of Paninian framework.​

●​ Karaka relations assign semantic roles to nouns, like agent, object, instrument,
source, etc.​

Karaka roles are assigned based on the verb and semantic dependency, not fixed
word order.

Example:​
राम ने फल खाया

राम (Rām) → Karta (Agent)

फल (phal) → Karma (Object)


खाया (khāyā) → Verb (Action)

4. Semantics Level (Meaning Representation)

●​ Captures the final meaning of the sentence by combining all the karaka roles.​

●​ Helps in machine translation, question answering, and information extraction.​

●​ This level handles ambiguities, idioms, metaphors, and coreference.​

Example meaning:​
"Rām ate the fruit."​
→ {Agent: Ram, Action: eat, Object: fruit}

7. Describe Karaka Theory and its relevance to Indian NLP

Karaka Theory is a fundamental concept in Paninian grammar that explains the semantic
roles of nouns in relation to the verb in a sentence. It helps identify who is doing what to whom,
with what, for whom, where, and from where, etc.

●​ The term "Karaka" means "causal relation".


●​ It refers to the role a noun (or noun phrase) plays in relation to the verb.
●​ The karaka relation is verb-dependent and semantic (not just syntactic).

Unlike English grammar which focuses on Subject/Object, karaka theory goes deeper into
semantic functions.

The Six Main Karakas:

Karaka Name Role (Function) Typical Marker Example Meaning


(Vibhakti) (Hindi)

1. Karta Doer/Agent of the action Ergative/Instrumental राम ने खाया Ram ate


(ने/Ø)

2. Karma Object or what action Accusative (को/Ø) फल खाया Ate fruit

is done on
3. Karana Instrument of the action Instrumental (से) चाकू से काटा Cut with a
knife

4. Sampradana Recipient/Beneficiary Dative (को) मोहन को दिया Gave to


Mohan

5. Apadana Source/Separation point Ablative (से) दिल्ली से Came from


आया Delhi

6. Adhikarana Location/Context Locative (में , पर) कुर्सी पर बैठा Sat on the


chair

Features:

●​ One verb can have multiple karaka relations.​

●​ Karakas are decided based on verb meaning, not word position.​

●​ They are language-independent roles; similar roles exist in many world languages.​

Example Sentence (Hindi):

राम ने चाकू से फल मोहन को प्लेट में दिया।​


(Ram gave the fruit to Mohan with a knife in a plate.)

Word Karaka Role

राम ने Karta Doer

फल Karma Object
मोहन को Sampradana Recipient

चाकू से Karana Instrument

प्लेट में Adhikarana Location/Context


The two problems challenging linguists are:​
(i) Computational implementation of PG, and​
(ii) Adaptation of PG to Indian and other similar languages.

An approach to implementing PG has been discussed in Bharati et al. (1995). This is a


multilayered implementation. The approach is named 'Utsarga-Apavāda' (default-exception),
where rules are arranged in multiple layers in such a way that each layer consists of rules which
are exceptions to the rules in the higher layer. Thus, as we go down the layers, more specific
information is derived. Rules may be represented in the form of charts (such as the Karaka chart
and Lakṣaṇa chart).

However, many issues remain unresolved, especially in cases of shared Karaka relations.
Another difficulty arises when the mapping between the Vibhakti (case markers and
postpositions) and the semantic relation (with respect to the verb) is not one-to-one. Two different
Vibhaktis can represent the same relation, or the same Vibhakti can represent different relations
in different contexts. The strategy to disambiguate the various senses of words or word groupings
remains a challenging issue.

As the system of rules differs across languages, the framework requires adaptation to handle
various applications in different languages. Only some general features of the PG framework
have been described here.

8. Compare Rule based and statistical based approaches to NLP

on enabling machines to understand and interact using human languages. It supports a


variety of applications such as translation, summarization, speech recognition, and
sentiment analysis. NLP methodologies have broadly evolved under two major
approaches: Rule-Based and Statistical (Data-Driven).

1. Rule-Based Approach (Rational/Symbolic)

The rule-based approach is grounded in linguistic knowledge and relies on explicitly


defined grammatical and syntactic rules. These systems are manually designed by
experts who understand the structure and function of a language. For instance, parsing a
sentence using Context-Free Grammar (CFG) rules falls under this category.

Features:

●​ Uses predefined rules for morphology, syntax, and semantics.​

●​ Requires extensive expert input and linguistic knowledge.​

●​ More interpretable and controllable due to human-defined logic.​

●​ Often language-specific and less adaptive to new data.​

Advantages:

●​ High accuracy on well-defined, grammatically correct input.​

●​ Useful for tasks that demand transparency and rule-explainability (e.g., legal or
medical domains).​

Disadvantages:
●​ Poor scalability to diverse or informal language (like social media text).​

●​ Time-consuming to develop and maintain for each new language or domain.​

2. Statistical Approach (Empirical/Data-Driven)

The statistical approach emerged with the rise of computational power and access to
large language datasets (corpora). These systems learn patterns and relationships from
data using machine learning and probabilistic models, such as Hidden Markov Models
(HMMs), Naive Bayes, or deep learning architectures.

Features:

●​ Learns from examples rather than relying on hand-crafted rules.​

●​ Flexible and adaptive to noisy, ambiguous, or incomplete data.​

●​ Relies on statistical measures like probability, frequency, and co-occurrence.​

Advantages:

●​ Scales well across languages and domains.​

●​ Capable of handling real-world complexities and informal language.​

●​ Continuously improves with more data.​

Disadvantages:

●​ Requires large annotated datasets for training.​

●​ May produce unpredictable or uninterpretable outputs.​

●​ Performance is data-dependent; less reliable on rare linguistic phenomena.


Module2: Word-Level and Syntactic
Analysis

1.​ Explain the use of regular expressions in NLP with examples

Regular expressions, or regexes for short, are a pattern-matching standard for string parsing
and replacement. They are a powerful way to find and replace strings that follow a defined format.
For example, regular expressions can be used to parse dates, URLs, email addresses, log
files, configuration files, command line switches, or programming scripts. They are useful
tools for the design of language compilers and have been used in NLP for tokenization,
describing lexicons, morphological analysis, etc.

We have all used simplified forms of regular expressions, such as the file search patterns used by
MS-DOS, e.g., dir*.txt.

The use of regular expressions in computer science was made popular by a Unix-based editor,
'ed'. Perl was the first language that provided integrated support for regular expressions. It used a
slash around each regular expression; we will follow the same notation in this book. However,
slashes are not part of the regular expression itself.

Regular expressions were originally studied as part of the theory of computation. They were first
introduced by Kleene (1956). A regular expression is an algebraic formula whose value is a
pattern consisting of a set of strings, called the language of the expression. The simplest kind
of regular expression contains a single symbol.

For example, the expression /a/ denotes the set containing the string 'a'. A regular expression
may specify a sequence of characters also. For example, the expression /supernova/ denotes
the set that contains the string "supernova" and nothing else.

In a search application, the first instance of each match to a regular expression is underlined in
Table.

Character Classes

Characters are grouped by putting them between square brackets [ ]. Any character in the class
will match one character in the input. For example, the pattern /[abcd]/ will match a, b, c, or d.
This is called disjunction of characters.

●​ /[0123456789]/ specifies any single digit.​


●​ /[a-z]/ specifies any lowercase letter (you can use - for range).​

●​ /[5-9]/ matches any of the characters 5, 6, 7, 8, or 9.​

●​ /[mnop]/ matches any one of the letters m, n, o, or p.​

Regular expressions can also specify what a character cannot be, using a caret (^) at the
beginning of the brackets.

●​ /[^x]/ matches any character except x.​

●​ This interpretation is true only when the caret is the first character inside brackets.​

Table Use of square brackets

RE Match description Example patterns matched

[abc] Match any of a, b, or c Refresher course will start


soon

[A-2] Match any character between A and 2 TREC Conference

[^A-Z] Match any character other than TREC Confere


uppercase letters

[ate] Match a, t, or e 3+2=5

[2] Match 2 The day has three different


slots

[or] Match o or r Match a vowel

Case Sensitivity:

●​ Regex is case-sensitive.​
●​ /s/ matches lowercase s but not uppercase S.​

●​ To solve this, use: [sS].​

○​ /[sS]ana/ matches both "sana" and "Sana".

To match both supernova and supernovas:

●​ Use /supernovas?/ — the ? makes "s" optional (0 or 1 time).​

Repetition using Kleene Star (*) and Plus (+):

●​ /b*/ → matches '', 'b', 'bb', 'bbb' — 0 or more occurrences​

●​ /b+/ → matches 'b', 'bb', 'bbb' — 1 or more occurrences​

●​ /[ab]*/ → matches combinations like 'aa', 'bb', 'abab'​

●​ /a+/ → matches one or more as​

●​ /[0-9]+/ → matches one or more digits​

Anchors:

●​ ^ → matches at beginning of line​

●​ $ → matches at end of line​

●​ Example: /^The nature.$/ matches the entire line “The nature.”​

Wildcard Character (.):

●​ Matches any single character​

●​ /at/ → matches cat, mat, rat, etc.​

●​ /.....berry/ → matches 10-letter words ending in berry like:​

○​ strawberry​

○​ blackberry​
○​ sugarberry​

○​ But not: blueberry (9 letters)​

Special Characters (Table 3.3)

RE Description

. Matches any single character

\n Newline character

\t Tab character

\d Digit (0-9)

\D Non-digit

\w Alphanumeric character (a-z, A-Z, 0-9, _)

\W Non-alphanumeric character

\s Whitespace

\S Non-whitespace

\ Escape special characters (e.g. \. matches


dot)

Disjunction using Pipe |:

●​ /blackberry|blackberries/ matches either one​

●​ Wrong: /blackberry|ies/ → matches either blackberry or ies, not blackberries​

●​ Correct way: /black(berry|berries)/​

Real-world use:

●​ Searching for Tanveer or Siddiqui → /Tanveer|Siddiqui/


2.​ Describe Finite-State Automata and its role in morphological parsing

A Finite-State Automaton (FSA) is a theoretical model of computation used to recognize


patterns within input data. It consists of:

●​ A finite set of states (Q)​

●​ An input alphabet (Σ) made up of valid symbols​

●​ A start state (S ∈ Q)​

●​ A set of final or accepting states (F ⊆ Q)​

●​ A transition function (δ) that maps a state and input symbol to the next state​

There are two main types of FSAs:

●​ Deterministic Finite Automaton (DFA) – Only one transition is allowed for a given input
from a state.​

●​ Non-Deterministic Finite Automaton (NFA) – Multiple transitions can occur from the
same state on the same input.​

FSAs are used to accept regular languages. They process strings symbol by symbol and
determine whether the input belongs to the defined language by checking if the final state reached
is an accepting state.

Morphology is the study of the internal structure of words. Morphological parsing refers to
breaking a word into its meaningful units, called morphemes (e.g., root + suffix).​
Example:

●​ "eggs" → egg (root) + s (plural suffix)​

●​ "unhappiness" → un (prefix) + happy (root) + ness (suffix)​

Morphological parsing identifies such components and their grammatical roles.

Role of Finite-State Automata in Morphological Parsing

Morphological analysis can be effectively modeled using a special form of finite automaton called
a Finite-State Transducer (FST).

Finite-State Transducer (FST)


An FST is a six-tuple:​
(Σ₁, Σ₂, Q, q₀, F, δ), where:

●​ Σ₁ = Input alphabet (surface form)​

●​ Σ₂ = Output alphabet (lexical/morphemic form)​

●​ Q = States​

●​ q₀ = Initial state​

●​ F = Final states​

●​ δ = Transition function: Q × (Σ₁ ∪ {ε}) × (Σ₂ ∪ {ε}) → P(Q)​

An FST reads an input string and produces an output string, mapping surface forms to lexical
representations or vice versa.

Use in Morphological Parsing

FSTs are widely used for two-level morphology, where they help in:

1.​ Analyzing surface forms into valid morphemes (e.g., "dogs" → "dog+N+PL")​

2.​ Generating surface forms from morphemes (e.g., "dog+N+PL" → "dogs")​

For example:

●​ Input: lesser​

●​ FST processes: less + er (comparative suffix)​

●​ Output: less+ADJ+COMP​

The FST maps the surface string to the morphological structure while also handling spelling
variations (e.g., dropping of ‘e’ in hope + ing → hoping).

Example of FST in Morphological Parsing

Imagine a simple FST that maps:

●​ Input: "hot" → Output: "cot"​


●​ Input: "cat" → Output: "bat"​

This FST has transitions labeled with symbol pairs (input:output), such as:

●​ h:c​

●​ o:o​

●​ t:t​

This shows that an FST simultaneously reads input and writes output as it moves through
states.

Advantages of Using FSAs and FSTs in Morphology

●​ Efficiency: FSAs and FSTs are computationally efficient and well-suited for real-time
applications.​

●​ Expressiveness: They can represent complex morphological patterns compactly.​

●​ Bidirectionality: A single FST can perform both analysis (surface → lexical) and
generation (lexical → surface).​

●​ Rule encoding: Morphophonological rules (e.g., spelling changes) can be embedded in


the transducer.

3.​ Explain morphological parsing with focus on stemming and lemmatization

Morphological Parsing
Morphological parsing is the process of analyzing a word to identify its morphemes, the smallest
meaning-bearing units of language. This includes recognizing the stem (or root) of the word and
identifying any affixes (prefixes, suffixes, infixes, circumfixes) that modify its meaning or
grammatical function. The goal is to map a word’s surface form (as it appears in text) to its
canonical form (or lemma) along with its morphological features (e.g., part of speech, number,
tense, gender).

Morphological parsing is essential for various Natural Language Processing (NLP) tasks such as
machine translation, information retrieval (IR), text-to-speech systems, and question answering.

Stemming and Lemmatization in Morphological Parsing


Two widely used techniques in morphological processing are stemming and lemmatization. Both
aim to reduce inflected or derived words to a base form, but they differ in methodology, accuracy,
and application.
Stemming
Definition:

Stemming is a rule-based process that removes known affixes from words to reduce them to a
common root form, known as a stem. The result may not always be a valid word in the language.

Goal:

To collapse morphological variants of a word into a common base form to aid in text normalization.

Method:

Stemmers typically apply a set of rewrite rules or heuristics. Common stemming algorithms
include:

●​ Lovins Stemmer (1968): A single-pass, dictionary-based stemmer.​

●​ Porter Stemmer (1980): A widely used, multi-stage rule-based stemmer that uses suffix
stripping and rule-based transformations.

Limitations:

●​ Overstemming: Reducing different words to the same stem (e.g., universe and university
→ univers).​

●​ Understemming: Failing to reduce related words to the same stem (e.g., relational and
relation may result in different stems).​

●​ Produces stems that may not be valid words (e.g., univers, organiz).​

●​ Does not handle prefixes, compound words, or irregular forms well.​

Applications:

●​ Information retrieval (e.g., matching play, playing, and played in a search query).​
●​ Text classification and clustering.

Lemmatization
Definition:

Lemmatization is the process of reducing a word to its lemma—its dictionary or canonical


form—while considering its part-of-speech and context.

Goal:

To convert different inflected forms of a word into a linguistically correct base form.

Method:

Lemmatizers rely on:

●​ Morphological analysis​

●​ Lexicons/dictionaries​

●​ Part-of-speech tagging

Advantages:

●​ Produces valid words.​

●​ Context-aware (considers grammatical role).​

●​ Handles irregular words and inflectional morphology effectively.​

Limitations:

●​ Requires linguistic resources (e.g., POS taggers, morphological databases).​

●​ Computationally heavier than stemming.​

Applications:

●​ Machine translation.​
●​ Sentiment analysis.​

●​ Knowledge extraction.​

●​ Question answering and chatbot systems.​

Role of Stemming and Lemmatization in Morphological Parsing


Morphological parsing benefits from both techniques, depending on the use case:

●​ Stemming is preferred in IR systems and search engines, where exact meaning is less
important than lexical similarity.​

●​ Lemmatization is used in linguistically intensive tasks, where accurate word forms and
meanings are necessary (e.g., machine translation, syntactic parsing).​

In more advanced systems, finite-state transducers (FSTs) and two-level morphology (e.g.,
Koskenniemi's model) are used to perform both analysis (lemmatization) and generation tasks
with formal precision.

4.​ What is spelling error detection and correction? Explain common techniques

Spelling Error Detection and Correction


In computer-based information systems, especially those involving text entry or automatic
recognition systems (like OCR or speech recognition), errors in typing and spelling are a
major source of variation between input strings.

Common Typing Errors (80% are single-error misspellings):

1.​ Substitution: Replacing one letter with another (e.g., cat → bat).​

2.​ Omission: Leaving out a letter (e.g., blue → bue).​

3.​ Insertion: Adding an extra letter (e.g., car → caar).​

4.​ Transposition: Switching two adjacent letters (e.g., form → from).​

5.​ Reversal errors: A specific case of transposition where letters are reversed.​

Errors from OCR and Speech Recognition:

●​ OCR (Optical Character Recognition) and similar devices introduce errors such as:​
○​ Substitution​

○​ Multiple substitutions (framing errors)​

○​ Space deletion/insertion​

○​ Character omission or duplication​

●​ Speech recognition systems process phoneme strings and attempt to match them to
known words. These errors are often phonetic in nature, leading to non-trivial
distortions of words.

Spelling Errors: Two Categories

1.​ Non-word errors: The incorrect word does not exist in the language (e.g., freind instead of
friend).​

2.​ Real-word errors: The incorrect word is a valid word, but incorrect in the given context
(e.g., their instead of there).

Spelling Correction Process

●​ Error Detection: Identifying words that are likely misspelled.​

●​ Error Correction: Suggesting valid alternatives for the detected errors.​

Two Approaches to Spelling Correction:

1.​ Isolated Error Detection and Correction:​

○​ Focuses on individual words without considering context.​

○​ Example: Spell-checker highlighting recieve and suggesting receive.​

2.​ Context-Dependent Error Detection and Correction:​

○​ Uses surrounding words to detect and correct errors.​

○​ Useful for real-word errors (e.g., correcting there to their based on sentence
meaning).

Spelling Correction Algorithms

1.​ Minimum Edit Distance:​


○​ Measures the least number of edit operations (insertions, deletions, substitutions)
to convert one word into another.
2.​ Similarity Key Techniques:​

○​ Generates a phonetic or structural key for a word and matches against similarly
keyed dictionary entries.
○​ Example: Soundex
3.​ N-gram Based Techniques:​

○​ Break words into sequences of characters (n-grams) and compare overlaps to


identify similar words.​

4.​ Neural Nets:​

○​ Use machine learning models (e.g., RNNs, Transformers) trained on large corpora
to detect and correct errors.​

○​ Can learn both spelling patterns and contextual usage.​

5.​ Rule-Based Techniques:​

○​ Apply handcrafted or learned rules about common misspellings, phonetic


confusions, and grammar patterns.​

○​ Often used in conjunction with dictionaries.

5.​ Define POS tagging. Compare rule based, statistical based and hybrid approaches

Part-of-Speech Tagging
Part-of-Speech tagging is the process of assigning an appropriate grammatical category (such as
noun, verb, adjective, etc.) to each word in a given sentence. It is a fundamental task in Natural
Language Processing (NLP), which plays a crucial role in syntactic parsing, information extraction,
machine translation, and other language processing tasks.

POS tagging helps in resolving syntactic ambiguity and understanding the grammatical structure
of a sentence. Since many words in English and other natural languages can serve multiple
grammatical roles depending on the context, POS tagging is necessary to identify the correct
category for each word.

There are several approaches to POS tagging, which are broadly categorized as: (i) Rule-based
POS tagging, (ii) Stochastic POS tagging, and (iii) Hybrid POS tagging.

Rule-Based POS Tagging


Rule-based POS tagging uses a set of hand-written linguistic rules to determine the correct tag for
a word in a given context. The approach starts by assigning each word a set of possible tags
based on a lexicon. Then, contextual rules are applied to eliminate unlikely tags.

The rule-based taggers make use of rules that consider the tags of neighboring words and the
morphological structure of the word. For example, a rule might state that if a word is preceded by
a determiner and is a noun or verb, it should be tagged as a noun. Another rule might say that if a
word ends in "-ly", it is likely an adverb.

The effectiveness of this approach depends heavily on the quality and comprehensiveness of the
hand-written rules. Although rule-based taggers can be accurate for specific domains, they are
difficult to scale and maintain, especially for languages with rich morphology or free word order.

Stochastic POS Tagging

Stochastic or statistical POS tagging makes use of probabilistic models to determine the most
likely tag for a word based on its occurrence in a tagged corpus. These taggers are trained on
annotated corpora where each word has already been tagged with its correct part of speech.

In the simplest form, a unigram tagger assigns the most frequent tag to a word, based on the
maximum likelihood estimate computed from the training data:

where f(w,t) is the frequency of word w being tagged as t, and f(w) is the total frequency of the
word w in the corpus. This approach, however, does not take into account the context in which the
word appears.

To incorporate context, bigram and trigram models are used. In a bigram model, the tag assigned
to a word depends on the tag of the previous word. The probability of a sequence of tags is given
by:

The probability of the word sequence given the tag sequence is:

Thus, the best tag sequence is the one that maximizes the product:
This is known as the Hidden Markov Model (HMM) approach to POS tagging. Since the actual
tag sequence is hidden and only the word sequence is observed, the Viterbi algorithm is used to
compute the most likely tag sequence.

Bayesian inference is also used in stochastic tagging. Based on Bayes’ theorem, the posterior
probability of a tag given a word is:

Since P(w) is constant for all tags, we can choose the tag that maximizes P(w∣t)⋅P(t).
Statistical taggers can be trained automatically from large annotated corpora and tend to
generalize better than rule-based systems, especially in handling noisy or ambiguous data.

Hybrid POS Tagging

Hybrid approaches combine rule-based and statistical methods to take advantage of the strengths
of both. One of the most popular hybrid methods is Transformation-Based Learning (TBL),
introduced by Eric Brill, commonly referred to as Brill’s Tagger.

In this approach, an initial tagging is done using a simple method, such as assigning the most
frequent tag to each word. Then, a series of transformation rules are applied to improve the
tagging. These rules are automatically learned from the training data by comparing the initial
tagging to the correct tagging and identifying patterns where the tag should be changed.

Each transformation is of the form: "Change tag A to tag B when condition C is met". For example,
a rule might say: "Change the tag from VB to NN when the word is preceded by a determiner".

The transformation rules are applied iteratively to correct errors in the tagging, and each rule is
chosen based on how many errors it corrects in the training data. This approach is robust,
interpretable, and works well across different domains.

POS tagging is an essential component in many natural language processing systems.


Rule-based taggers rely on linguistic knowledge encoded in rules, stochastic taggers use
statistical models learned from data, and hybrid taggers attempt to combine the two for better
performance. Among these, stochastic methods are particularly effective when large annotated
corpora are available, while rule-based methods are useful for low-resource languages or when
domain-specific rules are well known. Hybrid methods, especially those using
transformation-based learning, provide a good balance between accuracy and interpretability.

6.​ Write the CFG rules for a sentence and parse it using top-down parsing.
Parsing is the process of analyzing a string of symbols (typically a sentence) according to the
rules of a formal grammar. In Natural Language Processing (NLP), parsing determines the
syntactic structure of a sentence by identifying its grammatical constituents (like noun phrases,
verb phrases, etc.). It checks whether the sentence follows the grammatical rules defined by a
grammar, often a Context-Free Grammar (CFG). The result of parsing is typically a parse tree or
syntax tree, which shows how a sentence is hierarchically structured. Parsing helps in
disambiguating sentences with multiple meanings. It is essential for understanding, translation,
and information extraction. There are two main types: syntactic parsing, which focuses on
structure, and semantic parsing, which focuses on meaning. Parsing algorithms include
top-down, bottom-up, and chart parsing. Efficient parsing is crucial for developing
grammar-aware NLP applications.

Top-down Parsing
As the name suggests, top-down parsing starts its search from the root node S and works
downwards towards the leaves. The underlying assumption here is that the input can be derived
from the designated start symbol, S, of the grammar. The next step is to find all sub-trees which
can start with S. To generate the sub-trees of the second-level search, we expand the root node
using all the grammar rules with S on their left hand side. Likewise, each non-terminal symbol in
the resulting sub-trees is expanded next using the grammar rules having a matching non-terminal
symbol on their left hand side. The right hand side of the grammar rules provide the nodes to be
generated, which are then expanded recursively. As the expansion continues, the tree grows
downward and eventually reaches a state where the bottom of the tree consists only of
part-of-speech categories. At this point, all trees whose leaves do not match words in the input
sentence are rejected, leaving only trees that represent successful parses. A successful parse
corresponds to a tree which matches exactly with the words in the input sentence.
Sample grammar
●​ S → NP VP
●​ S → VP
●​ NP → Det Nominal
●​ NP → NP PP
●​ Nominal → Noun
●​ Nominal → Nominal Noun
●​ VP → Verb
●​ VP → Verb NP
●​ VP → Verb NP PP
●​ PP → Preposition NP
●​ Det → this | that | a | the
●​ Noun → book | flight | meal | money
●​ Verb → book | include | prefer
●​ Pronoun → I | he | she | me | you
●​ Preposition → from | to | on | near | through
A top-down search begins with the start symbol of the grammar. Thus, the first level (ply) search
tree consists of a single node labelled S. The grammar in Table 4.2 has two rules with S on their
left hand side. These rules are used to expand the tree, which gives us two partial trees at the
second level search, as shown in Figure 4.4. The third level is generated by expanding the
non-terminal at the bottom of the search tree in the previous ply. Due to space constraints, only
the expansion corresponding to the left-most non-terminals has been shown in the figure. The
subsequent steps in the parse are left, as an exercise, to the readers. The correct parse tree
shown in Figure 4.4 is obtained by expanding the fifth parse tree of the third level.

7.​ What is bottom-up parsing? How does it differ from top-down parsing?

Parsing is the process of analyzing a string of symbols (typically a sentence) according to the
rules of a formal grammar. In Natural Language Processing (NLP), parsing determines the
syntactic structure of a sentence by identifying its grammatical constituents (like noun phrases,
verb phrases, etc.). It checks whether the sentence follows the grammatical rules defined by a
grammar, often a Context-Free Grammar (CFG). The result of parsing is typically a parse tree or
syntax tree, which shows how a sentence is hierarchically structured. Parsing helps in
disambiguating sentences with multiple meanings. It is essential for understanding, translation,
and information extraction. There are two main types: syntactic parsing, which focuses on
structure, and semantic parsing, which focuses on meaning. Parsing algorithms include
top-down, bottom-up, and chart parsing. Efficient parsing is crucial for developing
grammar-aware NLP applications.

Bottom-up Parsing
A bottom-up parser starts with the words in the input sentence and attempts to construct a parse
tree in an upward direction towards the root. At each step, the parser looks for rules in the
grammar where the right hand side matches some of the portions in the parse tree constructed so
far, and reduces it using the left hand side of the production. The parse is considered successful if
the parser reduces the tree to the start symbol of the grammar. Figure 4.5 shows some steps
carried out by the bottom-up parser for sentence Paint the door.

Each of these parsing strategies has its advantages and disadvantages. As the top-down search
starts generating trees with the start symbol of the grammar, it never wastes time exploring a tree
leading to a different root. However, it wastes considerable time exploring S trees that eventually
result in words that are inconsistent with the input. This is because a top-down parser generates
trees before seeing the input. On the other hand, a bottom-up parser never explores a tree that
does not match the input. However, it wastes time generating trees that have no chance of leading
to an S-rooted tree. The left branch of the search space in Figure 4.5 that explores a sub-tree
assuming paint as a noun, is an example of wasted effort. We now present a basic search
strategy that uses the top-down method to generate trees and augments it with bottom-up
constraints to filter bad parses.

8.​ Explain the CYK(Cocke-Younger-Kasami) parsing algorithm with a suitable example

Parsing is the process of analyzing a string of symbols (typically a sentence) according to the
rules of a formal grammar. In Natural Language Processing (NLP), parsing determines the
syntactic structure of a sentence by identifying its grammatical constituents (like noun phrases,
verb phrases, etc.). It checks whether the sentence follows the grammatical rules defined by a
grammar, often a Context-Free Grammar (CFG). The result of parsing is typically a parse tree or
syntax tree, which shows how a sentence is hierarchically structured. Parsing helps in
disambiguating sentences with multiple meanings. It is essential for understanding, translation,
and information extraction. There are two main types: syntactic parsing, which focuses on
structure, and semantic parsing, which focuses on meaning. Parsing algorithms include
top-down, bottom-up, and chart parsing. Efficient parsing is crucial for developing
grammar-aware NLP applications.

The CYK Parser


Like the Earley algorithm, the CYK (Cocke–Younger–Kasami) is a dynamic programming parsing
algorithm. However, it follows a bottom-up approach in parsing. It builds a parse tree
incrementally. Each entry in the table is based on previous entries. The process is iterated until
the entire sentence has been parsed. The CYK parsing algorithm assumes the grammar to be in
Chomsky Normal Form (CNF). A CFG is in CNF if all the rules are of only two forms:

A → BC​
A → w, where w is a word.

The algorithm first builds parse trees of length one by considering all rules which could produce
words in the sentence being parsed. Then, it finds the most probable parse for all the constituents
of length two. The parse of shorter constituents constructed in earlier iterations can now be used
in constructing the parse of longer constituents.
A* ⇒ wᵢ ​
1. A → B C is a rule in grammar​
2. B* ⇒ wᵢₖ​
3. C* ⇒ wₖ₊₁

For a sub-string wᵢ of length j starting at i, the algorithm considers all possible ways of breaking it
into two parts wᵢₖ and wₖ₊₁ . Finally, since A ⇒ wᵢ , we have to verify that S* ⇒ w₁ₙ, i.e., the
start symbol of the grammar derives w₁ₙ.

CYK ALGORITHM
Let w = w1 w2 w3 ... wn
and w0 = w, wn+1 = ∅

// Initialization step
for i := 1 to n do
for all rules A → wi do
chart[i, i] := [A]

// Recursive step
for j := 2 to n do
for i := 1 to n-j+1 do
begin
chart[i, j] := ∅
for k := i to j-1 do
chart[i, j] := chart[i, j] ∪ { A | A → BC is a production and
B ∈ chart[i, k] and C ∈ chart[k+1, j] }
end

if S ∈ chart[1, n] then accept else reject


Module 3: Naive bayes, Text
classification and Sentiment Analysis

1.​ Explain The Naive Bayes Classification algorithm in detail.

Naive Bayes Classification Algorithm


One of the core goals in natural language processing is to build systems that can understand,
categorize, and respond to human language. Text classification, also called text categorization, is
the task of assigning a predefined label or class to a text segment. Examples include identifying
whether a review is positive or negative, whether a document is about sports or politics, or whether an
email is spam or not spam.

Such classification tasks are typically framed as supervised machine learning problems, where we
are given a set of labeled examples (training data) and must build a model that generalizes to unseen
examples. These tasks rely on representing text in a numerical feature space and applying statistical
models to predict classes.

One of the most commonly used classifiers in NLP is the Naive Bayes classifier, a probabilistic
model that applies Bayes’ theorem under strong independence assumptions. Despite its simplicity, it
is robust and surprisingly effective in many domains including spam filtering, sentiment analysis,
and language identification. Naive Bayes is categorized as a generative model, because it models
the joint distribution of inputs and classes to "generate" data points, in contrast with discriminative
models that directly estimate the class boundary.

Naive Bayes Classifiers


Text Documents as Bags of Words

Before applying any classifier, text must be converted into a format suitable for machine learning
algorithms. In Naive Bayes, we represent a document as a bag of words (BoW), which treats the
document as an unordered collection of words, discarding grammar and word order. This assumption
simplifies modeling, reducing a complex structured input into a feature vector.

Let is a feature (usually a word or term). These


features could be binary (word presence), count-based (word frequency), or even TF-IDF weighted.
2.​ How do you train a Naive Bayes Classifier? Illustrate with an example
3.​ Describe the use of Naive Bayes in Sentimental Analysis
4.​ What is add1(Laplace) smoothing? Why is it used?
5.​ Explain the text classification pipeline using Naive Bayes.

Text Classification Pipeline using Naive Bayes


Text classification is the task of assigning a category label to a given text document. Naive Bayes is a
popular algorithm for this purpose because it is simple, efficient, and performs well on
high-dimensional data like text.

The text classification pipeline using Naive Bayes typically consists of the following steps:

1. Data Collection
Gather labeled training data where each document is already assigned a class (e.g., spam or not
spam, positive or negative).

2. Text Preprocessing
Convert raw text into a clean format for analysis:

●​ Tokenization: Split text into individual words or tokens.​

●​ Lowercasing: Convert all text to lowercase.​

●​ Stopword Removal: Remove common words like “the”, “is”, “and”.​

●​ Stemming/Lemmatization (optional): Reduce words to their root form.​

3. Feature Extraction (Vectorization)


Transform text into numerical features using the Bag-of-Words (BoW) model:

●​ Each document is represented as a vector of word counts.


●​ The vocabulary is built from all unique words in the training set.
6. Evaluation
Evaluate the classifier using metrics like:

●​ Accuracy​

●​ Precision, Recall, F1-Score​

●​ Use cross-validation to ensure robustness.

6.​ Compare Naive Bayes in language modeling vs classification.

Comparison of Naive Bayes in Language Modeling and


Classification
Naive Bayes is a probabilistic algorithm that applies Bayes' Theorem with the assumption of
conditional independence among features. It is used in two major contexts within Natural Language
Processing: classification and language modeling. Though the core mathematical principles are the
same, the purposes and interpretations in these two cases differ significantly.

Naive Bayes in Text Classification


In text classification, the goal is to assign a document or sentence to a predefined class or category.
The Naive Bayes classifier uses labeled training data to learn the prior probability of each class and
the likelihood of each word given the class. When a new document is given, it computes the
posterior probability for each class and selects the class with the highest probability.

Mathematically, the classification rule is:

This approach is typically used in supervised learning tasks like spam detection, sentiment analysis,
or topic categorization, where the classifier is trained on documents that are already labeled.

Naive Bayes in Language Modeling


In contrast, when used as a language model, Naive Bayes helps compute the probability of a
sentence or sequence of words under a specific class-based language model. Instead of making a
classification decision, the model estimates how likely a sentence is to be generated by a particular
class. Each class has its own unigram language model, which assigns a probability to each word.

This is purely a generative model where the sentence is generated from the class model, and the
focus is on calculating likelihood rather than making a classification.

Such language modeling is useful in applications like document ranking, spell correction, machine
translation, or even language identification, where we need to evaluate which class (or language) is
most likely to produce the given sentence.

7.​ Discuss the challenges of sentiment classification using supervised learning

Challenges of Sentiment Classification Using Supervised


Learning
Sentiment classification is the task of automatically identifying the emotional tone (e.g.,
positive, negative, or neutral) expressed in text. Supervised learning is a common approach
where models are trained on labeled datasets. Despite its popularity, sentiment classification
using supervised learning poses several linguistic, statistical, and practical challenges.

1. Subjectivity and Ambiguity in Language


Language used in expressing sentiment is often subjective, nuanced, and
context-dependent. The same word or phrase can express different sentiments depending on
usage.

●​ Example: “The plot was predictable” (negative), but “Predictable in a good way”
(positive).​

●​ Words like “unbelievable” can be positive (“unbelievable performance”) or negative


(“unbelievable failure”).​

This ambiguity makes it hard for a classifier to generalize without deep contextual
understanding.

2. Presence of Sarcasm and Irony


Sarcasm in text reverses the literal sentiment. Models trained on surface-level features often
fail to detect this.

●​ Example: “Great! Another bug in the software.” is sarcastic, but appears positive to a
naive classifier.​

●​ Sarcasm detection typically requires pragmatic cues, which are hard to model in
traditional supervised settings.​

3. Domain Dependence
Supervised sentiment models are domain-specific. A model trained on movie reviews may
perform poorly on product reviews or political opinions.

●​ Sentiment expressions vary across domains. For instance, the word “unpredictable” is
positive in movie reviews but negative in car reviews.​

●​ Domain adaptation is difficult due to vocabulary shifts and changing sentiment cues.​
4. Data Sparsity and Imbalanced Classes
Many datasets have an uneven distribution of sentiment classes. Neutral or majority classes
dominate, while minority classes (e.g., strong negatives) are underrepresented.

●​ This biases the model toward predicting the majority class.​

●​ Also, certain sentiments (like anger or sarcasm) may have few examples, making it
difficult for the model to learn meaningful patterns.​

5. Dependence on High-Quality Labeled Data


Supervised learning requires large volumes of accurately labeled data, which is expensive
and time-consuming to create.

●​ Human annotations can be inconsistent due to subjective interpretations.​

●​ Labeling complex sentences with mixed sentiment or irony is particularly challenging.​

6. Handling Informal Language and Noise


In social media and real-world data, text is often noisy:

●​ Misspellings, emojis, abbreviations (e.g., “lol”, “gr8”, “idk”), code-switching.​

●​ These reduce the effectiveness of standard models trained on clean data.​

Special preprocessing or robust embeddings are required to handle such variability.

7. Inability to Capture Long-Range Dependencies


Basic supervised models like Naive Bayes or logistic regression assume independence of
features, failing to capture contextual or syntactic dependencies.

For instance, negation words like “not”, “never” change the polarity, but only if correctly linked
to the word they modify.​

Example: “I do not like this movie” → model must connect not with like.​
8.​ How is accuracy of a Naive Bayes sentiment classifier optimized

Optimizing the Accuracy of a Naive Bayes Sentiment


Classifier
Naive Bayes is a fast and effective classifier for sentiment analysis, but its performance can be
significantly improved through various data preprocessing, feature engineering, and model-level
enhancements. Below are key strategies to optimize its accuracy in sentiment classification tasks:

Text Preprocessing
Proper preprocessing ensures that the input to the classifier is clean, normalized, and informative.

●​ Lowercasing: Convert all text to lowercase to avoid treating “Good” and “good” as different
words.​

●​ Stopword Removal: Common words (e.g., the, is, in) that carry little sentiment can be
removed.​

●​ Punctuation and Number Removal: Removing unnecessary characters reduces noise.​

●​ Stemming/Lemmatization: Reduces words to their base form (e.g., loved → love) so that
variants are treated as the same feature.

Feature Engineering and Representation


The way text is converted into numerical features directly affects model performance.

●​ Bag-of-Words (BoW): The standard approach, but improvements can be made by:​

○​ Using term frequency (TF) or TF-IDF instead of raw counts.​

○​ Removing rare or overly frequent words.​

●​ Binarization: For Naive Bayes, converting word counts to 0/1 (presence/absence) often
improves accuracy. This is called Binary Multinomial Naive Bayes.​

●​ N-grams: Using bigrams or trigrams captures short phrases like "not good", which are
important in sentiment detection.

Handling Negation
Naive Bayes fails to capture negation unless explicitly handled. A common trick is:

●​ Negation Tagging: Add a “NOT_” prefix to every word following a negation word (e.g., not,
never, didn't) until a punctuation mark is found.​

○​ Example: “didn’t like the movie” becomes “NOT_like NOT_the NOT_movie”.​


This transforms sentiment-flipping phrases into new features learned by the model.

Smoothing Techniques
To prevent zero probabilities when a test word wasn't seen in training, Laplace Smoothing (Add-1
smoothing) is applied:

Stopword and Noise Control


Though stopword removal is common, retaining some stopwords can help in sentiment analysis:

●​ Words like "not", "never", "no" are stopwords but crucial for detecting negative sentiment.​

●​ Consider using custom stopword lists based on experimentation.

Feature Selection
Remove irrelevant or misleading features using:

●​ Chi-square test, Mutual Information, or Information Gain to keep only the most
sentiment-informative words.​

●​ This reduces overfitting and speeds up training.

Using Sentiment Lexicons


Incorporate external sentiment lexicons such as:

●​ MPQA, SentiWordNet, or Opinion Lexicon to boost feature quality.​

●​ These help identify words with strong positive or negative polarity, even in small training sets.

Data Balancing
In imbalanced datasets (e.g., more positive reviews than negative), the model becomes biased.
Accuracy is improved by:

●​ Resampling (oversampling minority or undersampling majority class),​

●​ Or by using class weights in loss calculations.​

Parameter Tuning and Evaluation


Though Naive Bayes has few parameters, some like smoothing factor α\alphaα can be tuned using:

●​ Cross-validation on the training set to find the optimal value.​

●​ This helps in generalizing better on unseen data.​

Regular Evaluation with Proper Metrics


Accuracy alone can be misleading. Use:

●​ Precision, Recall, and F1-score, especially in imbalanced sentiment datasets.​

●​ Use a held-out validation set or cross-validation for reliable error estimation and model
selection.
Module 4:
1.​ Explain the architecture and design features of Information retrieval

Design Features in Information Retrieval


Information Retrieval (IR) systems aim to efficiently locate relevant documents or information from
large datasets. Several key design features play a crucial role in enhancing the performance,
efficiency, and relevance of such systems. These include Indexing, Stop Word Elimination,
Stemming, and understanding word distributions through Zipf’s Law.

1. Indexing

Indexing is the process of organizing data to enable rapid search and retrieval. In IR, an inverted
index is commonly used. This structure maps each term in the document collection to a list of
documents (or document IDs) where that term occurs. It typically includes additional information like
term frequency, position, and weight (e.g., TF-IDF score).​
Efficient indexing allows the system to avoid scanning all documents for every query, dramatically
reducing search time and computational cost. Index construction involves tokenizing documents,
normalizing text, and storing index entries in a sorted and optimized structure, often with compression
techniques to reduce storage requirements.

2. Eliminating Stop Words

Stop words are extremely common words that appear in almost every document, such as "the", "is",
"at", "which", "on", and "and". These words usually add little value to understanding the main content
or differentiating between documents.​
Removing stop words reduces the size of the index, speeds up the search process, and minimizes
noise in results. However, careful handling is required because some stop words may be semantically
important depending on the domain (e.g., "to be or not to be" in literature, or "in" in legal texts). Most
IR systems use a predefined stop word list, though it can be customized based on corpus analysis.

3. Stemming

Stemming is a form of linguistic normalization used to reduce related words to a common base or root
form. For example:

●​ "connect", "connected", "connection", "connecting" → "connect"​

Stemming improves recall in IR systems by ensuring that different inflected or derived forms of a
word are matched to the same root term in the index. This is particularly important in languages with
rich morphology.​
Common stemming algorithms include:

●​ Porter Stemmer: Lightweight and widely used, based on heuristic rules.​

●​ Snowball Stemmer: An improvement over Porter, supporting multiple languages.​

●​ Lancaster Stemmer: More aggressive but sometimes over-stems words.​

Stemming is different from lemmatization, which uses vocabulary and grammar rules to derive the
base form.

4. Zipf’s Law

Zipf’s Law is a statistical principle that describes the frequency distribution of words in natural
language corpora. It states that the frequency f of any word is inversely proportional to its rank r:

f ∝ 1/r

This means that the most frequent word occurs roughly twice as often as the second most frequent
word, three times as often as the third, and so on.​
For example, in English corpora, words like "the", "of", "and", and "to" dominate the frequency list.
Meanwhile, the majority of words occur rarely (called the "long tail").​
In IR, Zipf’s Law justifies:

●​ Stop word elimination (high-frequency terms contribute little to relevance)​

●​ TF-IDF weighting (rare terms are more informative)​

●​ Optimizing index structures for space and search​

Understanding this law helps in designing efficient indexing and retrieval strategies that focus on the
more informative, lower-frequency words.

2.​ Compare Classical and non classical IR models with examples

Introduction to IR Models
Information Retrieval (IR) models are mathematical frameworks or algorithms designed to retrieve
relevant documents from a large corpus in response to a user’s information need, often expressed as
a query. These models serve as the backbone of search engines, recommendation systems, and
other knowledge discovery platforms. They function by transforming both the documents and queries
into a formal representation and then applying a matching or ranking function to determine relevance.
Based on foundational principles, IR models are broadly classified into Classical and Non-Classical
models. Each category employs distinct strategies for matching queries with relevant content.

1. Classical Models of Information Retrieval


Classical models are based on formal mathematical or logical foundations such as Boolean logic,
algebraic vector operations, and probabilistic estimation. These models have traditionally dominated
IR research and system implementations due to their simplicity, interpretability, and computational
efficiency.

1.1 Boolean Model


The Boolean Model is the earliest and simplest IR model based on Boolean algebra and set theory. It
represents documents and queries as sets of terms and retrieves only those documents that exactly
satisfy the logical conditions specified in the query.

●​ Document and Query Representation: Each document is encoded as a binary vector where
each term is either present (1) or absent (0). Queries are Boolean expressions using operators
like AND, OR, and NOT.​

●​ Operators:​

○​ AND: Intersection of sets — retrieves documents containing all query terms.​

○​ OR: Union of sets — retrieves documents containing at least one of the terms.​

○​ NOT: Complement of set — excludes documents containing certain terms.​

●​ Example:​
Query: AI AND Robotics​
Documents:​
D1: "AI and Robotics in Industry" → Retrieved​
D2: "AI in Healthcare" → Not Retrieved​

●​ Advantages:​

○​ Simple and fast.​

○​ Complete control over query specification.​

●​ Limitations:​

○​ No concept of partial relevance.​

○​ Results are unranked.​

○​ Users must construct complex Boolean queries.

1.2 Vector Space Model (VSM)


The Vector Space Model is an algebraic IR model that represents both documents and queries as
vectors in a high-dimensional term space. It introduces the concept of partial matching and
document ranking, thus offering more nuanced retrieval.

●​ Document Representation: Each document is a vector of term weights, usually calculated


using TF-IDF (Term Frequency-Inverse Document Frequency). Higher weight indicates higher
importance of a term in that document.​

●​ Similarity Measurement:​

○​ Cosine Similarity: Measures the cosine of the angle between document and query
vectors. A higher cosine value indicates higher relevance.​

○​ Jaccard Coefficient: Used when binary vectors are applied; computes the intersection
over union of term sets.​

●​ Solved Example:​

○​ Query: “AI Future”​

○​ Documents:​
D1: “AI is shaping the future of technology”​
D2: “History of AI and computing”​

○​ Preprocessing: Stop-word removal and stemming.​

○​ Cosine Similarity:​
D1: 0.578 → Ranked 1​
D2: 0.000 → Ranked 2​

●​ Advantages:​

○​ Supports ranked retrieval.​

○​ Handles term importance through weighting.​

○​ Suitable for large-scale, noisy data.​

1.3 Probabilistic Model


The Probabilistic Model ranks documents based on the probability that a document is relevant to a
given query. It considers relevance as an uncertain event and attempts to estimate this uncertainty
using probabilistic inference.

●​ Key Assumptions:​

○​ Each document is either relevant or non-relevant.​

○​ The model seeks to maximize P(R | D, Q) — the probability of relevance R for a


document D given query Q.​

●​ Binary Independence Model (BIM):​

○​ Represents documents and queries as binary vectors.​


○​ Assumes independence among terms.​

○​ Calculates Retrieval Status Value (RSV) for ranking.​

●​ Advantages:​

○​ Statistically grounded.​

○​ Incorporates feedback and uncertainty.​

●​ Limitations:​

○​ Requires prior relevance judgments.​

○​ Independence assumption often violated.​

2. Non-Classical Models of Information Retrieval


Non-Classical models extend beyond similarity and probability-based approaches. They draw from
logic, semantics, and cognitive theories, offering deeper understanding of documents and more
flexible retrieval strategies. These models are more adaptive to the ambiguity and complexity of
natural language.

2.1 Information Logic Model


This model uses logical imaging, a logic-based method for determining document relevance. Rather
than relying on mere term matching, it infers the truth value of a query based on logical support from
document content.

●​ Key Concept:​
Introduces uncertainty measures to evaluate how strongly a document term supports or
contradicts a query. This is derived from van Rijsbergen’s logical inference framework.​

●​ Example:​
Sentence: “Atul is serving a dish”​
Term: “dish”​
If "dish" logically supports the truth of the sentence, then it contributes positively to retrieval.​

●​ Advantages:​

○​ Incorporates semantic inference.​

○​ Handles ambiguous or incomplete queries better.

2.2 Situational Logic Model


This model treats each document as a situation and represents knowledge in terms of infons —
basic informational units. Retrieval is seen as inferring whether the document’s situation supports the
query situation.

●​ Key Elements:​
○​ Infons are associated with polarity: 1 (true), 0 (false).​

○​ Documents support an infon if they provide evidence for its truth.​

●​ Working Mechanism:​

○​ Documents are processed structurally using logical calculus.​

○​ Retrieval involves logical matching of query infons with document infons.​

●​ Applications:​

○​ Semantic web IR.​

○​ Query expansion based on logic-based entailment.

2.3 Information Interaction Model


Proposed by Ingwersen, this model emphasizes the interactive and cognitive aspects of information
retrieval. It acknowledges that relevance is subjective and shaped by the user’s knowledge,
intentions, and context.

●​ Key Features:​

○​ Relevance is dynamic and influenced by user feedback.​

○​ Semantic transformations such as synonym expansion and concept mapping


improve matching.​

○​ Neural Networks may simulate user cognition to improve document relevance


judgments.​

●​ Applications:​

○​ Personalized search systems.​

○​ Conversational agents.​

○​ Adaptive IR systems with feedback loops.​

3.​ Discuss the cluster model, Fuzzy model and LSTM model in IR

Cluster Model in Information Retrieval


The cluster model is an approach used in Information Retrieval (IR) to reduce the number of
document matches during retrieval. The need for clustering was first highlighted by Gerard Salton.
Before discussing the cluster-based IR model, it's important to state the cluster hypothesis:

Cluster Hypothesis: Closely associated documents tend to be relevant to the same


queries.

This hypothesis suggests that related documents are likely to be retrieved together. Therefore, by
grouping related documents into clusters, the search time can be significantly reduced. Rather than
matching a query with every document in the collection, it is first matched with representative
vectors of each cluster. Only the documents from clusters whose representative is similar to the
query vector are then examined in detail.

Clustering can also be applied to terms, rather than documents. In such cases, terms are grouped
into co-occurrence classes, which are useful in dimensionality reduction and thesaurus
construction.

Clustering Based on Similarity Matrix


Let:
Fuzzy Model in Information Retrieval
In the fuzzy model, a document is represented as a fuzzy set of terms, i.e., a set of pairs:
Long Short-Term Memory (LSTM) Model
1. Introduction
The Long Short-Term Memory (LSTM) model is a type of Recurrent Neural Network (RNN)
designed to effectively learn long-range dependencies in sequential data. It was introduced to
address the limitations of standard RNNs, such as the vanishing and exploding gradient
problems, which make them ineffective in modeling long-term dependencies.

LSTMs are widely used in natural language processing (NLP) tasks, particularly in Machine
Translation, speech recognition, and information retrieval, where understanding the context
across sequences is essential.

2. Structure of LSTM
An LSTM network consists of a chain of repeating modules, where each module contains four key
components:

1.​ Forget Gate​

2.​ Input Gate​

3.​ Cell State (Memory Cell)​

4.​ Output Gate​

Each gate is a neural network layer that controls how information flows through the network.

2.1 Forget Gate


This gate decides what information to discard from the previous cell state. It uses a sigmoid activation
function to output values between 0 and 1.
3. Advantages of LSTM in NLP and MT
●​ Handles long-term dependencies: Ideal for capturing the context in long sentences or
documents.​

●​ Resistant to vanishing gradients: The gating mechanisms preserve gradients across many
time steps.​

●​ Flexible input/output: Supports various sequence configurations (e.g.,


sequence-to-sequence).

4. LSTM in Machine Translation


In Machine Translation, LSTM is often used within the Encoder–Decoder architecture:

●​ The encoder LSTM reads the source language sentence and encodes it into a context vector.​

●​ The decoder LSTM uses this context to generate the target language sentence, one word at a
time.​

This architecture works well in sequence-to-sequence tasks and has been a standard approach
before the introduction of Transformer models.

5. LSTM in Information Retrieval


In the context of Information Retrieval (IR), LSTM can be used to model queries and documents as
sequences. Applications include:

●​ Query understanding: Capturing semantic meaning of user queries.​

●​ Contextual document ranking: Ranking results based on the semantic relationship between
query and document.​

●​ Session-based retrieval: Tracking user interactions across multiple queries.​

6. Limitations of LSTM
●​ Training complexity: LSTM networks are computationally intensive.​

●​ Long training times: Due to their sequential nature, LSTMs cannot be easily parallelized.​

●​ Superseded in some areas: Transformers have largely replaced LSTMs in state-of-the-art


NLP tasks due to better scalability and performance.

4.​ What are the Key challenges in information Retrieval

Major Challenges in Information Retrieval


(IR)
Information Retrieval (IR) is the science of locating relevant information from vast, mostly unstructured
collections of text. While advances in IR have revolutionized access to information, several persistent
challenges continue to affect the accuracy, efficiency, and user satisfaction of IR systems. These
challenges stem from linguistic ambiguity, computational limitations, user behavior, and ethical
considerations.

1. Relevance of Results
One of the core challenges in IR is identifying which documents are genuinely relevant to a user’s
query. Relevance is subjective and context-dependent, making it difficult to model computationally.

●​ Example: The query “jaguar” may refer to a car brand, an animal, or a sports team.​

●​ Problem: Users often express queries vaguely or imprecisely.​

●​ Impact: Systems may retrieve either too few or too many irrelevant results.

2. Vocabulary Mismatch
Often, users and documents refer to the same concept using different words.

●​ Example: A user searching for “automobile” may not retrieve documents containing only the
word “car.”​

●​ Solutions:​

○​ Stemming/Lemmatization: Reduces words to their base form.​

○​ Query Expansion: Adds synonyms and related terms using tools like WordNet or
embeddings.​

3. Word Sense Disambiguation (WSD)


WSD is essential to resolve semantic ambiguity in terms.

●​ Challenge: Words often have multiple meanings depending on the context.​

●​ Example: The term “python” can mean a snake or a programming language.​

●​ Need: IR systems must infer the correct sense from limited query context.​

5.​ Explain the structure and use of wordnet

WordNet: Structure and Use


Introduction to WordNet
WordNet is a large lexical database for the English language, developed under the direction of
George A. Miller at Princeton University. It is inspired by psycholinguistic theories and aims to model
the way humans organize and use words semantically and cognitively. WordNet combines
dictionary-style definitions with semantic relationships between words, making it a valuable
resource for both Natural Language Processing (NLP) and Information Retrieval (IR) systems.

Structure of WordNet
WordNet organizes words into sets of synonyms known as synsets, each representing one distinct
concept or sense. These synsets are interlinked by various lexical and semantic relations, which
allows WordNet to model complex linguistic relationships.

1. Synsets (Synonym Sets)


●​ A synset is a set of words (lemmas) that share the same meaning or concept.​
●​ Each sense of a word belongs to a different synset.​

●​ A word may appear in multiple synsets, depending on how many senses it has.​

Example:​
The word “read” can appear in different synsets depending on its meaning:

●​ read (interpret written text)​

●​ read (study something)​

●​ read (have certain content, e.g., "The sign reads STOP")​

Each synset includes:

●​ A set of synonyms (lemmas),​

●​ A gloss (brief definition), and​

●​ One or more example sentences.​

2. Parts of Speech Organization


WordNet is divided into four main lexical categories:

●​ Nouns​

●​ Verbs​

●​ Adjectives​

●​ Adverbs​

Separate databases exist for each category, though some relations may cross between categories
(e.g., derivationally related forms).
4. Example of a WordNet Entry: “read”
WordNet lists 1 noun sense and 11 verb senses for “read.”​
Each verb sense includes a synset, definition, and example usage:

●​ read.v.01: Interpret something written or printed.​


“Have you read the instructions?”​

●​ read.v.04: Interpret a meaning in a particular way.​


“She read the silence as criticism.”

Uses of WordNet in NLP and IR


WordNet serves as a semantic backbone for multiple language processing tasks:

1. Word Sense Disambiguation (WSD)


●​ Helps determine the correct sense of a word based on context.​

●​ Uses gloss and synset structure to match the intended meaning.​

Example:​
In “The python bit him,” WordNet helps determine if “python” means a snake or a
programming language.

2. Text Categorization and Document Structuring


●​ Synsets and hypernym trees enable semantic classification of documents.​

●​ Supports topic categorization using conceptual hierarchies.​

Example:​
A document mentioning "poodle" and "bulldog" can be categorized under "dog" →
"animal".

3. Information Retrieval (IR)


●​ Query Expansion: Adds synonyms or related terms from synsets to improve retrieval.​

Example:​
Query for “car” can be expanded to include “automobile” and “vehicle.”

●​ Relevance Improvement: Matching documents that use different but semantically related
terms.​

4. Text Summarization
●​ WordNet is used to form lexical chains to identify salient words or concepts.​

●​ Helps group related ideas and condense them into summaries.

5. Machine Translation and Question Answering


●​ Enables mapping words across languages based on conceptual meaning.​
●​ In QA systems, helps relate question terms to relevant answer terms using synsets and
hypernyms.​

6.​ What are the Framenet and Stemmers? Explain their usage in NLP tasks.

FrameNet and Stemmers in NLP


Natural Language Processing (NLP) relies on lexical resources and preprocessing tools to analyze,
interpret, and extract meaning from human language. Two important components in this domain are
FrameNet and Stemmers. FrameNet contributes to the semantic analysis of language, while
stemmers assist in text normalization during preprocessing.

1. FrameNet
1.1 Introduction
FrameNet is a lexical database of English based on the theory of frame semantics, developed by
Charles J. Fillmore. It annotates English sentences with semantic frames that represent common
situations or events and the roles (participants) involved in them.

1.2 Structure of FrameNet


●​ Each frame is a conceptual structure that describes a particular type of event, relation, or
object.​

●​ A lexical unit (a word in a specific sense) evokes a frame.​

●​ Frame Elements are semantic roles that participants in a frame play.​

●​ Frames are defined for verbs, nouns, adjectives, and adverbs.​

Example: The word "nab" evokes the ARREST frame.​


Frame elements:

●​ AUTHORITIES (agent performing the arrest)​

●​ SUSPECT (person being arrested)​

Sentence: “The police nabbed the snatcher.”​


Here, “police” → AUTHORITIES, “snatcher” → SUSPECT.

1.3 Types of Frame Elements


●​ Core Frame Elements: Essential roles, like AGENT, PATIENT, THEME.​

●​ Non-core Elements: Optional roles like TIME, MANNER, LOCATION.​


1.4 Example: COMMUNICATION Frame (from Fig. 12.9)
●​ Lexical Unit: “say”, “tell”, “announce”​

●​ Core Roles:​

○​ Communicator: The speaker or sender of a message.​

○​ Message: The content communicated.​

○​ Addressee: The receiver of the message.​

●​ Non-core Roles: Medium, Time, Manner, etc.​

1.5 Applications of FrameNet in NLP


FrameNet is used in several high-level NLP applications:

a) Semantic Role Labeling (SRL)

●​ Identifies predicate-argument structures and labels them with frame elements.​

●​ Helps understand who did what to whom, when, where, and how.​

b) Question Answering

●​ Frame semantics helps identify the relevant parts of a sentence to extract accurate answers.​

Example:​
Q: “Who sent the packet to Khushbu?”​
A: “Khushbu received a packet from the examination cell.”​
Using the TRANSFER frame, roles like SENDER and RECIPIENT are matched.

c) Machine Translation

●​ Frames abstract over surface form and enable meaning-preserving translation.​

d) Text Summarization

●​ Semantic roles guide sentence selection based on key frame elements.​

e) Information Retrieval

●​ Retrieves documents by matching frames and roles, not just surface words.​
2. Stemmers
2.1 Introduction
Stemming is the process of reducing inflected or derived words to their base or root form, known as
the stem. It is a fundamental preprocessing step in NLP and IR to standardize word forms, improving
matching and reducing vocabulary size.

Example:​
“connect”, “connected”, “connection” → “connect”

Stemming differs from lemmatization as it doesn’t ensure the output is a valid dictionary word—it just
reduces the word to a common base form.

2.2 Common Stemming Algorithms


a) Porter Stemmer (1980)

●​ Most widely used for English.​

●​ Uses rule-based suffix stripping.​

●​ Effective for search and indexing.​

b) Lovins Stemmer (1968)

●​ One of the earliest stemmers.​

●​ Uses longest-match suffix removal.​

c) Paice/Husk Stemmer (1990)

●​ Iterative approach with high flexibility and aggressiveness.​

2.3 Language-Specific Stemmers


a) European Languages

●​ Snowball stemmer supports: English, French, Spanish, Portuguese, German, Hungarian, etc.​

b) Indian Languages

●​ Hindi and Bengali stemming research by Ramanathan & Rao (2003) and Majumder et al.
(2007).​

●​ Use handcrafted suffix lists or cluster-based root detection.​


2.4 Applications of Stemming in NLP
a) Information Retrieval (IR)

●​ Stemming reduces word variants, improving recall in search results.​

●​ E.g., a search for “astronaut” also retrieves “astronauts”.​

b) Text Categorization

●​ Frequency counts of stems help classify documents by topic.​

c) Text Summarization

●​ Common stems are used to detect key ideas across sentences.​

d) Query Expansion

●​ Stems allow broader matching across similar words.

2.5 Limitations of Stemming


●​ May over-stem: Different words may collapse into the same root.​

●​ May under-stem: Related forms might be left as separate.​

●​ Stemming is language-dependent and needs careful tuning.​

7.​ Describe the types of POS taggers used in lexical analysis

Types of POS Taggers Used in Lexical


Analysis
Part-of-Speech (POS) tagging is a fundamental step in lexical analysis and Natural Language
Processing (NLP). It involves assigning a grammatical category (such as noun, verb, adjective, etc.)
to each word in a given text, based on its context. POS tagging enables higher-level tasks such as
parsing, information extraction, machine translation, and sentiment analysis.

Different types of POS taggers are employed based on the underlying algorithmic approach. These
taggers vary in complexity, accuracy, and their reliance on linguistic or statistical resources.
1. Rule-Based POS Taggers
Rule-based POS taggers rely on a set of handcrafted grammatical rules and lexicons to determine
the POS tags of words.

Working:
●​ They use a dictionary that lists words and their possible tags.​

●​ Contextual rules (syntactic/linguistic) are applied to disambiguate the correct tag.​

●​ Rules may consider surrounding words, suffixes, prefixes, or sentence structure.​

Example:
If a word follows a determiner and is not a verb, it is likely a noun (e.g., “the cat”).

Advantages:
●​ Linguistically interpretable.​

●​ No need for large annotated corpora.​

Disadvantages:
●​ Difficult to maintain and scale.​

●​ Limited coverage for ambiguous or unseen data.

2. Stochastic (Statistical) POS Taggers


These taggers use probabilistic models and learn from annotated corpora. They predict the most
probable POS tag based on statistics derived from large text datasets.

Types:

a) Unigram Tagger
●​ Assigns the most frequent tag for each word based on training data.​

●​ Simple and fast but ignores context.​

b) Bigram and Trigram Taggers


●​ Consider the previous one or two tags to predict the current word’s tag.​

●​ Capture contextual dependencies better than unigrams.​


c) Hidden Markov Model (HMM) Tagger
●​ Models POS tagging as a sequence prediction task.​

●​ Uses transition probabilities (tag-to-tag) and emission probabilities (word-to-tag).​

Advantages:
●​ Adaptable to real-world data.​

●​ Higher accuracy than rule-based systems for large corpora.​

Disadvantages:
●​ Require labeled training data.​

●​ Struggle with out-of-vocabulary (OOV) words.

3. Transformation-Based Taggers (Brill Taggers)


Transformation-based taggers combine rule-based and statistical techniques.

Working:
●​ Initially assign tags using a simple method (e.g., unigram tagging).​

●​ Apply transformation rules iteratively to correct errors.​

Key Features:
●​ Rules are automatically learned from training data.​

●​ Rules are ordered and applied only if specific patterns are matched.​

Advantages:
●​ Interpretable like rule-based systems.​

●​ Adaptive like statistical models.​

Disadvantages:
●​ Slower than pure statistical taggers.​

●​ Requires labeled training data.​

4. Machine Learning-Based Taggers


These taggers treat POS tagging as a supervised learning problem and use classifiers trained on
annotated corpora.

Common Approaches:

a) Decision Trees
Predict tags by learning rules from feature splits based on context.

b) Maximum Entropy Models


Estimate probabilities using features without assuming independence (unlike HMMs).

c) Support Vector Machines (SVM)


Classify based on high-dimensional feature space, often with rich linguistic features.

d) Neural Network-Based Taggers


●​ Use deep learning models like Recurrent Neural Networks (RNNs), LSTMs, or
transformers.​

●​ Model long-range dependencies and complex linguistic patterns.​

Advantages:
●​ High accuracy on diverse and large datasets.​

●​ Handle ambiguity and unseen data effectively.​

Disadvantages:
●​ Require significant computation and data.​

●​ Less interpretable.​

5. Hybrid Taggers
Hybrid taggers combine two or more techniques to balance accuracy, speed, and robustness.
Examples:
●​ Combining a rule-based system with a statistical backoff.​

●​ Using HMM tagging followed by neural correction.​

Use Case:
Especially useful in low-resource languages or domain-specific corpora where combining
resources improves coverage and accuracy.

8.​ What are research corpora? Discuss their types and use in NLP experiments

Research Corpora
1. Introduction to Research Corpora
A research corpus (plural: corpora) is a large and structured set of textual or speech data,
systematically collected and used for linguistic research and Natural Language Processing (NLP)
tasks. It acts as the empirical foundation for designing, training, testing, and evaluating NLP models.

Corpora are essential in enabling data-driven approaches in computational linguistics. They offer
real-world linguistic evidence for studying patterns, building models, and conducting experiments in
areas such as part-of-speech tagging, syntactic parsing, named entity recognition, sentiment analysis,
and machine translation.

2. Characteristics of a Good Research Corpus


An ideal research corpus has:

●​ Diversity: Covers multiple genres, domains, and linguistic phenomena.​

●​ Representativeness: Reflects actual language use.​

●​ Annotation: Includes linguistic labels such as POS tags, parse trees, or semantic roles.​

●​ Standardization: Follows agreed annotation schemes (e.g., Penn Treebank tags).​

●​ Accessibility: Public availability or license for research purposes.

3. Types of Research Corpora


Research corpora can be classified based on structure, modality, annotation level, or application
domain. The major types include:
3.1 Raw Corpus (Unannotated Corpus)
●​ Contains only raw text without any linguistic annotations.​

●​ Used in tasks like language modeling, unsupervised learning, or frequency analysis.​

Examples:

●​ Wikipedia dump​

●​ News articles​

●​ Web-crawled corpora

3.2 Annotated Corpus


These corpora include linguistic annotations, either manual or automatically generated. They are
further divided based on the annotation level:

a) Morphologically Annotated Corpora

●​ Includes annotations like part-of-speech (POS) tags, lemmas, and morphological features.​

Example:

●​ Brown Corpus​

●​ Penn Treebank (POS layer)​

b) Syntactically Annotated Corpora (Treebanks)

●​ Annotated with constituency or dependency structures.​

●​ Useful for parsing, chunking, and grammar induction.​

Examples:

●​ Penn Treebank (constituency)​

●​ Universal Dependencies (UD) (dependency-based)​

c) Semantically Annotated Corpora

●​ Include annotations for named entities, semantic roles, word senses, etc.​

Examples:

●​ SemCor (word sense annotations using WordNet)​


●​ FrameNet corpus (frame semantics)​

●​ PropBank (predicate-argument structure)​

d) Discourse-Annotated Corpora

●​ Capture coherence relations, discourse markers, and coreference chains.​

Examples:

●​ RST Discourse Treebank​

●​ OntoNotes​

3.3 Speech Corpora


●​ Contain audio recordings and often their corresponding transcriptions, along with phonetic
or prosodic annotations.​

●​ Used for automatic speech recognition (ASR), text-to-speech (TTS), and speaker
identification.​

Examples:

●​ TIMIT​

●​ LibriSpeech​

●​ Switchboard Corpus

3.4 Multilingual and Parallel Corpora


●​ Include the same content in multiple languages, aligned at the sentence or phrase level.​

●​ Used for machine translation, cross-lingual retrieval, and multilingual NLP.​

Examples:

●​ Europarl Corpus​

●​ OpenSubtitles​

●​ UN Corpus

3.5 Specialized Domain Corpora


Designed for specific fields such as law, medicine, or social media.

Examples:

●​ GENIA (biomedical texts)​

●​ MedLine abstracts​

●​ Twitter Corpus for sentiment analysis

4. Uses of Research Corpora in NLP Experiments


4.1 Training Machine Learning Models
●​ Supervised models (e.g., for POS tagging or NER) need annotated corpora as training data.​

4.2 Evaluation and Benchmarking


●​ Corpora provide test sets to measure accuracy, F1-score, and other metrics.​

●​ Help compare different NLP systems under standardized conditions.​

4.3 Knowledge Extraction


●​ Unannotated corpora are mined to derive lexicons, word embeddings, or collocation
patterns.​

4.4 Linguistic Analysis


●​ Study frequency, grammar, and usage patterns.​

●​ Identify language change or dialectal variation.​

4.5 Pretraining Language Models


●​ Large unannotated corpora like Wikipedia, Common Crawl, or BooksCorpus are used to
pretrain models like BERT, GPT, etc.​

4.6 Data Augmentation and Synthesis


●​ Used to generate synthetic datasets through translation, paraphrasing, or entity
replacement.

5. Challenges in Building and Using Corpora


●​ Annotation cost and consistency: Manual annotation is time-consuming and may vary
across annotators.​

●​ Domain mismatch: Training on one domain (e.g., news) may not generalize to another (e.g.,
tweets).​

●​ Privacy and ethics: User-generated content needs anonymization.​

●​ Multilingual scarcity: Many languages lack high-quality annotated corpora.


Module 5: Machine Translation

1.​ Define Machine Translation. List and explain its types and applications

Machine Translation (MT)


1. Introduction
Machine Translation (MT) is a subfield of Computational Linguistics and Artificial Intelligence that
focuses on the automatic translation of text or speech from one natural language to another using
computer systems. The objective of MT is to produce translations that are both semantically accurate
and syntactically correct.

MT systems aim to analyze the structure and meaning of the source language and generate an
equivalent expression in the target language. This process involves understanding grammar,
vocabulary, context, and even cultural nuances.

2. Types of Machine Translation


Over time, various approaches to machine translation have been developed. Each type has its own
methodology, strengths, and limitations. The main types are as follows:

2.1 Rule-Based Machine Translation (RBMT)


Rule-Based Machine Translation operates using a set of linguistic rules and bilingual dictionaries. The
system processes the grammatical structure of the source sentence, performs morphological and
syntactic analysis, and applies pre-defined translation rules to generate the target sentence.

RBMT is highly dependent on the quality and coverage of its rule sets and dictionaries. It provides
more interpretable results but requires extensive human effort to create and maintain the linguistic
resources for each language pair.

2.2 Statistical Machine Translation (SMT)


Statistical Machine Translation relies on statistical models derived from large bilingual corpora. It
translates text based on the probability of a target sentence given a source sentence. Phrase-based
models are commonly used, which translate segments of text rather than word-by-word.

SMT systems require large amounts of parallel data to train effectively. While they can generalize well
from data, they often struggle with grammar and context, leading to translations that may sound
unnatural.

2.3 Example-Based Machine Translation (EBMT)


Example-Based Machine Translation uses a database of previously translated sentence pairs. When a
new sentence is input, the system searches for similar sentences in the database and uses parts of
their translations to construct the output.

This approach is based on the concept of translation by analogy. Its performance depends on the
quantity and quality of stored examples. EBMT systems work well in domains with repetitive or
formulaic language.

2.4 Neural Machine Translation (NMT)


Neural Machine Translation uses deep learning techniques, particularly sequence-to-sequence models
and transformer architectures, to translate entire sentences in a single integrated process. These
systems are capable of learning complex patterns and contextual relationships between words.

NMT systems produce translations that are generally more fluent and contextually appropriate than
previous methods. They require substantial computational resources and large datasets but represent
the current state-of-the-art in machine translation.

2.5 Hybrid Machine Translation


Hybrid Machine Translation combines elements of different translation approaches, such as RBMT with
SMT or SMT with NMT. This integration aims to leverage the strengths of each method to enhance
overall translation quality.

Hybrid systems may use rules for handling certain linguistic constructs while relying on statistical or
neural components for general translation. They are often employed in specialized domains where
accuracy and control are crucial.

3. Applications of Machine Translation


Machine Translation has numerous applications across a wide range of industries and everyday life. It
plays a vital role in facilitating global communication, information access, and commercial transactions.

One of the most prominent applications is in cross-lingual communication, where MT systems are used
in real-time messaging and voice translation tools to bridge language barriers. These systems are
widely adopted in social media platforms, customer service chatbots, and mobile translation apps.

In the domain of content localization, machine translation helps adapt websites, software interfaces,
and multimedia content to different languages and cultures. This is particularly useful for multinational
companies seeking to reach international audiences.

Another key area of application is in customer support, where automated translation enables
companies to provide assistance in multiple languages without requiring a large multilingual staff. This
enhances efficiency and improves user satisfaction.

Machine translation is also used in e-commerce to translate product descriptions, user reviews, and
transaction messages, thereby improving the shopping experience for users around the world.

In the field of education and research, MT tools help students and scholars access academic materials
published in foreign languages, broadening the scope of knowledge and collaboration.

The media and journalism industries benefit from machine translation by distributing news content
across linguistic regions rapidly and efficiently. Similarly, the travel and tourism industry relies on
translation tools to help tourists navigate foreign environments.

2.​ What are language divergences? Give examples in translation

Language Divergences in Machine Translation


Definition:

Language divergences refer to systematic differences in structure, grammar, or usage between


two languages that cause mismatches or challenges in translation. These divergences can occur at
various linguistic levels and often require more than a word-for-word substitution for accurate
translation.

When translating between languages, especially those that are typologically different (e.g., English and
Hindi, English and Japanese), divergences must be handled carefully to maintain meaning, fluency,
and grammatical correctness.

Types of Language Divergences with Examples


1. Lexical Divergence
This occurs when a word in the source language does not have a direct one-to-one equivalent in the
target language.

●​ Example:​

○​ English: He missed the train.​

○​ Hindi: वह ट्रे न छूट गया।​


(Literal: The train left him.)​

Here, "missed" doesn't translate directly; Hindi uses a construction meaning "the train left."

2. Syntactic Divergence
This involves differences in sentence structure or word order between languages.

●​ Example:​

○​ English: I eat apples.​

○​ Japanese: 私はリンゴを食べます。​
(Literal: I apples eat.)​

Japanese uses Subject–Object–Verb (SOV) order, unlike English's Subject–Verb–Object (SVO) order.

3. Morphological Divergence
Occurs when the languages differ in how they express grammatical information (e.g., tense,
number, case) through word forms.

●​ Example:​

○​ English: They ran.​

○​ Chinese: 他们跑了。​
(Pinyin: Tāmen pǎo le.)​

Chinese does not inflect verbs for tense like English does; it uses aspect markers (like 了) instead.

4. Categorical Divergence
Happens when the same idea is expressed using different grammatical categories in the two
languages.

●​ Example:​

○​ English: I am hungry. (adjective)​

○​ French: J’ai faim. (Literal: I have hunger. — noun)​

Here, "hungry" is an adjective in English, while the equivalent concept is expressed as a noun in
French.

5. Structural Divergence
Arises when an extra phrase or clause is needed in the target language to preserve the original
meaning.

●​ Example:​

○​ English: He entered the room.​

○​ Spanish: Él entró en la habitación.​


(Literal: He entered into the room.)​

Spanish requires a prepositional phrase “en la” that has no direct English counterpart in this context.

6. Idiomatic Divergence
Involves idioms or expressions that cannot be translated literally without losing meaning.

●​ Example:​

○​ English: Kick the bucket. (means "to die")​

○​ German: Den Löffel abgeben. (Literal: To give up the spoon.)​


Both are idioms meaning "to die," but direct translations would confuse the meaning.

Importance in Machine Translation:


Language divergences pose significant challenges for machine translation systems, especially those
based on word alignment or surface forms. Rule-based and statistical systems often struggle with
these, while Neural Machine Translation (NMT) systems handle them better due to their
context-awareness and ability to learn patterns across whole sentences.

3.​ Explain encoder-decoder architecture used in Neural Machine Translation

Encoder–Decoder Architecture in Neural


Machine Translation
1. Introduction
The Encoder–Decoder Architecture is a foundational framework used in Neural Machine
Translation (NMT). It enables the translation of a sentence from a source language to a target
language by learning an end-to-end mapping through neural networks.

This architecture handles variable-length sequences and allows the system to model complex
relationships between words, making it suitable for capturing contextual and semantic information
necessary for accurate translation.

2. Components of Encoder–Decoder Architecture


The architecture consists of two primary components:

2.1 Encoder
The encoder is responsible for processing the input sentence (in the source language). It reads the
entire sentence word by word and converts it into a fixed-length context vector (also called a thought
vector or hidden representation).

●​ Typically implemented using Recurrent Neural Networks (RNNs), Long Short-Term Memory
(LSTM), or Gated Recurrent Units (GRUs).​

●​ In modern systems, Transformer encoders are used for parallel processing and better context
handling.​

Function:​
The encoder learns to represent the entire input sentence in a compressed vector form that captures
the overall meaning.

2.2 Decoder
The decoder takes the context vector produced by the encoder and generates the translated
sentence (in the target language) word by word.

●​ Like the encoder, the decoder is usually an RNN, LSTM, GRU, or Transformer model.​

●​ At each time step, the decoder predicts the next word in the sequence, using:​

○​ The context vector​

○​ The previously generated words​

Function:​
The decoder learns to generate fluent and grammatically correct output based on the encoded source
sentence.

3. Working Mechanism
The basic steps are as follows:

●​ The input sentence is tokenized and fed into the encoder.​

●​ The encoder processes each token and outputs a context vector summarizing the
entire sentence.​

●​ The decoder uses this vector to generate the first word of the target sentence.​

●​ This predicted word is then fed back into the decoder to predict the next word.​

●​ This process continues until the decoder outputs an end-of-sentence token (<EOS>).

4. Limitations of Basic Encoder–Decoder Architecture


●​ Fixed-Length Bottleneck: Compressing an entire sentence into a single vector can lead to
information loss, especially for long or complex sentences.​

●​ Lack of Focus: The model has no mechanism to focus on relevant parts of the input while
generating each word.​

To address these issues, the Attention Mechanism was introduced.

5. Encoder–Decoder with Attention (Enhanced Version)


The attention mechanism allows the decoder to look at different parts of the input sentence at
each time step, instead of relying solely on a fixed-length vector.
●​ It creates a dynamic context vector for each output word.​

●​ This improves performance, especially for long or syntactically complex sentences.​

●​ Widely used in Transformer models and modern NMT systems like Google Translate
and DeepL.​

6. Transformer Architecture (Modern Implementation)


Introduced by Vaswani et al. (2017), the Transformer is a specialized encoder–decoder architecture
that replaces recurrence with self-attention mechanisms, allowing for parallel processing and better
long-range dependency modeling.

●​ Transformer Encoder: Consists of multiple self-attention and feedforward layers.​

●​ Transformer Decoder: Uses masked self-attention and cross-attention to generate output.​

The Transformer has become the standard architecture for modern NMT systems.

4.​ Discuss the challenges and techniques in low-resource machine translation section

Low-Resource Machine Translation


1. Introduction
Low-resource machine translation refers to the development of MT systems for language pairs
with limited parallel corpora, linguistic resources, or technological infrastructure. While major
languages like English, Spanish, and Chinese benefit from large datasets and robust tools, many
languages lack sufficient data, posing unique challenges for building accurate and reliable translation
systems.

These low-resource settings are common among indigenous, regional, and less commonly taught
languages, making this an important area for linguistic diversity and digital inclusion.

2. Challenges in Low-Resource Machine Translation


2.1 Lack of Parallel Corpora
The most critical challenge is the absence of large bilingual datasets, which are essential for training
data-driven models such as Statistical Machine Translation (SMT) and Neural Machine Translation
(NMT).

2.2 Poor Quality or Noisy Data


Available data for low-resource languages is often incomplete, noisy, or inconsistently aligned,
leading to poor model performance.

2.3 Morphological Complexity


Many low-resource languages have rich morphology, meaning a single word may represent complex
grammatical information. This increases vocabulary size and sparsity issues in training.

2.4 Lack of Pretrained Models


Pretrained language models (e.g., BERT, GPT) are typically unavailable or underdeveloped for
low-resource languages, limiting transfer learning opportunities.

2.5 Limited Linguistic Resources


Many low-resource languages lack tools such as POS taggers, parsers, and dictionaries, which
hinders the development of rule-based or hybrid systems.

2.6 Dialectal and Script Variation


Some languages exhibit multiple dialects or non-standardized scripts, complicating data collection
and preprocessing.

3. Techniques to Address Low-Resource MT


To overcome the above challenges, researchers and practitioners have developed several effective
techniques:

3.1 Transfer Learning


Leverages knowledge from high-resource language pairs by training a model on related languages
and then fine-tuning it for the low-resource pair. This approach is particularly effective when the
languages are linguistically similar.

3.2 Multilingual Neural Machine Translation (MNMT)


A single model is trained to handle multiple language pairs, allowing shared representations across
languages. This enables knowledge transfer even if direct parallel data is not available for the target
language.

3.3 Back-Translation
Involves training a model to translate from the target language to the source language, then using it
to generate synthetic parallel data from monolingual target-language text. This synthetic data
augments the training set of the forward translation model.

3.4 Data Augmentation


Includes techniques such as:
●​ Synonym replacement​

●​ Word reordering​

●​ Sentence paraphrasing​

These methods artificially increase the volume and diversity of training data.

3.5 Pivot Translation


When direct translation between two languages is not feasible due to lack of data, a third language
(pivot language) is used. For example, to translate from Nepali to French, the system might first
translate Nepali to English, then English to French.

3.6 Unsupervised and Semi-Supervised Learning


Unsupervised MT uses only monolingual corpora and no parallel data, relying on shared embeddings
and language modeling. Semi-supervised approaches combine small parallel corpora with large
monolingual data.

3.7 Cross-Lingual Word Embeddings


These map words from different languages into a shared vector space, allowing a model to
generalize translation knowledge from high-resource to low-resource languages.

3.8 Community and Crowdsourced Data Collection


Encouraging native speakers and linguists to contribute translations and linguistic annotations can
help create new datasets for underrepresented languages.

5.​ How is machine translation evaluated? Discuss BLEU and METEOR scores

Evaluation of Machine Translation


1. Introduction
Evaluating machine translation (MT) systems is a crucial step in determining the quality, accuracy,
and fluency of the translated output. Unlike other NLP tasks with clearly defined answers, translation
can be highly subjective, as multiple correct translations may exist for a single source sentence.

To address this, both automatic and human evaluation methods are used.

2. Types of Evaluation Methods


2.1 Human Evaluation
Involves linguists or bilingual speakers rating translations based on:

●​ Adequacy (Does it preserve meaning?)​

●​ Fluency (Is it grammatically and stylistically natural?)​

●​ Faithfulness (Is it true to the source text?)​

Human evaluation is accurate but time-consuming, costly, and not scalable.

2.2 Automatic Evaluation


Uses algorithms to compare machine-generated translations with one or more reference
translations created by humans.

Advantages:

●​ Fast and scalable​

●​ Objective and repeatable​

Common automatic metrics include BLEU, METEOR, TER, and ChrF. Among these, BLEU and
METEOR are the most widely used.

3. BLEU Score (Bilingual Evaluation Understudy)


3.1 Definition
BLEU is one of the earliest and most popular automatic metrics for MT evaluation. It measures the
overlap of n-grams (typically 1-gram to 4-gram) between the machine translation and one or more
reference translations.

3.2 Working Mechanism


●​ Counts how many n-grams in the candidate translation also appear in the reference
translation(s).​

●​ Applies a brevity penalty if the machine translation is too short.​

●​ Produces a score between 0 and 1, where 1 means a perfect match with the reference.​

3.3 Formula Overview


3.4 Strengths
●​ Language-independent​

●​ Correlates well with human judgment for high-quality systems​

3.5 Limitations
●​ Ignores synonymy and paraphrasing​

●​ Does not consider word order beyond n-gram level​

●​ Penalizes legitimate variations in translation​

4. METEOR Score (Metric for Evaluation of Translation with


Explicit ORdering)
4.1 Definition
METEOR is an automatic metric developed to address BLEU's limitations. It aligns machine and
reference translations by considering exact matches, stem matches, synonym matches, and
paraphrase matches.

4.2 Working Mechanism


●​ Computes precision and recall based on unigram matches.​

●​ Uses F-measure to combine them, with more weight on recall.​

●​ Applies penalties for word order differences (fragmentation penalty).​

4.3 Formula Overview


4.4 Strengths
●​ Accounts for synonymy and morphological variants​

●​ Better correlates with human judgments, especially for adequacy​

●​ Considers word order and alignment​

4.5 Limitations
●​ More computationally intensive than BLEU​

●​ Language-specific resources (e.g., WordNet) are needed for synonym matching

6.​ Explain bias and ethical issues associated with MT systems

Bias and Ethical Issues in Machine


Translation
1. Introduction
Machine Translation (MT) systems are widely used to break language barriers in global
communication, education, business, and government services. While these systems have made
tremendous progress, especially with the advent of Neural Machine Translation (NMT), they are not
free from bias and ethical concerns.

Bias in MT can affect the fairness, accuracy, and trustworthiness of translations, while ethical
issues raise questions about accountability, transparency, and inclusivity. Addressing these
challenges is critical to ensuring that MT systems serve all users equitably and responsibly.

2. Types of Bias in Machine Translation


2.1 Gender Bias
MT systems often default to gender stereotypes, especially when translating from gender-neutral
languages (like Turkish or Finnish) into gendered ones (like English or Spanish).
●​ Example:​

○​ Turkish: O bir doktor.​

○​ MT Output (English): He is a doctor.​

○​ Issue: The system assumes "doctor" is male, reinforcing stereotypes.​

2.2 Cultural Bias


Translations may reflect the cultural norms or assumptions of the data used to train the model, often
favoring majority or Western cultures and underrepresenting others.

●​ Example:​

○​ Religious or political terms may be translated in ways that reflect one ideology or
worldview.​

2.3 Linguistic Bias


Languages spoken by larger populations or more economically developed regions are often better
supported, while low-resource or indigenous languages are neglected or inaccurately translated.

●​ Example:​

○​ High-quality translations exist for English-French, but poor or non-existent support for
languages like Quechua or Xhosa.​

2.4 Representation Bias


MT systems trained on biased data may reproduce and amplify existing social inequalities, including
race, nationality, or religion.

●​ Example:​

○​ Automatically translating certain nationalities with negative adjectives if that pattern


appears in training data.​

3. Ethical Issues in Machine Translation


3.1 Lack of Transparency
Many MT systems, especially neural models, are black boxes. Users and developers may not
understand how translation decisions are made, making it hard to audit or correct errors and biases.

3.2 Accountability and Responsibility


If an MT system provides an offensive, misleading, or incorrect translation, it may result in legal or
reputational consequences. However, it is often unclear who is accountable — the developers, the
users, or the data providers.

3.3 Privacy and Data Usage


MT systems that use cloud-based translation services may process sensitive or personal data
without user consent, raising concerns about data privacy and security.

3.4 Misinformation and Miscommunication


In critical domains like healthcare, law, or diplomacy, mistranslations can lead to serious
consequences, such as legal misinterpretation, medical errors, or diplomatic misunderstandings.

3.5 Marginalization of Minority Languages


The dominance of high-resource languages in MT can lead to the neglect or extinction of minority
languages, further marginalizing already vulnerable communities.

4. Mitigation Strategies
To address bias and ethical concerns, developers and researchers can implement the following
strategies:

●​ Diversify training data to include multiple perspectives, genders, and dialects.​

●​ Evaluate translations for bias using controlled test sets.​

●​ Enable user feedback mechanisms to report and correct biased or harmful translations.​

●​ Increase support for low-resource languages through community collaboration and open
data initiatives.​

●​ Use explainable AI techniques to make MT systems more transparent and accountable.​

●​ Apply fairness audits regularly to assess the performance of systems across different
languages and groups.

7.​ Compare traditional rule-based,statistical and neural approaches to MT

Comparison of Rule-Based, Statistical, and


Neural Machine Translation Approaches
1. Rule-Based Machine Translation (RBMT)
Rule-Based Machine Translation is the earliest approach to automated translation. It relies on
comprehensive sets of linguistic rules and bilingual dictionaries to translate text from a source
language to a target language. The process typically involves three stages: analysis, transfer, and
generation.

In the analysis phase, the source sentence is parsed according to its grammatical structure. The
transfer phase applies rules to convert the syntactic and lexical elements of the source language into
the target language. Finally, the generation phase reconstructs the sentence in the grammatical format
of the target language.

RBMT offers high interpretability and control over translation output. However, it requires intensive
manual effort to develop language-specific rules and resources. It also struggles with ambiguity,
idiomatic expressions, and scalability to new languages.

2. Statistical Machine Translation (SMT)


Statistical Machine Translation introduced a data-driven approach to translation. Instead of relying on
hand-crafted rules, SMT uses large bilingual corpora to learn translation patterns. It operates on the
principle of maximizing the probability of a target sentence given a source sentence, using statistical
models.

The typical SMT system includes phrase-based models that translate segments of words rather than
individual words. Word alignments, language models, and phrase tables are used to construct
translations based on statistical likelihood.

This approach significantly reduces the need for linguistic expertise and is easier to scale. However,
SMT tends to produce translations that are grammatically awkward or contextually inaccurate,
especially for complex sentences. It also struggles with capturing long-range dependencies and
maintaining sentence coherence.

3. Neural Machine Translation (NMT)


Neural Machine Translation is the most recent and advanced method. It employs deep learning
techniques, specifically neural networks, to perform translation in an end-to-end fashion. The core
architecture is typically an encoder–decoder model, where the encoder converts the input sentence
into a fixed or variable-length vector, and the decoder generates the translated output.

Modern NMT systems use attention mechanisms and transformer models to improve context handling
and translation quality. These models are capable of learning semantic relationships and producing
fluent, human-like translations.

NMT offers significant improvements over previous approaches in terms of fluency, accuracy, and the
ability to model complex language patterns. However, it requires large amounts of training data and
computational power. Additionally, NMT systems are often less interpretable and may generate
incorrect or biased translations if not properly trained.

4. Conclusion
Rule-Based Machine Translation emphasizes linguistic rules and provides control but lacks scalability
and adaptability. Statistical Machine Translation introduces automation through data but often
sacrifices fluency and semantic depth. Neural Machine Translation, by leveraging deep learning,
achieves high levels of translation quality and context awareness, though at the cost of requiring more
data and computational resources.
Each approach represents a key stage in the evolution of machine translation technology and
contributes to the continuous advancement of language processing systems.

8.​ What are some low resource techniques used to improve MT in low-resource conditions?

Techniques to Improve Machine Translation


in Low-Resource Conditions
1. Introduction
Low-resource machine translation refers to the task of building MT systems for language pairs that lack
sufficient parallel corpora or linguistic resources. In such scenarios, traditional data-hungry methods
like standard neural machine translation perform poorly due to insufficient training data. To address this
challenge, researchers and developers employ various specialized techniques to enhance translation
quality.

2. Transfer Learning
Transfer learning involves training an MT model on a high-resource language pair and then
fine-tuning it on a related low-resource pair. This is especially effective when the languages share
linguistic characteristics such as syntax or vocabulary. Knowledge gained from high-resource data
helps improve performance in the low-resource setting.

3. Multilingual Neural Machine Translation (MNMT)


Multilingual NMT trains a single model on multiple language pairs simultaneously, allowing the
model to learn shared representations across languages. This approach enables knowledge transfer
from high-resource languages to low-resource ones, even if direct parallel data is limited or
unavailable.

4. Back-Translation
Back-translation is a widely used data augmentation technique. It involves the following steps:

●​ A reverse translation model is trained to translate from the target to the source language.​

●​ Monolingual data in the target language is translated back into the source language.​

●​ The resulting synthetic parallel data is combined with real parallel data to train the forward
translation model.​

Back-translation significantly improves fluency and adequacy, especially in low-resource settings


where target-language monolingual data is more readily available.
5. Pivot-Based Translation
In pivot translation, a third language (pivot language) is used as an intermediary when no direct
parallel corpus exists between the source and target languages. The source text is first translated into
the pivot language, and then into the target language. This technique is useful when both the source
and target languages are well-connected to a high-resource language like English.

6. Unsupervised Machine Translation


Unsupervised MT eliminates the need for parallel corpora entirely. It relies solely on monolingual
corpora from both the source and target languages. The system learns to translate by aligning the
latent representations of both languages and using denoising autoencoders, language modeling, and
iterative back-translation.

Though still a developing area, unsupervised MT has shown promising results for certain language
pairs.

7. Data Augmentation
Data augmentation techniques artificially increase the amount and diversity of training data. Common
methods include:

●​ Synonym replacement: Replacing words with their synonyms.​

●​ Word order variation: Changing the order of words to generate paraphrases.​

●​ Sentence shuffling or recombination: Creating new examples by combining fragments from


existing sentences.​

These techniques improve the model’s robustness and generalization capability.

8. Use of Cross-Lingual Embeddings


Cross-lingual word embeddings map words from different languages into a shared vector space,
allowing models to transfer learned representations from high-resource to low-resource languages.
These embeddings help the model understand semantic similarities across languages even with
minimal data.

9. Crowdsourcing and Community Involvement


Human-in-the-loop methods, including crowdsourcing and community contributions, are employed to
create parallel corpora or validate machine-generated translations. This is particularly effective for
endangered or indigenous languages where digital resources are scarce.

10. Linguistic Rule Integration


For extremely low-resource languages, incorporating linguistic knowledge, such as morphological
analyzers, part-of-speech taggers, or grammar rules, can enhance performance. Hybrid models that
blend rule-based and neural approaches are sometimes used to leverage this knowledge effectively.

You might also like