Natural Language Processing
Natural Language Processing
Course Objectives:
1. To introduce the fundamental concepts and ideas in Natural Language Processing (NLP)
2. To introduce some of the problems and solutions of NLP and their relation to linguistics and
statistics
3. To provide an understanding of the algorithms available for the processing of linguistic
information and the underlying computational properties of natural languages
4. To study and compare various NLP algorithms and design modelling techniques
Course Outcomes:
After successful completion of the course the student will be able to
1. Describe the underlying concepts of Natural Language, Language Model Evaluation,
Morphological Models and Issues and Challenges in finding the structure of a word and
documents. (K2)
2. Explain about Parsing Natural Language and Multilingual Issues in Syntax Analysis. (K3)
3. Explain about syntactic structure and language-specific modelling problems. (K3)
4. Formulate various predicate techniques and analyze discourse processing. (K4)
5. Analyze various language modeling techniques. (K4)
UNIT-I (11Hrs)
Finding the Structure of Words: Words and Their Components, Issues and Challenges, Morphological
Models Finding the Structure of Documents: Introduction, Methods, Complexity of the Approaches,
Performances of the Approaches.
UNIT-II (08Hrs)
Syntax Analysis: Parsing Natural Language, Treebanks: A Data-Driven Approach to Syntax, Representation
of Syntactic Structure, Parsing Algorithms, Models for Ambiguity Resolution in Parsing, Multilingual Issues.
UNIT-III (09Hrs)
Semantic Parsing: Introduction, Semantic Interpretation, System Paradigms, Word Sense Systems,
Software.
UNIT-IV (13Hrs)
Predicate-Argument Structure, Meaning Representation Systems, Software.
Discourse Processing: Cohesion, Reference Resolution, Discourse Cohesion and structure
UNIT-V(09Hrs)
Language Modeling: Introduction, N-Gram Models, Language Model Evaluation, Parameter Estimation,
Language Model Adaptation, Types of Language Models, Language-Specific Modeling Problems, Multilingual
and Cross lingual Language Modeling.
Text Books:
1. Multilingual Natural Language Processing Applications: From Theory to Practice – Daniel M.
Bikel and Imed Zitouni, Pearson Publication
2. Natural Language Processing and Information Retrieval: Tanvier Siddiqui, U. S. Tiwary
Reference Books:
1. Speech and Natural Language Processing - Daniel Jurafsky & James H. Martin, Pearson Publications
UNIT-I
Finding the Structure of Words: Words and Their Components, Issues and Challenges,
Morphological Models Finding the Structure of Documents: Introduction, Methods, Complexity of
the Approaches, Performances of the Approaches.
Understanding
(NLU) Generation
(NLG)
Computational models of human language processing
Programs that operate internally the way humans do
Computational models of human communication
Programs that interact like humans
Computational systems that efficiently process text and speech
Components of NLP
There are two components of NLP, Natural Language Understanding (NLU) and Natural
Language Generation (NLG).
Natural Language Understanding (NLU) which involves transforming human language into a
machine-readable format.
It helps the machine to understand and analyze human language by extracting the text from large data
such as keywords, emotions, relations, and semantics.
Natural Language Generation (NLG) acts as a translator that converts the computerized data into
natural language representation.
It mainly involves Text planning, Sentence planning, and Text realization.
The NLU is harder than NLG.
NLP Terminology
Phonology−It is study of organizing sound systematically.
Morphology: The study of the formation and internal structure of words.
Morpheme−It is primitive unit of meaning in a language.
Syntax: The study of the formation and internal structure of sentences.
Semantics: The study of the meaning of sentences.
NATURAL LANGUAGE PROCESSING
Downloaded by Praneeth Reddy (npraneeth255@gmail.com)
lOMoARcPSD|59088236
Pragmatics−It deals with using and understanding sentences in different situations and how the
interpretation of the sentence is affected.
Discourse−It deals with how the immediately preceding sentence can affect the interpretation of
the next sentence.
World Knowledge−It includes the general knowledge about the world.
Lexical Analysis
Steps in NLP
Syntactic Analysis
There are five steps:
1. Lexical Analysis Semantic Analysis
2. Syntactic Analysis (Parsing)
3. Semantic Analysis
Discourse Integration
4. Discourse Integration
5. Pragmatic Analysis
Pragmatic Analysis
Lexical Analysis–
The first phase of NLP is the Lexical Analysis.
This phase scans the source code as a stream of characters and converts it into meaningful
lexemes.
It divides the whole text into paragraphs, sentences, and words.
Syntactic Analysis (Parsing)–
Syntactic Analysis is used to check grammar, word arrangements, and shows the relationship
among the words.
The sentence such as “The school goes to boy” is rejected by English syntactic analyzer.
Semantic Analysis–
Semantic analysis is concerned with the meaning representation.
It mainly focuses on the literal meaning of words, phrases, and sentences.
The semantic analyzer disregards sentence such as “hot ice-cream”.
Discourse Integration–
Discourse Integration depends upon the sentences that proceeds it and also invokes the meaning
of the sentences that follow it.
Pragmatic Analysis–
During this, what was said is re-interpreted on what it actually meant.
It involves deriving those aspects of language which require real world knowledge.
Example: "Open the door" is interpreted as are quest instead of an order.
Here, first we explore how to identify words of distinct types in human languages, and how the
internal structure of words can be modelled in connection with the grammatical properties and
lexical concepts the words should represent.
The discovery of word structure is morphological parsing.
In many languages, words are delimited in the orthography by whitespace and punctuation.
But in many other languages, the writing system leaves it up to the reader to tell words apart or
determine their exact phonological forms.
Tokens
Suppose, for a moment, that words in English are delimited only by whitespace and punctuation
(the marks, such as full stop, comma, and brackets)
Example: Will you read the newspaper? Will you read it? I won’t read it.
If we confront our assumption with insights from syntax, we notice two here: words newspaper
and won’t.
Being a compound word, newspaper has an interesting derivational structure.
In writing, newspaper and the associated concept is distinguished from the isolated news and
paper.
For reasons of generality, linguists prefer to analyze won’t as two syntactic words, or tokens, each
of which has its independent role and can be reverted to its normalized form.
The structure of won’t could be parsed as will followed by not.
In English, this kind of tokenization and normalization may apply to just a limited set of cases,
but in other languages, these phenomena have to be treated in a less trivial manner.
In Arabic or Hebrew, certain tokens are concatenated in writing with the preceding or the
following ones, possibly changing their forms as well.
The underlying lexical or syntactic units are there by blur red into one compact string of letters
and no longer appear as distinct words.
Tokens behaving in this way can be found in various languages and are often called clitics.
In the writing systems of Chinese, Japanese, and Thai, whitespace is not used to separate words.
Lexemes
By the term word, we often denote not just the one linguistic form in the given context but also
the concept behind the form and the set of alternative forms that can express it.
Such sets are called lexemes or lexical items, and they constitute the lexicon of a language.
Lexemes can be divided by their behaviour into the lexical categories of verbs, nouns, adjectives,
conjunctions, particles, or other parts of speech.
The citation form of a lexeme, by which it is commonly identified, is also called its lemma.
When we convert a word into its other forms, such as turning the singular mouse into the plural
mice or mouses, we say we inflect the lexeme.
When we transform a lexeme into another one that is morphologically related, regardless of its
lexical category, we say we derive the lexeme: for instance, the nouns receiver and reception are
derived from the verb to receive.
Example: Did you see him? I didn’t see him. I didn’t see anyone.
Example presents the problem of tokenization of didn’t and the investigation of the internal
structure of anyone.
In the paraphrase I saw no one, the lexeme to see would be inflected into the form saw to reflect
its grammatical function of expressing positive past tense.
Likewise, him is the oblique case form of he or even of a more abstract lexeme representing all
personal pronouns.
In the paraphrase, no one can be perceived as the minimal word synonymous with nobody.
NATURAL LANGUAGE PROCESSING
Downloaded by Praneeth Reddy (npraneeth255@gmail.com)
lOMoARcPSD|59088236
The difficulty with the definition of what counts as a word need not pose a problem for the
syntactic description if we understand no one as two closely connected tokens treated as one
fixed element.
Morphemes
Morphological theories differ on whether and how to associate the properties of word forms with
their structural components.
These components are usually called segments or morphs.
The morphs that by themselves represent some aspect of the meaning of a word are called
morphemes of some function.
Human languages employ a variety of devices by which morphs and morphemes are combined
into word forms.
Morphology
Morphology is the domain of linguistics that analyzes the internal structure of words.
Morphological analysis–exploring the structure of words
Words are built up of minimal meaningful elements called morphemes:
played=play-ed
cats = cat-s
unfriendly = un-friend-ly
Two types of morphemes:
i. Stems: play, cat, friend
ii. Affixes: -ed, -s, un-, -ly
Two main types of affixes:
i. Prefixes precede the stem: un-
ii. Suffixes follow the stem:-ed,-s,un-,-ly
Stemming = find the stem by stripping off affixes
play =play
replayed = re-play-ed
computerized = comput-er-ize-d
Problems in morphological processing
Inflectional morphology: inflected forms are constructed from base forms and inflectional affixes.
Inflection relates different forms of the same word
Lemma Singular Plural
cat cat cats
dog dog dogs
knife knife knives
sheep sheep sheep
mouse mouse mice
Derivational morphology: words are constructed from roots (or stems) and derivational affixes:
inter+national= international
international+ize= internationalize
internationalize+ation= internationalization
The simplest morphological process concatenates morphs one by one, as in dis-agree-ment-s,
where agree is a free lexical morpheme and the other elements are bound grammatical
morphemes contributing some partial meaning to the whole word.
in a more complex scheme, morph scan interact with each other, and their forms may become
subject to additional phono logical and orthographic changes denoted as morphophonemic.
The alternative forms of a morpheme are termed allomorphs.
Typology
Morphological typology divides languages into groups by characterizing the prevalent
morphological phenomena in those languages.
NATURAL LANGUAGE PROCESSING
Downloaded by Praneeth Reddy (npraneeth255@gmail.com)
lOMoARcPSD|59088236
It can consider various criteria, and during the history of linguistics, different classifications have
been proposed.
Let us outline the typology that is based on quantitative relations between words, their
morphemes, and their features:
Isolating, or analytic, languages include no or relatively few words that would comprise more
than one morpheme (typical members are Chinese, Vietnamese, and Thai; analytic tendencies are
also found in English).
Synthetic languages can combine more morphemes in one word and are further divided into
agglutinative and fusional languages.
Agglutinative languages have morphemes associated with only a single function at a time (as in
Korean, Japanese, Finnish, and Tamil, etc.)
Fusional languages are defined by their feature-per-morpheme ratio higher than one (as in
Arabic, Czech, Latin, Sanskrit, German, etc.).
In accordance with the notions about word formation processes mentioned earlier, we can also
find out using concatenative and nonlinear:
Concatenative languages linking morphs and morphemes one after another.
Nonlinear languages allowing structural components to merge nonsequentially to apply tonal
morphemes or change the consonantal or vocalic templates of words.
Morphological Typology
Morphological typology is a way of classifying the languages of the world that groups languages
according to their common morphological structures.
The field organizes languages on the basis of how those languages form words by combining
morphemes.
The morphological typology classifies languages into two broad classes of synthetic languages
and analytical languages.
The synthetic class is then further sub classified as either agglutinative languages or fusional
languages.
Analytic languages contain very little inflection, instead relying on features like word order and
auxiliary words to convey meaning.
Synthetic languages, ones that are not analytic, are divided into two categories: agglutinative and
fusional languages.
Agglutinative languages rely primarily on discrete particles (prefixes, suffixes, and infixes) for
inflection, ex: inter+national= international, international+ize= internationalize.
While fusional languages "fuse" inflectional categories together, often allowing one word ending
to contain several categories, such that the original root can be difficult to extract (anybody,
newspaper).
Issues and Challenges
Irregularity: word forms are not described by a proto typical linguistic model.
Ambiguity: word forms be understood in multiple ways out of the context of their discourse.
Productivity: is the inventory of words in a language finite, or is it unlimited?
Morphological parsing tries to eliminate the variability of word forms to provide higher-level
linguistic units whose lexical and morphological properties are explicit and well de fined.
It attempts to remove unnecessary irregularity and give limits to ambiguity, both of which are
present inherently in human language.
By irregularity, we mean existence of such forms and structures that are not described
appropriately by a prototypical linguistic model.
Some irregularities can be understood by redesigning the model and improving its rules, but
other lexically dependent irregularities often cannot be generalized
Ambiguity is indeterminacy (not being interpreted) in interpretation of expressions of language.
Morphological modelling also faces the problem of productivity and creativity in language, by
which unconventional but perfectly meaningful new words or new senses are coined.
Irregularity
Morphological parsing is motivated by the quest for generalization and abstraction in the world
of words.
Immediate descriptions of given linguistic data may not be the ultimate ones, due to either their
inadequate accuracy or in appropriate complexity, and better formulations may be needed.
The design principles of the morphological model are there for every important.
In Arabic, the deeper study of the morphological processes that are in effect during inflection and
derivation, even for the so-called irregular words, is essential for mastering the whole
morphological and phonological system.
With the proper abstractions made, irregular morphology can be seen as merely enforcing some
extended rules, the nature of which is phonological, over the underlying or proto typical regular
word forms.
Table illustrates differences between a naive model of word structure in Arabic and the model
proposed in Smrˇz and Smrˇz and Bielick´y where morphophonemic merge rules and templates
are involved.
Morphophonemic templates capture morphological processes by just organizing stem patterns
and generic affixes without any context-dependent variation of the affixes or ad hoc modification
of the stems.
The merge rules, indeed very terse, then ensure that such structured representations can be
converted into exactly the surface forms, both orthographic and phonological, used in the natural
language.
Applying the merge rules is independent of and irrespective of any grammatical parameters or
information other than that contained in a template.
Most morphological irregularities are thus successfully removed.
Table: Discovering the regularity of Arabic morphology using morphophonemic templates, where
uniform structural operations apply to different kinds of stems. In rows, surface forms S of qara_ ‘to
read’ and ra_¯a ‘to see’ and their inflections are analyzed into immediate I and morphophonemic M
templates, in which dashes mark the structural boundaries where merge rules are enforced. The
outer columns of the table correspond to P perfective and I imperfective stems declared in the
lexicon; the inner columns treat active verb forms of the following morphosyntactic properties: I
indicative, S subjunctive, J jussive mood; 1 first, 2 second, 3 third person; M masculine, F feminine
gender; S singular, P plural number
Ambiguity
Morphological ambiguity is the possibility that word forms be understood in multiple ways out
of the context of their discourse.
Words forms that look the same but have distinct functions or meaning are called homonyms.
Ambiguity is present in all aspects of morphological processing and language processing at large.
Table arranges homonyms on the basis of their behavior with different endings.
Three nominal cases are expressed by the same word form with ‘my study’ and
‘my teachers’, but the original case endings are distinct.
Productivity
Is the inventory of words in a language finite, or is it unlimited?
This question leads directly to discerning two fundamental approaches to language, summarized
in the distinction between langue and parole by Ferdinand de Saussure, or in the competence
versus performance duality by Noam Chomsky.
In one view, language can be seen as simply a collection of utterances (parole) actually pronounced
or written (performance).
NATURAL LANGUAGE PROCESSING
Downloaded by Praneeth Reddy (npraneeth255@gmail.com)
lOMoARcPSD|59088236
This ideal data set can in practice be approximated by linguistic corpora, which are finite
collections of linguistic data that are studied with empirical methods and can be used for
comparison when linguistic models are developed.
Yet, if we consider language as a system (langue), we discover in it structural devices like recursion,
iteration, or compounding (makeup; constitute) that allow to produce (competence) an infinite set
of concrete linguistic utterances.
This general potential holds for morphological processes as well and is called morphological
productivity.
We denote the set of word forms found in a corpus of a language as its vocabulary.
The members of this set are word types, whereas every original instance of a word form is a word
token.
The distribution of words or other elements of language follows the “80/20 rule”, also known as
the law of the vital few.
It says that most of the word tokens in a given corpus can be identified with just a couple of word
types in its vocabulary, and words from the rest of the vocabulary occur much less commonly if not
rarely in the corpus.
Furthermore, new, unexpected words will always appear as the collection of linguistic data is
enlarged.
In Czech, negation is a productive morphological operation. Verbs, nouns, adjectives, and adverbs
can be prefixed with ne- to define the complementary lexical concept.
Morphological Models
There are many possible approaches to designing and implementing morphological models.
Over time, computational linguistics has witnessed the development of a number of formalisms
and frameworks, in particular grammars of different kinds and expressive power, with which to
address whole classes of problems in processing natural as well as formal languages.
Let us now look at the most prominent types of computational approaches to morphology.
Dictionary Lookup
Morphological parsing is a process by which word forms of a language are associated with
corresponding linguistic descriptions.
Morphological systems that specify these associations by merely enumerating (is the act or
process of making or stating a list of things one after another) them case by case do not offer
any generalization means.
Likewise for systems in which analyzing a word form is reduced to looking it up verbatim in word
lists, dictionaries, or databases, unless they are constructed by and kept in sync with more
sophisticated models of the language.
In this context, a dictionary is understood as a data structure that directly enables obtaining some
precomputed results, in our case word analyzes.
The data structure can be optimized for efficient lookup, and the results can be shared. Lookup
operations are relatively simple and usually quick.
Dictionaries can be implemented, for instance, as lists, binary search trees, tries, hash tables, and
so on.
Because the set of associations between word forms and their desired descriptions is declared by
plain enumeration, the coverage of the model is finite and the generative potential of the language
is not exploited.
Developing as well as verifying the association list is tedious, liable to errors, and likely inefficient
and inaccurate unless the data are retrieved automatically from large and reliable linguistic
resources.
Despite all that, an enumerative model is often sufficient for the given purpose, deals easily with
exceptions, and can implement even complex morphology.
For instance, dictionary based approaches to Korean depend on a large dictionary of all possible
combinations of allomorphs and morphological alternations.
These approaches do not allow development of reusable morphological rules, though.
Finite-State Morphology
By finite-state morphological models, we mean those in which the specifications written by human
programmers are directly compiled into finite-state transducers.
The two most popular tools supporting this approach, XFST (Xerox Finite-State Tool) and LexTools.
Finite-state transducers are computational devices extending the power of finite-state automata.
They consist of a finite set of nodes connected by directed edges labeled with pairs of input and
output symbols.
In such a network or graph, nodes are also called states, while edges are called arcs.
Traversing the network from the set of initial states to the set of final states along the arcs is
equivalent to reading the sequences of encountered input symbols and writing the sequences of
corresponding output symbols.
The set of possible sequences accepted by the transducer defines the input language; the set of
possible sequences emitted by the transducer defines the output language.
For example, a finite-state transducer could translate the infinite regular language consisting of the
words vnuk, pravnuk, prapravnuk, . . . to the matching words in the infinite regular language
defined by grandson, great-grandson, great-great-grandson, . . .
In finite-state computational morphology, it is common to refer to the input word forms as surface
strings and to the output descriptions as lexical strings, if the transducer is used for
morphological analysis, or vice versa, if it is used for morphological generation.
In English, a finite-state transducer could analyze the surface string children into the lexical string
child [+plural], for instance, or generate women from woman [+plural].
Relations on languages can also be viewed as functions. Let us have a relation R, and let us denote
by [Σ] the set of all sequences over some set of symbols Σ, so that the domain and the range of R
are subsets of [Σ].
We can then consider R as a function mapping an input string into a set of output strings, formally
denoted by this type signature, where [Σ] equals String:
R :: [Σ] → {[Σ]} R:: String → {String} (1.1)
A theoretical limitation of finite-state models of morphology is the problem of capturing
reduplication of words or their elements (e.g., to express plurality) found in several human
languages.
Finite-state technology can be applied to the morphological modeling of isolating and agglutinative
languages in a quite straightforward manner. Korean finite-state models are discussed by Kim, Lee
and Rim, and Han, to mention a few.
Unification-Based Morphology
The concepts and methods of these formalisms are often closely connected to those of logic
programming.
In finite-state morphological models, both surface and lexical forms are by themselves
unstructured strings of atomic symbols.
In higher-level approaches, linguistic information is expressed by more appropriate data
structures that can include complex values or can be recursively nested if needed.
Morphological parsing P thus associates linear forms φ with alternatives of structured content ψ,
cf.
P :: φ → {ψ} P:: form → {content} (1.2)
Erjavec argues that for morphological modeling, word forms are best captured by regular
expressions, while the linguistic content is best described through typed feature structures.
Feature structures can be viewed as directed acyclic graphs.
A node in a feature structure comprises a set of attributes whose values can be feature structures
again.
Nodes are associated with types, and atomic values are attributeless nodes distinguished by their
type.
Instead of unique instances of values everywhere, references can be used to establish value
instance identity.
written, or printed text or spoken utterances into machine editable text, the finding of sentences
boundaries must deal with the errors of those systems as well.
On the other hand, for conversational speech or text or multiparty meetings with ungrammatical
sentences and disfluencies, in most cases it is not clear where the boundaries are.
Code switching-that is, the use of words, phrases, or sentences from multiple languages by
multilingual speakers- is another problem that can affect the characteristics of sentences.
For example, when switching to a different language, the writer can either keep the punctuation
rules from the first language or resort to the code of the second language.
Conventional rule-based sentence segmentation systems in well-formed texts rely on patterns to
identify potential ends of sentences and lists of abbreviations for disambiguating them.
For example, if the word before the boundary is a known abbreviation, such as “Mr.” or “Gov.,” the
text is not segmented at that position even though some periods are exceptions.
To improve on such a rule-based approach, sentence segmentation is stated as a classification
problem.
Given the training data where all sentence boundaries are marked, we can train a classifier to
recognize them
1.2 Topic Boundary Detection
Segmentation (Discourse or text segmentation) is the task of automatically dividing a stream of
text or speech into topically homogenous blocks.
This is, given a sequence of (written or spoken) words, the aim of topic segmentation is to find
the boundaries where topics change.
Topic segmentation is an important task for various language understanding applications, such as
information extraction and retrieval and text summarization.
For example, in information retrieval, if along documents can be segmented into shorter, topically
coherent segments, then only the segment that is about the user’s query could be retrieved.
During the late 1990s, the U.S defence advanced research project agency (DARPA) initiated the
topic detection and tracking program to further the state of the art in finding and following
new topic in a stream of broadcast news stories.
One of the tasks in the TDT effort was segmenting a news stream into individual stories.
2 Methods
Sentence segmentation and topic segmentation have been considered as a boundary
classification problem.
Given a boundary candidate (between two word tokens for sentence segmentation and between
two sentences for topic segmentation), the goal is to predict whether or not the candidate is an
actual boundary (sentence or topic boundary).
Formally, let xƐX be the vector of features (the observation) associated with a candidate and
yƐY be the label predicted for that candidate.
The label y can be b for boundary and 𝒃 ̅ for non-boundary.
Classification problem: given a set of training examples (x,y) train, find a function that will assign
the most accurate possible label y of unseen examples xunseen.
Alternatively to the binary classification problem, it is possible to model boundary types using
finer-grained categories.
For segmentation in text be framed as a three-class problem: sentence boundary ba, without an
abbreviation b𝒂 ̅ and abbreviation not as a boundary 𝒃̅a
Similarly spoken language, a three way classification can be made between non-boundaries
Statements bs, and question boundaries bq.
For sentence or topic segmentation, the problem is defined as finding the most probable sentence
or topic boundaries.
The natural unit of sentence segmentation is words and of topic segmentation is sentence, as we
can assume that topics typically do not change in the middle of a sentences.
The words or sentences are then grouped into categories stretches belonging to one sentences or
topic-that is word or sentence boundaries are classified into sentences or topic boundaries and-
non-boundaries.
The classification can be done at each potential boundary i (local modelling); then, the aim is to
estimate the most probable boundary type 𝑦̂i for each candidate xi
NATURAL LANGUAGE PROCESSING
Downloaded by Praneeth Reddy (npraneeth255@gmail.com)
lOMoARcPSD|59088236
Here, the ^ is used to denote estimated categories, and a variable without a ^ is used to show
possible categories.
In this formulation, a category is assigned to each example in isolation; hence, decision is made
locally.
However, the consecutive types can be related to each other. For example, in broadcast news
speech, two consecutive sentences boundaries that form a single word sentence are very
infrequent.
In local modelling, features can be extracted from surrounding example context of the candidate
boundary to model such dependencies.
It is also possible to see the candidate boundaries as a sequence and search for the sequence of
boundary types 𝑌̂ = 𝑦̂1……𝑦̂n that have the maximum probability given the candidate examples, 𝑋̂
= 𝑥̂1……𝑥̂n
P(X) in the denominator is dropped because it is fixed for different Y and hence does not change
the argument of max.
P(X|Y) and P(Y) can be estimated as
Where NumNewTerms(b) returns the number of terms in block b seen the first time intext.
2.3 Discriminative Sequence Classification Methods
In segmentation tasks, the sentence or topic decision for a given example (word, sentence,
paragraph) highly depends on the decision for the examples in its vicinity.
Discriminative sequence classification methods are in general extensions of local discriminative
models with additional decoding stages that find the best assignment of labels by looking at
neighbouring decisions to label.
Conditional random fields (CRFs) are extension of maximum entropy, SVM struct is an extension
of SVM, and maximum margin Markov networks (M3N) are extensions of HMM.
CRFs are a class of log-linear models for labelling structures.
Contrary to local classifiers that predict sentences or topic boundaries independently, CRF scan
oversee the whole sequence of boundary hypotheses to make their decisions.
Hybrid Approaches
Non-sequential discriminative classification algorithms ignore the context, which is critical for the
segmentation task.
We can add context as a feature or use CRFs which consider context.
An alternative is to use a hybrid classification approach.
In this approach the posterior probabilities, Pc (yi /xi) for each candidate obtained from the
classifiers such as boosting or CRF are used.
The posterior probabilities are converted to state observation likelihoods by dividing to their
priors using the Bayes rule.
They need to handle the complexity of finding the best sequence of decisions which requires
evaluating all possible sequences of decisions.
The assumption of conditional independence in generative sequence algorithms allow the use of
dynamic programming to trade time for memory and decode in polynomial time.
This complexity is measured in:
The number of boundary candidates processed together
The number of boundary states
Discriminative sequence classifiers like CRFs need to repeatedly perform inference on the
training data which might become expensive.
Performances of the Approaches
For sentence segmentation in speech performance is evaluated using:
Error rate:- ratio of number of errors to the number of examples
F1 measure
National Institute of standards and Technology (NIST) error rate:- Number of candidates
wrongly labelled divided by the number of actual boundaries.
For sentence segmentation in text the reports on the error rate on a subset of the wall street
journal corpus of about 27000 sentences are as follows:
A typical rule-based system performs at an error rate of 1.41%
An addition of abbreviation list lowers the error rate to 0.45%
Combining it with a supervised classifier using POS tag features lead to an error rate of 0.31%
An SVM based system obtains an error rate of 0.25%
Even though the error rates seem to be very low they might effect activities like extractive
summarization which depend on sentence segmentation.
For sentence segmentation in speech reports on the Mandarin TDT4 Multilingual Broadcast News
Speech Corpus are as follows:
F1 measure of 69.1% for a MaxEnt classifier, 72.6% with Adaboost, and 72.7% with SVM using
the same features.
On a Turkish Broadcast News Corpus reports are as follows:
F1 measure of 78.2% with HELM 86.2% with fHELM with morphology features 86.9% with
Adaboost and 89.1% with CRFs.
Reports show that on the English TDT4 broadcast news corpus Adaboost combined with HELM
performs at an F1 measure 67.3%
UNIT-II
Syntax Analysis: Parsing Natural Language, Treebanks: A Data-Driven Approach to Syntax,
Representation of Syntactic Structure, Parsing Algorithms, Models for Ambiguity Resolution in
Parsing, Multilingual Issues.
Syntax Analysis:
The parsing in NLP is the process of determining the syntactic structure of a text by analysing its
constituent words based on an underlying grammar.
Example Grammar:
Then, the outcome of the parsing process would be a parse tree, where sentence is the root,
intermediate nodes such as noun_phrase, verb_phrase etc. have children-hence they are called non-
terminals and finally, the leaves of the tree ‘Tom’, ‘ate’, ‘an’, ‘apple’ are called terminals.
Parse Tree:
A treebank can be defined as a linguistically annotated corpus that includes some kind of syntactic
analysis over and above part-of-speech tagging.
A sentence is parsed by relating each word to other words in the sentence which depend on it.
The syntactic parsing of a sentence consists of finding the correct syntactic structure of that
sentence in the given formalism/grammar.
Dependency grammar (DG) and phrase structure grammar (PSG) are two such formalisms.
PSG breaks sentence into constituents (phrases), which are then broken into smaller constituents.
Describe phrase, clause structure Example: NP, PP, VP etc.,
DG: syntactic structure consist of lexical items, linked by binary asymmetric relations called
dependencies.
Interested in grammatical relations between individual words.
Does propose are cursive structure rather a network of relations
These relations can also have labels.
NLP: is the capability of the computer software to understand the natural language.
There are variety of languages in the world.
Each language has its own structure (SVO or SOV)->called grammar ->has certain set of rules->
determines: what is allowed, what is not allowed.
English: S O V Other languages: S V O or O S V
I eat mango
Grammar is defined as the rules for forming well-structured sentences.
belongs to VN
Different Types of Grammar in NLP
1. Context-Free Grammar (CFG)
2. Constituency Grammar (CG) or Phrase structure grammar
3. Dependency Grammar (DG)
Context-Free Grammar (CFG)
Mathematically, a grammar G can be written as a 4-tuple (N, T, S, P)
N or VN=set of non-terminal symbols, or variables.
T or ∑ =set of terminal symbols.
S =Start symbol where S ∈N
P =Production rules for Terminals as well as Non-terminals.
It has the form α → β, where α and β are strings on VN ∪ ∑ at least one symbol of α
Natural Language as Each Language
input (sentences)
Set of Rules =
NLP s/w Grammar
Parsing Contains
CFG
S
Example: John hit the ball
S -> NP VP N VP
VP -> V NP
V ->hit
V NP
NP-> DN
D->the
N->John|ball D N
The dependency tree analyses, where each word depends on exactly one parent, either another
word or a dummy root symbol.
By convention, independency tree 0 index is used to indicate the root symbol and the directed
arcs are drawn from the head word to the dependent word.
In the Fig shows a dependency tree for Czech sentence taken from the Prague dependency tree
bank.
Each node in the graph is a word, its part of speech and the position of the word in the sentence.
For example [fakulte,N3,7] is the seventh word in the sentence with POS tag N3.
The node [#,ZSB,0] is the root node of the dependency tree.
There are many variations of dependency syntactic analysis, but the basic textual format for a
dependency tree can be written in the following form.
Where each dependent word specifies the head Word in the sentence, and exactly one word is
dependent to the root of the sentence.
NNP: proper noun, singular VBZ: verb, third person singular present
ADJP: adjective phrase RB: adverb JJ: adjective
The same sentence gets the following dependency tree analysis: some of the information from the
bracketing labels from the phrase structure analysis gets mapped onto the labelled arcs of the
dependency analysis.
To explain some details of phrase structure analysis in treebank, which was a project that
annotated 40,000 sentences from the wall street journal with phrase structure tree,
The SBARQ label marks what questions ie those that contain a gap and therefore require a trace.
Wh-moved noun phrases are labeled WHNP and put inside SBARQ. They bear an identity index
that matches the reference index on the *T* in the position of the gap.
However questions that are missing both subject and auxiliary are label SQ
NP-SBJ noun phrases cab be subjects.
*T* traces for wh-movement and this empty trace has an index (here it is 1) and associated with
the WHNP constituent with the same index.
Parsing Algorithms
Given an input sentence, a parser produces an output analysis of that sentence.
Treebank parsers do not need to have an explicit grammar, but to discuss the parsing algorithms
simpler, we use CFG.
The simple CFG G that can be used to derive string such as a and b or c from the start symbol N.
1. Start with an empty stack and the buffer contains the input string.
2. Exit with success if the top of the stack contains the start symbol of the grammar and if the buffer
is empty
3. Choose between the following two steps (if the choice is ambiguous, choose one based on an
oracle):
Shift a symbol from the buffer onto the stack.
If the top k symbols of the stack are 1…k which corresponds to the right-hand side of a CFG
rule A1…k then replace the top k symbols with the left-hand side non-terminal A.
4. Exit with failure if no action can be taken in previous step.
5. Else, go to step 2.
Hypergraphs and Chart Parsing (CYK Parsing)
CFG s in the worst case such a parser might have to resort to backtracking, which means re-parsing
the input which leads to a time that is exponential in the grammar size in the worst case.
Variants of this algorithm (CYK) are often used in statistical parsers that attempt to search the
space of possible parse trees without the limitation of purely left to right parsing.
One of the earliest recognition parsing algorithm is CYK (Cocke, Kasamiand younger) parsing
algorithm and it works only with CNF (Chomsky normal form).
CYK example:
Here we want to provide a model that matches the intuition that the second tree above is preferred
over the first.
The parses can be thought of as ambiguous (left most to right most) derivation of the following
CFG:
We assign scores or probabilities to the rules in CGF in order to provide as core or probability for
each derivation.
From these rule probabilities, the only deciding factor for choosing between the two parses for
John brought a shirt with pockets in the two rules NP->NP PP and VP->VP PP. The probability for
NP->NP PP is set higher in the preceding PCFG.
The rule probabilities can be derived from a treebank, consider a treebank with three tress t1,t2,t3
If we assume that tree t1 occurred 10 times in the treebank, t2 occurred 20 times and t3 occurred
50 times, then the PCFG we obtain from this treebank is:
For input a a a there are two parses using the above PCFG: the probability P1 =0.125 0.334 0.285
= 0.01189 p2=0.25 0.667 0.714 =0.119.
The parse tree p2 is the most likely tree for that input.
Generative models
To find the most plausible parse tree, the parser has to choose between the possible derivations
each of which can be represented as a sequence of decisions.
Let each derivation D=d1,d2,…..,dn, which is the sequence of decisions used to build the parse tree.
Then for input sentence x, the output parse tree y is defined by the sequence of steps in the
derivation.
The probability for each derivation:
The conditioning context in the probability P (di|d1,……..,di-1) is called the history and corresponds
to a partially built parse tree (as defined by the derived sequence).
We make a simplifying assumption that keeps the conditioning context to a finite set by grouping
the histories into equivalence classes using a function
There are two general approaches to parsing 1.Top down parsing ( start with start symbol)
2.Botttom up parsing (start from terminals)
Multilingual Issues:
Multilingualism is the ability of an individual speaker or a community of speakers to communicate
effectively in three or more languages. Contrast with monolingualism, the ability to use only one
language.
A person who can speak multiple languages is known as a polyglot or a multilingual.
The original language a person grows up speaking is known as their first language or mother tongue.
Someone who is raised speaking two first languages or mother tongues is called a simultaneous
bilingual. If they learn a second language later, they are called a sequential bilingual.
Multilingual NLP is a technology that integrates linguistics, artificial intelligence, and computer
science to serve the purpose of processing and analyzing substantial amounts of natural human
language in numerous settings.
NATURAL LANGUAGE PROCESSING
Downloaded by Praneeth Reddy (npraneeth255@gmail.com)
lOMoARcPSD|59088236
There are many different forms of multilingual NLP, but in general, it enables computational software
to understand the language of certain texts, along with contextual nuances. Multilingual NLP is also
capable of obtaining specific data and delivering key insights. In short, multilingual NLP technology
makes the impossible possible which is to process and analyze large amounts of data. Without it, this
kind of task can probably only be executed by employing a very labor- and time-intensive approach.
One of the biggest obstacles preventing multilingual NLP from scaling quickly is relating to low
availability of labelled data in low-resource languages.
Among the 7,100 languages that are spoken worldwide, each of them has its own linguistic rules and
some languages simply work in different ways. For instance, there are undeniable similarities
between Italian, French and Spanish, whilst on the other hand, these three languages are totally
different from a specific Asian language group, that is Chinese, Japanese, and Korean which share
some similar symbols and ideographs.
Solutions for tackling multilingual NLP challenges
1. Training specific non-English NLP models
The first suggested solution is to train an NLP model for a specific language. A well-known example
would be a few new versions of Bidirectional Encoder Representations from Transformers (BERT)
that have been trained in numerous languages.
However, the biggest problem with this approach is its low success rate of scaling. It takes lots of
time and money to train a new model, let alone many models. NLP systems require various large
models, hence the processes can be very expensive and time-consuming.
This technique also does not scale effectively in terms of inference. Using NLP in different
languages means the business would have to sustain different models and provision several
servers and GPUs. Again, this can be extremely costly for the business.
2. Leveraging multilingual models
The past years have seen that new emerging multilingual NLP models can be incredibly accurate,
at times even more accurate than specific, dedicated non-English language models.
Whilst there are several high-quality pre-trained models for text classification, so far there has not
been a multilingual model for text generation with impressive performance.
3. Utilizing translation
The last solution some businesses benefit from is to use translation. Companies can translate their
non-English content to English, provide the NLP model with that English content, then translate
the result back to the needed language.
As manual as it may sound, this solution has several advantages, including cost-effective workflow
maintenance and easily supported worldwide languages.
Translation may not be suitable if your business is after quick results, as the overall workflow’s
response time must increase to include translating process.
UNIT-III
Semantic Parsing: Introduction, Semantic Interpretation, System Paradigms, Word Sense Systems,
Software.
Semantic Parsing
3.1 Introduction:
In language understanding is the identification of a meaning representation that is detailed
enough to allow reasoning system to make deduction. (the process of reaching a decision or
answer by thinking about the known facts).
But at the same time, is general enough that it can be used across many domains with little to no
adaptation. (Not capable of adjusting to new conditions or situations).
It is not clear whether a final, low-level, detailed semantics representation covering various
applications that use some form of language interface can be achieved or
An ontology (a branch of metaphysics concerned with the nature and relations of being) can
be created that can be created that can capture the various granularities and aspects of meaning
that are embodied in such that a variety of applications.
None of these approaches are not created, so two compromise approaches have emerged in the
NLP for language understanding.
In the first approach, a specific, rich meaning representation is created for a limited domain for
use by application that are restricted to that domain, such as travel reservations, football game
simulations, or querying a geographic database.
In the second approach, a related set of intermediate-specific meaning representation is created,
going from low-level analysis to a middle analysis, and the bigger understanding task is divided
into multiple, smaller pieces that are more manageable, such as word sense disambiguation
followed by predicate-argument structure recognition.
Here two types of meaning representations: a domain-dependent, deeper representation and a
set of relatively shallow but general-purpose, low-level, and intermediate representation.
The task of producing the output of the first type is often called deep semantic parsing, and the
task of producing the output of the second type is often called shallow semantic parsing.
The first approach is so specific that porting to every new domain can require anywhere from a
few modifications to almost reworking the solution from scratch.
In other words, the reusability of the representation across domains is very limited.
The problem with second approach is that it is extremely difficult to construct a general purpose
ontology and create symbols that are shallow enough to be learnable but detailed enough to be
useful for all possible applications.
Therefore, an application specific translation layer between the more general representation and
the more specific representation becomes necessary.
3.2 Semantic Interpretation:
Semantic parsing can be considered as part of Semantic interpretation, which involves various
components that together define a representation of text that can be fed into a computer to allow
further computations manipulations and search, which are prerequisite for any language
understanding system or application. Here we start with discus with structure of semantic theory.
A Semantic theory should be able to:
1. Explain sentence having ambiguous meaning: The bill is large is ambiguous in the sense that
is could represent money or the beak of a bird.
2. Resolve the ambiguities of words in context. The bill is large but need not be paid, the theory
should be able to disambiguate the monetary meaning of bill.
3. Identify meaningless but syntactically well-formed sentence: Colourless green ideas sleep
furiously.
4. Identify syntactically or transformationally unrelated paraphrases of concept having the same
semantic content.
Here we look at some requirements for achieving a semantic representation.
3.2.1. Structural ambiguity:
Structure means syntactic structure of sentences.
NATURAL LANGUAGE PROCESSING
Downloaded by Praneeth Reddy (npraneeth255@gmail.com)
lOMoARcPSD|59088236
The syntactic structure means transforming the sentence into its underlying syntactic
representation and in theory of semantic interpretation refer to underlying syntactic
representation.
3.2.2. Word Sense:
In any given language, the same word type is used in different contexts and with different
morphological variants to represent different entities or concepts in the world.
For example, we use the word nail to represent a part of the human anatomy and also to represent
the generally metallic object used to secure other objects.
Intended by the author or speaker. Let’s take the following four examples. The presence of words
such as hammer and hardware store in sentences 1 and 2, and of clipped and manicure in sentences
3 and 4, enable humans to easily disambiguate the sense in which nail is used:
1. He nailed the loose arm of the chair with a hammer.
2. He bought a box of nails from the hardware store.
3. He went to the beauty salon to get his nails clipped.
4. He went to get a manicure. His nails had grown very long.
Resolving the sense of words in a discourse, therefore, constitutes one of the steps in the process of
semantic interpretation.
3.2.3. Entity and Event Relation:
The next important component of semantic interpretation is the identification of various entities
that are sparkled across the discourse using the same or different phrases.
The predominant tasks have become popular over the years: named entity recognition and
coreference resolution.
3.2.4. Predicate-Argument Structure:
Once we have the word-sense, entities and events identified, another level of semantics structure
comes into play: identifying the participants of the entities in these events.
Resolving the argument structure of predicate in the sentence is where we identify which entities
play what part in which event.
A word which functions as the verb does here, we call a predicate and words which function as
the nouns do are called arguments. Here are some other predicates and arguments:
Argument predicate
Selena slept
Tom is tall
Word sense ambiguities can be of three principal types: i. Homonymy ii. polysemy iii. categorial
ambiguity.
Homonymy defined as the words having same spelling or same form but having different and
unrelated meaning. For example, the word “Bat” is a homonymy word because bat can be an
implement to hit a ball or bat is a nocturnal flying mammal also.
Polysemy is a Greek word, which means “many signs”. Polysemy has the same spelling but
different and related meaning.
Both polysemy and homonymy words have the same syntax or spelling. The main difference
between them is that in polysemy, the meanings of the words are related but in homonymy, the
meanings of the words are not related.
For example: Bank Homonymy: financial bank and river bank
Polysemy: financial bank, bank of clouds and book Bank: indicate collection of things.
Categorial ambiguity: the word book can mean a book which contain the chapters or police
register which is used to enter the charges against someone. Text book and note book.
In the above note book belongs to the grammatical category of noun, and text book is verb.
Distinguishing between these two categories effectively helps disambiguate these two senses.
Therefore categorical ambiguity can be resolved with syntactic information (part of speech) alone,
but polyseme and homonymy need more than syntax.
Traditionally, in English, word senses have been annotated for each part of speech separately,
whereas in Chinese, the sense annotation has been done per lemma.
Resources:
As with any language understanding task, the availability of resources is key factor in the
disambiguation of the wors senses in corpora.
Early work on wors sense disambiguation used machine readable dictionaries or thesauruses as
knowledge sources.
Two prominent sources were the Longman dictionary of contemporary English (LDOCE) and
Roget’s Thesaurus.
The biggest sense annotation corpus OntoNotes released through Lissuistic Data Consortium
(LDC). • The Chinese annotation corpus is HowNet.
Systems:
Researchers have explored various system architectures to address the sense disambiguation
problem.
We can classify these systems into four main categories: i. rules based or knowledge ii. Supervised
iii.unsupervised iv. Semisupervised
Rule Based:
The first generation pf word sense disambiguation systems was primarily based on dictionary
sense definitions.
Much of this information is historical and cannot readily be translated and made available for
building systems today. But some of techniques and algorithms are still available.
The simplest and oldest dictionary based sense disambiguation algorithm was introduced by leak.
The core of the algorithm is that the dictionary sense whose terms most closely overlap with the
terms in the context
Supervised
The simpler form of word sense disambiguating systems the supervised approach, which tends to
transfer all the complexity to the machine learning machinery while still requiring hand
annotation tends to be superior to unsupervised and performs best when tested on annotated
data.
These systems typically consist of a classifier trained on various features extracted for words that
have been manually disambiguated in a given corpus and the application of the resulting mbdels
to disambiguating words in the unseen test sets.
A good feature of these systems is that the user can incorporate rules and knowledge in the form
of features.
Classifier:
Probably the most common and high performing classifiers are support vector machine (SVMs)
and maximum entropy classifiers
Features: Here we discuss a more commonly found subset of features that have been useful in
supervised learning of word sense.
Lexical context: The feature comprises the words and lemma of words occurring in the entire
paragraph or a smaller window of usually five words.
Parts of speech: the feature comprises the POS information for words in the window surrounding
the word that is being sense tagged.
Bag of words context: this feature comprises using an unordered set of words in the context
window.
Local Collections: Local collections are an ordered sequence of phrases near the target word that
provide semantic context for disambiguation. Usually, a very small window of about three tokens
on each side of the target word, most often in contiguous pairs or triplets, are added as a list of
features.
Syntactic relations: if the parse of the sentence containing the target word is available, then we
can use syntactic features.
Topic features: The board topic, or domain, of the article that word belongs to is also a good
indicator of what sense of the word might be most frequent.
Chen and palmer [51] recently proposed some additional rich features for disambiguation:
Voice of the sentence- This ternary feature indicates whether the sentence in which the word
occurs is a passive, semi passive or active sentence.
Presence of subject/object- This binary feature indicates whether the target word has a subject
or object. Given a large amount of training data, we could also use the actual lexeme and possibly
the semantic roles rather than the syntactic subject/objects.
Sentential complement- This binary feature indicates whether the word has a sentential
complement.
Prepositional phrase adjunct- this feature indicates whether the target word has a prepositional
phrase, and if so, selects the head of the noun phrase inside the prepositional phrase.
Semantic analysis starts with lexical semantics, which studies individual words' meanings (i.e.,
dictionary definitions),
Semantic analysis then examines relationships between individual words and analyzes the
meaning of words that come together to form a sentence.
This analysis provides a clear understanding of words in context. For Example:
"The boy ate the apple" defines an apple as a fruit.
"The boy went to Apple" defines Apple as a brand or store.
It draws the exact meaning or the dictionary meaning from the text. The text is checked for
meaningfulness.
Represent sentences in meaningful parts.
Mapping syntactic structures and objects in the task domain.
Disregards sentence such as “hot ice-cream”.
“colorless green idea,” this would be rejected by the Symantec analysis as colorless here: green
doesn’t make any sense.
https://www.youtube.com/watch?v=eEjU8oY_7DE
https://www.youtube.com/watch?v=W7QdqCrX_mY
https://www.youtube.com/watch?v=XLvv_5meRNM
https://www.geeksforgeeks.org/understanding-semantic-analysis-
nlp/#:~:text=Semantic%20Analysis%20is%20a%20subfield,process%20to%20us%20as%20humans.
UNIT-IV
Predicate-Argument Structure, Meaning Representation Systems, Software.
Discourse Processing: Cohesion, Reference Resolution, Discourse Cohesion and structure
Predicate
Predicate Argument Structure:
Shallow semantics parsing or semantic role labelling, is the process of identifying the various
arguments of predicates in a sentence.
In linguistics, predicate refers to the main verb in the sentence. Predicate takes arguments.
The role of Semantic Role Labelling (SRL) is to determine how these arguments are semantically
related to the predicate.
Consider the sentence "Mary loaded the truck with hay at the depot on Friday".
Loaded' is the predicate. Mary, truck and hay have respective semantic roles of loader, bearer and
cargo. Mary->loader, truck->bearer hay->cargo.
We can identify additional roles of location (depot) and time (Friday).The job of SRL is to identify
these roles so that NLP tasks can "understand" the sentence. Often an idea can be expressed in
multiple ways. Consider these sentences that all mean the same thing: "Yesterday, Kristina hit
Scott with a baseball"; "Scott was hit by Kristina yesterday with a baseball"; "With a baseball,
Kristina hit Scott yesterday"; "Kristina hit Scott with a baseball yesterday”.
Either constituent or dependency parsing will analyse these sentence syntactically. But syntactic
relations don't necessarily help in determining semantic roles.
SRL is useful in any NLP application that requires semantic understanding: machine translation,
information extraction, text summarization, question answering, and more.
Resources:
The late 1990s saw the emergence of two important corpora that are semantically tagged, one is
Frame Net and the other is Prop Bank.
These resources have begun a transition from a long tradition of predominantly rule-based
approaches toward more data-oriented approaches.
These approaches focus on transforming linguistic insights into features, rather than into rules
and letting a machine learning framework use those features to learn a model that helps
automatically tag the semantic information encoded in such resources,
FrameNet is based on the theory of frame semantics, where a given predicate invokes a semantic
frame, this instantiating some or all of the possible semantic roles belonging to that frame.
PropBank on the, on the other hand, is based on Dowty’s prototype theory and uses a more
linguistically neutral view in which each predicate has a set of core arguments that are predicate
dependent and all predicates share a set of non-core or adjunctive, arguments.
FrameNet:
FrameNet contains frame-specific semantic annotation of a number of predicates in English.
It contains tagged sentences extracted from British National Corpus (BNC).
The process of FrameNet annotation consists of identifying specific semantic frames and creating
a set of frame-specific roles called frame elements.
Then, a set of predicates that instantiate the semantic frame, irrespective of their grammatical
category, are identified, and a variety of sentences are labelled for those predicates.
The labelling process entails identifying the frame that an instance of the predicate lemma
invokes, then identifying semantic arguments for that instance, and tagging them with one of the
predetermined ser of frame elements for that frame.
The combination of the predicate lemma and the frame that its instance invokes is called a lexical
unit (LU).
This is therefore the pairing of a word with its meaning.
Each sense of a polysemous word tends to be associated with a unique frame.
Explanation
It infers that the state asserted by S1 could cause the state asserted by S0. For example, two statements
show the relationship − Ram fought with Shyam’s friend. He was drunk.
Parallel
It infers p(a1,a2,…) from assertion of S0 and p(b1,b2,…) from assertion S1. Here ai and bi are similar
for all i. For example, two statements are parallel − Ram wanted car. Shyam wanted money.
Elaboration
It infers the same proposition P from both the assertions − S0 and S1 For example, two statements
show the relation elaboration: Ram was from Chandigarh. Shyam was from Kerala.
Occasion
It happens when a change of state can be inferred from the assertion of S0, final state of which can be
inferred from S1 and vice-versa. For example, the two statements show the relation occasion: Ram
picked up the book. He gave it to Shyam.
Building Hierarchical Discourse Structure
The coherence of entire discourse can also be considered by hierarchical structure between
coherence relations. For example, the following passage can be represented as hierarchical structure
S1 − Ram went to the bank to deposit money.
S2 − He then took a train to Shyam’s cloth shop.
S3 − He wanted to buy some clothes.
S4 − He do not have new clothes for party.
S5 − He also wanted to talk to Shyam regarding his health
Reference Resolution
Interpretation of the sentences from any discourse is another important task and to achieve this we
need to know who or what entity is being talked about. Here, interpretation reference is the key
element. Reference may be defined as the linguistic expression to denote an entity or individual. For
example, in the passage, Ram, the manager of ABC bank, saw his friend Shyam at a shop. He went to
meet him, the linguistic expressions like Ram, His, He are reference.
On the same note, reference resolution may be defined as the task of determining what entities are
referred to by which linguistic expression.
Terminology Used in Reference Resolution
We use the following terminologies in reference resolution −
Referring expression − The natural language expression that is used to perform reference is
called a referring expression. For example, the passage used above is a referring expression.
Referent − It is the entity that is referred. For example, in the last given example Ram is a
referent.
Corefer − When two expressions are used to refer to the same entity, they are called corefers.
For example, Ram and he are corefers.
Antecedent − The term has the license to use another term. For example, Ram is the
antecedent of the reference he.
Anaphora & Anaphoric − It may be defined as the reference to an entity that has been
previously introduced into the sentence. And, the referring expression is called anaphoric.
Discourse model − The model that contains the representations of the entities that have been
referred to in the discourse and the relationship they are engaged in.
Types of Referring Expressions
Let us now see the different types of referring expressions. The five types of referring expressions are
described below −
Indefinite Noun Phrases
Such kind of reference represents the entities that are new to the hearer into the discourse context.
For example − in the sentence Ram had gone around one day to bring him some food − some is an
indefinite reference.
Definite Noun Phrases
Opposite to above, such kind of reference represents the entities that are not new or identifiable to
the hearer into the discourse context. For example, in the sentence - I used to read The Times of India
– The Times of India is a definite reference.
Pronouns
It is a form of definite reference. For example, Ram laughed as loud as he could. The
word he represents pronoun referring expression.
Demonstratives
These demonstrate and behave differently than simple definite pronouns. For example, this and that
are demonstrative pronouns.
Names
It is the simplest type of referring expression. It can be the name of a person, organization and location
also. For example, in the above examples, Ram is the name-refereeing expression.
Reference Resolution Tasks
The two reference resolution tasks are described below.
Coreference Resolution
It is the task of finding referring expressions in a text that refer to the same entity. In simple words, it
is the task of finding corefer expressions. A set of coreferring expressions are called coreference chain.
For example - He, Chief Manager and His - these are referring expressions in the first passage given
as example.
Constraint on Coreference Resolution
In English, the main problem for coreference resolution is the pronoun it. The reason behind this is
that the pronoun it has many uses. For example, it can refer much like he and she. The pronoun it also
refers to the things that do not refer to specific things. For example, It’s raining. It is really good.
Pronominal Anaphora Resolution
Unlike the coreference resolution, pronominal anaphora resolution may be defined as the task of
finding the antecedent for a single pronoun. For example, the pronoun is his and the task of
pronominal anaphora resolution is to find the word Ram because Ram is the antecedent.
UNIT-V
Language Modeling: Introduction, N-Gram Models, Language Model Evaluation, Parameter
Estimation, Language Model Adaptation, Types of Language Models, Language-Specific Modeling
Problems, Multilingual and Cross lingual Language Modeling.
Language Modeling
Introduction
Statistical Language Model is a model that specifies the a priori probability of a particular word
sequence in the language of interest.
Given an alphabet or inventory of units ∑ and a sequence W= w1w2…..wt ϵ ∑* a language model
can be used to compute the probability of W based on parameters previously estimated from a
training set.
The inventory ∑ is the list of unique words encountered in the training data.
Selecting the units over which a language model should be defined is a difficult problem
particularly in languages other than English.
A language model is combined with other model or models that hypothesize possible word
sequences.
In speech recognition a speech recognizer combines acoustic model scores with language model
scores to decode spoken word sequences from an acoustic signal.
Language models have also become a standard tool in information retrieval, authorship
identification, and document classification.
N-Gram Models
Finding the probability of a word sequence of arbitrary length is not possible in natural language
because natural language permits infinite number of word sequences of variable length.
The probability P(W) can be decomposed into a product of component probabilities according to
the chain rule of probability:
Since the individual terms in the above product are difficult to compute directly n-gram
approximation was introduced.
The assumption is that all the preceding words except the n-1 words directly preceding the
current word are irrelevant for predicting the current word.
Hence P(W) is approximated to:
This model is also called as (n-1)-th order Markov model because of the assumption of the
independence of the current word given all the words except for the n-1 preceding words.
In language modeling we are more interested in the performance of a language model q on a test
set of a fixed size, say t words (w1w2wt).
The language model perplexity can be computed as:
When comparing different language models, their perplexities must be normalized with respect
to the same number of units in order to obtain a meaningful comparison.
Perplexity is the average number of equally likely successor words when transitioning from one
position in the word string to the next.
If the model has no predictive power, perplexity is equal to the vocabulary size.
A model achieving perfect prediction has a perplexity of one.
The goal in language model development is to minimize the perplexity on a held-out data set
representative of the domain of interest.
Sometimes the goal of language modeling might be to distinguish between “good” and “bad” word
sequences.
Optimization in such cases may not be minimizing the perplexity.
Parameter Estimation
Maximum-Likelihood Estimation and Smoothing
Bayesian Parameter Estimation
Large-Scale Language Models
Where c(wi,wi-1,wi-2) is the count of the trigram wi-2wi-1wi in the training data.
This method fails to assign nonzero probabilities to word sequences that have not been observed
in the training data.
The probability of sequences that were observed might also be overestimated.
The process of redistributing probability mass such that peaks in the n-gram probability
distribution are flattened and zero estimates are floored to some small nonzero value is called
smoothing.
The most common smoothing technique is backoff.
Backoff involves splitting n-grams into those whose counts in the training data fall below a
predetermined threshold Ʈ and those whose counts exceed the threshold.
In the former case the maximum-likelihood estimate of the n-gram probability is replaced with an
estimate derived from the probability of the lower-order (n-1)-gram and a backoff weight.
In the later case, n-grams retain their maximum-likelihood estimates, discounted by a factor that
redistributes probability mass to the lower-order distribution.
The back-off probability PBO for wi given wi-1,wi-2 is computed as follows:
Where c is the count of (wi,wi-1,wi-2), and dc is a discounting factor that is applied to the higher
order distribution.
The normalization factor α(wi-1,wi-2) ensures that the entire distribution sums to one and is
computed as:
The way in which the discounting factor is computed determines the precise smoothing technique.
Well-known techniques include:
Good-Turing
Written-Bell
Kneser-Ney
In Kneser-Ney smoothing a fixed discounting parameter D is applied to the raw n-gram counts
before computing the probability estimates:
In modified Kneser-Ney smoothing, which is one of the most widely used techniques, different
discounting factors D1,D2,D3+ are used for n-grams with exactly one, two, or three or more
counts:
Where n1,n2,….. are the counts of n-grams with one, two, …, counts.
Another common way of smoothing language model estimates is linear model interpolation.
In linear interpolation, M models are combined by
In language modeling, 𝜃 = < 𝑃 𝑤1 , … . , (𝑊𝑘) > (where K is the vocabulary size) for a unigram model.
For an n-gram model 𝜃= with K n-grams and history h of a specified length.
The Bayesian criterion finds the expected value of 𝜃 given the sample S:
Assuming that the prior distribution is a uniform distribution, the MAP is equivalent to the
maximum-likelihood estimate.
Bayesian estimate is equivalent to the maximum-likelihood estimate with Laplace smoothing:
Different choices for the prior distribution lead to different estimation functions.
The most commonly used prior distribution in language model is the Dirichlet distribution.
The Dirichlet distribution is the conjugate prior to the multinomial distribution. It is defined as:
Where Γ is the gamma function and α1,….. Αk are the parameters of the Dirichlet distribution.
It can also be thought of as counts derived from an a priori training sample.
The MAP estimate under the Dirichlet prior is:
gram count exceeds the minimum threshold (in this case 0):
The α parameter is fixed for all contexts rather than being dependent on the lower-order n-gram.
An alternative possibility is to use large-scale distributed language models at a second pass
rescoring stage only, after first-pass hypotheses have been generated using a smaller language
model.
The overall trend in large-scale language modeling is to abandon exact parameter estimation of
the type described in favor of approximate techniques.
Continuous Space: In this type of statistical model, words are arranged as a non-linear combination
of weights in a neural network. The process of assigning weight to a word is known as word
embedding. This type of model proves helpful in scenarios where the data set of words continues to
become large and include unique words.
In cases where the data set is large and consists of rarely used or unique words, linear models such
as n-gram do not work. This is because, with increasing words, the possible word sequences increase,
and thus the patterns predicting the next word become weaker.
2. Neural Language Models
These language models are based on neural networks and are often considered as an advanced
approach to execute NLP tasks. Neural language models overcome the shortcomings of classical
models such as n-gram and are used for complex tasks such as speech recognition or machine
translation.
Language is significantly complex and keeps on evolving. Therefore, the more complex the language
model is, the better it would be at performing NLP tasks. Compared to the n-gram model, an
exponential or continuous space model proves to be a better option for NLP tasks because they are
designed to handle ambiguity and language variation.
Meanwhile, language models should be able to manage dependencies. For example, a model should
be able to understand words derived from different languages.
Some Common Examples of Language Models
Language models are the cornerstone of Natural Language Processing (NLP) technology. We have
been making the best of language models in our routine, without even realizing it. Let’s take a look at
some of the examples of language models.
1. Speech Recognization
Voice assistants such as Siri and Alexa are examples of how language models help machines in
processing speech audio.
2. Machine Translation
Google Translator and Microsoft Translate are examples of how NLP models can help in translating
one language to another.
3. Sentiment Analysis
This helps in analyzing the sentiments behind a phrase. This use case of NLP models is used in
products that allow businesses to understand a customer’s intent behind opinions or attitudes
expressed in the text. Hubspot’s Service Hub is an example of how language models can help in
sentiment analysis.
4. Text Suggestions
Google services such as Gmail or Google Docs use language models to help users get text suggestions
while they compose an email or create long text documents, respectively.
5. Parsing Tools
Parsing involves analyzing sentences or words that comply with syntax or grammar rules. Spell
checking tools are perfect examples of language modelling and parsing.