UNIT - I
Finding the Structure of Words: Words and Their
Components, Issues and Challenges, Morphological Models
NLP
Natural Language Processing
NLP stands for Natural Language Processing, which is a part of Computer
Science, Human language, and Artificial Intelligence
ability of a computer program to understand human language referred to
as natural language.
It's a component of artificial intelligence
It is a technology used by machines to understand, analyse, manipulate,
and interpret human's languages.
Applications of NLP
Spam Detection
Sentiment Analysis
Machine Translation
Spelling correction
Speech Recognition
Chatbot
Information extraction
Autocorrection
NLP Process or NLP Pipeline
Segmentation:
The first process of Pipeline is segmentation. Sentence segmentation or text
segmentation is basically dividing the given text into logically units of
information. We divide the sentence into its constituent sub sentences,
usually around the punctuations like full stops or commas, or along line
breaks.
Dividing a document into its constituent sentences allows us to process it
without losing its essence and the necessary information that it contains.
Example:
Before segmentation
This is the first sentence. This is the second sentence. This is the third
sentence.
After segmentation
This is the first sentence.
This is the second sentence.
And this is the third sentence.
Tokenization:
Tokenization is nothing but the process of dividing a sentence into its
constituent words. The sentence that is given to us will be separated and all
the words in the sentence will be stored separately. This is done so we can
understand the syntactic and semantic information contained in each
sentence. Thus, we analysing sentence by it word by word.
The computer does not understand punctuation and special characters.
Hence, we can remove any punctuations and special characters which may
occur.
Before tokenization
This is the first sentence.
After tokenization
This
is
my
first
sentence
.
Stemming
Stemming is a process of obtaining the word stems of a word. Word stems are
also known as the base form of a word.
Example
Word Stem
program program
programming program
programer program
programs program
programmed program
Lemmatization
Lemmatization is the process of figuring out the root form or root word, which
is nothing but the most basic form, also known as the lemma of each word in
the sentence. Lemmatization is very similar to standing, where we remove
word affixes to get the base form of a word.
The difference is that the root word is always a word which is present in the
dictionary. But the root stem may not be so lemmatization uses a knowledge
base called Wordnet. Let's consider three different words.
Example
Going
Went
Gone
POS tagging
Part of speech identifies which part of speech a word belongs to. It tags a
word as a verb, noun, pronoun etc.
Part of speech tagging is a process of converting a sentence to different
forms. It can be a list of words or a list of tuples. They tag in case of is a part
of speech tag and signifies whether the word is a noun, adjective, verb, and
so on.
Example:
Learning NLP subject is not that easy
[('learning', 'VBG'), ('NLP', 'NNP'), ('is', 'VBZ'), ('not', 'RB'), ('that', 'IN'), ('easy',
'JJ')]
Named Entity Recognition
Classifying the words into subcategories. Such as person, quantity, location,
organization, movie etc
Example
The Mars Orbiter Mission (MOM), informally known as Mangalyaan, was
launched into Earth orbit on 5 November 2013 by the Indian Space Research
Organisation (ISRO) and has entered Mars orbit on 24 September 2014. India
thus became the first country to enter Mars orbit on its first attempt. It was
completed at a record low cost of $74 million
The Mars Orbiter Mission PRODUCT
MOM ORG
Mangalyaan GPE
Earth LOC
5 November 2013 DATE
the Indian Space Research Organisation ORG
ISRO ORG
Mars LOC
24 September 2014 DATE
India GPE
first ORDINAL
Mars LOC
first ORDINAL
$74 million MONEY
Components of NLP (image)
o NLU (Natural Language Understanding)
o NLG (Natural Language Generation)
There are two components of NLP, Natural Language Understanding
(NLU)and Natural Language Generation (NLG).
Natural Language Understanding (NLU) which involves transforming
human language into a machine-readable format. It helps the machine to
understand and analyse human language by extracting the text from
large data such as keywords, emotions, relations, and semantics.
Natural Language Generation (NLG) acts as a translator that converts the
computerized data into natural language representation.
The challenges of NLU
Lexical Ambiguity
Lexical Ambiguity exists in the presence of two or more possible meanings
of the sentence within a single word. It is word level ambiguity
Example:
Manya is looking for a match.
Syntactic Ambiguity
Syntactic Ambiguity exists in the presence of two or more possible
meanings within the sentence. It is sentence level ambiguity.
Example:
I saw the girl with the binocular.
Referential Ambiguity
Referential Ambiguity exists when you are referring to something using
the pronoun.
Example: Kiran went to suersh. He eats apple.
Natural Language Generation (NLG) acts as a translator that converts
the computerized data into natural language representation.
NLG has the following levels
Text planning
o Retrieve relevant content from knowledge based
Sentence planning
o Words, meaningful phrases and setting tone
Text realization.
o Mapping sentence planning into sentence structure
The NLU is harder than NLG.
LEXICON ANALYSIS
It is fundamental stage
Identifying and analysing the structure of words
It is word level processing
Dividing the whole text into paragraph, sentence and words
It involves stemming and lemmatization
SYNTACTIC ANALYSIS
Required syntactic knowledge
Find the roles played by words in a sentence,
Interpret the relationship between words,
Interpret the grammatical structure of sentences.
SEMANTIC ANALYSIS
exact meaning or dictionary meaning from the text.
to check the text for meaningfulness.
DISCOURSE ANALYSIS
Required discourse knowledge
PRAGMATIC ANALYSIS
how people communicate with each other, in which context they are
talking
required knowledge of the word
NLP Challenges
Elongated words
Shortcuts
Emojis
Mix use of Language
Ellipsis
Finding the Structure of Words
Words and Their Components
Words are the basic building blocks of a Language. we have following components of Words
Tokens
Lexemes
Morphemes
Typology
Tokens
Tokens are words that are created by dividing the text into smaller units
Process to identify tokens from the given text is known as Tokenization
Tokenization involves segmenting text into smaller units that are analysed individually.
Input is text and output are tokens
Types of Tokenization
Character Tokenization
Word Tokenization
Sentence Tokenization
Sub word Tokenization
Number Tokenization
Character Tokenization
Input: "Today is Monday"
Output: ["T", "o", "d", "a", "y", "i", "s", "M", "o", "n", "d", "a",” y”]
Word Tokenization (whitespace based, punctuation
based)
Sentence Tokenization
Input: "Tokenization is an important NLP task. It helps break down text into smaller units."
Output: ["Tokenization is an important NLP task.", "It helps break down text into smaller
units."]
Sub word Tokenization (frequently used words,
infrequency used words)
Input: unusual
Output: [“un”, “usual”].
Number tokenization
She had 100 pencils
LEXEMES
Base or canonical form of words
Process to find the lexemes is known as lemmatization.
MORPHEMES
Words are formed by combing more than one morpheme
Process to find morphemes from text is known as morphological process
We have following types of morphemes
1. Free morphemes
2. Bounded morphemes
TYPOLOGY
It refers categorized or classification of a language based structural and grammatical features
We have following categories
1. Isolated or analytical languages
2. Synthetic languages
3. Agglutinative languages
Issues and Challenges
Irregularity
Ambiguity
Productivity
Irregularity
Words or words forming follow regular patterns then it is regularity
Words or words forming doesn’t follow regular patterns then it is
irregularity
Ambiguity
word or word forms that can be having more than one meaning
irrespective of context.
Word forms that look like same but meaning is not unique
Occurred morphological processing
Ambiguity can be
o Word sense ambiguity
Meaning depending on the context
Example
image
o Parts of speech ambiguity
Different part of speech
o Structural ambiguity
Multiple valid syntactic structure
o Referential ambiguity
Referring person or noun
Productivity
Forming new word or word forms using productive rules
Person names, location names, organization names.
Morphological Models
Morphological models are used to analyse the structure and formation of words
We have 5 morphological models
Dictionary Lookup
Finite state morphology
Unification based morphology
Functional morphology
Morphological Induction
Dictionary Lookup
Includes
wordbase form or canonical form search in dictionary
retrieve information
Finite state morphology
Based on formal language theory
Process is known as FSTs (finite state transducers)
Morphemes
Root word affixes
Prefix
In fix
Suffix
success
success
un success
pre fix stem
successfull
stem suffix
unsuccessfull
prefix stem suffix
e stem suffix
prefix
stem
STEM CHANGES
Some irregular word requires stem changes
d o g epsilon
m s
c e
i e
u s
Mice
mouse
FST has two types of tapes
Surface tape
Lexical tape
Surface tape
c a t s
Lexical tape
c a t N Pl
FST has 7 tuples
MORPHEMES TYPES
Basically, two types of morphemes
o Free morphemes
Lexical (open class)
example
Functional (closed class)
example
o Bound morphemes
Inflectional
example
Derivational
Class changing
Class maintaining
Finding structure of Document
Corpus
Documents/sentences
Word/tokens
Vocabulary
Segmentation is chunking the input text or speech into blocks. It is the process of segmenting
sequence of words into units
I talked to Dr.XYZ and my house is mountain Dr.
I met Dr.Xyz and he suggested some medicines.
Types of segmentation
Sentence boundary detection
o Optical character recognition
Confused with . and ,
o Automatic speech recognition
Code switching problem
Topic boundary detection
Topic boundary detection
Discourse segmentation / text segmentation
Process of dividing speech or text into homogenous blocks
called as topic segmentation
Two ways (text segmentation)
1. By following headlines
2. By paragraph breaks
Two ways (speech segmentation)
1. Pause duration
2. Speaker changes
METHODS for sentence boundary and topic boundary
1. Generative sequence classification method
2. Discriminative local classification method
Generative sequence classification method
Observations: words & punctuations
Labels: sentence boundary, topic boundary
Hidden Markov Model (HMM) is method of Generative approach
How it works
Learn from the data about the observation and corresponding
hidden states (POS)
Predict label or sequence generation
Classifies the new sequence
Ex:
I love Coding
After Hidden states
I | pronoun Love|verb Coding|noun
She|pronoun sells|verb apples|noun
The| determiner quick| adjectivr brown| fox jumps over the lazy dog
Discriminative local classification method
Local feature: word, prefixes, suffixes, nearby POS
Label: sentence boundary label, topic boundary label
Ex: maximum entropy Markov model, SVM
Applications:
1. POS tagging
2. Speech recognition
3. Named entity recognition
Complexity of approaches
Quality
Quantity
Computational complexity
Structural complexity
Space
Time
Training
Prediction
Performance of the approaches
Precision
Recall
Accuracy
F1 measure/F1 score
Confusion matrix
Precision
When it predicts yes, how often is it correct
TP/predicted Yes
100/110=90.9%
True Negative Rate:
Actually No, how often does it predict No
TN/actual No=50/60=83%
True Positive Rate: (Recall/Sensitivity)
When it actual Yes how often does it predict Yes
TP/Actual Yes
100/105=95%
Accuracy
How often classifier correct
TN+TP/TN+FP+FN+TP
50+100/50+10+5+100
150/165=90%
Misclassification Rate:
Overall, how often is it wrong
FN+FP/TN+FP+FN+TP
5+10/50+10+5+100
15/165=9%
PRCESION = Total No. Of Correct Positive Prediction/Total No. Of
Positive Prediction
RECALL= Total No. Of Correct Positive Prediction/Total no. of positive
instances (+ve ,-ve)
ACCURACY
How often classifier correct
TN+TP/TN+FP+FN+TP
F1 SCORE= 2* precision *recall / precision +recall
UNIT -II
Prerequires CFG
Syntax Analysis: Parsing Natural Language, Treebanks: A
Data-Driven Approach to Syntax, Representation of Syntactic
Structure, Parsing Algorithms, Models for Ambiguity Resolution
in Parsing, Multilingual Issues
Syntax Analysis /syntactic Analysis
Syntax Vs grammar
Return_type Function_name(parameters);
Function_name(parameters) return_type;
Function_name(parameter);
Return_type Function_name(parameters)
Ramu eats apple
Eats ramu apple
Tree
Parsing
CFG
G= (N, T, P, S)
A α
α (NUT)*
AB
SNP VP
NP{article(adj)N, PRO, PN}
VPV NP(PP)(ADV)
PP PREP NP
NPN
NP DN
Ate ramu apple the
RAMU ATE THE APPLE
Brown ,switchboard
At/in the/at same/ap time/nn reaction/nn among/in anti-
organization/jj
At the same time reaction among anti-organization
Penn treebank
Ate ramu apple the
Parsing Natural Language
Examples:
The child ate the cake with the fork
He gave the book to his sister
The dog saw a man in the park
Common Types of Determiners (Det) in NLP:
1. Articles: Specify definiteness.
o Definite: the
Example: the book
o Indefinite: a, an
Example: a cat
2. Demonstratives: Point to specific nouns.
o this, that, these, those
Example: this car, those apples
3. Possessives: Indicate ownership.
o my, your, his, her, its, our, their
Example: my book, their house
4. Quantifiers: Specify quantity or amount.
o some, many, few, all, several, no
Example: many people, few options
5. Interrogative Determiners: Used in questions.
o which, what, whose
Example: which color, whose idea
6. Numbers: Cardinal numbers used before nouns.
o one, two, three
Example: two books, three dogs
Representation of Syntactic Structure
Two type s of approaches
Phrase structure graph
o Example
Dependency graph
o Example
Parsing Algorithms
Recursive descent parser
Shift reduce parser
Chart parser
RegExp parser
Models for Ambiguity Resolution in Parsing
John bought a shirt with pockets
Probability Context Free Grammer (PCFG)
Generative model
Discriminative model
HOMONYMY
UNIT - III
Semantic Parsing: Introduction, Semantic Interpretation, System Paradigms, Word
Sense, Systems, Software.
UNIT – IV
Predicate-Argument Structure, Meaning Representation
Systems, Software.
SENTENCE:
sentence is a group of words which makes complete
sense. a sentence has subject and predicate.
Examples
1. He is a student.
2. He goes to school daily.
3. A large number of spectators watched the match.
4. The cat chased the rats
5. John slept
6. The boy ate pizza
SUBJECT:
A subject is a word or a group of words that describe
the action or that is in the state of being described by
the verb.
Subject is person, animal, place or thing we talk about
the sentence
PREDICATE:
the part of the sentence which tells something about
the subject is called predicate
Semantic Rules
Agent:
Agent also known as the doer of an action
Usually agents are human beings, but they can
also be non-human such as machines,
creatures
Patient:
The patient is an entity that undergoes or receives the
action to whom action is passed over
Or
Patient is an entity undergoing the effect of some
action and often undergoing some change in state.
Theme:
Theme is an entity that is moved by the action or
whose location is described
Mary threw the ball
Experiencer:
Experiencer is one who receives, accepts, experiences
or undergoes the effect of an action
John looked at the moon
John saw the moon
Location:
The place where the action happens
Source:
The direction from which the action originated.
Goal:
The place to which something moves or the thing
towards which an action is directed or location or
entity in which something moves
Instrument:
an inanimate thing that an agent uses to implement an
event
Cause:
The entity that causes (but does, does not do)
Or an event (because of what)
Time:
When an action is performed or when an event took
place
Beneficiary:
the entity that is advantaged or disadvantaged by an
action
Semantic roles
Theta roles
Thematic roles
Theat theory
Thematic theory
Examples:
The baker baked a cake
Mark cooked a banana bread.
The detective interrogated the suspect.
The storm broke a tree.
A large stone crashed the car.
A rock hit Bob.
Dan played guitar.
Mary hit the car.
Mary hit her boss.
Richard saw the eclipse.
Andrei gave students an assignment.
Fred sends a package to Russia.
SRL (semantic Role Labelling)
Meaning Representation System
Abstract Meaning Representation (AMR):
AMR aims to assign the same AMR to sentences that
have same basic meaning
AMR makes extension use of PropBank and framesets
AMR is heavily biased towards English .it is not
interlingua.
AMR have 3 equivalent forms
1. Logical format
A formal representation
2. AMR format
Based on PENMAN notation, easy for human
reading and writing
3. Graph format
For visualization and used by programs
The AMR annotation is more frequently represented as
a single-rooted directed acyclic graph with labelled
nodes (concepts) and edges (relations) among them
Nodes represent the main events and entities that
occur in a sentence, and edges represent semantic
relationships among nodes
Every AMR has a single root node at the top of the
graph, which is considered to be the focus
Each node in the graph
has a variable and represents a semantic concept
(variable = instance of concept) a slash (/)
variables are reused if something is referenced
multiple times: re-entrancy
Semantic concepts include PB (PropBank)
framesets and English words
Graph edges denote relations between concepts
Semantic relations include different types of roles,
marked by a colon prefix (:)
Some relations known as constants – get no
variable, just a value
Relation (role) can be inverted (useful for
maintaining a single rooted structure)
It is also possible to convert a role into a concept
by reification (useful to make a relation the focus
of an AMR fragment)
I beg you to excuse me
AMR Notations:
Graph format
Hello everyone, how are you?
How many days should I learn?
And high CSMC students, how are you? You're great.
Excellent.
So, natural language understanding.
Lexical ambiguity exists in the presence of two or more
possible meanings of the sentence within a single
word.
Mania is looking for a match.
Phases of NLP.
We have a lexical analysis, syntactic analysis, semantic
analysis, discourse integration, fragment analysis.
The challenges of NLP. Elongated words, shortcuts,
emojis.
Mix of mixed-use of languages.
Ellipsis.
Has two types of tapes, surface tape and electrical
tape.
Lexical tip.
Lexical tape.
Hi, how many days should I learn natural language
processing? Is it necessary?
Hi, how many days should I learn natural language
processing? Is it necessary?
Hello.
Hello.
Hello, are you there?
Hello, are you there?
Pipeline is segmentation. Now what exactly is
segmentation? Sentence segmentation or text
segmentation is basically dividing the given text into
logically decipherable units of information. We divide
the sentence into its constituent sub sentences, usually
around the punctuations like full stops or commas, or
along line breaks and page components for HTML files.
Dividing a document into its constituent sentences
allows us to process it without losing its essence and
the necessary information that it contains.
It.
Called tokenization. Tokenization is nothing but the
process of dividing a sentence into its constituent
words. The sentence that is given to us will be
separated and all the words in the sentence will be
stored separately. This is done so we can understand
the syntactic and semantic information contained in
each sentence. Thus, we decipher the relevance of a
sentence by analyzing it word by word, thereby making
sure that.
Called tokenization. Tokenization is nothing but the
process of dividing a sentence into its constituent
words. The sentence that is given to us will be
separated and all the words in the sentence will be
stored separately. This is done so we can understand
the syntactic and semantic information contained in
each sentence. Thus, we decipher the relevance of a
sentence by analyzing it word by word, thereby making
sure that.
Information occurs. The computer does not understand
punctuation and special characters. Hence, we can
remove any punctuations and special characters which
may occur. Let's take a part of our previously
segmented sentence.
Stemming is a process of obtaining the word stems of a
word.
Stemming is a process of obtaining the word stems of a
word.
Word stems are also known as the base form of a word
and we can create new words by attaching affixes to
them in a process known as inflection. Stubbing is a
process of recognizing the word stems of individual
words. This is done by removing affixes such as in
South Ed, etc.
Word stems are also known as the base form of a word
and we can create new words by attaching affixes to
them in a process known as inflection. Stubbing is a
process of recognizing the word stems of individual
words. This is done by removing affixes such as in
South Ed, etc.
For example, consider a sentence jump.
For example, consider a sentence jump.
Jump is the best.
Jump is the word stem of various different words like
jumping, jump and jumps. If we remove all of these
affixes, we will get our basic word step which is junk.
This is basically what we want. At the end of stemming,
the next process in our pipeline is called.
Infect.
Lematization Lemmatization is the process of figuring
out the root form or root word, which is nothing but
the most basic form, also known as the lemma of each
word in the sentence. Lemmatization is very similar to
standing, where we remove word affixes to get the
base form of a word.
Lemmatization Lemmatization is the process of figuring
out the root form or root word, which is nothing but
the most basic form, also known as the lemma of each
word in the sentence. Lemmatization is very similar to
stemming, where we remove word affixes to get the
base form of a word. The difference is that the root
word is always a word which is present in the
dictionary.
But the root stem may not be so lemmatization uses a
knowledge base called Wordnet. Let's consider three
different words.
The difference is that the root word is always a word
which is present in the dictionary, but the root stem
may not be. So lemmatization uses a knowledge base
called Wordnet. Let's consider three different words.
Clinton contaminants may be 15 to 16 to 10.
50 to 60 Mathila. 5 to 6 Madam Record. Venada.
50 to 60 Mathila. 5 to 6 Madam Record. Venada.
Bangalore Direct.
Bhuvneshwar Kumar Duttclayer.
Antenna.
Acclaimed under Madam Bank.
You know Masala Prasad Chennai Cups telescope for
instance.
UK.
First, double ghatkopar yoga.
Nikhil Lavigne Krishna Malika.
Loan Kuruji. Epidemiology.
Dialogue.
Radki Opsar.
Godrej. Godrej.
Uh, athuvaramnada ***** dhirubhai athama Meru
cherry.
Uh.
Uh, ya avan sampanam okadasari.
Medabad. MMM.
OK.
OK.
Anjavatar Rao.
Ah, and about the Lord.
Yes.
Sara Iwandi.
Memo. Shukra Naidu. Nightlife.
End.
Sukhrani Shanmugaratnam.
Shukran Naidu talked about Kuntamu, Shanmukar
Nadu, Adhu Vinaya Tapuja a few days Vishkuntam.
Anirudh.
Anirudh.
Director Nadu.
Kadavani Adhu Directive Adhu Varanadu.
Adhu Varan Nadu, ***** Nadu.
Hey, BM was from the Atlanta.
What's Ratna?
'S.
Rahul Gandhi.
Uh, set up the mail.
Hello.
Hello. Hello, Sir. Good evening, Sir.
Retaining, Sir.
Japanese.
Sir Chappendi.
Class A, third year.
Class A third class.
OK, OK. One other combination.
OK, OK. One other combination.
One other combination.
One other combination.
CSMK.
CSMK 3rd year kid.
Part of speech target Part of speech tagging is a process
of converting a sentence to different forms. It can be a
list of words or a list of tuples. They tag in case of is a
part of speech tag and signifies whether the word is a
noun, adjective, verb, and so on. We are basically
splitting our verbs into the grammatical components to
understand the meaning of any sentence or to extract
relationships and to build a knowledge graph. Part of
speech tagging is.
As the same bird can have different part of speeches in
different sentences for example.
Let.
It is nothing but deep electricity recognition.
Entity recognition, also known as name, entity
identification, entity chunking, and entity extraction, is
a subtask of information expansion that seeks to locate
and classify, which are mentioned in.
Hello. Hello.
Hello CSMC, how are you?
Hello CSMC, how are you?
So did you.
So did you get?
So I given an input as a speech. It converted I mean
still.
So I given an input as a speech. It converted I mean
still.
It's going on.
Right, so this is the way.
It's going on, right? So this is the way.
Better stop.