KEMBAR78
NLP Notes | PDF | Morphology (Linguistics) | Word
0% found this document useful (0 votes)
9 views56 pages

NLP Notes

The document provides an overview of Natural Language Processing (NLP), including its definition, applications, and key processes such as segmentation, tokenization, stemming, and lemmatization. It discusses components of NLP like Natural Language Understanding (NLU) and Natural Language Generation (NLG), as well as challenges such as ambiguity and irregularity in language. Additionally, it covers morphological models and syntax analysis, highlighting the importance of understanding word structure and parsing in NLP.

Uploaded by

Rida kaunain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views56 pages

NLP Notes

The document provides an overview of Natural Language Processing (NLP), including its definition, applications, and key processes such as segmentation, tokenization, stemming, and lemmatization. It discusses components of NLP like Natural Language Understanding (NLU) and Natural Language Generation (NLG), as well as challenges such as ambiguity and irregularity in language. Additionally, it covers morphological models and syntax analysis, highlighting the importance of understanding word structure and parsing in NLP.

Uploaded by

Rida kaunain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 56

UNIT - I

Finding the Structure of Words: Words and Their


Components, Issues and Challenges, Morphological Models

NLP
Natural Language Processing
NLP stands for Natural Language Processing, which is a part of Computer
Science, Human language, and Artificial Intelligence
 ability of a computer program to understand human language referred to
as natural language.
 It's a component of artificial intelligence
 It is a technology used by machines to understand, analyse, manipulate,
and interpret human's languages.
Applications of NLP
 Spam Detection
 Sentiment Analysis
 Machine Translation
 Spelling correction
 Speech Recognition
 Chatbot
 Information extraction
 Autocorrection
NLP Process or NLP Pipeline

Segmentation:

The first process of Pipeline is segmentation. Sentence segmentation or text


segmentation is basically dividing the given text into logically units of
information. We divide the sentence into its constituent sub sentences,
usually around the punctuations like full stops or commas, or along line
breaks.

Dividing a document into its constituent sentences allows us to process it


without losing its essence and the necessary information that it contains.

Example:

Before segmentation
This is the first sentence. This is the second sentence. This is the third
sentence.

After segmentation
This is the first sentence.
This is the second sentence.
And this is the third sentence.

Tokenization:

Tokenization is nothing but the process of dividing a sentence into its


constituent words. The sentence that is given to us will be separated and all
the words in the sentence will be stored separately. This is done so we can
understand the syntactic and semantic information contained in each
sentence. Thus, we analysing sentence by it word by word.

The computer does not understand punctuation and special characters.


Hence, we can remove any punctuations and special characters which may
occur.

Before tokenization

This is the first sentence.

After tokenization

This
is
my
first
sentence
.

Stemming
Stemming is a process of obtaining the word stems of a word. Word stems are
also known as the base form of a word.

Example

Word Stem

program program

programming program

programer program

programs program

programmed program

Lemmatization

Lemmatization is the process of figuring out the root form or root word, which
is nothing but the most basic form, also known as the lemma of each word in
the sentence. Lemmatization is very similar to standing, where we remove
word affixes to get the base form of a word.

The difference is that the root word is always a word which is present in the
dictionary. But the root stem may not be so lemmatization uses a knowledge
base called Wordnet. Let's consider three different words.

Example

Going

Went

Gone
POS tagging

Part of speech identifies which part of speech a word belongs to. It tags a
word as a verb, noun, pronoun etc.

Part of speech tagging is a process of converting a sentence to different


forms. It can be a list of words or a list of tuples. They tag in case of is a part
of speech tag and signifies whether the word is a noun, adjective, verb, and
so on.

Example:

Learning NLP subject is not that easy

[('learning', 'VBG'), ('NLP', 'NNP'), ('is', 'VBZ'), ('not', 'RB'), ('that', 'IN'), ('easy',
'JJ')]
Named Entity Recognition

Classifying the words into subcategories. Such as person, quantity, location,


organization, movie etc

Example

The Mars Orbiter Mission (MOM), informally known as Mangalyaan, was


launched into Earth orbit on 5 November 2013 by the Indian Space Research
Organisation (ISRO) and has entered Mars orbit on 24 September 2014. India
thus became the first country to enter Mars orbit on its first attempt. It was
completed at a record low cost of $74 million

The Mars Orbiter Mission PRODUCT


MOM ORG
Mangalyaan GPE
Earth LOC
5 November 2013 DATE
the Indian Space Research Organisation ORG
ISRO ORG
Mars LOC
24 September 2014 DATE
India GPE
first ORDINAL
Mars LOC
first ORDINAL
$74 million MONEY

Components of NLP (image)


o NLU (Natural Language Understanding)
o NLG (Natural Language Generation)
There are two components of NLP, Natural Language Understanding
(NLU)and Natural Language Generation (NLG).

Natural Language Understanding (NLU) which involves transforming


human language into a machine-readable format. It helps the machine to
understand and analyse human language by extracting the text from
large data such as keywords, emotions, relations, and semantics.

Natural Language Generation (NLG) acts as a translator that converts the


computerized data into natural language representation.

The challenges of NLU

Lexical Ambiguity
Lexical Ambiguity exists in the presence of two or more possible meanings
of the sentence within a single word. It is word level ambiguity

Example:

Manya is looking for a match.

Syntactic Ambiguity
Syntactic Ambiguity exists in the presence of two or more possible
meanings within the sentence. It is sentence level ambiguity.

Example:

I saw the girl with the binocular.

Referential Ambiguity
Referential Ambiguity exists when you are referring to something using
the pronoun.

Example: Kiran went to suersh. He eats apple.

Natural Language Generation (NLG) acts as a translator that converts


the computerized data into natural language representation.

NLG has the following levels

 Text planning
o Retrieve relevant content from knowledge based

 Sentence planning

o Words, meaningful phrases and setting tone

 Text realization.

o Mapping sentence planning into sentence structure

The NLU is harder than NLG.


LEXICON ANALYSIS
 It is fundamental stage
 Identifying and analysing the structure of words
 It is word level processing
 Dividing the whole text into paragraph, sentence and words
 It involves stemming and lemmatization

SYNTACTIC ANALYSIS
 Required syntactic knowledge
 Find the roles played by words in a sentence,
 Interpret the relationship between words,
 Interpret the grammatical structure of sentences.
SEMANTIC ANALYSIS
 exact meaning or dictionary meaning from the text.
 to check the text for meaningfulness.
DISCOURSE ANALYSIS
 Required discourse knowledge
PRAGMATIC ANALYSIS
 how people communicate with each other, in which context they are
talking
 required knowledge of the word

NLP Challenges

 Elongated words
 Shortcuts
 Emojis
 Mix use of Language
 Ellipsis
Finding the Structure of Words

Words and Their Components

Words are the basic building blocks of a Language. we have following components of Words

 Tokens
 Lexemes
 Morphemes
 Typology

Tokens

 Tokens are words that are created by dividing the text into smaller units
 Process to identify tokens from the given text is known as Tokenization
 Tokenization involves segmenting text into smaller units that are analysed individually.
 Input is text and output are tokens

Types of Tokenization

 Character Tokenization
 Word Tokenization
 Sentence Tokenization
 Sub word Tokenization
 Number Tokenization

Character Tokenization

Input: "Today is Monday"

Output: ["T", "o", "d", "a", "y", "i", "s", "M", "o", "n", "d", "a",” y”]
Word Tokenization (whitespace based, punctuation
based)

Sentence Tokenization

Input: "Tokenization is an important NLP task. It helps break down text into smaller units."

Output: ["Tokenization is an important NLP task.", "It helps break down text into smaller
units."]
Sub word Tokenization (frequently used words,
infrequency used words)

Input: unusual

Output: [“un”, “usual”].

Number tokenization

She had 100 pencils

LEXEMES

 Base or canonical form of words


 Process to find the lexemes is known as lemmatization.

MORPHEMES

 Words are formed by combing more than one morpheme


 Process to find morphemes from text is known as morphological process
 We have following types of morphemes
1. Free morphemes
2. Bounded morphemes

TYPOLOGY

 It refers categorized or classification of a language based structural and grammatical features


 We have following categories
1. Isolated or analytical languages
2. Synthetic languages
3. Agglutinative languages

Issues and Challenges


 Irregularity
 Ambiguity
 Productivity
Irregularity
 Words or words forming follow regular patterns then it is regularity
 Words or words forming doesn’t follow regular patterns then it is
irregularity

Ambiguity
 word or word forms that can be having more than one meaning
irrespective of context.
 Word forms that look like same but meaning is not unique
 Occurred morphological processing
 Ambiguity can be
o Word sense ambiguity
 Meaning depending on the context
 Example
 image
o Parts of speech ambiguity
 Different part of speech
o Structural ambiguity
 Multiple valid syntactic structure
o Referential ambiguity
 Referring person or noun

Productivity
 Forming new word or word forms using productive rules
 Person names, location names, organization names.

Morphological Models
Morphological models are used to analyse the structure and formation of words
We have 5 morphological models
 Dictionary Lookup
 Finite state morphology
 Unification based morphology
 Functional morphology
 Morphological Induction

Dictionary Lookup
Includes

wordbase form or canonical form search in dictionary

retrieve information

Finite state morphology


Based on formal language theory

Process is known as FSTs (finite state transducers)

Morphemes

Root word affixes

Prefix

In fix

Suffix

success
success
un success

pre fix stem

successfull

stem suffix

unsuccessfull

prefix stem suffix

e stem suffix

prefix

stem

STEM CHANGES

Some irregular word requires stem changes

d o g epsilon
m s

c e

i e

u s

Mice

mouse

FST has two types of tapes

 Surface tape
 Lexical tape

Surface tape

c a t s

Lexical tape

c a t N Pl

FST has 7 tuples


MORPHEMES TYPES

Basically, two types of morphemes

o Free morphemes
 Lexical (open class)
 example
 Functional (closed class)
 example
o Bound morphemes
 Inflectional
example
 Derivational
 Class changing
 Class maintaining
Finding structure of Document
Corpus

Documents/sentences

Word/tokens

Vocabulary

Segmentation is chunking the input text or speech into blocks. It is the process of segmenting
sequence of words into units

I talked to Dr.XYZ and my house is mountain Dr.

I met Dr.Xyz and he suggested some medicines.

Types of segmentation

 Sentence boundary detection


o Optical character recognition
 Confused with . and ,
o Automatic speech recognition
 Code switching problem
 Topic boundary detection

Topic boundary detection

 Discourse segmentation / text segmentation


 Process of dividing speech or text into homogenous blocks
called as topic segmentation
 Two ways (text segmentation)
1. By following headlines
2. By paragraph breaks
 Two ways (speech segmentation)
1. Pause duration
2. Speaker changes

METHODS for sentence boundary and topic boundary


1. Generative sequence classification method
2. Discriminative local classification method

Generative sequence classification method


 Observations: words & punctuations
 Labels: sentence boundary, topic boundary
Hidden Markov Model (HMM) is method of Generative approach
How it works
 Learn from the data about the observation and corresponding
hidden states (POS)
 Predict label or sequence generation
 Classifies the new sequence

Ex:
I love Coding

After Hidden states

I | pronoun Love|verb Coding|noun

She|pronoun sells|verb apples|noun


The| determiner quick| adjectivr brown| fox jumps over the lazy dog

Discriminative local classification method


Local feature: word, prefixes, suffixes, nearby POS
Label: sentence boundary label, topic boundary label
Ex: maximum entropy Markov model, SVM

Applications:
1. POS tagging
2. Speech recognition
3. Named entity recognition

Complexity of approaches
 Quality
 Quantity
 Computational complexity
 Structural complexity
 Space
 Time
 Training
 Prediction
Performance of the approaches
 Precision
 Recall
 Accuracy
 F1 measure/F1 score

Confusion matrix

Precision
When it predicts yes, how often is it correct
TP/predicted Yes
100/110=90.9%

True Negative Rate:

Actually No, how often does it predict No


TN/actual No=50/60=83%
True Positive Rate: (Recall/Sensitivity)
When it actual Yes how often does it predict Yes
TP/Actual Yes
100/105=95%

Accuracy
How often classifier correct

TN+TP/TN+FP+FN+TP
50+100/50+10+5+100
150/165=90%

Misclassification Rate:
Overall, how often is it wrong

FN+FP/TN+FP+FN+TP
5+10/50+10+5+100
15/165=9%
PRCESION = Total No. Of Correct Positive Prediction/Total No. Of
Positive Prediction
RECALL= Total No. Of Correct Positive Prediction/Total no. of positive
instances (+ve ,-ve)
ACCURACY
How often classifier correct
TN+TP/TN+FP+FN+TP
F1 SCORE= 2* precision *recall / precision +recall

UNIT -II
Prerequires CFG

Syntax Analysis: Parsing Natural Language, Treebanks: A


Data-Driven Approach to Syntax, Representation of Syntactic
Structure, Parsing Algorithms, Models for Ambiguity Resolution
in Parsing, Multilingual Issues

Syntax Analysis /syntactic Analysis


Syntax Vs grammar
Return_type Function_name(parameters);
Function_name(parameters) return_type;

Function_name(parameter);

Return_type Function_name(parameters)

Ramu eats apple


Eats ramu apple

Tree

Parsing
CFG

G= (N, T, P, S)

A α
α  (NUT)*

AB

SNP VP
NP{article(adj)N, PRO, PN}
VPV NP(PP)(ADV)
PP PREP NP
NPN
NP DN

Ate ramu apple the

RAMU ATE THE APPLE

Brown ,switchboard

At/in the/at same/ap time/nn reaction/nn among/in anti-


organization/jj
At the same time reaction among anti-organization
Penn treebank
Ate ramu apple the
Parsing Natural Language

Examples:
The child ate the cake with the fork
He gave the book to his sister
The dog saw a man in the park
Common Types of Determiners (Det) in NLP:
1. Articles: Specify definiteness.
o Definite: the
 Example: the book
o Indefinite: a, an
 Example: a cat
2. Demonstratives: Point to specific nouns.
o this, that, these, those
 Example: this car, those apples
3. Possessives: Indicate ownership.
o my, your, his, her, its, our, their
 Example: my book, their house
4. Quantifiers: Specify quantity or amount.
o some, many, few, all, several, no
 Example: many people, few options
5. Interrogative Determiners: Used in questions.
o which, what, whose
 Example: which color, whose idea
6. Numbers: Cardinal numbers used before nouns.
o one, two, three
 Example: two books, three dogs

Representation of Syntactic Structure


Two type s of approaches
 Phrase structure graph
o Example
 Dependency graph
o Example
Parsing Algorithms

 Recursive descent parser


 Shift reduce parser
 Chart parser
 RegExp parser

Models for Ambiguity Resolution in Parsing

John bought a shirt with pockets


Probability Context Free Grammer (PCFG)
Generative model
Discriminative model

HOMONYMY

UNIT - III
Semantic Parsing: Introduction, Semantic Interpretation, System Paradigms, Word
Sense, Systems, Software.

UNIT – IV
Predicate-Argument Structure, Meaning Representation
Systems, Software.

SENTENCE:

sentence is a group of words which makes complete


sense. a sentence has subject and predicate.

Examples

1. He is a student.
2. He goes to school daily.
3. A large number of spectators watched the match.
4. The cat chased the rats
5. John slept
6. The boy ate pizza

SUBJECT:
A subject is a word or a group of words that describe
the action or that is in the state of being described by
the verb.
Subject is person, animal, place or thing we talk about
the sentence
PREDICATE:

the part of the sentence which tells something about


the subject is called predicate

Semantic Rules
Agent:
 Agent also known as the doer of an action
 Usually agents are human beings, but they can
also be non-human such as machines,
creatures

Patient:

The patient is an entity that undergoes or receives the


action to whom action is passed over
Or

Patient is an entity undergoing the effect of some


action and often undergoing some change in state.

Theme:

Theme is an entity that is moved by the action or


whose location is described

Mary threw the ball

Experiencer:
Experiencer is one who receives, accepts, experiences
or undergoes the effect of an action
John looked at the moon

John saw the moon

Location:
The place where the action happens

Source:

The direction from which the action originated.

Goal:

The place to which something moves or the thing


towards which an action is directed or location or
entity in which something moves

Instrument:

an inanimate thing that an agent uses to implement an


event
Cause:

The entity that causes (but does, does not do)


Or an event (because of what)

Time:

When an action is performed or when an event took


place

Beneficiary:

the entity that is advantaged or disadvantaged by an


action

Semantic roles
Theta roles
Thematic roles
Theat theory
Thematic theory

Examples:
The baker baked a cake
Mark cooked a banana bread.
The detective interrogated the suspect.
The storm broke a tree.
A large stone crashed the car.
A rock hit Bob.
Dan played guitar.
Mary hit the car.
Mary hit her boss.
Richard saw the eclipse.
Andrei gave students an assignment.
Fred sends a package to Russia.

SRL (semantic Role Labelling)

Meaning Representation System

Abstract Meaning Representation (AMR):


AMR aims to assign the same AMR to sentences that
have same basic meaning

AMR makes extension use of PropBank and framesets

AMR is heavily biased towards English .it is not


interlingua.

AMR have 3 equivalent forms


1. Logical format
A formal representation
2. AMR format
Based on PENMAN notation, easy for human
reading and writing
3. Graph format
For visualization and used by programs

The AMR annotation is more frequently represented as


a single-rooted directed acyclic graph with labelled
nodes (concepts) and edges (relations) among them
Nodes represent the main events and entities that
occur in a sentence, and edges represent semantic
relationships among nodes

Every AMR has a single root node at the top of the


graph, which is considered to be the focus

Each node in the graph


 has a variable and represents a semantic concept
(variable = instance of concept) a slash (/)
 variables are reused if something is referenced
multiple times: re-entrancy
 Semantic concepts include PB (PropBank)
framesets and English words

Graph edges denote relations between concepts


 Semantic relations include different types of roles,
marked by a colon prefix (:)
 Some relations known as constants – get no
variable, just a value
 Relation (role) can be inverted (useful for
maintaining a single rooted structure)
 It is also possible to convert a role into a concept
by reification (useful to make a relation the focus
of an AMR fragment)
I beg you to excuse me

AMR Notations:

Graph format
Hello everyone, how are you?
How many days should I learn?
And high CSMC students, how are you? You're great.
Excellent.
So, natural language understanding.
Lexical ambiguity exists in the presence of two or more
possible meanings of the sentence within a single
word.
Mania is looking for a match.
Phases of NLP.
We have a lexical analysis, syntactic analysis, semantic
analysis, discourse integration, fragment analysis.
The challenges of NLP. Elongated words, shortcuts,
emojis.
Mix of mixed-use of languages.
Ellipsis.
Has two types of tapes, surface tape and electrical
tape.
Lexical tip.
Lexical tape.
Hi, how many days should I learn natural language
processing? Is it necessary?
Hi, how many days should I learn natural language
processing? Is it necessary?
Hello.
Hello.
Hello, are you there?
Hello, are you there?
Pipeline is segmentation. Now what exactly is
segmentation? Sentence segmentation or text
segmentation is basically dividing the given text into
logically decipherable units of information. We divide
the sentence into its constituent sub sentences, usually
around the punctuations like full stops or commas, or
along line breaks and page components for HTML files.
Dividing a document into its constituent sentences
allows us to process it without losing its essence and
the necessary information that it contains.
It.
Called tokenization. Tokenization is nothing but the
process of dividing a sentence into its constituent
words. The sentence that is given to us will be
separated and all the words in the sentence will be
stored separately. This is done so we can understand
the syntactic and semantic information contained in
each sentence. Thus, we decipher the relevance of a
sentence by analyzing it word by word, thereby making
sure that.
Called tokenization. Tokenization is nothing but the
process of dividing a sentence into its constituent
words. The sentence that is given to us will be
separated and all the words in the sentence will be
stored separately. This is done so we can understand
the syntactic and semantic information contained in
each sentence. Thus, we decipher the relevance of a
sentence by analyzing it word by word, thereby making
sure that.
Information occurs. The computer does not understand
punctuation and special characters. Hence, we can
remove any punctuations and special characters which
may occur. Let's take a part of our previously
segmented sentence.
Stemming is a process of obtaining the word stems of a
word.
Stemming is a process of obtaining the word stems of a
word.
Word stems are also known as the base form of a word
and we can create new words by attaching affixes to
them in a process known as inflection. Stubbing is a
process of recognizing the word stems of individual
words. This is done by removing affixes such as in
South Ed, etc.
Word stems are also known as the base form of a word
and we can create new words by attaching affixes to
them in a process known as inflection. Stubbing is a
process of recognizing the word stems of individual
words. This is done by removing affixes such as in
South Ed, etc.
For example, consider a sentence jump.
For example, consider a sentence jump.
Jump is the best.
Jump is the word stem of various different words like
jumping, jump and jumps. If we remove all of these
affixes, we will get our basic word step which is junk.
This is basically what we want. At the end of stemming,
the next process in our pipeline is called.
Infect.
Lematization Lemmatization is the process of figuring
out the root form or root word, which is nothing but
the most basic form, also known as the lemma of each
word in the sentence. Lemmatization is very similar to
standing, where we remove word affixes to get the
base form of a word.
Lemmatization Lemmatization is the process of figuring
out the root form or root word, which is nothing but
the most basic form, also known as the lemma of each
word in the sentence. Lemmatization is very similar to
stemming, where we remove word affixes to get the
base form of a word. The difference is that the root
word is always a word which is present in the
dictionary.
But the root stem may not be so lemmatization uses a
knowledge base called Wordnet. Let's consider three
different words.
The difference is that the root word is always a word
which is present in the dictionary, but the root stem
may not be. So lemmatization uses a knowledge base
called Wordnet. Let's consider three different words.
Clinton contaminants may be 15 to 16 to 10.
50 to 60 Mathila. 5 to 6 Madam Record. Venada.
50 to 60 Mathila. 5 to 6 Madam Record. Venada.
Bangalore Direct.
Bhuvneshwar Kumar Duttclayer.
Antenna.
Acclaimed under Madam Bank.
You know Masala Prasad Chennai Cups telescope for
instance.
UK.
First, double ghatkopar yoga.
Nikhil Lavigne Krishna Malika.
Loan Kuruji. Epidemiology.
Dialogue.
Radki Opsar.
Godrej. Godrej.
Uh, athuvaramnada ***** dhirubhai athama Meru
cherry.
Uh.
Uh, ya avan sampanam okadasari.
Medabad. MMM.
OK.
OK.
Anjavatar Rao.
Ah, and about the Lord.
Yes.
Sara Iwandi.
Memo. Shukra Naidu. Nightlife.
End.
Sukhrani Shanmugaratnam.
Shukran Naidu talked about Kuntamu, Shanmukar
Nadu, Adhu Vinaya Tapuja a few days Vishkuntam.
Anirudh.
Anirudh.
Director Nadu.
Kadavani Adhu Directive Adhu Varanadu.
Adhu Varan Nadu, ***** Nadu.
Hey, BM was from the Atlanta.
What's Ratna?
'S.
Rahul Gandhi.
Uh, set up the mail.
Hello.
Hello. Hello, Sir. Good evening, Sir.
Retaining, Sir.
Japanese.
Sir Chappendi.
Class A, third year.
Class A third class.
OK, OK. One other combination.
OK, OK. One other combination.
One other combination.
One other combination.
CSMK.
CSMK 3rd year kid.
Part of speech target Part of speech tagging is a process
of converting a sentence to different forms. It can be a
list of words or a list of tuples. They tag in case of is a
part of speech tag and signifies whether the word is a
noun, adjective, verb, and so on. We are basically
splitting our verbs into the grammatical components to
understand the meaning of any sentence or to extract
relationships and to build a knowledge graph. Part of
speech tagging is.
As the same bird can have different part of speeches in
different sentences for example.
Let.
It is nothing but deep electricity recognition.
Entity recognition, also known as name, entity
identification, entity chunking, and entity extraction, is
a subtask of information expansion that seeks to locate
and classify, which are mentioned in.
Hello. Hello.
Hello CSMC, how are you?
Hello CSMC, how are you?
So did you.
So did you get?
So I given an input as a speech. It converted I mean
still.
So I given an input as a speech. It converted I mean
still.
It's going on.
Right, so this is the way.
It's going on, right? So this is the way.
Better stop.

You might also like