UNIT - I
Finding the Structure of Words: Words and Their Components, Issues and
Challenges, Morphological Models
Finding the Structure of Documents: Introduction, Methods, Complexity of the
Approaches, Performances of the Approaches
NLP
Natural Language Processing
NLP stands for Natural Language Processing, which is a part of Computer
Science, Human language, and Artificial Intelligence
ability of a computer program to understand human language referred to
as natural language.
It's a component of artificial intelligence
It is a technology used by machines to understand, analyse, manipulate,
and interpret human's languages.
Applications of NLP
Question Answering
Spam Detection
Sentiment Analysis
Machine Translation
Spelling correction
Speech Recognition
Chatbot
Information extraction
Components of NLP
o NLU (Natural Language Understanding)
o NLG (Natural Language Generation)
NLU (Natural Language Understanding)
Lexical Ambiguity
Lexical Ambiguity exists in the presence of two or more possible meanings of the
sentence within a single word.
Example:
Manya is looking for a match.
Syntactic Ambiguity
Syntactic Ambiguity exists in the presence of two or more possible meanings within
the sentence.
Example:
I saw the girl with the binocular.
Referential Ambiguity
Referential Ambiguity exists when you are referring to something using the pronoun.
Example: Kiran went to suresh. He eats apple.
In the above sentence, you do not know that who is hungry, either Kiran or Sunita.
Phases of NLP
NLP Challenges
Elongated words
Shortcuts
Emojis
Mix use of Language
Ellipsis
LEXICON ANALYSIS
It is fundamental stage
Identifying and analysing the structure of words
It is word level processing
Dividing the whole text into paragraph, sentence and words
It involves stemming and lemmatization
SYNTACTIC ANALYSIS
Required syntactic knowledge
Find the roles played by words in a sentence,
Interpret the relationship between words,
Interpret the grammatical structure of sentences.
SEMANTIC ANALYSIS
exact meaning or dictionary meaning from the text.
to check the text for meaningfulness.
DISCOURSE ANALYSIS
Required discourse knowledge
PRAGMATIC ANALYSIS
how people communicate with each other, in which context they are talking
required knowledge of the word
Finding the Structure of Words
Words and Their Components
Words are the basic building blocks of a Language. we have following components of Words
Tokens
Lexemes
Morphemes
Typology
Tokens
Tokens are words that are created by dividing the text into smaller units
Process to identify tokens from the given text is known as Tokenization
Tokenization involves segmenting text into smaller units that are analysed individually.
Input is text and output are tokens
Types of Tokenization
Character Tokenization
Word Tokenization
Sentence Tokenization
Sub word Tokenization
Number Tokenization
Character Tokenization
Input: "Today is Monday"
Output: ["T", "o", "d", "a", "y", "i", "s", "M", "o", "n", "d", "a",” y”]
Word Tokenization (whitespace based, punctuation based)
Sentence Tokenization
Input: "Tokenization is an important NLP task. It helps break down text into smaller units."
Output: ["Tokenization is an important NLP task.", "It helps break down text into smaller units."]
Sub word Tokenization (frequently used words, infrequency used
words)
Input: unusual
Output: [“un”, “usual”].
Morphological Process
Morphemes
Number tokenization
She had 100 pencils
LEXEMES
Base or canonical form of words
Process to find the lexemes is known as lemmatization.
MORPHEMES
Words are formed by combing more than one morpheme
Process to find morphemes from text is known as morphological process
We have following types of morphemes
1. Free morphemes
2. Bounded morphemes
TYPOLOGY
It refers categorized or classification of a language based structural and grammatical features
We have following categories
1. Isolated or analytical languages
2. Synthetic languages
3. Agglutinative languages
Issues and Challenges
Irregularity
Ambiguity
Productivity
Irregularity
Words or words forming follow regular patterns then it is regularity
Words or words forming doesn’t follow regular patterns then it is
irregularity
Ambiguity
word or word forms that can be having more than one meaning
irrespective of context.
Word forms that look like same but meaning is not unique
Occurred morphological processing
Ambiguity can be
o Word sense ambiguity
Meaning depending on the context
o Parts of speech ambiguity
Different part of speech
o Structural ambiguity
Multiple valid syntactic structure
o Referential ambiguity
Referring person or noun
Productivity
Forming new word or word forms using productive rules
Person names, location names, organization names.
Morphological Models
Morphological models are used to analyse the structure and formation of words
We have 5 morphological models
Dictionary Lookup
Finite state morphology
Unification based morphology
Functional morphology
Morphological Induction
Morphemes
Root word affixes
Prefix
In fix
suffix
Dictionary Lookup
Includes
wordbase form or canonical form search in dictionary
retrieve information
Finite state morphology
Based on formal language theory
Process is known as FSTs (finite state transducers)
success
success
un success
pre fix stem
successfull
stem suffix
unsuccessfull
prefix stem suffix
e stem suffix
prefix
stem
STEM CHANGES
Some irregular word requires stem changes
d o g epsilon
m s
c e
i e
u s
Mice
mouse
FST has two types of tapes
Surface tape
Lexical tape
Surface tape
c a t s
Lexical tape
c a t N Pl
FST has 7 tuples
MORPHEMES TYPES
Basically, two types of morphemes
o Free morphemes
Lexical
example
Functional
example
o Bound morphemes
Inflectional
example
Derivational
Class changing
Class maintaining
Finding structure of Document
Segmentation is chunking the input text or speech into blocks
Types of segmentation
Sentence boundary detection
o Optical character recognition
o Automatic speech recognition
Topic boundary detection
Corpus
Documents/sentences
Word/tokens
Vocabulary
I met Dr.Xyz and he suggested some medices.
What is the time now?
Topic boundary detection
Discourse segmentation / text segmentation
Process of dividing speech or text into homogenous blocks
called as topic segmentation
Two ways (text segmentation)
1. By following headlines
2. By paragraph breaks
Two ways (speech segmentation)
1. Pause duration
2. Speaker changes
METHODS for sentence boundary and topic boundary
1. Generative sequence classification method
2. Discriminative local classification method
Generative sequence classification method
Observations: words & punctuations
Labels: sentence boundary, topic boundary
Hidden Markov Model (HMM) is method of Generative approach
How it works
Learn from the data about the observation and corresponding
hidden states (POS)
Predict label or sequence generation
Classifies the new sequence
Ex:
I Love Coding
She sells apples
the quick brown fox jumps over the lazy dog
Discriminative local classification method
Local feature: word, prefixes, suffixes, nearby POS
Label: sentence boundary label, topic boundary label
Ex: maximum entropy Markov model, SVM
Applications:
1. POS tagging
2. Speech recognition
3. Named entity recognition
Complexity of approaches
Quality
Quantity
Computational complexity
Structural complexity
Space
Time
Training
Prediction
Performance of the approaches
Precision
Recall
Accuracy
F1 measure/F1 score
Confusion matrix
Precision
When it predicts yes, how often is it correct
TP/predicted Yes
100/110=90.9%
True Negative Rate:
Actually No, how often does it predict No
TN/actual No=50/60=83%
True Positive Rate: (Recall/Sensitivity)
When it actual Yes how often does it predict Yes
TP/Actual Yes
100/105=95%
Accuracy
How often classifier correct
TN+TP/TN+FP+FN+TP
50+100/50+10+5+100
150/165=90%
Misclassification Rate:
Overall, how often is it wrong
FN+FP/TN+FP+FN+TP
5+10/50+10+5+100
15/165=9%
PRCESION = Total No. Of Correct Positive Prediction/Total No. Of
Positive Prediction
RECALL= Total No. Of Correct Positive Prediction/Total no. of positive
instances (+ve ,-ve)
ACCURACY
How often classifier correct
TN+TP/TN+FP+FN+TP
F1 SCORE= 2* precision *recall / precision +recall
UNIT -II
Prerequires CFG
Syntax Analysis: Parsing Natural Language, Treebanks: A Data-Driven
Approach to Syntax, Representation of Syntactic Structure, Parsing Algorithms,
Models for Ambiguity Resolution in Parsing, Multilingual Issues
Chart parser
RegEx parser
Shift reduce parser
Recursive parser
Syntax Analysis /syntactic Analysis
Syntax Vs grammar
Return_type Function_name(parameters);
Function_name(parameters) return_type;
Function_name(parameter);
Return_type Function_name(parameters)
Ramu eats apple
Eats ramu apple
Tree
Parsing
CFG
G= (N, T, P, S)
A α
α (NUT)*
AB
SNP VP
NP{article(adj)N, PRO, PN}
VPV NP(PP)(ADV)
PP PREP NP
NPN
NP DN
Ate ramu apple the
RAMU ATE THE APPLE
Brown ,switchboard
At/in the/at same/ap time/nn reaction/nn among/in anti-
organization/jj
At the same time reaction among anti-organization
Penn treebank
Ate ramu apple the
Representation of Syntactic Structure
Two types of approaches
Phrase structure graph
o Example
Dependency graph
o Example