0% found this document useful (0 votes)

33 views26 pages

NLP Notes

The document discusses the key components of natural language processing including finding the structure of words and documents. It covers morphological models, tokenization, lemmatization, parts of speech tagging, and classification algorithms for sentence and topic boundary detection.

Uploaded by

rusma1786

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views26 pages

NLP Notes

Uploaded by

rusma1786

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 26

UNIT - I

Finding the Structure of Words: Words and Their Components, Issues and
Challenges, Morphological Models

Finding the Structure of Documents: Introduction, Methods, Complexity of the

Approaches, Performances of the Approaches

NLP
Natural Language Processing
NLP stands for Natural Language Processing, which is a part of Computer
Science, Human language, and Artificial Intelligence
 ability of a computer program to understand human language referred to
as natural language.
 It's a component of artificial intelligence
 It is a technology used by machines to understand, analyse, manipulate,
and interpret human's languages.
Applications of NLP
 Question Answering
 Spam Detection
 Sentiment Analysis
 Machine Translation
 Spelling correction
 Speech Recognition
 Chatbot
 Information extraction

Components of NLP
o NLU (Natural Language Understanding)
o NLG (Natural Language Generation)

NLU (Natural Language Understanding)

 Lexical Ambiguity

Lexical Ambiguity exists in the presence of two or more possible meanings of the
sentence within a single word.

Example:

Manya is looking for a match.

 Syntactic Ambiguity
Syntactic Ambiguity exists in the presence of two or more possible meanings within
the sentence.

Example:

I saw the girl with the binocular.

 Referential Ambiguity

Referential Ambiguity exists when you are referring to something using the pronoun.

Example: Kiran went to suresh. He eats apple.

In the above sentence, you do not know that who is hungry, either Kiran or Sunita.

Phases of NLP
NLP Challenges

 Elongated words
 Shortcuts
 Emojis
 Mix use of Language
 Ellipsis
LEXICON ANALYSIS

 It is fundamental stage
 Identifying and analysing the structure of words
 It is word level processing
 Dividing the whole text into paragraph, sentence and words
 It involves stemming and lemmatization

SYNTACTIC ANALYSIS

 Required syntactic knowledge

 Find the roles played by words in a sentence,
 Interpret the relationship between words,
 Interpret the grammatical structure of sentences.

SEMANTIC ANALYSIS

 exact meaning or dictionary meaning from the text.

 to check the text for meaningfulness.

DISCOURSE ANALYSIS

 Required discourse knowledge

PRAGMATIC ANALYSIS

 how people communicate with each other, in which context they are talking
 required knowledge of the word
Finding the Structure of Words

Words and Their Components

Words are the basic building blocks of a Language. we have following components of Words

 Tokens
 Lexemes
 Morphemes
 Typology

Tokens

 Tokens are words that are created by dividing the text into smaller units
 Process to identify tokens from the given text is known as Tokenization
 Tokenization involves segmenting text into smaller units that are analysed individually.
 Input is text and output are tokens

Types of Tokenization

 Character Tokenization
 Word Tokenization
 Sentence Tokenization
 Sub word Tokenization
 Number Tokenization

Character Tokenization

Input: "Today is Monday"

Output: ["T", "o", "d", "a", "y", "i", "s", "M", "o", "n", "d", "a",” y”]
Word Tokenization (whitespace based, punctuation based)

Sentence Tokenization

Input: "Tokenization is an important NLP task. It helps break down text into smaller units."

Output: ["Tokenization is an important NLP task.", "It helps break down text into smaller units."]

Sub word Tokenization (frequently used words, infrequency used

words)
Input: unusual

Output: [“un”, “usual”].

Morphological Process

Morphemes

Number tokenization

She had 100 pencils

LEXEMES

 Base or canonical form of words

 Process to find the lexemes is known as lemmatization.

MORPHEMES

 Words are formed by combing more than one morpheme

 Process to find morphemes from text is known as morphological process
 We have following types of morphemes
1. Free morphemes
2. Bounded morphemes

TYPOLOGY

 It refers categorized or classification of a language based structural and grammatical features

 We have following categories
1. Isolated or analytical languages
2. Synthetic languages
3. Agglutinative languages

Issues and Challenges

 Irregularity
 Ambiguity
 Productivity
Irregularity
 Words or words forming follow regular patterns then it is regularity
 Words or words forming doesn’t follow regular patterns then it is
irregularity

Ambiguity
 word or word forms that can be having more than one meaning
irrespective of context.
 Word forms that look like same but meaning is not unique
 Occurred morphological processing
 Ambiguity can be
o Word sense ambiguity
 Meaning depending on the context
o Parts of speech ambiguity
 Different part of speech
o Structural ambiguity
 Multiple valid syntactic structure
o Referential ambiguity
 Referring person or noun

Productivity
 Forming new word or word forms using productive rules
 Person names, location names, organization names.

Morphological Models
Morphological models are used to analyse the structure and formation of words

We have 5 morphological models

 Dictionary Lookup
 Finite state morphology
 Unification based morphology
 Functional morphology
 Morphological Induction

Morphemes

Root word affixes

Prefix

In fix

suffix

Dictionary Lookup
Includes

wordbase form or canonical form search in dictionary

retrieve information

Finite state morphology

Based on formal language theory

Process is known as FSTs (finite state transducers)

success
success

un success

pre fix stem

successfull

stem suffix

unsuccessfull

prefix stem suffix

e stem suffix

prefix

stem
STEM CHANGES

Some irregular word requires stem changes

d o g epsilon

m s

c e

i e

u s

Mice

mouse

FST has two types of tapes

 Surface tape
 Lexical tape

Surface tape

c a t s

Lexical tape

c a t N Pl

FST has 7 tuples

MORPHEMES TYPES

Basically, two types of morphemes

o Free morphemes
 Lexical
 example
 Functional
 example
o Bound morphemes
 Inflectional
example
 Derivational
 Class changing
 Class maintaining

Finding structure of Document

Segmentation is chunking the input text or speech into blocks

Types of segmentation

 Sentence boundary detection

o Optical character recognition
o Automatic speech recognition
 Topic boundary detection

Corpus

Documents/sentences

Word/tokens

Vocabulary

I met Dr.Xyz and he suggested some medices.

What is the time now?

Topic boundary detection

 Discourse segmentation / text segmentation
 Process of dividing speech or text into homogenous blocks
called as topic segmentation
 Two ways (text segmentation)
1. By following headlines
2. By paragraph breaks
 Two ways (speech segmentation)
1. Pause duration
2. Speaker changes

METHODS for sentence boundary and topic boundary

1. Generative sequence classification method
2. Discriminative local classification method

Generative sequence classification method

 Observations: words & punctuations
 Labels: sentence boundary, topic boundary
Hidden Markov Model (HMM) is method of Generative approach
How it works
 Learn from the data about the observation and corresponding
hidden states (POS)
 Predict label or sequence generation
 Classifies the new sequence
Ex:
I Love Coding
She sells apples
the quick brown fox jumps over the lazy dog
Discriminative local classification method
Local feature: word, prefixes, suffixes, nearby POS
Label: sentence boundary label, topic boundary label
Ex: maximum entropy Markov model, SVM

Applications:
1. POS tagging
2. Speech recognition
3. Named entity recognition

Complexity of approaches
 Quality
 Quantity
 Computational complexity
 Structural complexity
 Space
 Time
 Training
 Prediction

Performance of the approaches

 Precision
 Recall
 Accuracy
 F1 measure/F1 score

Confusion matrix

Precision
When it predicts yes, how often is it correct
TP/predicted Yes
100/110=90.9%

True Negative Rate:

Actually No, how often does it predict No

TN/actual No=50/60=83%

True Positive Rate: (Recall/Sensitivity)

When it actual Yes how often does it predict Yes
TP/Actual Yes
100/105=95%

Accuracy
How often classifier correct

TN+TP/TN+FP+FN+TP
50+100/50+10+5+100
150/165=90%

Misclassification Rate:
Overall, how often is it wrong

FN+FP/TN+FP+FN+TP
5+10/50+10+5+100
15/165=9%
PRCESION = Total No. Of Correct Positive Prediction/Total No. Of
Positive Prediction
RECALL= Total No. Of Correct Positive Prediction/Total no. of positive
instances (+ve ,-ve)
ACCURACY
How often classifier correct
TN+TP/TN+FP+FN+TP
F1 SCORE= 2* precision *recall / precision +recall
UNIT -II
Prerequires CFG

Syntax Analysis: Parsing Natural Language, Treebanks: A Data-Driven

Approach to Syntax, Representation of Syntactic Structure, Parsing Algorithms,
Models for Ambiguity Resolution in Parsing, Multilingual Issues

Chart parser
RegEx parser
Shift reduce parser
Recursive parser

Syntax Analysis /syntactic Analysis

Syntax Vs grammar

Return_type Function_name(parameters);
Function_name(parameters) return_type;
Function_name(parameter);

Return_type Function_name(parameters)

Ramu eats apple

Eats ramu apple
Tree

Parsing
CFG

G= (N, T, P, S)

A α
α  (NUT)*

AB

SNP VP
NP{article(adj)N, PRO, PN}
VPV NP(PP)(ADV)
PP PREP NP
NPN
NP DN

Ate ramu apple the

RAMU ATE THE APPLE

Brown ,switchboard

At/in the/at same/ap time/nn reaction/nn among/in anti-

organization/jj
At the same time reaction among anti-organization
Penn treebank
Ate ramu apple the
Representation of Syntactic Structure
Two types of approaches

 Phrase structure graph

o Example
 Dependency graph
o Example

NLP Notes
No ratings yet
NLP Notes
56 pages
NLP Sem Unit 1
No ratings yet
NLP Sem Unit 1
8 pages
Unit 12 (3 Half)
No ratings yet
Unit 12 (3 Half)
37 pages
NLP CSM
No ratings yet
NLP CSM
136 pages
NLP Shorts 3
No ratings yet
NLP Shorts 3
25 pages
NLP Unit 1
No ratings yet
NLP Unit 1
52 pages
Natural Language Processing by DR A Nagesh
No ratings yet
Natural Language Processing by DR A Nagesh
136 pages
Unit V Intelligence and Applications: Morphological Analysis/Lexical Analysis
No ratings yet
Unit V Intelligence and Applications: Morphological Analysis/Lexical Analysis
30 pages
NLP Unit-1 Notes
No ratings yet
NLP Unit-1 Notes
59 pages
NLP Unit 2
No ratings yet
NLP Unit 2
48 pages
Selected Topic CH 1
No ratings yet
Selected Topic CH 1
36 pages
NLP Simple Explanation
No ratings yet
NLP Simple Explanation
9 pages
NLP Unit-I Notes
No ratings yet
NLP Unit-I Notes
19 pages
ACFrOgBKMtkrKQXYgwzYfGAQxQ0GJjQ4MloahBs6vi5pwqo xRZUN6IRgh8lAAyR2U7sguAn6becvxh174Y RYo84nZ3K9mm OlN3Q JrDvd18FxMzMkCBuxruzd1tH0C6XqndKXsCSXuwHIWVT7olg5FKOstIhFYq-Kh6hMBg
No ratings yet
ACFrOgBKMtkrKQXYgwzYfGAQxQ0GJjQ4MloahBs6vi5pwqo xRZUN6IRgh8lAAyR2U7sguAn6becvxh174Y RYo84nZ3K9mm OlN3Q JrDvd18FxMzMkCBuxruzd1tH0C6XqndKXsCSXuwHIWVT7olg5FKOstIhFYq-Kh6hMBg
32 pages
Unit Ii NLP Notes Final
No ratings yet
Unit Ii NLP Notes Final
6 pages
Solution NLP UT1
No ratings yet
Solution NLP UT1
7 pages
NLP JNTUH Unit 1
No ratings yet
NLP JNTUH Unit 1
9 pages
NLPNotes
No ratings yet
NLPNotes
12 pages
CB3591 - Engineering Ssecure Software Systems - Notes
No ratings yet
CB3591 - Engineering Ssecure Software Systems - Notes
50 pages
Part-of-Speech (POS) Tagging
No ratings yet
Part-of-Speech (POS) Tagging
4 pages
Natural Language Processing
No ratings yet
Natural Language Processing
44 pages
NLP Unit-1 Notes
No ratings yet
NLP Unit-1 Notes
162 pages
NLP Reading Material-1
No ratings yet
NLP Reading Material-1
15 pages
What Is NLP?: Components of An FSA
No ratings yet
What Is NLP?: Components of An FSA
16 pages
NLP Mid-1
No ratings yet
NLP Mid-1
15 pages
NLP m2
No ratings yet
NLP m2
71 pages
NLP Lecture 3
No ratings yet
NLP Lecture 3
44 pages
5.natural Language Processing
No ratings yet
5.natural Language Processing
5 pages
UNIT-1 Notes
No ratings yet
UNIT-1 Notes
19 pages
NLP Ans
No ratings yet
NLP Ans
9 pages
Natural Language Processing
No ratings yet
Natural Language Processing
47 pages
NLP Unit 1 Answers
No ratings yet
NLP Unit 1 Answers
7 pages
Transition Networks in Computing
No ratings yet
Transition Networks in Computing
7 pages
Unit 2 Syntactic Processing
No ratings yet
Unit 2 Syntactic Processing
16 pages
Feature Systems and Augmented Grammars
No ratings yet
Feature Systems and Augmented Grammars
7 pages
Chapter 1 - Natural Language Processing (NLP)
No ratings yet
Chapter 1 - Natural Language Processing (NLP)
35 pages
Morphological Analysis
No ratings yet
Morphological Analysis
35 pages
Lecture Template 16x9
No ratings yet
Lecture Template 16x9
16 pages
Unit 5
No ratings yet
Unit 5
70 pages
Natural Language Processing
No ratings yet
Natural Language Processing
47 pages
NLP Lect 2 Words and Morphology
No ratings yet
NLP Lect 2 Words and Morphology
52 pages
NLP - Unit II Syntactic Analysis
No ratings yet
NLP - Unit II Syntactic Analysis
9 pages
Unit 2 Syntactic Processing
No ratings yet
Unit 2 Syntactic Processing
17 pages
NLP Question and Answers Final
No ratings yet
NLP Question and Answers Final
129 pages
NLP Notes
No ratings yet
NLP Notes
43 pages
Natural Language Processing PDF
100% (1)
Natural Language Processing PDF
47 pages
Atural Anguage Rocessing: Chandra Prakash LPU
No ratings yet
Atural Anguage Rocessing: Chandra Prakash LPU
59 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
NLP Notes
No ratings yet
NLP Notes
180 pages
2 NLP
No ratings yet
2 NLP
36 pages
Natural Language Processing
No ratings yet
Natural Language Processing
72 pages
NLP Basics for AI Enthusiasts
100% (1)
NLP Basics for AI Enthusiasts
21 pages
NLP Unit-1
No ratings yet
NLP Unit-1
37 pages
NLP One Mark Questions With Answers
No ratings yet
NLP One Mark Questions With Answers
8 pages
NLP Sem 7 Imp Questions
No ratings yet
NLP Sem 7 Imp Questions
11 pages
NLP Unit 1
No ratings yet
NLP Unit 1
68 pages
Unit - 5 Natural Language Processing
No ratings yet
Unit - 5 Natural Language Processing
66 pages
Unit 1 NLP KCS072
No ratings yet
Unit 1 NLP KCS072
12 pages
Integrating Handcrafted Features With Machine Lear
No ratings yet
Integrating Handcrafted Features With Machine Lear
13 pages
Artificial Intelligence Research Paper Topics
No ratings yet
Artificial Intelligence Research Paper Topics
6 pages
Unit 1 Text and Speech Analysis Notes
No ratings yet
Unit 1 Text and Speech Analysis Notes
28 pages
Vectors
No ratings yet
Vectors
22 pages
NLP Meta Masters NS NLP Master Practitioner Course - Michael Hall
100% (3)
NLP Meta Masters NS NLP Master Practitioner Course - Michael Hall
228 pages
AI Curriculum for Class X Students
No ratings yet
AI Curriculum for Class X Students
4 pages
3-Module-1-Applications of AI-Subfields of AI-08-01-2024
No ratings yet
3-Module-1-Applications of AI-Subfields of AI-08-01-2024
13 pages
Multi Model
No ratings yet
Multi Model
36 pages
Muhidin Kedir 2020
No ratings yet
Muhidin Kedir 2020
84 pages
Er S.C Om: CEGP013091
No ratings yet
Er S.C Om: CEGP013091
1 page
AI MCQs for 9th Grade Students
No ratings yet
AI MCQs for 9th Grade Students
51 pages
Impact of AI-focussed Technologies On Social and Technical Competencies For HR Managers - A Systematic
No ratings yet
Impact of AI-focussed Technologies On Social and Technical Competencies For HR Managers - A Systematic
18 pages
AI Agent Minor Project Report
No ratings yet
AI Agent Minor Project Report
28 pages
Amdework Asefa Belay
No ratings yet
Amdework Asefa Belay
119 pages
CS269 01
No ratings yet
CS269 01
78 pages
Natural Language Processing Course Planner
No ratings yet
Natural Language Processing Course Planner
17 pages
Point72 ML Engineer Screening Test
No ratings yet
Point72 ML Engineer Screening Test
4 pages
Ai Theory Kai-501
No ratings yet
Ai Theory Kai-501
65 pages
Sample Minor Project I Report CSE
No ratings yet
Sample Minor Project I Report CSE
12 pages
25EASMarch 3369
No ratings yet
25EASMarch 3369
10 pages
The Impact of Ai in Financial Services
No ratings yet
The Impact of Ai in Financial Services
5 pages
Deep Learning Unit1
No ratings yet
Deep Learning Unit1
63 pages
Design and Implementation
No ratings yet
Design and Implementation
35 pages
Hovy - 2021 - Text Analysis in Python For Social Scientists Dis
No ratings yet
Hovy - 2021 - Text Analysis in Python For Social Scientists Dis
104 pages
Ai Questions Bank
No ratings yet
Ai Questions Bank
15 pages
Revisiting AI Project Cycle Class 10 Notes
No ratings yet
Revisiting AI Project Cycle Class 10 Notes
5 pages
Computational Linguistics - Introduction
No ratings yet
Computational Linguistics - Introduction
50 pages
Text-To-Picture Tools, Systems, and Approaches: A Survey
No ratings yet
Text-To-Picture Tools, Systems, and Approaches: A Survey
27 pages
Language ID Using Machine Learning
No ratings yet
Language ID Using Machine Learning
39 pages

NLP Notes

Uploaded by

NLP Notes

Uploaded by

UNIT - I

Finding the Structure of Documents: Introduction, Methods, Complexity of the

NLU (Natural Language Understanding)

Manya is looking for a match.

I saw the girl with the binocular.

Example: Kiran went to suresh. He eats apple.

 Required syntactic knowledge

 exact meaning or dictionary meaning from the text.

 Required discourse knowledge

Words and Their Components

Input: "Today is Monday"

Sub word Tokenization (frequently used words, infrequency used

Output: [“un”, “usual”].

She had 100 pencils

 Base or canonical form of words

 Words are formed by combing more than one morpheme

 It refers categorized or classification of a language based structural and grammatical features

Issues and Challenges

We have 5 morphological models

Root word affixes

wordbase form or canonical form search in dictionary

Finite state morphology

Process is known as FSTs (finite state transducers)

pre fix stem

prefix stem suffix

Some irregular word requires stem changes

FST has two types of tapes

FST has 7 tuples

Basically, two types of morphemes

Finding structure of Document

Segmentation is chunking the input text or speech into blocks

 Sentence boundary detection

I met Dr.Xyz and he suggested some medices.

What is the time now?

Topic boundary detection

METHODS for sentence boundary and topic boundary

Generative sequence classification method

Performance of the approaches

True Negative Rate:

Actually No, how often does it predict No

True Positive Rate: (Recall/Sensitivity)

Syntax Analysis: Parsing Natural Language, Treebanks: A Data-Driven

Syntax Analysis /syntactic Analysis

Ramu eats apple

Ate ramu apple the

At/in the/at same/ap time/nn reaction/nn among/in anti-

 Phrase structure graph

You might also like