KEMBAR78
Lecture 2 - NLP-I | PDF | Parsing | Word
0% found this document useful (0 votes)
7 views91 pages

Lecture 2 - NLP-I

The document provides an overview of Natural Language Processing (NLP), defining it as the interaction between computers and human language, and highlighting its applications such as information extraction and translation. It discusses the challenges of NLP, including ambiguity, word segmentation, and the complexity of grammar, as well as foundational concepts like tokenization and morphology. Additionally, it touches on the historical context of NLP, referencing early systems like Eliza and theories such as Chomsky's Universal Grammar.

Uploaded by

8yashpandey8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views91 pages

Lecture 2 - NLP-I

The document provides an overview of Natural Language Processing (NLP), defining it as the interaction between computers and human language, and highlighting its applications such as information extraction and translation. It discusses the challenges of NLP, including ambiguity, word segmentation, and the complexity of grammar, as well as foundational concepts like tokenization and morphology. Additionally, it touches on the historical context of NLP, referencing early systems like Eliza and theories such as Chomsky's Universal Grammar.

Uploaded by

8yashpandey8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

Natural Language Processing-I

Murari Mandal
Assistant Professor
https://murarimandal.github.io/
https://respailab.github.io/
takes vehicle to
highway/ retires from
people want him as cricket
president/ wanted by
police

not cheap/not a
small coin

formula-1 race lap/


lap dance
Natural Language Processing (NLP)
Natural Language

● Any language we speak and/or write in our everyday lives to communicate


with each other:
● Hindi, English, Spanish, Odia, Bengali…
● Evolves naturally in humans through use and repetition without conscious
design or prefabrication.
NLP
Natural Language Processing

● Interaction between computers and human language.


● NLP is a field of AI that allows computers to understand, interpret, and
generate human language
● Example: extracting information from texts, translating between languages,
answering questions, holding a conversation, taking instructions…

A field of computer science,


artificial intelligence and
computational linguistics

https://lumosbusiness.com/new-tech-nlp-natural-language-processing/
https://www.ethnologue.com/
7000 + Language across the world today

https://www.ethnologue.com/
Turing Test
“Computing Machinery and Intelligence”
Mind, Vol. 59, No. 236, pp. 433-460, 1950

Alan Turing

I propose to consider the question

"Can machines think?”..

We can only see a short distance ahead,


but we can see plenty there that needs to
be done
https://i.abcnewsfe.com/a/187ba44a-8c4a-4a87-96e7-
140a1767243f/TuringTestInfographic_v01_DG_1689800738777_hpEmbed_16x9.j
pg
Eliza
● An early NLP system capable of carrying on a limited form of
conversation with a user (1964-1967)

https://en.wikipedia.org/wiki/ELIZA
Universal Grammar
● Chomsky's theory of universal grammar
revolutionized language understanding.

● All languages share an innate, common structure, Noam Chomsky


influencing both linguistics and computational models
in NLP.
Chomsky Hierarchy
Universal Grammar Hypothesis
Generative Grammar
Transformational-Generative
Grammar (TGG)
NLP is Hard, Why?
Ambiguity

https://www.ifioque.com/linguistic/semantic_ambiguity
NLP is Hard, Why?
Ambiguity

● Ambiguity at multiple levels


● Word senses: bank (finance or river ?)
● Part of speech: chair (noun or verb ?)
NLP is Hard, Why?
Ambiguity

● Ambiguity at multiple levels


● Word senses: bank (finance or river ?)
● Part of speech: chair (noun or verb ?)
● Syntactic structure: I can see a man with a telescope
NLP is Hard, Why?
Ambiguity

● Ambiguity at multiple levels


● Word senses: bank (finance or river ?)
● Part of speech: chair (noun or verb ?)
● Syntactic structure: I can see a man with a telescope
● Multiple: I made her duck
NLP is Hard, Why?
मैं उसे दे ख
लूं गा/ दे खे नेबो

I will see him/her. will take care of it

one morning i shot an elephant in my pajamas


NLP is Hard, Why?
Words

● Segmenting text into words


● Morphological variation
● Words with multiple meanings: bank, mean
● Domain-specific meanings: latex
● Multiword expressions: make a decision, take out, make up
NLP is Hard, Why?
A ship-shipping ship, shipping shipping-ships

Valid sentence? Yes!


NLP is Hard, Why? Slang, Colloquialisms, and Dialects

Context Sensitivity

World Knowledge and Reasoning

Complexity of Grammar and Syntax

Large Vocabulary
Multilingualism
NLP is Hard, Why? Tricky Entity Names

"Bill paid the bill"


Ambiguous Intent in Dialogue "Paris is a popular name"
“Apple just released a new iPhone”
“Amazon is known for fast shipping”
"The Amazon rainforest is vast"

Semantics & Syntax

Every fifteen minutes a woman in this country gives birth.


Idioms and Multi-Word Expressions
Our job is to find this woman, and stop her!
"Break the ice"
"Bite the bullet We saw the woman with the telescope wrapped in paper.
"A piece of cake" ● Who has the telescope?
"Hit the nail on the head" ● Who or what is wrapped in paper?
"Under the weather" ● An even of perception, or an assault?
NLP is Hard, Why?

Resource-Intensive Computation

Corpora (Bigger is better) Compute (Scale of the Model)


A corpus is a collection of text
- Often annotated in some way - Training large-scale NLP models like GPT
requires huge datasets and high-
- Sometimes just lots of text
performance hardware.
Examples
- Penn Treebank: 1M words of parsed
WSJ
NLP: Building Blocks
Language understanding
NLP
Language generation

Preprocessing Morphological Syntactic Analysis Semantic Analysis


Clean and prepare the Analysis Understand the sentence Understand the meanings
text data Understanding the structure. Parsing, of words and sentences.
segmentation, POS… NER, WSD…
structure of words

Pragmatic Discourse and Text Generation Application


Analysis Dialogue Generate meaningful, Apply the processed
Understand the Understand how different contextually appropriate information for real-world
text. applications
practical use & intent parts of a conversation
behind sentences connect/flow together.
Phonetics, Phonology
Pronunciation Modeling
Phonetics: study of the physical sounds of human speech

[s] vs. [ʃ] (as in "sip" vs. "ship")


[tʰ] vs. [t] (as in "top" vs. "stop")

Phonology: how sounds function in a particular language

[kæt] ("cat") vs. [bæt] ("bat")


/biru/ ("beer") vs. /biruu/ ("building")
Words
Unit of a language consisting of one of more characters.

Sentence: "I have a dog."


Words: ["I", "have", "a", "dog"]

Words are not atoms


They are composed of morphemes
[*Morphemes are minimal meaningful components of words]
Infixes: mis-understand-ing-s
Prefixes: un-happy
Suffixes: determin-ize
Tokens
Tokenization: splitting text into smaller units
A token can be a word, a part of a word, or even punctuation marks.

Text Sequence After Tokenization


"I can't go!" ["I", "can't", "go", "!"]
"I have 3 apples, and they ["I", "have", "3", "apples", ",", "and", "they",
cost $5!" "cost", "$", "5", "!"]
www.example.com ["www", ".", "example", ".", "com"]
150 dollars, and the ['’150'’, '’dollars’', ‘','’, ‘'and'’, ‘'the'’, ‘'discount'’,
discount is 20% ‘'is'’, ‘'20'’, ‘'%’', ‘'.’']
Regular Expressions
A formal language for specifying text strings

How can we search for any of these?

woodchuck
woodchucks
Woodchuck
Woodchucks
Regular Expressions: Disjunctions
Letters inside square brackets []
Pattern Matches
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any digit

Ranges [A-Z]

Pattern Matches
[A-Z] An upper case Drenched Blossoms
letter
[a-z] A lower case letter my beans were impatient
[0-9] A single digit Chapter 1: Down the Rabbit Hole
Regular Expressions: Negation in Disjunctions

● Negations [^Ss]
○ Carat means negation only when first in []

Pattern Matches
[^A-Z] Not an upper case letter Oyfn pripetchik

[^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reason”

[^e^] Neither e nor ^ Look here

a^b The pattern a carat b Look up a^b now


Regular Expressions: More Disjunctions

● Woodchucks is another name for groundhog!


● The pipe | for disjunction

Pattern Matches
groundhog|woodchuck

yours|mine yours
mine
a|b|c = [abc]
[gG]roundhog|[Ww]oodchuck Photo D. Fletcher
Regular Expressions: ? * + .

Pattern Matches
colou?r Optional color colour
previous char
oo*h! 0 or more of oh! ooh! oooh! ooooh!
previous char
o+h! 1 or more of oh! ooh! oooh! ooooh!
previous char
baa+ baa baaa baaaa baaaaa
Stephen C Kleene
beg.n begin begun begun beg3n

Kleene *, Kleene +
Regular Expressions: ^ $

Pattern Matches

^[A-Z] Palo Alto

^[^A-Za-z] 1 “Hello”

\.$ The end.

.$ The end? The end!

$ Matches the ending position of the string or the position just before a string-ending newline.
In line-based tools, it matches the ending position of any line.
Regular Expressions: Example

● Find me all instances of the word “the” in a text.

the Misses capitalized examples

[tT]he Incorrectly returns other or theology

[^a-zA-Z][tT]he[^a-zA-Z]
Regular Expressions: Errors

● The process we just went through was based on fixing two


kinds of errors
○ Matching strings that we should not have matched (there, then,
other)
■ False positives (Type I)
○ Not matching things that we should have matched (The)
■ False negatives (Type II)
Regular Expressions: Errors

● In NLP we are always dealing with these kinds of errors.

● Reducing the error rate for an application often involves two


antagonistic efforts:
○ Increasing accuracy or precision (minimizing false positives)
○ Increasing coverage or recall (minimizing false negatives).
Regular Expressions: Summary

● Regular expressions play a surprisingly large role


○ Sophisticated sequences of regular expressions are often the first model
for any text processing task

● For many hard tasks, we use machine learning classifiers


○ But regular expressions are used as features in the classifiers
○ Can be very useful in capturing generalizations

34
Basic Text Processing

Every NLP task needs to do text normalization:

1. Segmenting/tokenizing words in running text

2. Normalizing word formats

3. Segmenting sentences in running text


Basic Text Processing: How many words?

I do uh main- mainly business data processing

Fragments, filled pauses

Seuss’s cat in the hat is different from other cats!

Lemma: same stem, part of speech, rough word sense


cat and cats = same lemma

Wordform: the full inflected surface form


cat and cats = different wordforms
Basic Text Processing: How many words?

they lay back on the San Francisco grass and looked at the stars
and their
Type: an element of the vocabulary.

Token: an instance of that type in running text.

How many?

15 tokens (or 14)

13 types (or 12) (or 11?)


Basic Text Processing: How many words?

N = number of tokens
V = vocabulary = set of types
|V| is the size of the vocabulary Church and Gale (1990): |V| > O(N½)

Tokens = N Types = |V|

Switchboard phone 2.4 million 20 thousand


conversations
Shakespeare 884,000 31 thousand

Google N-grams 1 trillion 13 million


Simple Tokenization in UNIX

● (Inspired by Ken Church’s UNIX for Poets.)


● Given a text file, output the word tokens and their frequencies
tr -sc ’A-Za-z’ ’\n’ < shakes.txt Change all non-alpha to newlines
| sort Sort in alphabetical order
| uniq –c Merge and count each type

1945 A
72 AARON 25 Aaron
19 ABBESS 6 Abate
1 Abates
5 ABBOT
5 Abbess
... ... 6 Abbey
3 Abbot
.... …
The first step: Tokenizing

tr -sc ’A-Za-z’ ’\n’ < shakes.txt | head

THE
SONNETS
by
William
Shakespeare
From
fairest
creatures
We
...
The second step: Sorting

tr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | head

A
A
A
A
A
A
A
A
A
...
The second step: Sorting & Counting

● Merging upper and lower case


tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c
● Sorting the counts
tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c | sort –n –r

23243 the
22225 i
18618 and
16339 to
15687 of
12780 a
12163 you
10839 my What happened here?
10005 in
8954 d
Issues in Tokenization

● Finland’s capital → Finland Finlands Finland’s ?


● what’re, I’m, isn’t → What are, I am, is not
● Hewlett-Packard → Hewlett Packard ?
● state-of-the-art → state of the art ?
● Lowercase → lower-case lowercase lower
case ?
● San Francisco → one token or two?
● m.p.h., PhD. → ??
Tokenization: language issues

● French
○ L'ensemble → one token or two?
■ L ? L’ ? Le ?
■ Want l’ensemble to match with un ensemble

● German noun compounds are not segmented


○ Lebensversicherungsgesellschaftsangestellter
○ ‘life insurance company employee’
○ German information retrieval needs compound splitter
Tokenization: language issues

● Chinese and Japanese no spaces between words:


○ 莎拉波娃现在居住在美国东南部的佛罗里达。
○ 莎拉波娃 现在 居住 在 美国 东南部 的 佛罗里达
○ Sharapova now lives in US southeastern Florida

● Further complicated in Japanese, with multiple alphabets intermingled


○ Dates/amounts in multiple formats
Normalization

● Need to “normalize” terms


○ Information Retrieval: indexed text & query terms must have same form.
■We want to match U.S.A. and USA
● We implicitly define equivalence classes of terms
○ e.g., deleting periods in a term
● Alternative: asymmetric expansion:
○ Enter: window Search: window, windows
○ Enter: windows Search: Windows, windows, window
○ Enter: Windows Search: Windows
● Potentially more powerful, but less efficient
Morphology

● Morphemes:
○ The small meaningful units that make up words
○ Stems: The core meaning-bearing units
○ Affixes: Bits and pieces that adhere to stems
■ Often with grammatical functions
Stemming

● Reduce terms to their stems in information retrieval


● Stemming is crude chopping of affixes
○ language dependent
○ e.g., automate(s), automatic, automation all reduced
to automat.

for example, compressed for exampl compress and


and compression are both compress ar both accept
accepted as equivalent to as equival to compress
compress.
Porter’s Algorithm: the most common English stemmer

Step 1a Step 2 (for long stems)


sses → ss caresses → caress ational→ ate relational→ relate
ies → i ponies → poni izer→ ize digitizer → digitize
ss → ss caress → caress ator→ ate operator → operate
s → ø cats → cat …
Step 1b Step 3 (for longer stems)
(*v*)ing → ø walking → walk al → ø revival → reviv
sing → sing able → ø adjustable → adjust
(*v*)ed → ø plastered → plaster ate → ø activate → activ
… …
Language Model: N-Grams

Markov assumption
Language Model: N-Grams
Language Model: N-Grams

N-gram LM
Language Model: N-Grams

Unigram Probability
Language Model: N-Grams

Bigram Probability
Language Model: N-Grams

Bigram Probability

Example:
Language Model: N-Grams

Trigram Probability
Language Model: N-Grams

n-gram Probability
Language Model: N-Grams

Sampling from n-gram LMs


Language Model: N-Grams Example
Berkeley Restaurant Project sentences

● can you tell me about any good cantonese restaurants close


by
● mid priced thai food is what i’m looking for
● tell me about chez panisse
● can you give me a listing of the kinds of food that are available
● i’m looking for a good place to eat breakfast
● when is caffe venezia open during the day
Language Model: N-Grams Example
Raw bigram counts
● Out of 9222 sentences
Language Model: N-Grams Example
Raw bigram probabilities

● Normalize by unigrams:

● Result:
Language Model: N-Grams Example
Bigram estimates of sentence probabilities

P(<s> I want english food </s>) =


P(I|<s>)
× P(want|I)
× P(english|want)
× P(food|english)
× P(</s>|food)
= .000031
Language Model: N-Grams Example
What kinds of knowledge?

● P(english|want) = .0011
● P(chinese|want) = .0065
● P(to|want) = .66
● P(eat | to) = .28
● P(food | to) = 0
● P(want | spend) = 0
● P (i | <s>) = .25
Language Model: N-Grams Example
Practical Issues
● We do everything in log space
○ Avoid underflow

○ (also adding is faster than multiplying)

log(p1 ´ p2 ´ p3 ´ p4 ) = log p1 + log p2 + log p3 + log p4


Evaluation: How good is our model?

● Does our language model prefer good sentences to bad ones?


○ Assign higher probability to “real” or “frequently observed” sentences
■ Than “ungrammatical” or “rarely observed” sentences?

● We train parameters of our model on a training set.

● We test the model’s performance on data we haven’t seen.


○ A test set is an unseen dataset that is different from our training set,
totally unused.
○ An evaluation metric tells us how well our model does on the test set.
Extrinsic evaluation of N-gram models

● Best evaluation for comparing models A and B


○ Put each model in a task
■ spelling corrector, speech recognizer, MT system

○ Run the task, get an accuracy for A and for B


■ How many misspelled words corrected properly
■ How many words translated correctly

○ Compare accuracy for A and B


Extrinsic evaluation of N-gram models
Difficulty of extrinsic (in-vivo) evaluation of N-gram models

● Extrinsic evaluation
○ Time-consuming; can take days or weeks
● So
○ Sometimes use intrinsic evaluation: perplexity
○ Bad approximation
■ unless the test data looks just like the training data
■ So generally only useful in pilot experiments
○ But is helpful to think about.
Evaluation of LM: Perplexity

mushrooms 0.1
● The Shannon Game:
○ How well can we predict the next word? pepperoni 0.1

I always order pizza with cheese and ____ anchovies 0.01


….
The 33rd President of the US was ____
fried rice 0.0001
I saw a ____
….
○ Unigrams are terrible at this game. (Why?)
and 1e-100
● A better model of a text
○ is one which assigns a higher probability to the word that actually occurs
Evaluation of LM: Perplexity
The best language model is one that best predicts an unseen test set
• Gives the highest P(sentence)
1
-
PP(W ) = P(w1w2 ...wN ) N

Perplexity is the inverse probability of


the test set, normalized by the number 1
= N
P(w1w2 ...wN )
of words:

Chain rule:

For bigrams:

Minimizing perplexity is the same as maximizing probability


Evaluation of LM: Perplexity

Lower perplexity = better model

● Training 38 million words, test 1.5 million words, WSJ

N-gram Order Unigram Bigram Trigram

Perplexity 962 170 109


Language Model Evaluation

Zero Probabilities

Training set: Test set


… denied the allegations … denied the offer
… denied the reports
… denied the claims … denied the loan
… denied the request

P(“offer” | denied the) = 0


Language Model: Smoothing

Zero probability bigrams

● Bigrams with zero probability


○ mean that we will assign 0 probability to the test set!

● And hence we cannot compute perplexity (can’t divide by 0)!


Language Model: Smoothing
The intuition of smoothing
● When we have sparse statistics:
P(w | denied the)

allegations
3 allegations

outcome
reports
2 reports

attack

claims

request
1 claims

man
1 request
7 total
● Steal probability mass to generalize better
P(w | denied the)
2.5 allegations

allegations
1.5 reports

allegations

outcome
0.5 claims

attack
reports

0.5 request

man
request
claims
2 other
7 total
Language Model: Smoothing
Add-one estimation

● Also called Laplace smoothing


● Pretend we saw each word one more time than we did
● Just add one to all the counts!
c(wi-1, wi )
● MLE estimate: PMLE (wi | wi-1 ) =
c(wi-1 )
● Add-1 estimate: PAdd-1 (wi | wi-1 ) =
c(wi-1, wi ) +1
c(wi-1 ) +V
Language Model: Smoothing
Berkeley Restaurant Corpus: Laplace smoothed bigram counts
Language Model: Smoothing
Laplace-smoothed bigrams
Language Model: Smoothing
Reconstituted counts
Language Model: Smoothing

Compare with raw


bigram counts
Evaluation and Metrics
Evaluation and Metrics
Evaluation and Metrics
Evaluation and Metrics
Evaluation and Metrics
Evaluation and Metrics
Evaluation and Metrics
Evaluation and Metrics
Evaluation and Metrics
Evaluation and Metrics
Evaluation and Metrics
Evaluation and Metrics
Evaluation and Metrics

You might also like