KEMBAR78
2.BasicTextProcessing NEW | PDF | Regular Expression | Bracket
0% found this document useful (0 votes)
20 views39 pages

2.BasicTextProcessing NEW

The document discusses basic text processing techniques in Natural Language Processing (NLP), focusing on regular expressions, text normalization, and the early chatbot ELIZA. It covers the use of regular expressions for pattern matching and text manipulation, as well as the importance of tokenization and normalization in preparing text for analysis. Additionally, it highlights the distinction between word types and tokens, and introduces lemmatization and stemming as methods for reducing words to their base forms.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views39 pages

2.BasicTextProcessing NEW

The document discusses basic text processing techniques in Natural Language Processing (NLP), focusing on regular expressions, text normalization, and the early chatbot ELIZA. It covers the use of regular expressions for pattern matching and text manipulation, as well as the importance of tokenization and normalization in preparing text for analysis. Additionally, it highlights the distinction between word types and tokens, and introduces lemmatization and stemming as methods for reducing words to their base forms.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Natural Language Processing (NLP)

Basic Text Processing


Based on Jurafsky & Martin (2018), Chapter 2

Katrien Beuls
Artificial Intelligence Laboratory
Vrije Universiteit Brussel

1
TER
BASIC TEXT PROCESSING
Regular Expressions, Text
ELIZA

Normalization, Edit Distance


User: You are like my father in some ways.
ELIZA: WHAT RESEMBLANCE DO YOU SEE
User: You are not very aggressive but I think you don’t want me to
notice that.
ELIZA: WHAT MAKES YOU THINK I AM NOT VERY AGGRESSIVE
User: You don’t argue with me.
ELIZA: WHY DO YOU THINK I DON’T ARGUE WITH YOU
User: You are afraid of me.
ELIZA: DOES IT PLEASE YOU TO BELIEVE I AM AFRAID OF YOU
Weizenbaum (1966)
IZA The dialogue above is from ELIZA, an early natural language processing sys-
tem that could carry on a limited conversation with a user by imitating the responses
of a Rogerian psychotherapist (Weizenbaum, 1966). ELIZA is a surprisingly simple
program that uses pattern matching to recognize phrases like “You are X” and trans-
late them into suitable outputs like “What makes you think I am X?”. This simple
technique succeeds in this domain because ELIZA doesn’t actually need to know 2
BASIC TEXT PROCESSING
REGULAR EXPRESSIONS

I A formal language for specifying text strings

I How can we search for any of


these?
I woodchuck
I woodchucks
I Woodchuck
I Woodchucks

3
REGULAR EXPRESSIONS
DISJUNCTIONS

I Letters inside square brackets: []


Pattern Matches
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any digit

I Ranges: [A-Z]
Pattern Matches
[A-Z] An upper case letter Drenched Blossoms
[a-z] A lower case letter my beans were impatient
[0-9] A single digit Chapter 1: Down ...

4
REGULAR EXPRESSIONS
NEGATION IN DISJUNCTION

I Negations: [ˆSs]
Pattern Matches
[ˆA-Z] Not an upper case letter Oyfn pripetchik
[ˆSs] Neither ‘S’ nor ‘s’ I have no reason
[ˆ\.] Not a period our resident Djinn
aˆb The pattern ‘aˆb’ look up aˆb now

5
REGULAR EXPRESSIONS
MORE DISJUNCTION

I Woodchucks is another name for groundhog!


I The pipe | for disjunction

Pattern Matches
groundhog|woodchuck groundhog
woodchuck
gupp(y|ies) guppy
guppies
a|b|c = [abc]
[gG]roundhog|[Ww]oodchuck

6
REGULAR EXPRESSIONS
? *+.

Pattern Matches
colou?r Optional previous char color colour
oo*h! 0 or more of previous char oh! ooh! oooh! ooooh!
o+h! 1 or more of previous char oh! ooh! oooh! ooooh!
baa+ baa baaa baaaa baaaaa
beg.n begin begun began beg3n

7
REGULAR EXPRESSIONS
ANCHORS ˆ $

Pattern Matches
ˆ[A-Z] Palo Alto
ˆ[ˆA-Za-z] 1 “Hello”
\.$ The end.
.$ The end? The end!

8
REGULAR EXPRESSIONS
EXERCISE

I Find all instances of the word “the” in a text.

I /the/
Misses capitalised examples

I /[tT]he/
Incorrectly returns other or theology

I /[ˆa-zA-Z][tT]he[ˆa-zA-Z]/
Does not return “the” when it begins a line

I /(ˆ|[ˆa-zA-Z])[tT]he([ˆa-zA-Z]|$)/

9
REGULAR EXPRESSIONS
OPERATOR PRECEDENCE HIERARCHY

I From highest to lowest precedence:

Parenthesis ()
Counters * + ? {}
Sequences and anchors the ˆmy end$
Disjunction |

I Counters > Sequences: /the*/ matches theeeee but not


thethe
I Sequences > Disjunction: /the|any/ matches the or any
but not theny

10
REGULAR EXPRESSIONS
SUBSTITUTIONS

I Substitution operator s/regexp1/pattern/ used in Unix


commands like vim or sed allows a string characterized by
a regular expression to be replaced by another string:
s/colour/color
I Referring to a subpart of the string matching the first
pattern: e.g. put angle brackets around all integers in a text:
s/[0-9]+/<\1>/

11
REGULAR EXPRESSIONS
CAPTURE GROUPS

/the (.*)er they were, the \1er they will be/


I Parenthesis used for storing a pattern in memory
= a capture group
I Resulting match is stored in a numbered register
I If you match two different sets of parentheses, \2 means
whatever matched the second capture group
/the (.*)er they (.*), the \1er we \2 /
I Use a non-capturing group if you don’t want to capture the
resulting pattern in a register:
/((?:some|a few) (people|cats) like some \1/

12
REGULAR EXPRESSIONS
ELIZA OR A SIMPLE CHATBOT

I Works by having a cascade of (ranked) regular expression


substitutions
I Input lines are first uppercased
I First substitutions change all instances of MY to YOUR and
I’M to YOU ARE
I The next set of substitutions matches and replaces other
patterns in the input

s/.* I’M (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/


s/.* I AM (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/
s/.* all .*/IN WHAT WAY/
s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/

13
WORDS
WHAT COUNTS AS A WORD?

I How many words are in the following Brown sentence?


I He stepped out into the hall, was delighted to encounter a
water brother.
I 13 words (15 with punctuation marks)
I How many words are in the following utterance from the
Switchboard corpus?
I I do uh main- mainly business data processing
I Two disfluencies: a fragment (main-) and a filled pause (uh)

14
WORDS
WHAT COUNTS AS A WORD?

I How about inflected forms like cats versus cat?


I They have the same lemma cat but different wordforms
I A lemma is a set of lexical forms having the same stem, the
same major part-of-speech, and the same word sense
I The wordform is the fully inflected or derived form of the
word

15
WORDS
HOW MANY WORDS ARE THERE IN ENGLISH?

I Types are the number of distinct words in a corpus


I If the set of words in the vocabulary is V, the number of
types is the vocabulary size |V|
I Tokens are the total number N of running words
I If we ignore punctuation, the following Brown sentence has
16 tokens and 14 types:
I They picknicked by the pool, then lay back on the grass and
looked at the stars.

16
WORDS
POPULAR ENGLISH LANGUAGE CORPORA

Corpus Tokens = N Types = |V|


Shakespeare 884 thousand 31 thousand
Brown corpus 1 million 38 thousand
Switchboard telephone conversations 2.4 million 20 thousand
COCA 440 million 2 million
Google N-grams 1 trillion 13 million

The relationship between the number of types and number of tokens


is called Herdan’s Law (Herdan, 1960) or Heap’s Law (Heaps, 1978),
where k and β are positive constants and 0 < β < 1:

|V| = kNβ

17
WORDS
DICTIONARY ENTRIES

I Look at the number of lemmas instead of wordform types


I Dictionaries can help in giving lemma counts
I Dictionary entries or boldface forms are a very rough upper
bound on the number of lemmas
I The 1989 edition of the Oxford English Dictionary had
615,000 entries

18
TEXT NORMALIZATION
FIRST NLP TASK

At least three tasks are commonly applied as part of any


normalization process:
1. Segmenting/tokenizing words from running text
2. Normalizing word formats
3. Segmenting sentences in running text

19
TEXT NORMALIZATION
WORD TOKENIZATION

Main challenges:
I Break off punctuation as a separate token but preserve it
when it occurs word internally (Ph.D., AT&T, ...)
I Keep special characters and numbers in prices ($45.55)
and dates (15/02/2019)
I Expand clitic contractions that are marked by apostrophes
(we’re → we are)
I Tokenize multiword expressions like New York or rock ’n’
roll as a single token

20
contrast, use a comma to mark the decimal point, and spaces (or sometimes periods)
where English puts commas, for example, 555 500,50.
clitic A tokenizer can also be used to expand clitic contractions that are marked by
apostrophes, for example, converting what’re to the two tokens what are, and
TEXT NORMALIZATION
we’re to we are. A clitic is a part of a word that can’t stand on its own, and can only
occur when it is attached to another word. Some such contractions occur in other
PENN TREEBANK TOKENIZATION
alphabetic languages, including articles and pronouns in French (j’ai, l’homme).
Depending on the application, tokenization algorithms may also tokenize mul-
tiword expressions like New York or rock ’n’ roll as a single token, which re-
quires a multiword expression dictionary of some sort. Tokenization is thus inti-
mately tied up with named entity detection, the task of detecting names, dates, and
Pennorganizations
Treebank (Chapter 17).
tokenization standard separates out clitics
One commonly used tokenization standard is known as the Penn Treebank to-
(doesn’t
Penn Treebank becomes does plus n’t), keeps(treebanks)
kenization standard, used for the parsed corpora hyphenated releasedwords
by the Lin-
tokenization
together,
guistic and separates
Data Consortium (LDC),outtheall punctuation:
source of many useful datasets. This standard
separates out clitics (doesn’t becomes does plus n’t), keeps hyphenated words to-
gether, and separates out all punctuation:
Input: “The San Francisco-based restaurant,” they said, “doesn’t charge $10”.
Output: “ The San Francisco-based restaurant , ” they
said , “ does n’t charge $ 10 ” .

Tokens can also be normalized, in which a single normalized form is chosen for
words with multiple forms like USA and US or uh-huh and uhhuh. This standard-
ization may be valuable, despite the spelling information that is lost in the normal-
ization process. For information retrieval, we might want a query for US to match a
document that has USA; for information extraction we might want to extract coherent
information that is consistent across differently-spelled instances.
21
TEXT NORMALIZATION
NORMALIZING TOKENS

Choosing a single normalized form for words with multiple


forms such as USA and US.
I Valuable for information retrieval if you want to query for
US to match a document that has USA
I In information extraction we might want to extract
coherent information that is consistent across
differently-spelled instances

22
TEXT NORMALIZATION
CASE FOLDING

Case folding is another kind of normalization


I For tasks like speech recognition and information retrieval,
everything is mapped to lower case
I For sentiment analysis and other text classification tasks,
information extraction and machine translation, case is
quite helpful and case folding is generally not done

23
TEXT NORMALIZATION
EFFICIENCY

Because tokenization needs to be run before any other


language processing, it is important to be very fast.
I Deterministic algorithms based on regular expressions
compiled into very efficient finite state automata
I Carefully designed deterministic algorithms can deal with
the ambiguities that arise (e.g. the apostrophe)

24
TEXT NORMALIZATION
COLLAPSING WORDS

I Lemmatization is the task of determining that two words


have the same root, despite their surface differences
I How is lemmatization done?
I Most sophisticated methods for lemmatization involve
complete morphological parsing of the word
I Two broad classes of morphemes can be distinguished:
1. Stems - the central morpheme of the word, supplying the
main meaning
2. Affixes - adding “additional” meanings of various kinds

25
involve complete morphological parsing of the word. Morphology is the study of
morpheme the way words are built up from smaller meaning-bearing units called morphemes.
stem Two broad classes of morphemes can be distinguished: stems—the central mor-
affix pheme of the word, supplying the main meaning— and affixes—adding “additional”
meanings
TEXT of various kinds. So, for example, the word fox consists of one morpheme
NORMALIZATION
(the morpheme fox) and the word cats consists of two: the morpheme cat and the
morpheme -s. AWORDS
COLLAPSING morphological parser takes a word like cats and parses it into the
two morphemes cat and s, or a Spanish word like amaren (‘if in the future they
would love’) into the morphemes amar ‘to love’, 3PL, and future subjunctive.

The Porter Stemmer


I Stemming is a simpler but cruder method, which mainly
Lemmatization algorithms can be complex. For this reason we sometimes make use
consists of chopping off word-final affixes
of a simpler but cruder method, which mainly consists of chopping off word-final
stemming Iaffixes.
OneThisof naive version of
the most morphological
widely used analysis
stemming stemming. Onefor
is calledalgorithms of
ter stemmer the most widely used stemming algorithms is the Porter (1980). The Porter stemmer
English is the Porter stemmer (1980)
applied to the following paragraph:
This was not the map we found in Billy Bones’s chest, but
C HAPTER 2 • anR EGULAR
accurate copy, complete
E XPRESSIONS , T EXT Nin all things-names
ORMALIZATION , E DIT Dand heights
ISTANCE
and soundings-with the single exception of the red crosses
and the written notes.
produces the following stemmed output:
Thi wa not the map we found in Billi Bone s chest but an
accur copi complet in all thing name and height and sound
with the singl except of the red cross and the written note
cascade The algorithm is based on series of rewrite rules run in series, as a cascade, in
which the output of each pass is fed as input to the next pass; here is a sampling of
the rules:
26
TEXT NORMALIZATION
COLLAPSING WORDS

I The Porter stemmer algorithm is based on a series of


rewrite rules in series, as a cascade

ATIONAL → ATE (e.g. relational → relate)


ING →  if stem contains vowel (e.g., motoring → motor)
SSES → SS (e.g., grasses → grass)

I Detailed rule lists for the Porter stemmer can be found on


Martin Porter’s homepage

27
TEXT NORMALIZATION
SENTENCE SEGMENTATION

Sentence segmentation is another important step in text


processing
I The most useful cues are punctuation (periods, question
marks, exclamation points)
I Periods can be ambiguous: Mr. or Inc.
In general, sentence tokenization methods work by building a
binary classifier that decides if a period is a part of the word or
is a sentence-boundary marker

28
MINIMUM EDIT DISTANCE
STRING SIMILARITY

I Calculating the similarity between two strings is useful in


many NLP tasks, such as spelling correction or
coreference resolution
I The minimum edit distance between two strings is defined
as the minimum number of editing operations (operations
like insertion, deletion, substitution) needed to transform
one string into another

29
s of symbols expressing an operation list for
om string: d for deletion, s for substitution, i fo
MINIMUM EDIT DISTANCE
EXAMPLE ALIGNMENT

INTE*NTION
| | | | | | | | | |
*EXECUTION
d s s i s

ing theI The


minimum edit
gap between distance
intention between
and execution is 5 two strings

peration list for converting the top string into the bot
30
MINIMUM EDIT DISTANCE
LEVENSHTEIN DISTANCE

I We can also assign a particular cost or weight to each of


these operations
I Levenshtein distance: Each insertion or deletion has a
cost of 1 and substitutions are not allowed
I (This is equivalent to allowing substitution, but giving each
substitution a cost of 2 since any substitution can be represented
by one insertion and one deletion)
I Using this metric, the Levenshtein distance between
intention and execution is 8

31
MINIMUM EDIT DISTANCE
ALGORITHM

I How do we find the minimum edit distance?


I We can think of this as a search task, in which we are
searching for the shortest path - a sequence of edits - from
one string to another
I The space of all possible edits is enormous, so we can’t
search naively
I However, lots of distinct edit paths will end up in the same
state (string), so rather than recomputing all those paths,
we could just remember the shortest path to a state each
time we saw it

32
MINIMUM EDIT DISTANCE
DYNAMIC PROGRAMMING

I Dynamic programming is the name for a class of


algorithms, first introduced by Bellman (1957), that apply a
table-driven method to solve problems by combining
solutions to sub-problems
I Some of the most commonly used algorithms in NLP make
use of dynamic programming, such as the Viterbi
algorithm and the CKY algorithm for parsing

33
a table-driven method to solve problems by combining solutions to sub-problems.
Some of the most commonly used algorithms in natural language processing make
use of dynamic programming, such as the Viterbi algorithm (Chapter 8) and the
CKY algorithm for parsing (Chapter 11).
MINIMUM EDIT DISTANCE
The intuition of a dynamic programming problem is that a large problem can
be solved by properly
SHORTEST PATH combining the solutions to various sub-problems. Consider
the shortest path of transformed words that represents the minimum edit distance
between the strings intention and execution shown in Fig. 2.15.

i n t e n t i o n
delete i
n t e n t i o n
substitute n by e
e t e n t i o n
substitute t by x
e x e n t i o n
insert u
e x e n u t i o n
substitute n by c
e x e c u t i o n
Figure 2.15 Path from intention to execution.

Imagine some string (perhaps it is exention) that is in this optimal path (whatever
it is). The intuition of dynamic programming is that if exention is in the optimal
operation list, then the optimal sequence must also include the optimal path from 34
HAPTER 2 MINIMUM
• R EGULAR E EDIT DISTANCE
XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

ALGORITHM

function M IN -E DIT-D ISTANCE(source, target) returns min-distance

n L ENGTH(source)
m L ENGTH(target)
Create a distance matrix distance[n+1,m+1]

# Initialization: the zeroth row and column is the distance from the empty string
D[0,0] = 0
for each row i from 1 to n do
D[i,0] D[i-1,0] + del-cost(source[i])
for each column j from 1 to m do
D[0,j] D[0, j-1] + ins-cost(target[j])

# Recurrence relation:
for each row i from 1 to n do
for each column j from 1 to m do
D[i, j] M IN( D[i 1, j] + del-cost(source[i]),
D[i 1, j 1] + sub-cost(source[i], target[j]),
D[i, j 1] + ins-cost(target[j]))
# Termination
return D[n,m]

Figure 2.16 The minimum edit distance algorithm, an example of the class of dynamic
35
D[i 1, j 1] + sub-cost(source[i], target[j]),
D[i, j 1] + ins-cost(target[j]))
# Termination
return D[n,m]
MINIMUM EDIT DISTANCE
Figure 2.16 The minimum edit distance algorithm, an example of the class of dynamic
THEprogramming
EDIT DISTANCE MATRIX
algorithms. The various costs can either be fixed (e.g., 8x, ins-cost(x) = 1)
or can be specific to the letter (to model the fact that some letters are more likely to be in-
serted than others). We assume that there is no cost for substituting a letter for itself (i.e.,
sub-cost(x, x) = 0).

Src\Tar # e x e c u t i o n
# 0 1 2 3 4 5 6 7 8 9
i 1 2 3 4 5 6 7 6 7 8
n 2 3 4 5 6 7 8 7 8 7
t 3 4 5 6 7 8 7 8 9 8
e 4 3 4 5 6 7 8 9 10 9
n 5 4 5 6 7 8 9 10 11 10
t 6 5 6 7 8 9 8 9 10 11
i 7 6 7 8 9 10 9 8 9 10
o 8 7 8 9 10 11 10 9 8 9
n 9 8 9 10 11 12 11 10 9 8
Figure 2.17 Computation of minimum edit distance between intention and execution with
the algorithm of Fig. 2.16, using Levenshtein distance with cost of 1 for insertions or dele-
tions, 2 for substitutions.

minimum edit distance algorithm to store the pointers and compute the backtrace to
output an alignment. 36
MINIMUM EDIT DISTANCE
PRODUCING AN ALIGNMENT

2.6 • S UMMARY 25

# e x e c u t i o n
# 0 1 2 3 4 5 6 7 8 9
i "1 - "2 - "3 - "4 - "5 - "6 - "7 -6 7 8
n "2 - "3 - "4 - "5 - "6 - "7 - "8 "7 - "8 -7
t "3 - "4 - "5 - "6 - "7 - "8 -7 "8 - "9 "8
e "4 -3 4 - 5 6 7 "8 - "9 - " 10 "9
n "5 "4 - "5 - "6 - "7 - "8 - "9 - " 10 - " 11 -" 10
t "6 "5 - "6 - "7 - "8 - "9 -8 9 10 " 11
i "7 "6 - "7 - "8 - "9 - " 10 "9 -8 9 10
o "8 "7 - "8 - "9 - " 10 - " 11 " 10 "9 -8 9
n "9 "8 - "9 - " 10 - " 11 - " 12 " 11 " 10 "9 -8
Figure 2.18 When entering a value in each cell, we mark which of the three neighboring
cells we came from with up to three arrows. After the table is full we compute an alignment
(minimum edit path) by using a backtrace, starting at the 8 in the lower-right corner and
following the arrows back. The sequence of bold cells represents one possible minimum cost
alignment between the two strings. Diagram design after Gusfield (1997).

Summary 37
MINIMUM EDIT DISTANCE
EXTENSIONS

I The algorithm allows arbitrary weights on the operations


I For spelling correction, substitutions are more likely to
happen between letters that are next to each other on the
keyboard
I The Viterbi algorithm is a probabilistic extension of
minimum edit distance
I Viterbi computes the “maximum probability alignment” of
one string with another (cf. Ch. 8 on POS tagging)

38
SUMMARY

I The regular expression language is a powerful tool for


pattern-matching.
I Basic operations in regular expressions include concatenation
of symbols, disjunction of symbols ([], |, and ),˙ counters (*, +, and
{n,m}), anchors (ˆ, $) and precedence operators ((,)).
I Word tokenization and normalization are generally done by
cascades of simple regular expressions substitutions or finite
automata.
I The Porter algorithm is a simple and efficient way to do
stemming, stripping off affixes. It does not have high accuracy
but may be useful for some tasks.
I The minimum edit distance between two strings is the minimum
number of operations it takes to edit one into the other. Minimum
edit distance can be computed by dynamic programming, which
also results in an alignment of the two strings.
39

You might also like