Introducing natural language processing(NLP) with r

Introducing NLP with R 10/6/14, 19:37
Introducing NLP with R
Charlie Redmon | SupStat Analytics
Copyright Supstat Inc. All Rights Reserved
http://docs.supstat.com/NLPwithR/#1 Page 1 of 26

Outline
Introduction to NLP
Foundational Frameworks
Working with text in R
Regular Expressions
As pattern matching device
Theoretical connection with finite state automaton
Application in morphological analysis
-
-
-
N-gram models
Recognizing language
Generating language
-
-
Further reading
·
·
·
·
·
·
2/26

What+is+NLP?
Natural Language Processing
Briefly: Building models to facilitate human-computer interaction through language
We say natural language here to distinguish languages like English, Hungarian, and Bengali
from computer languages and other invented communication systems (e.g. Morse code)
-
-
Major sub-disciplines:
·
·
Speech Recognition/Synthesis
Computational Morphology (word structure)
Lexical Semantics (word meaning)
Computational Syntax (phrase/sentence structure)
Compositional Semantics (phrase/sentence meaning)
Information Retrieval
-
-
-
-
-
-
3/26

Why+R?
R has powerful text processing capabilities
Many useful NLP-related packages
Many of the more sophisticated procedures in NLP generalize to statistical models, which is
where R really excels
·
·
·
4/26

Founda6onal+NLP+Frameworks
Turing
- Turing Machine: Finite State Automaton, Finite State Transducer
Kleene
- Regular Expressions
Chomsky
- Regular Languages and their relation to natural languages
Markov:
N-gram models
HMMs
-
-
Shannon
·
·
·
·
·
Information Theory
Noisy Channel, Entropy models
-
-
5/26

The+Workflow
1. Import and manipulate text in R
2. Create data structures facilitating NLP operations
3. Model implementation:
Morphological parsing
N-gram parsing
N-gram language generation
...
·
·
·
·
6/26

Impor6ng+text+into+R
· Primary importing functions: scan(), readLines()
monty_text = scan('data/grail.txt', what="character", sep="", quote="")
monty_text[1:6]
[1] "SCENE" "1:" "[wind]" "[clop" "clop" "clop]"
malayalam_text = scan('data/mathrubhumi_2014-10_full.txt',
what="character", sep="", quote="")
malayalam_text[15:20]
[1] "#Date:" "01-10-2014"
[3] "#----------------------------------------" "അേമരിkയിെലtിയ"
[5] "+പധാനമ+nി" "നേര+nേമാദി"
· Why might this data structure be a problem for many natural language structures?
7/26

Condensing+to+single+text+stream
monty_text = paste(monty_text, collapse=" ")
malayalam_text = paste(malayalam_text, collapse=" ")
length(monty_text); length(malayalam_text)
[1] 1
[1] 1
substr(monty_text, 1, 70)
[1] "SCENE 1: [wind] [clop clop clop] KING ARTHUR: Whoa there! [clop clop c"
substr(malayalam_text, 304, 400)
[1] "െത4ായി ഉcരിc് അേdഹെt അനാദരിcുെവn് െക.പി.സി.സി. +പസിഡn് വി.എം. സുധീരD. േമാഹDദ"
8/26

Regular+Expressions
SYMBOL MEANING EXAMPLE
[] Disjunction (set) / [Gg]oogle / = Google, google
? 0 or 1 characters / savou?r / = savor, savour
* 0 or more characters / hey!* / = hey, hey!, hey!!, ...
Escape character / hey? / = hey?
+ 1 or more characters / a+h / = ah, aah, aaah, ...
{n, m} n to m repetitions / a{1-4}h{1-3} / = aahh, ahhh, ...
. Wildcard (any character) / #.* / = #rstats, #uofl, ...
() Conjunction / (ha)+ / = ha, haha, hahaha, ...
[^ ] NOT (negates bracketed chars) / [^ #.*] / = everything but #...
9/26

Regular+Expressions
SYMBOL MEANING EXAMPLE
[x-y] Match characters from 'x' to 'y' / [A-Z][1-9] / = A1, Q8, X5, ...
w Word character (alphanumeric) / w's / = that's, Jerry's, ...
W Non-word character
d Digit character (0-9) / d{3} / = 137, 254, ...
D Non-digit character
s Whitespace / w+s+w+ / = I am, I am, ...
S Non-whitespace
b Word boundary / btheb / = the, not then
B Non-word boundary
^ Beginning of line / [a-z] / = non-capitalized beg.
$ End of line / #.*$ / = hashtags at end of line
10/26

Manual+segmenta6on
The advantage of having all the text in a single element is we can now split the text into different-sized
segments for different kinds of natural language tasks.
#sentence level
pattern = "(?<=[.?!])s+"
monty_sentences = strsplit(monty_text, split=pattern, perl=T)
monty_sentences = unlist(monty_sentences)
monty_sentences[5:8]
[1] "King of the Britons, defeator of the Saxons, sovereign of all England!"
[2] "SOLDIER #1: Pull the other one!"
[3] "ARTHUR: I am, ..."
[4] "and this is my trusty servant Patsy."
11/26

Manual+segmenta6on
Of course, depending on the language you're working with you might have different definitions of
sentence boundaries. For example, Hindi uses what's called a danda marker, । , in place of a period.
hindi_text = scan('data/hindustan_full.txt', what="character", sep="")
hindi_text = paste(hindi_text, collapse=" ")
pattern = "(?<=[।?!])s+"
hindi_sentences = strsplit(hindi_text, split=pattern, perl=T)
hindi_sentences = unlist(hindi_sentences)
hindi_sentences[5:8]
[1] "व"# मन# को लोकसभा चuनाव . करारी हार का सामना करना पड़ा था और उसका खाता भी नह9 खuल पाया था।"
[2] "लोकसभा चuनाव . भाजपा और िशव#ना > कuछ छोA दलo D साथ िमलकर 48 . # 42 सीAE जीत9।"
[3] "महाराFG . िशव#ना अब तक भाजपा D बड़e भाई की भLिमका iनभाती रही थी।"
[4] "इन दोनo D बीच उस वOत अलगाव Qआ S जब भाजपा TU . नVU मोदी D >तWXव . पLणZ बQमत D साथ स[ासीन S।"
12/26

Manual+segmenta6on
We can also split the original text according to word boundaries.
#word level
pattern = "[()[]":;,.?!-]*s+[()[]":;,.?!-]*"
monty_words = strsplit(monty_text, split=pattern, perl=T)
monty_words = unlist(monty_words)
monty_words[5:30]
[1] "clop" "clop" "KING" "ARTHUR" "Whoa" "there" "clop" "clop"
[9] "clop" "SOLDIER" "#1" "Halt" "Who" "goes" "there" "ARTHUR"
[17] "It" "is" "I" "Arthur" "son" "of" "Uther" "Pendragon"
[25] "from" "the"
13/26

Building+a+Lexicon
For many NLP tasks it is useful to have a dictionary, or lexicon, of the language you're working with.
Other researchers may have already built a text-formatted lexicon of the language you're using, but
nevertheless it's useful to see how we might build one.
#convert all words to lowercase
monty_words = tolower(monty_words)
monty_words[1:9]
[1] "scene" "1" "wind" "clop" "clop" "clop" "king" "arthur" "whoa"
#convert vector of tokens to set of unique words
monty_lexicon = unique(monty_words)
monty_lexicon[1:8]
[1] "scene" "1" "wind" "clop" "king" "arthur" "whoa" "there"
14/26

Building+a+Lexicon
length(monty_words)
[1] 11213
length(monty_lexicon)
[1] 1889
15/26

Morphological+Analysis
Now that we have our lexicon we can start to model the internal structure of the words in our corpus.
Formally, morphological rules can be modeled as an FSA. Here's a simple example from Jurafsky
and Martin (2000)
16/26

But since it has already been proven that all regular expressions can be modeled as FSAs, and vice
versa, we can utilize the grep utilities in R to handle this process. First let's see if we can extract all
the agentive nouns (e.g. builder, worker, shopper, etc.).
monty_agents = grep('.+er$', monty_lexicon, perl=T, value=T)
monty_agents[1:30]
[1] "soldier" "uther" "other" "master" "together" "winter"
[7] "plover" "warmer" "matter" "order" "creeper" "under"
[13] "cart-master" "customer" "better" "over" "bother" "ever"
[19] "officer" "her" "water" "power" "mer" "villager"
[25] "whether" "cider" "e'er" "prisoner" "shelter" "wiper"
· This isn't exactly what we want. How can we improve our results?
17/26

Take advantage of the lexicon.
monty_agents = grep('.+er$', monty_lexicon, perl=T, value=T)
new_monty_agents = character(0)
for (i in 1:length(monty_agents)) {
word = monty_agents[i]
stem_end = nchar(word) - 2
stem = substr(word, 1, stem_end)
if (is.element(stem, monty_lexicon)) {
new_monty_agents[i] = word
}
}
new_monty_agents = new_monty_agents[!is.na(new_monty_agents)]
new_monty_agents
[1] "warmer" "creeper" "longer" "nearer" "higher" "killer" "bleeder" "keeper"
18/26

Malayalam+FSA
19/26

NHgram+Models
Based on Markov model
At their heart, n-grams answer the question: "What is the likelihood of one word (or character,
phrase, sentence...) following another word or sequence of words?"
The kernel equation:
P(wn|wn−1 ) ≈ P( | )
1 wn wn−1
n−N+1
N N
where is the in N-gram (i.e. the number of words used to build the grammar)
For example, if we have the string, "We are the Knights who say, 'Ni!'", in the bigram model we're
moving along the string asking: P(Knights|are the), P(who|the Knights), ...
·
·
·
·
20/26

NHgram+Models
library(ngram)
monty_bigram = ngram(monty_text, n=2)
get.ngrams(monty_bigram)[1:10]
[1] "cannot tell," "away. Just" "not 'is'." "bowels unplugged,"
[5] "well, Arthur," "[twang] Wayy!" "HERBERT: B--" "no. Until"
[9] "trade. I" "down, fell"
monty_trigram = ngram(monty_text, n=3)
get.ngrams(monty_trigram)[1:10]
[1] "a good spanking!" "Oooh! GALAHAD: My" "is the capital" "to you no"
[5] "Who's that then?" "you get back." "no arms left." "want... a shrubbery!"
[9] "Shut up! Um," "to a successful"
21/26

NHgram+Models
print(monty_bigram, full=TRUE)
cannot tell,
suffice {1} |
away. Just
ignore {1} |
not 'is'.
HEAD {1} | You {2} | Not {1} |
bowels unplugged,
And {1} |
well, Arthur,
for {1} |
[twang] Wayy!
[twang] {1} |
22/26

NHgram+Models
print(monty_trigram, full=TRUE)
a good spanking!
GIRLS: {1} |
Oooh! GALAHAD: My
God! {1} |
is the capital
of {1} |
to you no
more, {1} |
Who's that then?
CART-MASTER: {1} |
you get back.
GUARD {1} |
23/26

NHgram+Models
babble(monty_bigram, 8)
[1] "must go too. OFFICER #1: Back. Right away. "
[1] "I'll do you up a treat mate! GALAHAD: "
[1] "from just stop him entering the room. GUARD "
24/26

NHgram+Models
babble(monty_trigram, 8)
[1] "were still no nearer the Grail. Meanwhile, King "
[1] "the Britons. BEDEVERE: My liege! I would be "
[1] "Shh! VILLAGER #2: Wood! BEDEVERE: So, why do "
25/26

Further+Reading
Jurafsky and Martin (2008), Speech and Language Processing
Manning (2008), An Introduction to Information Retrieval
Gries (2009), Quantitative Corpus Linguistics with R
·
·
·
26/26

Introducing natural language processing(NLP) with r

More Related Content

What's hot

Viewers also liked

Similar to Introducing natural language processing(NLP) with r

More from Vivian S. Zhang

Introducing natural language processing(NLP) with r