KEMBAR78
Introducing natural language processing(NLP) with r | PDF
Introducing NLP with R 10/6/14, 19:37 
Introducing NLP with R 
Charlie Redmon | SupStat Analytics 
Copyright Supstat Inc. All Rights Reserved 
http://docs.supstat.com/NLPwithR/#1 Page 1 of 26
Introducing NLP with R 10/6/14, 19:37 
Outline 
Introduction to NLP 
Foundational Frameworks 
Working with text in R 
Regular Expressions 
As pattern matching device 
Theoretical connection with finite state automaton 
Application in morphological analysis 
- 
- 
- 
N-gram models 
Recognizing language 
Generating language 
- 
- 
Further reading 
· 
· 
· 
· 
· 
· 
2/26 
http://docs.supstat.com/NLPwithR/#1 Page 2 of 26
Introducing NLP with R 10/6/14, 19:37 
What+is+NLP? 
Natural Language Processing 
Briefly: Building models to facilitate human-computer interaction through language 
We say natural language here to distinguish languages like English, Hungarian, and Bengali 
from computer languages and other invented communication systems (e.g. Morse code) 
- 
- 
Major sub-disciplines: 
· 
· 
Speech Recognition/Synthesis 
Computational Morphology (word structure) 
Lexical Semantics (word meaning) 
Computational Syntax (phrase/sentence structure) 
Compositional Semantics (phrase/sentence meaning) 
Information Retrieval 
- 
- 
- 
- 
- 
- 
3/26 
http://docs.supstat.com/NLPwithR/#1 Page 3 of 26
Introducing NLP with R 10/6/14, 19:37 
Why+R? 
R has powerful text processing capabilities 
Many useful NLP-related packages 
Many of the more sophisticated procedures in NLP generalize to statistical models, which is 
where R really excels 
· 
· 
· 
4/26 
http://docs.supstat.com/NLPwithR/#1 Page 4 of 26
Introducing NLP with R 10/6/14, 19:37 
Founda6onal+NLP+Frameworks 
Turing 
- Turing Machine: Finite State Automaton, Finite State Transducer 
Kleene 
- Regular Expressions 
Chomsky 
- Regular Languages and their relation to natural languages 
Markov: 
N-gram models 
HMMs 
- 
- 
Shannon 
· 
· 
· 
· 
· 
Information Theory 
Noisy Channel, Entropy models 
- 
- 
5/26 
http://docs.supstat.com/NLPwithR/#1 Page 5 of 26
Introducing NLP with R 10/6/14, 19:37 
The+Workflow 
1. Import and manipulate text in R 
2. Create data structures facilitating NLP operations 
3. Model implementation: 
Morphological parsing 
N-gram parsing 
N-gram language generation 
... 
· 
· 
· 
· 
6/26 
http://docs.supstat.com/NLPwithR/#1 Page 6 of 26
Introducing NLP with R 10/6/14, 19:37 
Impor6ng+text+into+R 
· Primary importing functions: scan(), readLines() 
monty_text = scan('data/grail.txt', what="character", sep="", quote="") 
monty_text[1:6] 
[1] "SCENE" "1:" "[wind]" "[clop" "clop" "clop]" 
malayalam_text = scan('data/mathrubhumi_2014-10_full.txt', 
what="character", sep="", quote="") 
malayalam_text[15:20] 
[1] "#Date:" "01-10-2014" 
[3] "#----------------------------------------" "അേമരിkയിെലtിയ" 
[5] "+പധാനമ+nി" "നേര+nേമാദി" 
· Why might this data structure be a problem for many natural language structures? 
7/26 
http://docs.supstat.com/NLPwithR/#1 Page 7 of 26
Introducing NLP with R 10/6/14, 19:37 
Condensing+to+single+text+stream 
monty_text = paste(monty_text, collapse=" ") 
malayalam_text = paste(malayalam_text, collapse=" ") 
length(monty_text); length(malayalam_text) 
[1] 1 
[1] 1 
substr(monty_text, 1, 70) 
[1] "SCENE 1: [wind] [clop clop clop] KING ARTHUR: Whoa there! [clop clop c" 
substr(malayalam_text, 304, 400) 
[1] "െത4ായി ഉcരിc് അേdഹെt അനാദരിcുെവn് െക.പി.സി.സി. +പസിഡn് വി.എം. സുധീരD. േമാഹDദ" 
8/26 
http://docs.supstat.com/NLPwithR/#1 Page 8 of 26
Introducing NLP with R 10/6/14, 19:37 
Regular+Expressions 
SYMBOL MEANING EXAMPLE 
[] Disjunction (set) / [Gg]oogle / = Google, google 
? 0 or 1 characters / savou?r / = savor, savour 
* 0 or more characters / hey!* / = hey, hey!, hey!!, ... 
 Escape character / hey? / = hey? 
+ 1 or more characters / a+h / = ah, aah, aaah, ... 
{n, m} n to m repetitions / a{1-4}h{1-3} / = aahh, ahhh, ... 
. Wildcard (any character) / #.* / = #rstats, #uofl, ... 
() Conjunction / (ha)+ / = ha, haha, hahaha, ... 
[^ ] NOT (negates bracketed chars) / [^ #.*] / = everything but #... 
9/26 
http://docs.supstat.com/NLPwithR/#1 Page 9 of 26
Introducing NLP with R 10/6/14, 19:37 
Regular+Expressions 
SYMBOL MEANING EXAMPLE 
[x-y] Match characters from 'x' to 'y' / [A-Z][1-9] / = A1, Q8, X5, ... 
w Word character (alphanumeric) / w's / = that's, Jerry's, ... 
W Non-word character 
d Digit character (0-9) / d{3} / = 137, 254, ... 
D Non-digit character 
s Whitespace / w+s+w+ / = I am, I am, ... 
S Non-whitespace 
b Word boundary / btheb / = the, not then 
B Non-word boundary 
^ Beginning of line / [a-z] / = non-capitalized beg. 
$ End of line / #.*$ / = hashtags at end of line 
10/26 
http://docs.supstat.com/NLPwithR/#1 Page 10 of 26
Introducing NLP with R 10/6/14, 19:37 
Manual+segmenta6on 
The advantage of having all the text in a single element is we can now split the text into different-sized 
segments for different kinds of natural language tasks. 
#sentence level 
pattern = "(?<=[.?!])s+" 
monty_sentences = strsplit(monty_text, split=pattern, perl=T) 
monty_sentences = unlist(monty_sentences) 
monty_sentences[5:8] 
[1] "King of the Britons, defeator of the Saxons, sovereign of all England!" 
[2] "SOLDIER #1: Pull the other one!" 
[3] "ARTHUR: I am, ..." 
[4] "and this is my trusty servant Patsy." 
11/26 
http://docs.supstat.com/NLPwithR/#1 Page 11 of 26
Introducing NLP with R 10/6/14, 19:37 
Manual+segmenta6on 
Of course, depending on the language you're working with you might have different definitions of 
sentence boundaries. For example, Hindi uses what's called a danda marker, । , in place of a period. 
hindi_text = scan('data/hindustan_full.txt', what="character", sep="") 
hindi_text = paste(hindi_text, collapse=" ") 
pattern = "(?<=[।?!])s+" 
hindi_sentences = strsplit(hindi_text, split=pattern, perl=T) 
hindi_sentences = unlist(hindi_sentences) 
hindi_sentences[5:8] 
[1] "व"# मन# को लोकसभा चuनाव . करारी हार का सामना करना पड़ा था और उसका खाता भी नह9 खuल पाया था।" 
[2] "लोकसभा चuनाव . भाजपा और िशव#ना > कuछ छोA दलo D साथ िमलकर 48 . # 42 सीAE जीत9।" 
[3] "महाराFG . िशव#ना अब तक भाजपा D बड़e भाई की भLिमका iनभाती रही थी।" 
[4] "इन दोनo D बीच उस वOत अलगाव Qआ S जब भाजपा TU . नVU मोदी D >तWXव . पLणZ बQमत D साथ स[ासीन S।" 
12/26 
http://docs.supstat.com/NLPwithR/#1 Page 12 of 26
Introducing NLP with R 10/6/14, 19:37 
Manual+segmenta6on 
We can also split the original text according to word boundaries. 
#word level 
pattern = "[()[]":;,.?!-]*s+[()[]":;,.?!-]*" 
monty_words = strsplit(monty_text, split=pattern, perl=T) 
monty_words = unlist(monty_words) 
monty_words[5:30] 
[1] "clop" "clop" "KING" "ARTHUR" "Whoa" "there" "clop" "clop" 
[9] "clop" "SOLDIER" "#1" "Halt" "Who" "goes" "there" "ARTHUR" 
[17] "It" "is" "I" "Arthur" "son" "of" "Uther" "Pendragon" 
[25] "from" "the" 
13/26 
http://docs.supstat.com/NLPwithR/#1 Page 13 of 26
Introducing NLP with R 10/6/14, 19:37 
Building+a+Lexicon 
For many NLP tasks it is useful to have a dictionary, or lexicon, of the language you're working with. 
Other researchers may have already built a text-formatted lexicon of the language you're using, but 
nevertheless it's useful to see how we might build one. 
#convert all words to lowercase 
monty_words = tolower(monty_words) 
monty_words[1:9] 
[1] "scene" "1" "wind" "clop" "clop" "clop" "king" "arthur" "whoa" 
#convert vector of tokens to set of unique words 
monty_lexicon = unique(monty_words) 
monty_lexicon[1:8] 
[1] "scene" "1" "wind" "clop" "king" "arthur" "whoa" "there" 
14/26 
http://docs.supstat.com/NLPwithR/#1 Page 14 of 26
Introducing NLP with R 10/6/14, 19:37 
Building+a+Lexicon 
length(monty_words) 
[1] 11213 
length(monty_lexicon) 
[1] 1889 
15/26 
http://docs.supstat.com/NLPwithR/#1 Page 15 of 26
Introducing NLP with R 10/6/14, 19:37 
Morphological+Analysis 
Now that we have our lexicon we can start to model the internal structure of the words in our corpus. 
Formally, morphological rules can be modeled as an FSA. Here's a simple example from Jurafsky 
and Martin (2000) 
16/26 
http://docs.supstat.com/NLPwithR/#1 Page 16 of 26
Introducing NLP with R 10/6/14, 19:37 
Morphological+Analysis 
But since it has already been proven that all regular expressions can be modeled as FSAs, and vice 
versa, we can utilize the grep utilities in R to handle this process. First let's see if we can extract all 
the agentive nouns (e.g. builder, worker, shopper, etc.). 
monty_agents = grep('.+er$', monty_lexicon, perl=T, value=T) 
monty_agents[1:30] 
[1] "soldier" "uther" "other" "master" "together" "winter" 
[7] "plover" "warmer" "matter" "order" "creeper" "under" 
[13] "cart-master" "customer" "better" "over" "bother" "ever" 
[19] "officer" "her" "water" "power" "mer" "villager" 
[25] "whether" "cider" "e'er" "prisoner" "shelter" "wiper" 
· This isn't exactly what we want. How can we improve our results? 
17/26 
http://docs.supstat.com/NLPwithR/#1 Page 17 of 26
Introducing NLP with R 10/6/14, 19:37 
Morphological+Analysis 
Take advantage of the lexicon. 
monty_agents = grep('.+er$', monty_lexicon, perl=T, value=T) 
new_monty_agents = character(0) 
for (i in 1:length(monty_agents)) { 
word = monty_agents[i] 
stem_end = nchar(word) - 2 
stem = substr(word, 1, stem_end) 
if (is.element(stem, monty_lexicon)) { 
new_monty_agents[i] = word 
} 
} 
new_monty_agents = new_monty_agents[!is.na(new_monty_agents)] 
new_monty_agents 
[1] "warmer" "creeper" "longer" "nearer" "higher" "killer" "bleeder" "keeper" 
18/26 
http://docs.supstat.com/NLPwithR/#1 Page 18 of 26
Introducing NLP with R 10/6/14, 19:37 
Malayalam+FSA 
19/26 
http://docs.supstat.com/NLPwithR/#1 Page 19 of 26
Introducing NLP with R 10/6/14, 19:37 
NHgram+Models 
Based on Markov model 
At their heart, n-grams answer the question: "What is the likelihood of one word (or character, 
phrase, sentence...) following another word or sequence of words?" 
The kernel equation: 
P(wn|wn−1 ) ≈ P( | ) 
1 wn wn−1 
n−N+1 
N N 
where is the in N-gram (i.e. the number of words used to build the grammar) 
For example, if we have the string, "We are the Knights who say, 'Ni!'", in the bigram model we're 
moving along the string asking: P(Knights|are the), P(who|the Knights), ... 
· 
· 
· 
· 
20/26 
http://docs.supstat.com/NLPwithR/#1 Page 20 of 26
Introducing NLP with R 10/6/14, 19:37 
NHgram+Models 
library(ngram) 
monty_bigram = ngram(monty_text, n=2) 
get.ngrams(monty_bigram)[1:10] 
[1] "cannot tell," "away. Just" "not 'is'." "bowels unplugged," 
[5] "well, Arthur," "[twang] Wayy!" "HERBERT: B--" "no. Until" 
[9] "trade. I" "down, fell" 
monty_trigram = ngram(monty_text, n=3) 
get.ngrams(monty_trigram)[1:10] 
[1] "a good spanking!" "Oooh! GALAHAD: My" "is the capital" "to you no" 
[5] "Who's that then?" "you get back." "no arms left." "want... a shrubbery!" 
[9] "Shut up! Um," "to a successful" 
21/26 
http://docs.supstat.com/NLPwithR/#1 Page 21 of 26
Introducing NLP with R 10/6/14, 19:37 
NHgram+Models 
print(monty_bigram, full=TRUE) 
cannot tell, 
suffice {1} | 
away. Just 
ignore {1} | 
not 'is'. 
HEAD {1} | You {2} | Not {1} | 
bowels unplugged, 
And {1} | 
well, Arthur, 
for {1} | 
[twang] Wayy! 
[twang] {1} | 
22/26 
http://docs.supstat.com/NLPwithR/#1 Page 22 of 26
Introducing NLP with R 10/6/14, 19:37 
NHgram+Models 
print(monty_trigram, full=TRUE) 
a good spanking! 
GIRLS: {1} | 
Oooh! GALAHAD: My 
God! {1} | 
is the capital 
of {1} | 
to you no 
more, {1} | 
Who's that then? 
CART-MASTER: {1} | 
you get back. 
GUARD {1} | 
23/26 
http://docs.supstat.com/NLPwithR/#1 Page 23 of 26
Introducing NLP with R 10/6/14, 19:37 
NHgram+Models 
babble(monty_bigram, 8) 
[1] "must go too. OFFICER #1: Back. Right away. " 
babble(monty_bigram, 8) 
[1] "I'll do you up a treat mate! GALAHAD: " 
babble(monty_bigram, 8) 
[1] "from just stop him entering the room. GUARD " 
24/26 
http://docs.supstat.com/NLPwithR/#1 Page 24 of 26
Introducing NLP with R 10/6/14, 19:37 
NHgram+Models 
babble(monty_trigram, 8) 
[1] "were still no nearer the Grail. Meanwhile, King " 
babble(monty_trigram, 8) 
[1] "the Britons. BEDEVERE: My liege! I would be " 
babble(monty_trigram, 8) 
[1] "Shh! VILLAGER #2: Wood! BEDEVERE: So, why do " 
25/26 
http://docs.supstat.com/NLPwithR/#1 Page 25 of 26
Introducing NLP with R 10/6/14, 19:37 
Further+Reading 
Jurafsky and Martin (2008), Speech and Language Processing 
Manning (2008), An Introduction to Information Retrieval 
Gries (2009), Quantitative Corpus Linguistics with R 
· 
· 
· 
26/26 
http://docs.supstat.com/NLPwithR/#1 Page 26 of 26

Introducing natural language processing(NLP) with r

  • 1.
    Introducing NLP withR 10/6/14, 19:37 Introducing NLP with R Charlie Redmon | SupStat Analytics Copyright Supstat Inc. All Rights Reserved http://docs.supstat.com/NLPwithR/#1 Page 1 of 26
  • 2.
    Introducing NLP withR 10/6/14, 19:37 Outline Introduction to NLP Foundational Frameworks Working with text in R Regular Expressions As pattern matching device Theoretical connection with finite state automaton Application in morphological analysis - - - N-gram models Recognizing language Generating language - - Further reading · · · · · · 2/26 http://docs.supstat.com/NLPwithR/#1 Page 2 of 26
  • 3.
    Introducing NLP withR 10/6/14, 19:37 What+is+NLP? Natural Language Processing Briefly: Building models to facilitate human-computer interaction through language We say natural language here to distinguish languages like English, Hungarian, and Bengali from computer languages and other invented communication systems (e.g. Morse code) - - Major sub-disciplines: · · Speech Recognition/Synthesis Computational Morphology (word structure) Lexical Semantics (word meaning) Computational Syntax (phrase/sentence structure) Compositional Semantics (phrase/sentence meaning) Information Retrieval - - - - - - 3/26 http://docs.supstat.com/NLPwithR/#1 Page 3 of 26
  • 4.
    Introducing NLP withR 10/6/14, 19:37 Why+R? R has powerful text processing capabilities Many useful NLP-related packages Many of the more sophisticated procedures in NLP generalize to statistical models, which is where R really excels · · · 4/26 http://docs.supstat.com/NLPwithR/#1 Page 4 of 26
  • 5.
    Introducing NLP withR 10/6/14, 19:37 Founda6onal+NLP+Frameworks Turing - Turing Machine: Finite State Automaton, Finite State Transducer Kleene - Regular Expressions Chomsky - Regular Languages and their relation to natural languages Markov: N-gram models HMMs - - Shannon · · · · · Information Theory Noisy Channel, Entropy models - - 5/26 http://docs.supstat.com/NLPwithR/#1 Page 5 of 26
  • 6.
    Introducing NLP withR 10/6/14, 19:37 The+Workflow 1. Import and manipulate text in R 2. Create data structures facilitating NLP operations 3. Model implementation: Morphological parsing N-gram parsing N-gram language generation ... · · · · 6/26 http://docs.supstat.com/NLPwithR/#1 Page 6 of 26
  • 7.
    Introducing NLP withR 10/6/14, 19:37 Impor6ng+text+into+R · Primary importing functions: scan(), readLines() monty_text = scan('data/grail.txt', what="character", sep="", quote="") monty_text[1:6] [1] "SCENE" "1:" "[wind]" "[clop" "clop" "clop]" malayalam_text = scan('data/mathrubhumi_2014-10_full.txt', what="character", sep="", quote="") malayalam_text[15:20] [1] "#Date:" "01-10-2014" [3] "#----------------------------------------" "അേമരിkയിെലtിയ" [5] "+പധാനമ+nി" "നേര+nേമാദി" · Why might this data structure be a problem for many natural language structures? 7/26 http://docs.supstat.com/NLPwithR/#1 Page 7 of 26
  • 8.
    Introducing NLP withR 10/6/14, 19:37 Condensing+to+single+text+stream monty_text = paste(monty_text, collapse=" ") malayalam_text = paste(malayalam_text, collapse=" ") length(monty_text); length(malayalam_text) [1] 1 [1] 1 substr(monty_text, 1, 70) [1] "SCENE 1: [wind] [clop clop clop] KING ARTHUR: Whoa there! [clop clop c" substr(malayalam_text, 304, 400) [1] "െത4ായി ഉcരിc് അേdഹെt അനാദരിcുെവn് െക.പി.സി.സി. +പസിഡn് വി.എം. സുധീരD. േമാഹDദ" 8/26 http://docs.supstat.com/NLPwithR/#1 Page 8 of 26
  • 9.
    Introducing NLP withR 10/6/14, 19:37 Regular+Expressions SYMBOL MEANING EXAMPLE [] Disjunction (set) / [Gg]oogle / = Google, google ? 0 or 1 characters / savou?r / = savor, savour * 0 or more characters / hey!* / = hey, hey!, hey!!, ... Escape character / hey? / = hey? + 1 or more characters / a+h / = ah, aah, aaah, ... {n, m} n to m repetitions / a{1-4}h{1-3} / = aahh, ahhh, ... . Wildcard (any character) / #.* / = #rstats, #uofl, ... () Conjunction / (ha)+ / = ha, haha, hahaha, ... [^ ] NOT (negates bracketed chars) / [^ #.*] / = everything but #... 9/26 http://docs.supstat.com/NLPwithR/#1 Page 9 of 26
  • 10.
    Introducing NLP withR 10/6/14, 19:37 Regular+Expressions SYMBOL MEANING EXAMPLE [x-y] Match characters from 'x' to 'y' / [A-Z][1-9] / = A1, Q8, X5, ... w Word character (alphanumeric) / w's / = that's, Jerry's, ... W Non-word character d Digit character (0-9) / d{3} / = 137, 254, ... D Non-digit character s Whitespace / w+s+w+ / = I am, I am, ... S Non-whitespace b Word boundary / btheb / = the, not then B Non-word boundary ^ Beginning of line / [a-z] / = non-capitalized beg. $ End of line / #.*$ / = hashtags at end of line 10/26 http://docs.supstat.com/NLPwithR/#1 Page 10 of 26
  • 11.
    Introducing NLP withR 10/6/14, 19:37 Manual+segmenta6on The advantage of having all the text in a single element is we can now split the text into different-sized segments for different kinds of natural language tasks. #sentence level pattern = "(?<=[.?!])s+" monty_sentences = strsplit(monty_text, split=pattern, perl=T) monty_sentences = unlist(monty_sentences) monty_sentences[5:8] [1] "King of the Britons, defeator of the Saxons, sovereign of all England!" [2] "SOLDIER #1: Pull the other one!" [3] "ARTHUR: I am, ..." [4] "and this is my trusty servant Patsy." 11/26 http://docs.supstat.com/NLPwithR/#1 Page 11 of 26
  • 12.
    Introducing NLP withR 10/6/14, 19:37 Manual+segmenta6on Of course, depending on the language you're working with you might have different definitions of sentence boundaries. For example, Hindi uses what's called a danda marker, । , in place of a period. hindi_text = scan('data/hindustan_full.txt', what="character", sep="") hindi_text = paste(hindi_text, collapse=" ") pattern = "(?<=[।?!])s+" hindi_sentences = strsplit(hindi_text, split=pattern, perl=T) hindi_sentences = unlist(hindi_sentences) hindi_sentences[5:8] [1] "व"# मन# को लोकसभा चuनाव . करारी हार का सामना करना पड़ा था और उसका खाता भी नह9 खuल पाया था।" [2] "लोकसभा चuनाव . भाजपा और िशव#ना > कuछ छोA दलo D साथ िमलकर 48 . # 42 सीAE जीत9।" [3] "महाराFG . िशव#ना अब तक भाजपा D बड़e भाई की भLिमका iनभाती रही थी।" [4] "इन दोनo D बीच उस वOत अलगाव Qआ S जब भाजपा TU . नVU मोदी D >तWXव . पLणZ बQमत D साथ स[ासीन S।" 12/26 http://docs.supstat.com/NLPwithR/#1 Page 12 of 26
  • 13.
    Introducing NLP withR 10/6/14, 19:37 Manual+segmenta6on We can also split the original text according to word boundaries. #word level pattern = "[()[]":;,.?!-]*s+[()[]":;,.?!-]*" monty_words = strsplit(monty_text, split=pattern, perl=T) monty_words = unlist(monty_words) monty_words[5:30] [1] "clop" "clop" "KING" "ARTHUR" "Whoa" "there" "clop" "clop" [9] "clop" "SOLDIER" "#1" "Halt" "Who" "goes" "there" "ARTHUR" [17] "It" "is" "I" "Arthur" "son" "of" "Uther" "Pendragon" [25] "from" "the" 13/26 http://docs.supstat.com/NLPwithR/#1 Page 13 of 26
  • 14.
    Introducing NLP withR 10/6/14, 19:37 Building+a+Lexicon For many NLP tasks it is useful to have a dictionary, or lexicon, of the language you're working with. Other researchers may have already built a text-formatted lexicon of the language you're using, but nevertheless it's useful to see how we might build one. #convert all words to lowercase monty_words = tolower(monty_words) monty_words[1:9] [1] "scene" "1" "wind" "clop" "clop" "clop" "king" "arthur" "whoa" #convert vector of tokens to set of unique words monty_lexicon = unique(monty_words) monty_lexicon[1:8] [1] "scene" "1" "wind" "clop" "king" "arthur" "whoa" "there" 14/26 http://docs.supstat.com/NLPwithR/#1 Page 14 of 26
  • 15.
    Introducing NLP withR 10/6/14, 19:37 Building+a+Lexicon length(monty_words) [1] 11213 length(monty_lexicon) [1] 1889 15/26 http://docs.supstat.com/NLPwithR/#1 Page 15 of 26
  • 16.
    Introducing NLP withR 10/6/14, 19:37 Morphological+Analysis Now that we have our lexicon we can start to model the internal structure of the words in our corpus. Formally, morphological rules can be modeled as an FSA. Here's a simple example from Jurafsky and Martin (2000) 16/26 http://docs.supstat.com/NLPwithR/#1 Page 16 of 26
  • 17.
    Introducing NLP withR 10/6/14, 19:37 Morphological+Analysis But since it has already been proven that all regular expressions can be modeled as FSAs, and vice versa, we can utilize the grep utilities in R to handle this process. First let's see if we can extract all the agentive nouns (e.g. builder, worker, shopper, etc.). monty_agents = grep('.+er$', monty_lexicon, perl=T, value=T) monty_agents[1:30] [1] "soldier" "uther" "other" "master" "together" "winter" [7] "plover" "warmer" "matter" "order" "creeper" "under" [13] "cart-master" "customer" "better" "over" "bother" "ever" [19] "officer" "her" "water" "power" "mer" "villager" [25] "whether" "cider" "e'er" "prisoner" "shelter" "wiper" · This isn't exactly what we want. How can we improve our results? 17/26 http://docs.supstat.com/NLPwithR/#1 Page 17 of 26
  • 18.
    Introducing NLP withR 10/6/14, 19:37 Morphological+Analysis Take advantage of the lexicon. monty_agents = grep('.+er$', monty_lexicon, perl=T, value=T) new_monty_agents = character(0) for (i in 1:length(monty_agents)) { word = monty_agents[i] stem_end = nchar(word) - 2 stem = substr(word, 1, stem_end) if (is.element(stem, monty_lexicon)) { new_monty_agents[i] = word } } new_monty_agents = new_monty_agents[!is.na(new_monty_agents)] new_monty_agents [1] "warmer" "creeper" "longer" "nearer" "higher" "killer" "bleeder" "keeper" 18/26 http://docs.supstat.com/NLPwithR/#1 Page 18 of 26
  • 19.
    Introducing NLP withR 10/6/14, 19:37 Malayalam+FSA 19/26 http://docs.supstat.com/NLPwithR/#1 Page 19 of 26
  • 20.
    Introducing NLP withR 10/6/14, 19:37 NHgram+Models Based on Markov model At their heart, n-grams answer the question: "What is the likelihood of one word (or character, phrase, sentence...) following another word or sequence of words?" The kernel equation: P(wn|wn−1 ) ≈ P( | ) 1 wn wn−1 n−N+1 N N where is the in N-gram (i.e. the number of words used to build the grammar) For example, if we have the string, "We are the Knights who say, 'Ni!'", in the bigram model we're moving along the string asking: P(Knights|are the), P(who|the Knights), ... · · · · 20/26 http://docs.supstat.com/NLPwithR/#1 Page 20 of 26
  • 21.
    Introducing NLP withR 10/6/14, 19:37 NHgram+Models library(ngram) monty_bigram = ngram(monty_text, n=2) get.ngrams(monty_bigram)[1:10] [1] "cannot tell," "away. Just" "not 'is'." "bowels unplugged," [5] "well, Arthur," "[twang] Wayy!" "HERBERT: B--" "no. Until" [9] "trade. I" "down, fell" monty_trigram = ngram(monty_text, n=3) get.ngrams(monty_trigram)[1:10] [1] "a good spanking!" "Oooh! GALAHAD: My" "is the capital" "to you no" [5] "Who's that then?" "you get back." "no arms left." "want... a shrubbery!" [9] "Shut up! Um," "to a successful" 21/26 http://docs.supstat.com/NLPwithR/#1 Page 21 of 26
  • 22.
    Introducing NLP withR 10/6/14, 19:37 NHgram+Models print(monty_bigram, full=TRUE) cannot tell, suffice {1} | away. Just ignore {1} | not 'is'. HEAD {1} | You {2} | Not {1} | bowels unplugged, And {1} | well, Arthur, for {1} | [twang] Wayy! [twang] {1} | 22/26 http://docs.supstat.com/NLPwithR/#1 Page 22 of 26
  • 23.
    Introducing NLP withR 10/6/14, 19:37 NHgram+Models print(monty_trigram, full=TRUE) a good spanking! GIRLS: {1} | Oooh! GALAHAD: My God! {1} | is the capital of {1} | to you no more, {1} | Who's that then? CART-MASTER: {1} | you get back. GUARD {1} | 23/26 http://docs.supstat.com/NLPwithR/#1 Page 23 of 26
  • 24.
    Introducing NLP withR 10/6/14, 19:37 NHgram+Models babble(monty_bigram, 8) [1] "must go too. OFFICER #1: Back. Right away. " babble(monty_bigram, 8) [1] "I'll do you up a treat mate! GALAHAD: " babble(monty_bigram, 8) [1] "from just stop him entering the room. GUARD " 24/26 http://docs.supstat.com/NLPwithR/#1 Page 24 of 26
  • 25.
    Introducing NLP withR 10/6/14, 19:37 NHgram+Models babble(monty_trigram, 8) [1] "were still no nearer the Grail. Meanwhile, King " babble(monty_trigram, 8) [1] "the Britons. BEDEVERE: My liege! I would be " babble(monty_trigram, 8) [1] "Shh! VILLAGER #2: Wood! BEDEVERE: So, why do " 25/26 http://docs.supstat.com/NLPwithR/#1 Page 25 of 26
  • 26.
    Introducing NLP withR 10/6/14, 19:37 Further+Reading Jurafsky and Martin (2008), Speech and Language Processing Manning (2008), An Introduction to Information Retrieval Gries (2009), Quantitative Corpus Linguistics with R · · · 26/26 http://docs.supstat.com/NLPwithR/#1 Page 26 of 26