KEMBAR78
Natural Language Processing Journal | PDF | Parsing | Grammar
0% found this document useful (0 votes)
7 views73 pages

Natural Language Processing Journal

The document outlines a series of practical exercises related to Natural Language Processing (NLP) for a Master of Science in Information Technology program. It includes instructions for installing Python and NLTK, converting text to speech, speech recognition, studying various corpora, and using WordNet for synonyms and antonyms. Each practical exercise is accompanied by code examples and expected outputs.

Uploaded by

anjali.snbp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views73 pages

Natural Language Processing Journal

The document outlines a series of practical exercises related to Natural Language Processing (NLP) for a Master of Science in Information Technology program. It includes instructions for installing Python and NLTK, converting text to speech, speech recognition, studying various corpora, and using WordNet for synonyms and antonyms. Each practical exercise is accompanied by code examples and expected outputs.

Uploaded by

anjali.snbp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 73

NATURAL LANGUAGE PROCESSING

Certified Journal
Submitted in partial fulfilment of the
Requirements for the award of the Degree of

MASTER OF SCIENCE
(INFORMATION_TECHNOLOGY)
By
Anjali Rameshwar Nimje

DEPARTMENT OF INFORMATION TECHNOLOGY

KERALEEYA SAMAJAM (REGD.) DOMBIVLI’S


MODEL COLLEGE (AUTONOMOUS)
Re-Accredited ‘A’ Grade by NAAC

(Affiliated to University of Mumbai)

FOR THE YEAR

(2023-24)
INDEX
Sr No Practical Name Date Signature
PRACTICAL 1A
Install NLTK
Python 3.9.2 Installation on Windows
Step 1) Go to link https://www.python.org/downloads/, and select the latest
version for windows.

Note: If you don't want to download the latest version, you can visit the
download tab and see all releases.

Step 2) Click on the Windows installer (64 bit)

Step 3) Select Customize Installation


Step 4) Click NEXT

Step 5) In next screen


1. Select the advanced options
2. Give a Custom install location. Keep the default folder as c:\Program
files\Python39
3. Click Install

Step 6) Click Close button once install is done.


Step 7) open command prompt window and run the following commands:
C:\Users\Beena Kapadia>pip install --upgrade pip
C:\Users\Beena Kapadia> pip install --user -U nltk
C:\Users\Beena Kapadia> >pip install --user -U numpy
C:\Users\Beena Kapadia>python
>>> import nltk

PRACTICAL 1B
Convert the given text to speech.

Code:
from playsound import playsound
from gtts import gTTS
mytext="Welcome to Natural Language Processing"
language="en"
myobj=gTTS(text=mytext,lang=language,slow=False)
myobj.save("myfile.mp3")
playsound("myfile.mp3")

from playsound import playsound


from gtts import gTTS
mytext="Welcome to Natural Language Processing"
language="en"
myobj=gTTS(text=mytext,lang=language,slow=False)
myobj.save("myfile.mp3")
playsound("myfile.mp3")
Output:

PRACTICAL 1C
Convert audio file Speech to Text.

Code:
import speech_recognition as sr
filename="male.wav"
r=sr.Recognizer()
with sr.AudioFile(filename)as source:
audio_data=r.record(source)
text=r.recognize_google(audio_data)
print(text)

Output:

PRACTICAL 2A
Study of various Corpus – Brown, Inaugural, Reuters, udhr with various
methods like fields, raw, words, sents, categories,

Code:
import nltk
from nltk.corpus import brown
print ('File ids of brown corpus\n',brown.fileids())
'''Let’s pick out the first of these texts — Emma by Jane Austen — and give it a
short
name, emma, then find out how many words it contains:'''
ca01 = brown.words('ca01')
# display first few words
print('\nca01 has following words:\n',ca01)
# total number of words in ca01
print('\nca01 has',len(ca01),'words')

#categories or files
print ('\n\nCategories or file in brown corpus:\n')
print (brown.categories())
'''display other information about each text, by looping over all the values of
fileid
corresponding to the brown file identifiers listed earlier and then computing
statistics
for each text.'''
print ('\n\nStatistics for each text:\n')
print
('AvgWordLen\tAvgSentenceLen\tno.ofTimesEachWordAppearsOnAvg\t\
tFileName')
for fileid in brown.fileids():
num_chars = len(brown.raw(fileid))
num_words = len(brown.words(fileid))
num_sents = len(brown.sents(fileid))
num_vocab = len(set([w.lower() for w in brown.words(fileid)]))
print (int(num_chars/num_words),'\t\t\t', int(num_words/num_sents),'\t\t\t',
int(num_words/num_vocab),'\t\t\t', fileid)
Output:
PRACTICAL 2C
Study Conditional frequency distributions
Code:
#process a sequence of pairs
text = ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
pairs = [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ...]
import nltk
from nltk.corpus import brown
fd = nltk.ConditionalFreqDist(
(genre, word)
for genre in brown.categories()
for word in brown.words(categories=genre))
genre_word = [(genre, word)
for genre in ['news', 'romance']
for word in brown.words(categories=genre)]
print(len(genre_word))
print(genre_word[:4])
print(genre_word[-4:])
cfd = nltk.ConditionalFreqDist(genre_word)
print(cfd)
print(cfd.conditions())
print(cfd['news'])
print(cfd['romance'])
print(list(cfd['romance']))
from nltk.corpus import inaugural
cfd = nltk.ConditionalFreqDist(
(target, fileid[:4])
for fileid in inaugural.fileids()
for w in inaugural.words(fileid)
for target in ['america', 'citizen']
if w.lower().startswith(target))
from nltk.corpus import udhr
languages = ['Chickasaw', 'English', 'German_Deutsch','Greenlandic_Inuktikut',
'Hungarian_Magyar', 'Ibibio_Efik']
cfd = nltk.ConditionalFreqDist(
(lang, len(word))
for lang in languages
for word in udhr.words(lang + '-Latin1'))
cfd.tabulate(conditions=['English', 'German_Deutsch'],
samples=range(10), cumulative=True)

Output:
PRACTICAL 2D
Study of tagged corpora with methods like tagged_sents, tagged_words.

Code:
import nltk
from nltk import tokenize
nltk.download('punkt')
nltk.download('words')
para = "Hello! My name is Anjali Nimje. Today you'll be learning NLTK."
sents = tokenize.sent_tokenize(para)
print("\nsentence tokenization\n===================\n",sents)
# word tokenization
print("\nword tokenization\n===================\n")
for index in range(len(sents)):
words = tokenize.word_tokenize(sents[index])
print(words)

Output:

PRACTICAL 2E
Write a program to find the most frequent noun tags.
Code:
import nltk
from collections import defaultdict
text = nltk.word_tokenize("Nick likes to play football. Nick does not like to
playcricket.")
tagged = nltk.pos_tag(text)
print(tagged)
# checking if it is a noun or not
addNounWords = []
count=0
for words in tagged:
val = tagged[count][1]
if(val == 'NN' or val == 'NNS' or val == 'NNPS' or val == 'NNP'):
addNounWords.append(tagged[count][0])
count+=1
print (addNounWords)
temp = defaultdict(int)
# memoizing count
for sub in addNounWords:
for wrd in sub.split():
temp[wrd] += 1
# getting max frequency
res = max(temp, key=temp.get)
# printing result
print("Word with maximum frequency : " + str(res))
Output:

PRACTICAL 2F
Map Words to Properties Using Python Dictionaries
Code:
#creating and printing a dictionay by mapping word with its properties
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
print(thisdict)
print(thisdict["brand"])
print(len(thisdict))
print(type(thisdict))

PRACTICAL 2G
Study DefaultTagger, Regular expression tagger, UnigramTagger
i)Default Tagger
Code:
import nltk
from nltk.tag import DefaultTagger
exptagger = DefaultTagger('NN')
from nltk.corpus import treebank
testsentences = treebank.tagged_sents() [1000:]
print(exptagger.evaluate (testsentences))
#Tagging a list of sentences
import nltk
from nltk.tag import DefaultTagger
exptagger = DefaultTagger('NN')
print(exptagger.tag_sents([['Hi', ','], ['How', 'are', 'you', '?']]))

Output:
ii)Regular Expressions
Code:
from nltk.corpus import brown
from nltk.tag import RegexpTagger
test_sent = brown.sents(categories='news')[0]
regexp_tagger = RegexpTagger(
[(r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers
(r'(The|the|A|a|An|an)$', 'AT'), # articles
(r'.*able$', 'JJ'), # adjectives
(r'.*ness$', 'NN'), # nouns formed from adjectives
(r'.*ly$', 'RB'), # adverbs
(r'.*s$', 'NNS'), # plural nouns
(r'.*ing$', 'VBG'), # gerunds
(r'.*ed$', 'VBD'), # past tense verbs
(r'.*', 'NN') # nouns (default)
])
print(regexp_tagger)
print(regexp_tagger.tag(test_sent))

Output:
iii) Unigram Tagger
Code:
# Loading Libraries
from nltk.tag import UnigramTagger
from nltk.corpus import treebank
# Training using first 10 tagged sentences of the treebank corpus as data.
# Using data
train_sents = treebank.tagged_sents()[:10]
# Initializing
tagger = UnigramTagger(train_sents)
# Lets see the first sentence
# (of the treebank corpus) as list
print(treebank.sents()[0])
print('\n',tagger.tag(treebank.sents()[0]))
#Finding the tagged results after training.
tagger.tag(treebank.sents()[0])
#Overriding the context model
tagger = UnigramTagger(model ={'Pierre': 'NN'})
print('\n',tagger.tag(treebank.sents()[0]))

Output:
PRACTICAL 3A
Study of Wordnet Dictionary with methods as synsets, definitions,
examples,antonyms.
Code:
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet
print(wordnet.synsets("computer"))
print(wordnet.synset("computer.n.01").definition())
print("Examples:", wordnet.synset("computer.n.01").examples())
print(wordnet.lemma('buy.v.01.buy').antonyms())

Output:

PRACTICAL 3B
Study lemmas, hyponyms, hypernyms.
Code:
import nltk
from nltk.corpus import wordnet
print(wordnet.synsets("computer"))
print(wordnet.synset("computer.n.01").lemma_names())
for e in wordnet.synsets("computer"):
print(f'{e} --> {e.lemma_names()}')
print(wordnet.synset("computer.n.01").lemmas())
print(wordnet.lemma('computer.n.01.computing_device').synset())
print(wordnet.lemma('computer.n.01.computing_device').name())
syn = wordnet.synset('computer.n.01')
print(syn.hyponyms)
print([lemma.name() for synset in syn.hyponyms() for lemma in
synset.lemmas()])
vehicle = wordnet.synset('vehicle.n.01')car = wordnet.synset('car.n.01')
print(car.lowest_common_hypernyms(vehicle))
Output:

PRACTICAL 3C
Write a program using python to find synonym and antonym of word
"active" using Wordnet.
Code:
from nltk.corpus import wordnet
print(wordnet.synsets("active"))
print(wordnet.lemma('active.a.01.active').antonyms())
Output:

PRACTICAL 3D
Compare two nouns.
Code:
import nltk
from nltk.corpus import wordnet
syn1 = wordnet.synsets('football')
syn2 = wordnet.synsets('soccer')
for s1 in syn1:
for s2 in syn2:
print("path similarity of:")
print(s1,'(',s1.pos(),')','[',s1.definition(),']')
print(s2,'(',s2.pos(),')','[',s2.definition(),']')
print("is", s1.path_similarity(s2))
print()

Output:

PRACTICAL 3E
Handling stopword
i) Using nltk Adding or Removing Stop Words in NLTK's Default Stop
Word List
Code:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
text = "Yashesh like to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in
stopwords.words()]
print(tokens_without_sw)
all_stopwords = stopwords.words('english')all_stopwords.append('play')
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in
all_stopwords]
print(tokens_without_sw)
all_stopwords.remove('not')
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in
all_stopwords]
print(tokens_without_sw)
Output:

ii) Using Gensim Adding and Removing Stop Words in Default Gensim
Stop Words List
Code:
import gensim
from gensim.parsing.preprocessing import remove_stopwords
from nltk.tokenize import word_tokenize
text = "Yashesh likes to play football, however he is not too fond of tennis."
filtered_sentense = remove_stopwords(text)
print(filtered_sentense)
all_stpwords = gensim.parsing.preprocessing.STOPWORDS
print( all_stpwords)
from gensim.parsing.preprocessing import STOPWORDS
all_stpwords_genism = STOPWORDS.union(set(['likes','play']))
text = "Yashesh likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in
all_stpwords]print(tokens_without_sw)
print("========================================")
all_stpwords_genism = STOPWORDS
sw_list = {"not"}
all_stpwords_genism = STOPWORDS.difference(sw_list)
text = "Yashesh likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in
all_stpwords_genism]
print(all_stpwords_genism)
print(tokens_without_sw)
Output:
iii) Using Spacy Adding and Removing Stop Words in Default Spacy Stop
Words List.
Code:
import spacy
import nltk
from nltk.tokenize import word_tokenize
sp = spacy.load('en_core_web_sm')
all_stopwords = sp.Defaults.stop_words
all_stopwords.add('play')
text = "Yashesh like to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in
all_stopwords]
print(tokens_without_sw)
all_stopwords.remove('not')
tokens_without_sw = [word for word in text_tokens if not word in
all_stopwords]
print(tokens_without_sw)

Commands:
#pip install spacy
#python -m spacy download en_core_web_sm
#python -m spacy download en
Output:
PRACTICAL 4A
Tokenization using Python’s split() function.
Code:
text = "Monism is a thesis about oneness: that only one thing exists in a certain
sense. The denial of monism is pluralism, the thesis that, in a certain sense,
more than one thing exists.[7] There are manyforms of monism and pluralism,
but in relation to the world as a whole, two are of special interest:existence
monism/pluralism and priority monism/pluralism."
data = text.split('.')
for i in data:
print(i)
Output:
PRACTICAL 4B
Tokenization using Regular Expressions (RegEx)
Code:
import nltk
from nltk.tokenize import RegexpTokenizer
tk = RegexpTokenizer('\s+', gaps=True)
str = "I love to study Natrual Language processing in python"
tokkens = tk.tokenize(str)
print(tokkens)
Output:

PRACTICAL 4C
Tokenization using NLTK
Code:
import nltk
from nltk.tokenize import word_tokenize
str = "I love to study Natrual Language processing in python"
print(word_tokenize(str))

PRACTICAL 4D
Tokenization using the spaCy library
Code:
import spacy
nlp = spacy.blank("en")
str = "I love to study Natrual Language processing in python"
doc = nlp(str)
words = [ word.text for word in doc]
print(words)

Output

PRACTICAL 4E
Tokenization using Keras
Code:
import tensorflow
import keras
from tensorflow.keras.preprocessing.text import text_to_word_sequence
# Create a string input
str = "I love to study Natural Language Processing in Python"
# tokenizing the text
tokens = text_to_word_sequence(str)
print(tokens)

Output:

PRACTICAL 4F
Tokenization using Gensim
Code:
from gensim.utils import tokenize
str = "I love to study Natrual Language processing in python"
print(list(tokenize(str)))
Output:

PRACTICAL 6A
Illustrate part of speech tagging
Part of speech Tagging and chunking of user defined text.
Code:
import nltk
from nltk import tokenize
nltk.download('punkt')
from nltk import tag
from nltk import chunk
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

para = "Hello! My name is anjali nimje . Today you'll be learning NLTK."


sents = tokenize.sent_tokenize(para)
print("\nsentence tokenization\n===================\n",sents)

# word tokenization
print("\nword tokenization\n===================\n")
for index in range(len(sents)):
words = tokenize.word_tokenize(sents[index])
print(words)

# POS Tagging
tagged_words = []
for index in range(len(sents)):
tagged_words.append(tag.pos_tag(words))
print("\nPOS Tagging\n===========\n",tagged_words)
# chunking
tree = []
for index in range(len(sents)):
tree.append(chunk.ne_chunk(tagged_words[index]))
print("\nchunking\n========\n")
print(tree)
Output:

PRACTICAL 6B
Named Entity recognition using user defined text.
Code:
import spacy
nlp = spacy.load("en_core_web_sm")
text = ("when seabastian Thrun started working on self-driving cars at"
"Google in 2007, few people oustside of the company took him"
"Seroiusly,I can tell you very senior CEO's of major Amercan"
"car companies would shake my hand and turn away beacause I wasn't"
"worth talking to,said Thurn , in an interview with recorder earlier")doc =
nlp(text)
print("nouns:\n",[chunk.text for chunk in doc.noun_chunks])
print("verbs",[token.lemma_ for token in doc if token.pos_ =="VERB"])

Output:

PRACTICAL 6C
Named Entity recognition with diagram using NLTK corpus – treebank.
Code:
import nltk
nltk.download('treebank')
from nltk.corpus import treebank_chunk
treebank_chunk.tagged_sents()[0]
treebank_chunk.chunked_sents()[0]
treebank_chunk.chunked_sents()[0].draw()
Output:

PRACTICAL 7A

Finite state automata


Define grammar using nltk. Analyze a sentence using the same.
Code:
import nltk
from nltk import tokenize
grammar1 = nltk.CFG.fromstring("""
S -> VP
VP -> VP NP
NP -> Det NP
Det -> 'that'
NP -> singular Noun
NP -> 'flight'
VP -> 'Book'
""")
sentence = "Book that flight"
for index in range(len(sentence)):
all_tokens = tokenize.word_tokenize(sentence)
print(all_tokens)
parser = nltk.ChartParser(grammar1)
for tree in parser.parse(all_tokens):
print(tree)
tree.draw()

Output:
PRACTICAL 7B
Accept the input string with Regular expression of Finite Automaton:
101+.
Code:
def FA(s):
#if the length is less than 3 then it can't be accepted, Therefore end the process.
if len(s)<3:
return "Rejected"
#first three characters are fixed. Therefore, checking them using index
if s[0]=='1':
if s[1]=='0':
if s[2]=='1':
# After index 2 only "1" can appear. Therefore break the process if any other
character is detected
for i in range(3,len(s)):
if s[i]!='1':
return "Rejected"
return "Accepted" # if all 4 nested if true
return "Rejected" # else of 3rd if
return "Rejected" # else of 2nd if
return "Rejected" # else of 1st if

inputs=['1','10101','101','10111','01010','100','','10111101','1011111']
for i in inputs:
print(FA(i))

Output:
PRACTICAL 7C
Accept the input string with Regular expression of FA: (a+b)*bba.
Code:
def FA(s):
size=0

#scan complete string and make sure that it contains only 'a' & 'b'
for i in s:
if i=='a' or i=='b':
size+=1
else:
return "Rejected"

#After checking that it contains only 'a' & 'b'


#check it's length it should be 3 atleast
if size>=3:

#check the last 3 elements


if s[size-3]=='b':
if s[size-2]=='b':
if s[size-1]=='a':
return "Accepted" # if all 4 if true
return "Rejected" # else of 4th if
return "Rejected" # else of 3rd if
return "Rejected" # else of 2nd if
return "Rejected" # else of 1st if
inputs=['bba', 'ababbba', 'abba','abb', 'baba','bbb','']
for i in inputs:
print(FA(i))

Output:
PRACTICAL 7D
Implementation of Deductive Chart Parsing using context free grammar
and a given sentence.
Code:
import nltk
from nltk import tokenize
grammar1 = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
Det -> 'a' | 'my'
N -> 'bird' | 'balcony'
V -> 'saw'
P -> 'in'
""")
sentence = "I saw a bird in my balcony"

for index in range(len(sentence)):


all_tokens = tokenize.word_tokenize(sentence)
print(all_tokens)

# all_tokens = ['I', 'saw', 'a', 'bird', 'in', 'my', 'balcony']


parser = nltk.ChartParser(grammar1)
for tree in parser.parse(all_tokens):
print(tree)
tree.draw()
Output:
PRACTICAL 8
Study PorterStemmer, LancasterStemmer, RegexpStemmer,
SnowballStemmer ,Study WordNetLemmatizer
Code:
import nltk
from nltk.stem import PorterStemmer
word_stemm = PorterStemmer()
print(word_stemm.stem('writing'))
import nltk
from nltk.stem import LancasterStemmer
lanc_stemm = LancasterStemmer()
print(lanc_stemm.stem('writing'))
import nltk
from nltk.stem import RegexpStemmer
Reg_stemm = RegexpStemmer('ing$|s$|e$|able$', min=4)
print(Reg_stemm.stem('writing'))
import nltk
from nltk.stem import SnowballStemmer
english_stemm = SnowballStemmer('english')
print(english_stemm.stem('writing'))
import nltk
from nltk.stem import WordNetLemmatizer
lemetizer = WordNetLemmatizer()
print("word:\t lemma")
print("rocks:", lemetizer.lemmatize("rocks"))
print("corpa:",lemetizer.lemmatize("corpora"))

Output:
PRACTICAL 9
Implement Naive Bayes classifier
Code:
#pip install pandas
#pip install sklearn
import pandas as pd
import numpy as np
sms_data = pd.read_csv('C:\\Users\\Student\\Desktop\\Mayuri\\
spam.csv',encoding='latin-1')
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
stemming = PorterStemmer()
corpus = []
for i in range (0,len(sms_data)):
s1 = re.sub('[^a-zA-Z]',repl='',string=sms_data['v2'][i])
s1.lower()
s1 = s1.split()
s1 = [stemming.stem(word) for word in s1 if word not in
set(stopwords.words('english'))]
s1 = ' '.join(s1)
corpus.append(s1)

from sklearn.feature_extraction.text import CountVectorizer


countvectorizer =CountVectorizer()
x = countvectorizer.fit_transform(corpus).toarray()
print(x)
y = sms_data['v1'].values
print(y)

from sklearn.model_selection import train_test_split


x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.3,
stratify=y,random_state=2)

#Multinomial Naïve Bayes.


from sklearn.naive_bayes import MultinomialNB
multinomialnb = MultinomialNB()
multinomialnb.fit(x_train,y_train)

# Predicting on test data:


y_pred = multinomialnb.predict(x_test)
print(y_pred)

#Results of our Models


from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score
print(classification_report(y_test,y_pred))
print("accuracy_score: ",accuracy_score(y_test,y_pred))

Output:
PRACTICAL 10 A
i)Speech tagging using spacy
Code:
import spacy
sp = spacy.load('en_core_web_sm')
sen = sp(u"I like to play football. I hated it in my childhood though")
print(sen.text)
print(sen[7].pos_)
print(sen[7].tag_)
print(spacy.explain(sen[7].tag_))
for word in sen:
print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}}
{spacy.explain(word.tag_)}')

sen = sp(u'Can you google it?')


word = sen[2]
print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}}
{spacy.explain(word.tag_)}')
sen = sp(u'Can you search it on google?')
word = sen[5]
print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}}
{spacy.explain(word.tag_)}')
#Finding the Number of POS Tags
sen = sp(u"I like to play football. I hated it in my childhood though")
num_pos = sen.count_by(spacy.attrs.POS)
num_pos
for k,v in sorted(num_pos.items()):
print(f'{k}. {sen.vocab[k].text:{8}}: {v}')
#Visualizing Parts of Speech Tags
from spacy import displacy
sen = sp(u"I like to play football. I hated it in my childhood though")
displacy.serve(sen, style='dep', options={'distance': 120})

Output:
PRACTICAL 10 A
ii)Speech tagging using nltk
Code:
import nltk
nltk.download('state_union')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

#create our training and testing data:


train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

#train the Punkt tokenizer like:


custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
# tokenize:
tokenized = custom_sent_tokenizer.tokenize(sample_text)
def process_content():
try:
for i in tokenized[:2]:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
print(tagged)
except Exception as e:
print(str(e))
process_content()
Output:
PRACTICAL 10 B
i)Usage of Give and Gave in the Penn Treebank sample
Code:
import nltk
import nltk.parse.viterbi
import nltk.parse.pchart
def give(t):
return t.label() == 'VP' and len(t) > 2 and t[1].label() == 'NP'\
and (t[2].label() == 'PP-DTV' or t[2].label() == 'NP')\
and ('give' in t[0].leaves() or 'gave' in t[0].leaves())
def sent(t):
return ' '.join(token for token in t.leaves() if token[0] not in '*-0')
def print_node(t, width):
output = "%s %s: %s / %s: %s" %\
(sent(t[0]), t[1].label(), sent(t[1]), t[2].label(), sent(t[2]))
if len(output) > width:
output = output[:width] + "..."
print (output)
for tree in nltk.corpus.treebank.parsed_sents():
for t in tree.subtrees(give):
print_node(t, 72)

Output:
PRACTICAL 10 B
ii)Probabilistic parser
Code:
import nltk
from nltk import PCFG
grammar = PCFG.fromstring('''
NP -> NNS [0.5] | JJ NNS [0.3] | NP CC NP [0.2]
NNS -> "men" [0.1] | "women" [0.2] | "children" [0.3] | NNS CC NNS [0.4]
JJ -> "old" [0.4] | "young" [0.6]
CC -> "and" [0.9] | "or" [0.1]
''')
print(grammar)
viterbi_parser = nltk.ViterbiParser(grammar)
token = "old men and women".split()
obj = viterbi_parser.parse(token)
print("Output: ")
for x in obj:
print(x)
Output:
PRACTICAL 10C

Code:
from nltk.parse import malt
mp = malt.MaltParser('maltparser-1.9.2', 'engmalt.linear-1.7.mco')#file
t = mp.parse_one('I saw a bird from my window.'.split()).tree()
print(t)
t.draw()
PRACTICAL 11A
Multiword Expressions in NLP
Code:
from nltk.tokenize import MWETokenizer
from nltk import sent_tokenize, word_tokenize
s = '''Good cake cost Rs.1500\kg in Mumbai. Please buy me one of them.\n\
nThanks.'''
mwe = MWETokenizer([('New', 'York'), ('Hong', 'Kong')], separator='_')
for sent in sent_tokenize(s):
print(mwe.tokenize(word_tokenize(sent)))
Output:
PRACTICAL 11B
Normalized Web Distance and Word Similarity
Code:
import numpy as np
import re
import textdistance # pip install textdistance
# we will need scikit-learn>=0.21
import sklearn #pip install sklearn
from sklearn.cluster import AgglomerativeClustering
texts = [
'Reliance supermarket', 'Reliance hypermarket', 'Reliance', 'Reliance',
'Reliancedowntown', 'Relianc market',
'Mumbai', 'Mumbai Hyper', 'Mumbai dxb', 'mumbai airport',
'k.m trading', 'KM Trading', 'KM trade', 'K.M. Trading', 'KM.Trading'
]
def normalize(text):
""" Keep only lower-cased text and numbers"""
return re.sub('[^a-z0-9]+', ' ', text.lower())
def group_texts(texts, threshold=0.4):
""" Replace each text with the representative of its cluster"""
normalized_texts = np.array([normalize(text) for text in texts])
distances = 1 - np.array([
[textdistance.jaro_winkler(one, another) for one in normalized_texts]
for another in normalized_texts
])
clustering = AgglomerativeClustering(
distance_threshold=threshold, # this parameter needs to be tuned
carefully
metric="precomputed", linkage="complete", n_clusters=None
).fit(distances)
centers = dict()
for cluster_id in set(clustering.labels_):
index = clustering.labels_ == cluster_id
centrality = distances[:, index][index].sum(axis=1)
centers[cluster_id] = normalized_texts[index][centrality.argmin()]
return [centers[i] for i in clustering.labels_]
print(group_texts(texts))

Output:
PRACTICAL 11C
Word Sense Disambiguation
Code:
from nltk.corpus import wordnet as wn
def get_first_sense(word, pos=None):
if pos:
synsets = wn.synsets(word,pos)
else:
synsets = wn.synsets(word)
return synsets[0]

best_synset = get_first_sense('bank')
print ('%s: %s' % (best_synset.name, best_synset.definition))
best_synset = get_first_sense('set','n')
print ('%s: %s' % (best_synset.name, best_synset.definition))
best_synset = get_first_sense('#set','v')
print ('%s: %s' % (best_synset.name, best_synset.definition))

Output:

You might also like