0% found this document useful (0 votes)

7 views73 pages

Natural Language Processing Journal

The document outlines a series of practical exercises related to Natural Language Processing (NLP) for a Master of Science in Information Technology program. It includes instructions for installing Python and NLTK, converting text to speech, speech recognition, studying various corpora, and using WordNet for synonyms and antonyms. Each practical exercise is accompanied by code examples and expected outputs.

Uploaded by

anjali.snbp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views73 pages

Natural Language Processing Journal

Uploaded by

anjali.snbp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 73

NATURAL LANGUAGE PROCESSING

Certified Journal
Submitted in partial fulfilment of the
Requirements for the award of the Degree of

MASTER OF SCIENCE
(INFORMATION_TECHNOLOGY)
By
Anjali Rameshwar Nimje

DEPARTMENT OF INFORMATION TECHNOLOGY

KERALEEYA SAMAJAM (REGD.) DOMBIVLI’S

MODEL COLLEGE (AUTONOMOUS)
Re-Accredited ‘A’ Grade by NAAC

(Affiliated to University of Mumbai)

FOR THE YEAR

(2023-24)
INDEX
Sr No Practical Name Date Signature
PRACTICAL 1A
Install NLTK
Python 3.9.2 Installation on Windows
Step 1) Go to link https://www.python.org/downloads/, and select the latest
version for windows.

Note: If you don't want to download the latest version, you can visit the
download tab and see all releases.

Step 2) Click on the Windows installer (64 bit)

Step 3) Select Customize Installation

Step 4) Click NEXT

Step 5) In next screen

1. Select the advanced options
2. Give a Custom install location. Keep the default folder as c:\Program
files\Python39
3. Click Install

Step 6) Click Close button once install is done.

Step 7) open command prompt window and run the following commands:
C:\Users\Beena Kapadia>pip install --upgrade pip
C:\Users\Beena Kapadia> pip install --user -U nltk
C:\Users\Beena Kapadia> >pip install --user -U numpy
C:\Users\Beena Kapadia>python
>>> import nltk

PRACTICAL 1B
Convert the given text to speech.

Code:
from playsound import playsound
from gtts import gTTS
mytext="Welcome to Natural Language Processing"
language="en"
myobj=gTTS(text=mytext,lang=language,slow=False)
myobj.save("myfile.mp3")
playsound("myfile.mp3")

from playsound import playsound

from gtts import gTTS
mytext="Welcome to Natural Language Processing"
language="en"
myobj=gTTS(text=mytext,lang=language,slow=False)
myobj.save("myfile.mp3")
playsound("myfile.mp3")
Output:

PRACTICAL 1C
Convert audio file Speech to Text.

Code:
import speech_recognition as sr
filename="male.wav"
r=sr.Recognizer()
with sr.AudioFile(filename)as source:
audio_data=r.record(source)
text=r.recognize_google(audio_data)
print(text)

Output:

PRACTICAL 2A
Study of various Corpus – Brown, Inaugural, Reuters, udhr with various
methods like fields, raw, words, sents, categories,

Code:
import nltk
from nltk.corpus import brown
print ('File ids of brown corpus\n',brown.fileids())
'''Let’s pick out the first of these texts — Emma by Jane Austen — and give it a
short
name, emma, then find out how many words it contains:'''
ca01 = brown.words('ca01')
# display first few words
print('\nca01 has following words:\n',ca01)
# total number of words in ca01
print('\nca01 has',len(ca01),'words')

#categories or files
print ('\n\nCategories or file in brown corpus:\n')
print (brown.categories())
'''display other information about each text, by looping over all the values of
fileid
corresponding to the brown file identifiers listed earlier and then computing
statistics
for each text.'''
print ('\n\nStatistics for each text:\n')
print
('AvgWordLen\tAvgSentenceLen\tno.ofTimesEachWordAppearsOnAvg\t\
tFileName')
for fileid in brown.fileids():
num_chars = len(brown.raw(fileid))
num_words = len(brown.words(fileid))
num_sents = len(brown.sents(fileid))
num_vocab = len(set([w.lower() for w in brown.words(fileid)]))
print (int(num_chars/num_words),'\t\t\t', int(num_words/num_sents),'\t\t\t',
int(num_words/num_vocab),'\t\t\t', fileid)
Output:
PRACTICAL 2C
Study Conditional frequency distributions
Code:
#process a sequence of pairs
text = ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
pairs = [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ...]
import nltk
from nltk.corpus import brown
fd = nltk.ConditionalFreqDist(
(genre, word)
for genre in brown.categories()
for word in brown.words(categories=genre))
genre_word = [(genre, word)
for genre in ['news', 'romance']
for word in brown.words(categories=genre)]
print(len(genre_word))
print(genre_word[:4])
print(genre_word[-4:])
cfd = nltk.ConditionalFreqDist(genre_word)
print(cfd)
print(cfd.conditions())
print(cfd['news'])
print(cfd['romance'])
print(list(cfd['romance']))
from nltk.corpus import inaugural
cfd = nltk.ConditionalFreqDist(
(target, fileid[:4])
for fileid in inaugural.fileids()
for w in inaugural.words(fileid)
for target in ['america', 'citizen']
if w.lower().startswith(target))
from nltk.corpus import udhr
languages = ['Chickasaw', 'English', 'German_Deutsch','Greenlandic_Inuktikut',
'Hungarian_Magyar', 'Ibibio_Efik']
cfd = nltk.ConditionalFreqDist(
(lang, len(word))
for lang in languages
for word in udhr.words(lang + '-Latin1'))
cfd.tabulate(conditions=['English', 'German_Deutsch'],
samples=range(10), cumulative=True)

Output:
PRACTICAL 2D
Study of tagged corpora with methods like tagged_sents, tagged_words.

Code:
import nltk
from nltk import tokenize
nltk.download('punkt')
nltk.download('words')
para = "Hello! My name is Anjali Nimje. Today you'll be learning NLTK."
sents = tokenize.sent_tokenize(para)
print("\nsentence tokenization\n===================\n",sents)
# word tokenization
print("\nword tokenization\n===================\n")
for index in range(len(sents)):
words = tokenize.word_tokenize(sents[index])
print(words)

Output:

PRACTICAL 2E
Write a program to find the most frequent noun tags.
Code:
import nltk
from collections import defaultdict
text = nltk.word_tokenize("Nick likes to play football. Nick does not like to
playcricket.")
tagged = nltk.pos_tag(text)
print(tagged)
# checking if it is a noun or not
addNounWords = []
count=0
for words in tagged:
val = tagged[count][1]
if(val == 'NN' or val == 'NNS' or val == 'NNPS' or val == 'NNP'):
addNounWords.append(tagged[count][0])
count+=1
print (addNounWords)
temp = defaultdict(int)
# memoizing count
for sub in addNounWords:
for wrd in sub.split():
temp[wrd] += 1
# getting max frequency
res = max(temp, key=temp.get)
# printing result
print("Word with maximum frequency : " + str(res))
Output:

PRACTICAL 2F
Map Words to Properties Using Python Dictionaries
Code:
#creating and printing a dictionay by mapping word with its properties
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
print(thisdict)
print(thisdict["brand"])
print(len(thisdict))
print(type(thisdict))

PRACTICAL 2G
Study DefaultTagger, Regular expression tagger, UnigramTagger
i)Default Tagger
Code:
import nltk
from nltk.tag import DefaultTagger
exptagger = DefaultTagger('NN')
from nltk.corpus import treebank
testsentences = treebank.tagged_sents() [1000:]
print(exptagger.evaluate (testsentences))
#Tagging a list of sentences
import nltk
from nltk.tag import DefaultTagger
exptagger = DefaultTagger('NN')
print(exptagger.tag_sents([['Hi', ','], ['How', 'are', 'you', '?']]))

Output:
ii)Regular Expressions
Code:
from nltk.corpus import brown
from nltk.tag import RegexpTagger
test_sent = brown.sents(categories='news')[0]
regexp_tagger = RegexpTagger(
[(r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers
(r'(The|the|A|a|An|an)$', 'AT'), # articles
(r'.*able$', 'JJ'), # adjectives
(r'.*ness$', 'NN'), # nouns formed from adjectives
(r'.*ly$', 'RB'), # adverbs
(r'.*s$', 'NNS'), # plural nouns
(r'.*ing$', 'VBG'), # gerunds
(r'.*ed$', 'VBD'), # past tense verbs
(r'.*', 'NN') # nouns (default)
])
print(regexp_tagger)
print(regexp_tagger.tag(test_sent))

Output:
iii) Unigram Tagger
Code:
# Loading Libraries
from nltk.tag import UnigramTagger
from nltk.corpus import treebank
# Training using first 10 tagged sentences of the treebank corpus as data.
# Using data
train_sents = treebank.tagged_sents()[:10]
# Initializing
tagger = UnigramTagger(train_sents)
# Lets see the first sentence
# (of the treebank corpus) as list
print(treebank.sents()[0])
print('\n',tagger.tag(treebank.sents()[0]))
#Finding the tagged results after training.
tagger.tag(treebank.sents()[0])
#Overriding the context model
tagger = UnigramTagger(model ={'Pierre': 'NN'})
print('\n',tagger.tag(treebank.sents()[0]))

Output:
PRACTICAL 3A
Study of Wordnet Dictionary with methods as synsets, definitions,
examples,antonyms.
Code:
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet
print(wordnet.synsets("computer"))
print(wordnet.synset("computer.n.01").definition())
print("Examples:", wordnet.synset("computer.n.01").examples())
print(wordnet.lemma('buy.v.01.buy').antonyms())

Output:

PRACTICAL 3B
Study lemmas, hyponyms, hypernyms.
Code:
import nltk
from nltk.corpus import wordnet
print(wordnet.synsets("computer"))
print(wordnet.synset("computer.n.01").lemma_names())
for e in wordnet.synsets("computer"):
print(f'{e} --> {e.lemma_names()}')
print(wordnet.synset("computer.n.01").lemmas())
print(wordnet.lemma('computer.n.01.computing_device').synset())
print(wordnet.lemma('computer.n.01.computing_device').name())
syn = wordnet.synset('computer.n.01')
print(syn.hyponyms)
print([lemma.name() for synset in syn.hyponyms() for lemma in
synset.lemmas()])
vehicle = wordnet.synset('vehicle.n.01')car = wordnet.synset('car.n.01')
print(car.lowest_common_hypernyms(vehicle))
Output:

PRACTICAL 3C
Write a program using python to find synonym and antonym of word
"active" using Wordnet.
Code:
from nltk.corpus import wordnet
print(wordnet.synsets("active"))
print(wordnet.lemma('active.a.01.active').antonyms())
Output:

PRACTICAL 3D
Compare two nouns.
Code:
import nltk
from nltk.corpus import wordnet
syn1 = wordnet.synsets('football')
syn2 = wordnet.synsets('soccer')
for s1 in syn1:
for s2 in syn2:
print("path similarity of:")
print(s1,'(',s1.pos(),')','[',s1.definition(),']')
print(s2,'(',s2.pos(),')','[',s2.definition(),']')
print("is", s1.path_similarity(s2))
print()

Output:

PRACTICAL 3E
Handling stopword
i) Using nltk Adding or Removing Stop Words in NLTK's Default Stop
Word List
Code:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
text = "Yashesh like to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in
stopwords.words()]
print(tokens_without_sw)
all_stopwords = stopwords.words('english')all_stopwords.append('play')
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in
all_stopwords]
print(tokens_without_sw)
all_stopwords.remove('not')
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in
all_stopwords]
print(tokens_without_sw)
Output:

ii) Using Gensim Adding and Removing Stop Words in Default Gensim
Stop Words List
Code:
import gensim
from gensim.parsing.preprocessing import remove_stopwords
from nltk.tokenize import word_tokenize
text = "Yashesh likes to play football, however he is not too fond of tennis."
filtered_sentense = remove_stopwords(text)
print(filtered_sentense)
all_stpwords = gensim.parsing.preprocessing.STOPWORDS
print( all_stpwords)
from gensim.parsing.preprocessing import STOPWORDS
all_stpwords_genism = STOPWORDS.union(set(['likes','play']))
text = "Yashesh likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in
all_stpwords]print(tokens_without_sw)
print("========================================")
all_stpwords_genism = STOPWORDS
sw_list = {"not"}
all_stpwords_genism = STOPWORDS.difference(sw_list)
text = "Yashesh likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in
all_stpwords_genism]
print(all_stpwords_genism)
print(tokens_without_sw)
Output:
iii) Using Spacy Adding and Removing Stop Words in Default Spacy Stop
Words List.
Code:
import spacy
import nltk
from nltk.tokenize import word_tokenize
sp = spacy.load('en_core_web_sm')
all_stopwords = sp.Defaults.stop_words
all_stopwords.add('play')
text = "Yashesh like to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in
all_stopwords]
print(tokens_without_sw)
all_stopwords.remove('not')
tokens_without_sw = [word for word in text_tokens if not word in
all_stopwords]
print(tokens_without_sw)

Commands:
#pip install spacy
#python -m spacy download en_core_web_sm
#python -m spacy download en
Output:
PRACTICAL 4A
Tokenization using Python’s split() function.
Code:
text = "Monism is a thesis about oneness: that only one thing exists in a certain
sense. The denial of monism is pluralism, the thesis that, in a certain sense,
more than one thing exists.[7] There are manyforms of monism and pluralism,
but in relation to the world as a whole, two are of special interest:existence
monism/pluralism and priority monism/pluralism."
data = text.split('.')
for i in data:
print(i)
Output:
PRACTICAL 4B
Tokenization using Regular Expressions (RegEx)
Code:
import nltk
from nltk.tokenize import RegexpTokenizer
tk = RegexpTokenizer('\s+', gaps=True)
str = "I love to study Natrual Language processing in python"
tokkens = tk.tokenize(str)
print(tokkens)
Output:

PRACTICAL 4C
Tokenization using NLTK
Code:
import nltk
from nltk.tokenize import word_tokenize
str = "I love to study Natrual Language processing in python"
print(word_tokenize(str))

PRACTICAL 4D
Tokenization using the spaCy library
Code:
import spacy
nlp = spacy.blank("en")
str = "I love to study Natrual Language processing in python"
doc = nlp(str)
words = [ word.text for word in doc]
print(words)

Output

PRACTICAL 4E
Tokenization using Keras
Code:
import tensorflow
import keras
from tensorflow.keras.preprocessing.text import text_to_word_sequence
# Create a string input
str = "I love to study Natural Language Processing in Python"
# tokenizing the text
tokens = text_to_word_sequence(str)
print(tokens)

Output:

PRACTICAL 4F
Tokenization using Gensim
Code:
from gensim.utils import tokenize
str = "I love to study Natrual Language processing in python"
print(list(tokenize(str)))
Output:

PRACTICAL 6A
Illustrate part of speech tagging
Part of speech Tagging and chunking of user defined text.
Code:
import nltk
from nltk import tokenize
nltk.download('punkt')
from nltk import tag
from nltk import chunk
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

para = "Hello! My name is anjali nimje . Today you'll be learning NLTK."

sents = tokenize.sent_tokenize(para)
print("\nsentence tokenization\n===================\n",sents)

# word tokenization
print("\nword tokenization\n===================\n")
for index in range(len(sents)):
words = tokenize.word_tokenize(sents[index])
print(words)

# POS Tagging
tagged_words = []
for index in range(len(sents)):
tagged_words.append(tag.pos_tag(words))
print("\nPOS Tagging\n===========\n",tagged_words)
# chunking
tree = []
for index in range(len(sents)):
tree.append(chunk.ne_chunk(tagged_words[index]))
print("\nchunking\n========\n")
print(tree)
Output:

PRACTICAL 6B
Named Entity recognition using user defined text.
Code:
import spacy
nlp = spacy.load("en_core_web_sm")
text = ("when seabastian Thrun started working on self-driving cars at"
"Google in 2007, few people oustside of the company took him"
"Seroiusly,I can tell you very senior CEO's of major Amercan"
"car companies would shake my hand and turn away beacause I wasn't"
"worth talking to,said Thurn , in an interview with recorder earlier")doc =
nlp(text)
print("nouns:\n",[chunk.text for chunk in doc.noun_chunks])
print("verbs",[token.lemma_ for token in doc if token.pos_ =="VERB"])

Output:

PRACTICAL 6C
Named Entity recognition with diagram using NLTK corpus – treebank.
Code:
import nltk
nltk.download('treebank')
from nltk.corpus import treebank_chunk
treebank_chunk.tagged_sents()[0]
treebank_chunk.chunked_sents()[0]
treebank_chunk.chunked_sents()[0].draw()
Output:

PRACTICAL 7A

Finite state automata

Define grammar using nltk. Analyze a sentence using the same.
Code:
import nltk
from nltk import tokenize
grammar1 = nltk.CFG.fromstring("""
S -> VP
VP -> VP NP
NP -> Det NP
Det -> 'that'
NP -> singular Noun
NP -> 'flight'
VP -> 'Book'
""")
sentence = "Book that flight"
for index in range(len(sentence)):
all_tokens = tokenize.word_tokenize(sentence)
print(all_tokens)
parser = nltk.ChartParser(grammar1)
for tree in parser.parse(all_tokens):
print(tree)
tree.draw()

Output:
PRACTICAL 7B
Accept the input string with Regular expression of Finite Automaton:
101+.
Code:
def FA(s):
#if the length is less than 3 then it can't be accepted, Therefore end the process.
if len(s)<3:
return "Rejected"
#first three characters are fixed. Therefore, checking them using index
if s[0]=='1':
if s[1]=='0':
if s[2]=='1':
# After index 2 only "1" can appear. Therefore break the process if any other
character is detected
for i in range(3,len(s)):
if s[i]!='1':
return "Rejected"
return "Accepted" # if all 4 nested if true
return "Rejected" # else of 3rd if
return "Rejected" # else of 2nd if
return "Rejected" # else of 1st if

inputs=['1','10101','101','10111','01010','100','','10111101','1011111']
for i in inputs:
print(FA(i))

Output:
PRACTICAL 7C
Accept the input string with Regular expression of FA: (a+b)*bba.
Code:
def FA(s):
size=0

#scan complete string and make sure that it contains only 'a' & 'b'
for i in s:
if i=='a' or i=='b':
size+=1
else:
return "Rejected"

#After checking that it contains only 'a' & 'b'

#check it's length it should be 3 atleast
if size>=3:

#check the last 3 elements

if s[size-3]=='b':
if s[size-2]=='b':
if s[size-1]=='a':
return "Accepted" # if all 4 if true
return "Rejected" # else of 4th if
return "Rejected" # else of 3rd if
return "Rejected" # else of 2nd if
return "Rejected" # else of 1st if
inputs=['bba', 'ababbba', 'abba','abb', 'baba','bbb','']
for i in inputs:
print(FA(i))

Output:
PRACTICAL 7D
Implementation of Deductive Chart Parsing using context free grammar
and a given sentence.
Code:
import nltk
from nltk import tokenize
grammar1 = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
Det -> 'a' | 'my'
N -> 'bird' | 'balcony'
V -> 'saw'
P -> 'in'
""")
sentence = "I saw a bird in my balcony"

for index in range(len(sentence)):

all_tokens = tokenize.word_tokenize(sentence)
print(all_tokens)

# all_tokens = ['I', 'saw', 'a', 'bird', 'in', 'my', 'balcony']

parser = nltk.ChartParser(grammar1)
for tree in parser.parse(all_tokens):
print(tree)
tree.draw()
Output:
PRACTICAL 8
Study PorterStemmer, LancasterStemmer, RegexpStemmer,
SnowballStemmer ,Study WordNetLemmatizer
Code:
import nltk
from nltk.stem import PorterStemmer
word_stemm = PorterStemmer()
print(word_stemm.stem('writing'))
import nltk
from nltk.stem import LancasterStemmer
lanc_stemm = LancasterStemmer()
print(lanc_stemm.stem('writing'))
import nltk
from nltk.stem import RegexpStemmer
Reg_stemm = RegexpStemmer('ing$|s$|e$|able$', min=4)
print(Reg_stemm.stem('writing'))
import nltk
from nltk.stem import SnowballStemmer
english_stemm = SnowballStemmer('english')
print(english_stemm.stem('writing'))
import nltk
from nltk.stem import WordNetLemmatizer
lemetizer = WordNetLemmatizer()
print("word:\t lemma")
print("rocks:", lemetizer.lemmatize("rocks"))
print("corpa:",lemetizer.lemmatize("corpora"))

Output:
PRACTICAL 9
Implement Naive Bayes classifier
Code:
#pip install pandas
#pip install sklearn
import pandas as pd
import numpy as np
sms_data = pd.read_csv('C:\\Users\\Student\\Desktop\\Mayuri\\
spam.csv',encoding='latin-1')
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
stemming = PorterStemmer()
corpus = []
for i in range (0,len(sms_data)):
s1 = re.sub('[^a-zA-Z]',repl='',string=sms_data['v2'][i])
s1.lower()
s1 = s1.split()
s1 = [stemming.stem(word) for word in s1 if word not in
set(stopwords.words('english'))]
s1 = ' '.join(s1)
corpus.append(s1)

from sklearn.feature_extraction.text import CountVectorizer

countvectorizer =CountVectorizer()
x = countvectorizer.fit_transform(corpus).toarray()
print(x)
y = sms_data['v1'].values
print(y)

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.3,
stratify=y,random_state=2)

#Multinomial Naïve Bayes.

from sklearn.naive_bayes import MultinomialNB
multinomialnb = MultinomialNB()
multinomialnb.fit(x_train,y_train)

# Predicting on test data:

y_pred = multinomialnb.predict(x_test)
print(y_pred)

#Results of our Models

from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score
print(classification_report(y_test,y_pred))
print("accuracy_score: ",accuracy_score(y_test,y_pred))

Output:
PRACTICAL 10 A
i)Speech tagging using spacy
Code:
import spacy
sp = spacy.load('en_core_web_sm')
sen = sp(u"I like to play football. I hated it in my childhood though")
print(sen.text)
print(sen[7].pos_)
print(sen[7].tag_)
print(spacy.explain(sen[7].tag_))
for word in sen:
print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}}
{spacy.explain(word.tag_)}')

sen = sp(u'Can you google it?')

word = sen[2]
print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}}
{spacy.explain(word.tag_)}')
sen = sp(u'Can you search it on google?')
word = sen[5]
print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}}
{spacy.explain(word.tag_)}')
#Finding the Number of POS Tags
sen = sp(u"I like to play football. I hated it in my childhood though")
num_pos = sen.count_by(spacy.attrs.POS)
num_pos
for k,v in sorted(num_pos.items()):
print(f'{k}. {sen.vocab[k].text:{8}}: {v}')
#Visualizing Parts of Speech Tags
from spacy import displacy
sen = sp(u"I like to play football. I hated it in my childhood though")
displacy.serve(sen, style='dep', options={'distance': 120})

Output:
PRACTICAL 10 A
ii)Speech tagging using nltk
Code:
import nltk
nltk.download('state_union')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

#create our training and testing data:

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

#train the Punkt tokenizer like:

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
# tokenize:
tokenized = custom_sent_tokenizer.tokenize(sample_text)
def process_content():
try:
for i in tokenized[:2]:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
print(tagged)
except Exception as e:
print(str(e))
process_content()
Output:
PRACTICAL 10 B
i)Usage of Give and Gave in the Penn Treebank sample
Code:
import nltk
import nltk.parse.viterbi
import nltk.parse.pchart
def give(t):
return t.label() == 'VP' and len(t) > 2 and t[1].label() == 'NP'\
and (t[2].label() == 'PP-DTV' or t[2].label() == 'NP')\
and ('give' in t[0].leaves() or 'gave' in t[0].leaves())
def sent(t):
return ' '.join(token for token in t.leaves() if token[0] not in '*-0')
def print_node(t, width):
output = "%s %s: %s / %s: %s" %\
(sent(t[0]), t[1].label(), sent(t[1]), t[2].label(), sent(t[2]))
if len(output) > width:
output = output[:width] + "..."
print (output)
for tree in nltk.corpus.treebank.parsed_sents():
for t in tree.subtrees(give):
print_node(t, 72)

Output:
PRACTICAL 10 B
ii)Probabilistic parser
Code:
import nltk
from nltk import PCFG
grammar = PCFG.fromstring('''
NP -> NNS [0.5] | JJ NNS [0.3] | NP CC NP [0.2]
NNS -> "men" [0.1] | "women" [0.2] | "children" [0.3] | NNS CC NNS [0.4]
JJ -> "old" [0.4] | "young" [0.6]
CC -> "and" [0.9] | "or" [0.1]
''')
print(grammar)
viterbi_parser = nltk.ViterbiParser(grammar)
token = "old men and women".split()
obj = viterbi_parser.parse(token)
print("Output: ")
for x in obj:
print(x)
Output:
PRACTICAL 10C

Code:
from nltk.parse import malt
mp = malt.MaltParser('maltparser-1.9.2', 'engmalt.linear-1.7.mco')#file
t = mp.parse_one('I saw a bird from my window.'.split()).tree()
print(t)
t.draw()
PRACTICAL 11A
Multiword Expressions in NLP
Code:
from nltk.tokenize import MWETokenizer
from nltk import sent_tokenize, word_tokenize
s = '''Good cake cost Rs.1500\kg in Mumbai. Please buy me one of them.\n\
nThanks.'''
mwe = MWETokenizer([('New', 'York'), ('Hong', 'Kong')], separator='_')
for sent in sent_tokenize(s):
print(mwe.tokenize(word_tokenize(sent)))
Output:
PRACTICAL 11B
Normalized Web Distance and Word Similarity
Code:
import numpy as np
import re
import textdistance # pip install textdistance
# we will need scikit-learn>=0.21
import sklearn #pip install sklearn
from sklearn.cluster import AgglomerativeClustering
texts = [
'Reliance supermarket', 'Reliance hypermarket', 'Reliance', 'Reliance',
'Reliancedowntown', 'Relianc market',
'Mumbai', 'Mumbai Hyper', 'Mumbai dxb', 'mumbai airport',
'k.m trading', 'KM Trading', 'KM trade', 'K.M. Trading', 'KM.Trading'
]
def normalize(text):
""" Keep only lower-cased text and numbers"""
return re.sub('[^a-z0-9]+', ' ', text.lower())
def group_texts(texts, threshold=0.4):
""" Replace each text with the representative of its cluster"""
normalized_texts = np.array([normalize(text) for text in texts])
distances = 1 - np.array([
[textdistance.jaro_winkler(one, another) for one in normalized_texts]
for another in normalized_texts
])
clustering = AgglomerativeClustering(
distance_threshold=threshold, # this parameter needs to be tuned
carefully
metric="precomputed", linkage="complete", n_clusters=None
).fit(distances)
centers = dict()
for cluster_id in set(clustering.labels_):
index = clustering.labels_ == cluster_id
centrality = distances[:, index][index].sum(axis=1)
centers[cluster_id] = normalized_texts[index][centrality.argmin()]
return [centers[i] for i in clustering.labels_]
print(group_texts(texts))

Output:
PRACTICAL 11C
Word Sense Disambiguation
Code:
from nltk.corpus import wordnet as wn
def get_first_sense(word, pos=None):
if pos:
synsets = wn.synsets(word,pos)
else:
synsets = wn.synsets(word)
return synsets[0]

best_synset = get_first_sense('bank')
print ('%s: %s' % (best_synset.name, best_synset.definition))
best_synset = get_first_sense('set','n')
print ('%s: %s' % (best_synset.name, best_synset.definition))
best_synset = get_first_sense('#set','v')
print ('%s: %s' % (best_synset.name, best_synset.definition))

Output:

NLP Practical Manual
No ratings yet
NLP Practical Manual
48 pages
TSA Lab Manual New
No ratings yet
TSA Lab Manual New
14 pages
NLP FinAL
No ratings yet
NLP FinAL
27 pages
Shubham Jade MSC It 31031420010 NLP Practical Journal
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
17 pages
NLP Lab
No ratings yet
NLP Lab
63 pages
CCS369-Text and Speech Analysis Lab (1-9)
No ratings yet
CCS369-Text and Speech Analysis Lab (1-9)
37 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
Ccs369-Lab Ex 3,4,5
No ratings yet
Ccs369-Lab Ex 3,4,5
8 pages
SPR 05 NLTK
No ratings yet
SPR 05 NLTK
18 pages
Tsa Labmanual
No ratings yet
Tsa Labmanual
26 pages
NLP Session 4
No ratings yet
NLP Session 4
13 pages
NLP
No ratings yet
NLP
12 pages
InfoSec Lab Manual for Students
No ratings yet
InfoSec Lab Manual for Students
25 pages
Natural Language Toolkit NLTK PDF
No ratings yet
Natural Language Toolkit NLTK PDF
23 pages
NLP Practical Journal 2023-24
No ratings yet
NLP Practical Journal 2023-24
22 pages
NLP Practical Journal
No ratings yet
NLP Practical Journal
36 pages
NLP Lab Codes Till Mod3
No ratings yet
NLP Lab Codes Till Mod3
7 pages
Python
No ratings yet
Python
9 pages
NLP Practicals
No ratings yet
NLP Practicals
6 pages
NLP Lab1
No ratings yet
NLP Lab1
6 pages
NLP Lab Manual for CSE Students
No ratings yet
NLP Lab Manual for CSE Students
28 pages
I041 - NLP - Assignment1.ipynb - Colaboratory
No ratings yet
I041 - NLP - Assignment1.ipynb - Colaboratory
11 pages
NLP Record
No ratings yet
NLP Record
15 pages
R22 NLP Python Programs
No ratings yet
R22 NLP Python Programs
15 pages
NLTK Cheatsheet
No ratings yet
NLTK Cheatsheet
27 pages
CCS369 - Text and Speech Analysis
No ratings yet
CCS369 - Text and Speech Analysis
31 pages
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
No ratings yet
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
42 pages
NLP Projects
No ratings yet
NLP Projects
4 pages
Research Paper
No ratings yet
Research Paper
20 pages
Lab-1 - Tokenization, Stemming, Stopwords - Jupyter Notebook
No ratings yet
Lab-1 - Tokenization, Stemming, Stopwords - Jupyter Notebook
15 pages
Python NLP Tasks with NLTK
No ratings yet
Python NLP Tasks with NLTK
17 pages
Tsarecord
No ratings yet
Tsarecord
22 pages
DS 7
No ratings yet
DS 7
3 pages
Ccs369 - Text and Speech Analysis - Lab Manual
100% (1)
Ccs369 - Text and Speech Analysis - Lab Manual
23 pages
NLP Lab Manual - Final
No ratings yet
NLP Lab Manual - Final
15 pages
NLP Lab Programs
No ratings yet
NLP Lab Programs
18 pages
C24064 - NLP - Lab Manual
No ratings yet
C24064 - NLP - Lab Manual
28 pages
NLP Pratical
No ratings yet
NLP Pratical
14 pages
Natural Language Processing
No ratings yet
Natural Language Processing
12 pages
Batch 2
No ratings yet
Batch 2
13 pages
NLP - Record (Weeks 1-12)
No ratings yet
NLP - Record (Weeks 1-12)
41 pages
Sample Program Using Python 3
No ratings yet
Sample Program Using Python 3
5 pages
3.Nlp Lab Manual
No ratings yet
3.Nlp Lab Manual
18 pages
Module-2 NLP
No ratings yet
Module-2 NLP
50 pages
NLP Using Python
No ratings yet
NLP Using Python
50 pages
NLP Notes and Related Questions
No ratings yet
NLP Notes and Related Questions
7 pages
NLP Techniques for Text Processing
No ratings yet
NLP Techniques for Text Processing
41 pages
x0 Process
No ratings yet
x0 Process
4 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
423/723 Natural Language Processing: Assignment 1
No ratings yet
423/723 Natural Language Processing: Assignment 1
4 pages
Natural Language Processing With Python's NLTK Package - Real Python
No ratings yet
Natural Language Processing With Python's NLTK Package - Real Python
27 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
33 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
Session2 3
No ratings yet
Session2 3
18 pages
NLP Record
No ratings yet
NLP Record
6 pages
NLP Journl
No ratings yet
NLP Journl
15 pages
Natural Language Processing Dossier 20231110 141736 0000
No ratings yet
Natural Language Processing Dossier 20231110 141736 0000
114 pages
NLP PRGRM-1
No ratings yet
NLP PRGRM-1
7 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
RTP Unit 1 11std
No ratings yet
RTP Unit 1 11std
1 page
DT20246335345 Application
No ratings yet
DT20246335345 Application
5 pages
Blockchain: Certified Journal
No ratings yet
Blockchain: Certified Journal
39 pages
Deep Learning Anjali
No ratings yet
Deep Learning Anjali
26 pages
Upload A Document - Scribd
No ratings yet
Upload A Document - Scribd
5 pages
String Buffer Class
No ratings yet
String Buffer Class
4 pages
Static 18:8
No ratings yet
Static 18:8
2 pages
Packages in Java - A Detailed Tutorial With Examples - by Pratik T - Medium
No ratings yet
Packages in Java - A Detailed Tutorial With Examples - by Pratik T - Medium
20 pages
University of Gondar Faculty of Informatics Departments of Computer Science
No ratings yet
University of Gondar Faculty of Informatics Departments of Computer Science
4 pages
Principles of Compiler Design
100% (2)
Principles of Compiler Design
35 pages
Compiler Construction: Chapter 1: Introduction To Compilation
No ratings yet
Compiler Construction: Chapter 1: Introduction To Compilation
65 pages
02 Regular Expressions in Practical NLP 6-04
No ratings yet
02 Regular Expressions in Practical NLP 6-04
3 pages
IoT-based Green City Architecture Using Secured and Sustanibale Android Services
No ratings yet
IoT-based Green City Architecture Using Secured and Sustanibale Android Services
12 pages
Dayananda Sagar University: A Mini Project Report ON
100% (1)
Dayananda Sagar University: A Mini Project Report ON
14 pages
Compiler Construction - Lecture 02
No ratings yet
Compiler Construction - Lecture 02
12 pages
2.13 Java Tokens: Tokens Are Identifier, Keyword, Separator, Operator, and Literal
50% (2)
2.13 Java Tokens: Tokens Are Identifier, Keyword, Separator, Operator, and Literal
4 pages
Lexical and Syntax Analysis
No ratings yet
Lexical and Syntax Analysis
3 pages
Ruce - B.tech - Ru20 - Vi Semester - Syllabus
No ratings yet
Ruce - B.tech - Ru20 - Vi Semester - Syllabus
34 pages
Lexical Issues
100% (2)
Lexical Issues
2 pages
Intelligent Email Automation Analysis Driving Through Natural Language Processing NLP
No ratings yet
Intelligent Email Automation Analysis Driving Through Natural Language Processing NLP
5 pages
S6CSEHand Out
No ratings yet
S6CSEHand Out
59 pages
9820learn LLVM 17: A Beginner's Guide To Learning LLVM Compiler Tools and Core Libraries With C++ 2nd Edition Anonymous Available Any Format
No ratings yet
9820learn LLVM 17: A Beginner's Guide To Learning LLVM Compiler Tools and Core Libraries With C++ 2nd Edition Anonymous Available Any Format
179 pages
Kuldeep Mishra 0611
No ratings yet
Kuldeep Mishra 0611
1 page
New Static Data Anonymization On Multidimensional Data 19-02-2024
No ratings yet
New Static Data Anonymization On Multidimensional Data 19-02-2024
71 pages
Chapter 4 - Syntax Analysis
No ratings yet
Chapter 4 - Syntax Analysis
82 pages
Compiler Construction Complete Notes
No ratings yet
Compiler Construction Complete Notes
22 pages
12th Computer Science EM Supplementary Exam June 2023 Question Paper With Answer Keys Sura Guide English Medium PDF Download
No ratings yet
12th Computer Science EM Supplementary Exam June 2023 Question Paper With Answer Keys Sura Guide English Medium PDF Download
8 pages
Deepak Kumar Tala, Verilog Ver 2005
No ratings yet
Deepak Kumar Tala, Verilog Ver 2005
279 pages
Compiler Wikibook
No ratings yet
Compiler Wikibook
498 pages
Advance Database System Ch.1 (Autosaved)
No ratings yet
Advance Database System Ch.1 (Autosaved)
40 pages
Micro Java Handouts
No ratings yet
Micro Java Handouts
18 pages
Lucene and Solr
No ratings yet
Lucene and Solr
24 pages
Using SableCC
No ratings yet
Using SableCC
19 pages
Beginner's Guide To AmiBroker AFL Programming
89% (9)
Beginner's Guide To AmiBroker AFL Programming
133 pages
Compiler Questions
No ratings yet
Compiler Questions
11 pages
Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers
No ratings yet
Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers
10 pages
M.Tech CSE Syllabus 2013
No ratings yet
M.Tech CSE Syllabus 2013
35 pages
15 Cs 63
No ratings yet
15 Cs 63
424 pages

Natural Language Processing Journal

Uploaded by

Natural Language Processing Journal

Uploaded by

NATURAL LANGUAGE PROCESSING

DEPARTMENT OF INFORMATION TECHNOLOGY

KERALEEYA SAMAJAM (REGD.) DOMBIVLI’S

(Affiliated to University of Mumbai)

FOR THE YEAR

Step 2) Click on the Windows installer (64 bit)

Step 3) Select Customize Installation

Step 5) In next screen

Step 6) Click Close button once install is done.

from playsound import playsound

para = "Hello! My name is anjali nimje . Today you'll be learning NLTK."

Finite state automata

#After checking that it contains only 'a' & 'b'

#check the last 3 elements

for index in range(len(sentence)):

# all_tokens = ['I', 'saw', 'a', 'bird', 'in', 'my', 'balcony']

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.model_selection import train_test_split

#Multinomial Naïve Bayes.

# Predicting on test data:

#Results of our Models

sen = sp(u'Can you google it?')

#create our training and testing data:

#train the Punkt tokenizer like:

You might also like