KEMBAR78
NLP Lab Manual | PDF | Linguistics | Grammar
0% found this document useful (0 votes)
205 views33 pages

NLP Lab Manual

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
205 views33 pages

NLP Lab Manual

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

1

WEEK-1
AIM: Installation and exploring features of NLTK and spaCy tools. Download word cloud and few
corpora.
Installation of NLTK:
1. Go to link https://www.python.org/downloads/, and select the latest version for windows.
2. Click on the Downloaded File
3. Select Install Now.
4. After Set Up was Successful Click close.
5. In windows command prompt, Navigate to the location of the pip folder.
6. Enter command to install NLTK : pip install nltk
7. Installation should be done successfully
Features of NLTK:
 Part of Speech tagging
 Summarization
 Named Entity Recognition
 Sentiment Analysis
 Emotion Detection
 Language Detection
 Data Ingestion and Wrangling
 Programming Language Support
 Drag and Drop
 Customizable Models
 Pre-Build Algorithms

Installation of spaCy:
1. Go to Command Prompt and Enter following commands to install spaCy.
2. pip install -U pip setuptools wheel
3. pip install -U spacy
4. python -m spacy download en_core_web_sm
Features of spaCy:
 Parts of Speech tagging
 Morphology
 Lemmatization
 Dependency Parse
 Named Entities
 Tokenization
 Merging and Splitting
 Sentence Segmentation etc..,
Downloading of word Cloud:
 Using Command Prompt
WordCloud can be installed in our system by using the given command in the command prompt.
$ pip install wordcloud
2

 Using Anaconda
We can install wordcloud using Anaconda by typing the following command in the Anaconda
Prompt.
$ conda install -c conda-forge wordcloud

Downloading of few Corpora


We can Install Corpora by using python interpreter.
Run the Python interpreter and type the commands:
import nltk
nltk.download()
A new window should open, showing the NLTK Downloader. Click on the File menu and select
Change Download Directory. Next, select the packages or collections you want to download. After
Successful installation, we can test has been installed as follows:
from nltk.corpus import brown
brown.words()
Output:
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
3

WEEK-2
AIM: Write a program to implement word tokenizer, sentence and paragraph tokenizer.
DESCRIPTION:
Tokenization: Tokenization is the process by which a large quantity of text is divided into smaller
parts called tokens. These tokens are very for finding patterns and are considered as a base step for
stemming and lemmatization. Tokenization also helps to substitute sensitive data elements with non-
sensitive data elements.
Natural Language toolkit has very important module NLTK tokenize sentences which further
comprises of sub-modules
1. word tokenize
2. sentence tokenize
We use the method word_tokenize() to split a sentence into words and sent_tokenize() to tokenize into
sentences.
PROGRAM:
import nltk
nltk.download
OUTPUT:
<bound method Downloader.download of <nltk.downloader.Downloader object at
0x00000296F81E1730>>

from nltk import word_tokenize, sent_tokenize


text="Tokenization in NLP is the process by which a large quantity of text is divided into smaller
parts called tokens. Natural language processing is used for building applications such as Text
classification, intelligent chatbot, sentimental analysis, language translation, etc.Natural Language
toolkit has very important module NLTK tokenize sentence which further comprises of sub-
modulesWe use the method word_tokenize() to split a sentence into words. The output of word
tokenizer in NLTK can be converted to Data Frame for better text understanding in machine learning
applications.Sub-module available for the above is sent_tokenize. Sentence tokenizer in Python
NLTK is an important feature for machine training."
print(word_tokenize(text))
OUTPUT:
['Tokenization', 'in', 'NLP', 'is', 'the', 'process', 'by', 'which', 'a', 'large', 'quantity', 'of', 'text', 'is', 'divided',
'into', 'smaller', 'parts', 'called', 'tokens', '.', 'Natural', 'language', 'processing', 'is', 'used', 'for', 'building',
'applications', 'such', 'as', 'Text', 'classification', ',', 'intelligent', 'chatbot', ',', 'sentimental', 'analysis', ',',
'language', 'translation', ',', 'etc.Natural', 'Language', 'toolkit', 'has', 'very', 'important', 'module', 'NLTK',
'tokenize', 'sentence', 'which', 'further', 'comprises', 'of', 'sub-modulesWe', 'use', 'the', 'method',
'word_tokenize', '(', ')', 'to', 'split', 'a', 'sentence', 'into', 'words', '.', 'The', 'output', 'of', 'word', 'tokenizer',
'in', 'NLTK', 'can', 'be', 'converted', 'to', 'Data', 'Frame', 'for', 'better', 'text', 'understanding', 'in',
'machine', 'learning', 'applications.Sub-module', 'available', 'for', 'the', 'above', 'is', 'sent_tokenize', '.',
'Sentence', 'tokenizer', 'in', 'Python', 'NLTK', 'is', 'an', 'important', 'feature', 'for', 'machine', 'training', '.']

print(sent_tokenize(text))
4

OUTPUT: ['Tokenization in NLP is the process by which a large quantity of text is divided into
smaller parts called tokens.', 'Natural language processing is used for building applications such as
Text classification, intelligent chatbot, sentimental analysis, language translation, etc.Natural
Language toolkit has very important module NLTK tokenize sentence which further comprises of
sub-modulesWe use the method word_tokenize() to split a sentence into words.', 'The output of word
tokenizer in NLTK can be converted to Data Frame for better text understanding in machine learning
applications.Sub-module available for the above is sent_tokenize.', 'Sentence tokenizer in Python
NLTK is an important feature for machine training.']

AIM: Check how many words are there in any corpus.Also check how many distinct words are there?
PROGRAM:
from nltk.corpus import movie_reviews
w = movie_reviews.words()
s = movie_reviews.sents()
from nltk.probability import FreqDist
count = FreqDist(w)
count.N()
OUTPUT:
1583820

print("No of Words in movie_reviews corpus are :",len(w))


print("No of Distinct words in movie_reviews corpus are : ",len(set(w)))
OUTPUT;
No of Words in movie_reviews corpus are : 1583820
No of Distinct words in movie_reviews corpus are : 39768
5

WEEK-3
AIM: Write a program to implement pre-defined and user-defined functions to generate
a. Uni-grams
b. Bi-grams
c. Tri-grams
d. N-grams
DESCRIPTION:
N-grams represent a continuous sequence of N elements from a given set of texts. However, Natural
Language Processing commonly refers to N-grams as strings of words, where n stands for the number
of words you are looking for. The following types of N-grams are usually distinguished:
 Unigram — An N-gram with simply one string inside (for example, it can be a unique word
— computer or human from a given sentence, e.g. Natural Language Processing is the ability
of a computer program to understand human language as it is spoken and written).
 2-gram or Bigram — Typically a combination of two strings or words that appear in a
document: short-form video or video format will be likely a search result of bigrams in a
certain corpus of texts (and not format video, video short-form as the word order remains the
same).
 3-gram or Trigram — An N-gram containing up to three elements processed together (e.g.
short-form video format or new short-form video) etc.

PROGRAM:
def generate_unigrams(text):
# Split text into words
words = text.split()
# Return list of words
return words
text = " NLP is used in a wide range of applications, including machine translation, sentiment
analysis, text summarization, speech recognition, chatbots, and more. It involves several tasks, such as
tokenization, part-of-speech tagging, named entity recognition, parsing, and semantic analysis."
generate_unigrams(text)
OUTPUT:
['NLP', 'is', 'used', 'in', 'a', 'wide', 'range', 'of', 'applications,', 'including', 'machine', 'translation',
'sentiment', 'analysis,', 'text', 'summarization,', 'speech', 'recognition,', 'chatbots,', 'and', 'more.', 'It',
'involves', 'several', 'tasks,', 'such', 'as', 'tokenization,', 'part-of-speech', 'tagging,', 'named', 'entity',
'recognition,', 'parsing,', 'and', 'semantic', 'analysis.']

def generate_bigrams(text):
# Split text into words
words = text.split()
# Create list to hold bigrams
bigrams = []
6

# Loop through words to create bigrams


for i in range(len(words) - 1):
bigram = (words[i], words[i+1])
bigrams.append(bigram)
# Return list of bigrams
return bigrams
text = " NLP is used in a wide range of applications, including machine translation, sentiment
analysis, text summarization, speech recognition, chatbots, and more. It involves several tasks, such as
tokenization, part-of-speech tagging, named entity recognition, parsing, and semantic analysis."
generate_bigrams(text)
OUTPUT:
[('NLP', 'is'), ('is', 'used'), ('used', 'in'), ('in', 'a'), ('a', 'wide'), ('wide', 'range'), ('range', 'of'), ('of',
'applications,'), ('applications,', 'including'), ('including', 'machine'), ('machine', 'translation,'),
('translation,', 'sentiment'), ('sentiment', 'analysis,'), ('analysis,', 'text'), ('text', 'summarization,'),
('summarization,', 'speech'), ('speech', 'recognition,'), ('recognition,', 'chatbots,'), ('chatbots,', 'and'),
('and', 'more.'), ('more.', 'It'), ('It', 'involves'), ('involves', 'several'), ('several', 'tasks,'), ('tasks,', 'such'),
('such', 'as'), ('as', 'tokenization,'), ('tokenization,', 'part-of-speech'), ('part-of-speech', 'tagging,'),
('tagging,', 'named'), ('named', 'entity'), ('entity', 'recognition,'), ('recognition,', 'parsing,'), ('parsing,',
'and'), ('and', 'semantic'), ('semantic', 'analysis.')]

def generate_trigrams(text):
# Split text into words
words = text.split()
# Create list to hold trigrams
trigrams = []
# Loop through words to create trigrams
for i in range(len(words) - 2):
trigram = (words[i], words[i+1], words[i+2])
trigrams.append(trigram)
# Return list of trigrams
return trigrams
text = " NLP is used in a wide range of applications, including machine translation, sentiment
analysis, text summarization, speech recognition, chatbots, and more. It involves several tasks, such as
tokenization, part-of-speech tagging, named entity recognition, parsing, and semantic analysis."
generate_trigrams(text)
OUTPUT:
[('NLP', 'is', 'used'), ('is', 'used', 'in'), ('used', 'in', 'a'), ('in', 'a', 'wide'), ('a', 'wide', 'range'), ('wide',
'range', 'of'), ('range', 'of', 'applications,'), ('of', 'applications,', 'including'), ('applications,', 'including',
7

'machine'), ('including', 'machine', 'translation,'), ('machine', 'translation,', 'sentiment'), ('translation,',


'sentiment', 'analysis,'), ('sentiment', 'analysis,', 'text'), ('analysis,', 'text', 'summarization,'), ('text',
'summarization,', 'speech'), ('summarization,', 'speech', 'recognition,'), ('speech', 'recognition,',
'chatbots,'), ('recognition,', 'chatbots,', 'and'), ('chatbots,', 'and', 'more.'), ('and', 'more.', 'It'), ('more.', 'It',
'involves'), ('It', 'involves', 'several'), ('involves', 'several', 'tasks,'), ('several', 'tasks,', 'such'), ('tasks,',
'such', 'as'), ('such', 'as', 'tokenization,'), ('as', 'tokenization,', 'part-of-speech'), ('tokenization,', 'part-of-
speech', 'tagging,'), ('part-of-speech', 'tagging,', 'named'), ('tagging,', 'named', 'entity'), ('named', 'entity',
'recognition,'), ('entity', 'recognition,', 'parsing,'), ('recognition,', 'parsing,', 'and'), ('parsing,', 'and',
'semantic'), ('and', 'semantic', 'analysis.')]
def generate_ngrams(text, n):
# Split text into words
words = text.split()
# Create list to hold n-grams
ngrams = []
# Loop through words to create n-grams
for i in range(len(words) - n + 1):
ngram = tuple(words[i:i+n])
ngrams.append(ngram)
# Return list of n-grams
return ngrams
text = " NLP is used in a wide range of applications, including machine translation, sentiment
analysis, text summarization, speech recognition, chatbots, and more. It involves several tasks, such as
tokenization, part-of-speech tagging, named entity recognition, parsing, and semantic analysis."
generate_ngrams(text,4)

OUTPUT:
[('NLP', 'is', 'used', 'in'), ('is', 'used', 'in', 'a'), ('used', 'in', 'a', 'wide'), ('in', 'a', 'wide', 'range'), ('a', 'wide',
'range', 'of'), ('wide', 'range', 'of', 'applications,'), ('range', 'of', 'applications,', 'including'), ('of',
'applications,', 'including', 'machine'), ('applications,', 'including', 'machine', 'translation,'), ('including',
'machine', 'translation,', 'sentiment'), ('machine', 'translation,', 'sentiment', 'analysis,'), ('translation,',
'sentiment', 'analysis,', 'text'), ('sentiment', 'analysis,', 'text', 'summarization,'), ('analysis,', 'text',
'summarization,', 'speech'), ('text', 'summarization,', 'speech', 'recognition,'), ('summarization,', 'speech',
'recognition,', 'chatbots,'), ('speech', 'recognition,', 'chatbots,', 'and'), ('recognition,', 'chatbots,', 'and',
'more.'), ('chatbots,', 'and', 'more.', 'It'), ('and', 'more.', 'It', 'involves'), ('more.', 'It', 'involves', 'several'),
('It', 'involves', 'several', 'tasks,'), ('involves', 'several', 'tasks,', 'such'), ('several', 'tasks,', 'such', 'as'),
('tasks,', 'such', 'as', 'tokenization,'), ('such', 'as', 'tokenization,', 'part-of-speech'), ('as', 'tokenization,',
'part-of-speech', 'tagging,'), ('tokenization,', 'part-of-speech', 'tagging,', 'named'), ('part-of-speech',
'tagging,', 'named', 'entity'), ('tagging,', 'named', 'entity', 'recognition,'), ('named', 'entity', 'recognition,',
'parsing,'), ('entity', 'recognition,', 'parsing,', 'and'), ('recognition,', 'parsing,', 'and', 'semantic'),
('parsing,', 'and', 'semantic', 'analysis.')]
8

AIM: Write a program to calculate the highest probability of a word (w2) occurring after another
word(w1).
DESCRIPTION:
The formula to calculate the probability of a word (w2) occurring after another word (w1) is:
P(w2|w1) = count(w1, w2) / count(w1)
where `count(w1, w2)` is the number of times the bigram (w1, w2) occurs in the corpus, and
`count(w1)` is the number of times the word w1 occurs in the corpus.
To calculate the highest probability of a word (w2) occurring after another word (w1), you need to
find the bigram that has the highest probability among all the bigrams that start with the word w1.
Here are the steps to calculate the highest probability:
1. Count the frequency of each bigram that starts with the word w1.
2. Calculate the probability of each bigram using the above formula.
3. Find the bigram with the highest probability.
Here's an example:
Suppose we have a corpus of text that contains the following sentences:
the cat sat on the mat
the dog chased the cat
the cat chased the mouse
We want to find the highest probability of a word occurring after the word "the".
1. Count the frequency of each bigram that starts with "the":
count("the", "cat") = 2
count("the", "dog") = 1
count("the", "mat") = 1
count("the", "chased") = 2
count("the", "mouse") = 1
2. Calculate the probability of each bigram:
P("cat"|"the") = count("the", "cat") / count("the") = 2 / 5 = 0.4
P("dog"|"the") = count("the", "dog") / count("the") = 1 / 5 = 0.2
P("mat"|"the") = count("the", "mat") / count("the") = 1 / 5 = 0.2
P("chased"|"the") = count("the", "chased") / count("the") = 2 / 5 = 0.4
P("mouse"|"the") = count("the", "mouse") / count("the") = 1 / 5 = 0.2
3. Find the bigram with the highest probability, which is "cat" in this case.
Therefore, the highest probability of a word occurring after "the" is 0.4, and the word that has this
probability is "cat".
9

PROGRAM:
import nltk
nltk.download('reuters')
nltk.download('punkt')
OUTPUT:
[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data] Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
True

import nltk
from nltk.corpus import reuters
from nltk.util import ngrams
from nltk.probability import FreqDist
# Load the Reuters corpus
corpus = reuters.sents()
# Get all bigrams in the corpus
bigrams = list(ngrams([word.lower() for sentence in corpus for word in sentence], 2))
# Calculate the frequency of each bigram
freq_dist = FreqDist(bigrams)
# Define the starting word (w1)
w1 = 'the'
# Get all bigrams starting with w1
w1_bigrams = [(w1, b[1]) for b in bigrams if b[0] == w1]
# Get the bigram with the highest frequency
max_freq = 0
max_bigram = None
for bigram in w1_bigrams:
freq = freq_dist[bigram]
if freq > max_freq:
max_freq = freq
max_bigram = bigram
10

# Print the bigram with the highest frequency


print(corpus)
print("The bigram with the highest frequency after 'the' is:", max_bigram)
print("It occurs", max_freq, "times in the corpus.")
OUTPUT:
[['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', '.', 'S', '.-', 'JAPAN', 'RIFT', 'Mounting',
'trade', 'friction', 'between', 'the', 'U', '.', 'S', '.', 'And', 'Japan', 'has', 'raised', 'fears', 'among', 'many', 'of',
'Asia', "'", 's', 'exporting', 'nations', 'that', 'the', 'row', 'could', 'inflict', 'far', '-', 'reaching', 'economic',
'damage', ',', 'businessmen', 'and', 'officials', 'said', '.'], ['They', 'told', 'Reuter', 'correspondents', 'in',
'Asian', 'capitals', 'a', 'U', '.', 'S', '.', 'Move', 'against', 'Japan', 'might', 'boost', 'protectionist', 'sentiment',
'in', 'the', 'U', '.', 'S', '.', 'And', 'lead', 'to', 'curbs', 'on', 'American', 'imports', 'of', 'their', 'products', '.'], ...]
The bigram with the highest frequency after 'the' is: ('the', 'company')
It occurs 3126 times in the corpus.
11

WEEK-4
AIM: (i) Write a program to identify the word collocations.
(ii) Write a program to print all words beginning with a given sequence of letters.
(iii) Write a program to print all words longer than four characters.
DESCRIPTION:
Word collocations are pairs or groups of words that commonly occur together in a language. They are
words that are statistically associated with one another due to their frequency of co-occurrence.
Collocations are important in language learning and understanding because they can help learners to
understand how words are used in context.
There are two types of collocations: grammatical collocations and lexical collocations. Grammatical
collocations involve the use of certain words with specific grammatical structures, such as "make a
decision" or "take a risk." Lexical collocations involve the pairing of certain words that commonly
occur together, such as "strong coffee" or "heavy rain."
PROGRAM:
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.corpus import brown
words = brown.words()
biagram_collocation = BigramCollocationFinder.from_words(words)
from nltk.corpus import stopwords
stopset = set(stopwords.words('english'))
filter_stops = lambda w: len(w) < 3 or w in stopset
biagram_collocation.apply_word_filter(filter_stops)
biagram_collocation.apply_freq_filter(3)
biagram_collocation.nbest(BigramAssocMeasures.likelihood_ratio, 15)
from nltk.collocations import TrigramCollocationFinder
from nltk.metrics import TrigramAssocMeasures
trigram_collocation = TrigramCollocationFinder.from_words(words)
trigram_collocation.apply_word_filter(filter_stops)
trigram_collocation.apply_freq_filter(3)
trigram_collocation.nbest(TrigramAssocMeasures.likelihood_ratio, 15)
# to print all words beginning with a given sequence of letters.
test_list=brown.words()
check="len"
12

res = [idx for idx in test_list if idx.lower().startswith(check.lower())]


print("The list of words matching first sequence of letters : " + str(set(res)))
# to print all words longer than four characters.
res2=[wrd for wrd in res if len(wrd)>4]
print("list of words of length greater than 4 : ",str(set(res2)))
OUTPUT:
The list of words matching first sequence of letters : {'Leninism-Marxism', 'Lena', 'Leningrad-Kirov',
'lending', 'length', 'Lenygon', "Lenygon's", 'lengthened', 'Lenin', 'Lenny', 'Lendrum', 'Leni',
"Lenobel's", 'lends', 'Lennie', 'lengthen', "Leningrad's", "Lenin's", 'lengthy', 'lens', 'lengthening', 'lent',
'lentils', 'Leningrad', 'lend', 'lengthwise', 'Lend', 'Len', 'lenient', 'lengthily', 'lenses', 'Length', 'lengths'}
list of words of length greater than 4 : {'Leninism-Marxism', 'Leningrad-Kirov', 'lending', 'length',
'Lenygon', "Lenygon's", 'lengthened', 'Lenin', 'Lenny', 'Lendrum', "Lenobel's", 'Lennie', 'lends',
'lengthen', "Leningrad's", "Lenin's", 'lengthy', 'lengthening', 'lentils', 'Leningrad', 'lengthwise', 'lenient',
'lengthily', 'lenses', 'Length', 'lengths'}
13

WEEK-5
AIM: (i) Write a program to identify the mathematical expression in a given sentence.
(ii) Write a program to identify different components of an email address.

DESCRIPTION:
Here are some general steps you can take to identify mathematical expressions in a sentence:Tokenize
the sentence: Use an NLP library like spaCy, NLTK, or TextBlob to break down the sentence into
individual words.
Identify mathematical operators: Look for keywords or symbols that indicate mathematical
operations, such as +, -, *, /, ^, =, <, >, etc. These are often used in combination with numerical or
variable values.Identify numerical values: Look for words or symbols that indicate numerical values,
such as digits (0-9), decimal points, fractions, percentages, etc. Identify variable names: Look for
words or symbols that indicate variable names, such as x, y, z, a, b, c, etc. Combine the identified
elements: Once you have identified the mathematical operators, numerical values, and variable names,
you can combine them into mathematical expressions.

In order to identify the different components of an email address using NLP, we need to first
understand the structure of an email address. An email address typically consists of two main parts:
the username and the domain name. The username is the name of the user who owns the email
address, while the domain name is the name of the email provider that hosts the user's email account.
Here are some steps you can take to identify these components using NLP:
Tokenize the email address: Use an NLP library like spaCy, NLTK, or TextBlob to break down the
email address into individual words.
Identify the "@" symbol: Look for the "@" symbol, which separates the username from the domain
name.
Identify the username: The username is typically the part of the email address before the "@" symbol.
However, email addresses may also contain additional characters, such as dots, underscores, or
dashes, that can be part of the username. Use heuristics or regular expressions to identify the
username.
Identify the domain name: The domain name is typically the part of the email address after the "@"
symbol. However, it can also include additional subdomains, such as "mail" or "calendar". Use
heuristics or regular expressions to identify the domain name.
PROGRAM:
# to identify the mathematical expression in a given sentence.
import re
sentence = "The expression is 2+3+4-2/3 3+3+3 and this is -2 .33 2 +.2"
regex = r'[-+]?\d*\.?\d+(?:[-+*/]\d*\.?\d+)*'
expressions = re.findall(regex, sentence)
print(expressions)

OUTPUT:
['2+3+4-2/3', '3+3+3', '-2', '.33', '2', '+.2']
14

# to identify different components of an email address.


import re
def identify_email_components(email):
# Define regular expressions to identify different components of an email address
# The \w metacharacter matches word characters. A word character is a character a-z,
email_regex = r'([\w\.-]+)@([\w\.-]+)(\.[\w\.]+)'
# Match the regular expression against the email address
match = re.match(email_regex, email)
if match:
username = match.group(1)
domain = match.group(2)
extension = match.group(3)
return {"username": username, "domain": domain, "extension": extension}
else:
# If the email address doesn't match the regular expression, return None
return None
# Example
email = "madeena4251@gmail.com"
components = identify_email_components(email)
print(components)
OUTPUT:
{'username': 'madeena4251', 'domain': 'gmail', 'extension': '.com'}
15

WEEK-6
AIM: (i) Write a program to identify all antonyms and synonyms of a word.
(ii) Write a program to find hyponymy, homonymy, polysemy for a given word.

DESCRIPTION:
Antonyms:
Antonyms are words with opposite meanings. For example, hot and cold, good and bad, love
and hate are pairs of antonyms.
Synonyms:
Synonyms are words with similar meanings. For example, big and large, happy and joyful, eat
and consume are pairs of synonyms. Antonyms and synonyms are important for improving
vocabulary and understanding the nuances of language.
Hyponymy:
Hyponymy is a relationship between words where one word is a subtype of another word. For
example, car is a hyponym of vehicle, as car is a type of vehicle. Similarly, rose is a hyponym of
flower, as rose is a type of flower. Hyponymy is important for organizing and categorizing
knowledge, and for understanding the hierarchical structure of language.
Homonymy:
Homonymy is a relationship between words where two or more words have the same
pronunciation and/or spelling but have different meanings. For example, bank can mean a financial
institution or the side of a river. Homonyms can be homophones, which have the same pronunciation
but different meanings (e.g. flower and flour), or homographs, which have the same spelling but
different meanings (e.g. bow can mean a knot or a weapon).
Polysemy:
Polysemy is a relationship between words where one word has multiple related meanings. For
example, the word bank can mean a financial institution, the side of a river, or a place to store things.
Polysemy is important for understanding the complexity and flexibility of language, and for
interpreting the meaning of words in different contexts.
PROGRAM:
# to identify all antonyms and synonyms of a word.
import nltk
from nltk.corpus import wordnet
# Define the word to find synonyms and antonyms for
word = input("enter a word: ")
# Find all synonyms of the word
synonyms = []
for syn in wordnet.synsets(word):
for lemma in syn.lemmas():
16

if lemma.name() != word:
synonyms.append(lemma.name())
# Find all antonyms of the word
antonyms = []
for syn in wordnet.synsets(word):
for lemma in syn.lemmas():
for antonym in lemma.antonyms():
antonyms.append(antonym.name())
# Print the synonyms and antonyms
print("Synonyms of", word + ":", set(synonyms))
print("Antonyms of", word + ":", set(antonyms))
OUTPUT:
enter a word: happy
Synonyms of happy: {'well-chosen', 'felicitous', 'glad'}
Antonyms of happy: {'unhappy'}

# to find hyponymy, homonymy, polysemy for a given word.


from nltk.corpus import wordnet
# Define the word
word = input("enter a word:")
# Find all hyponyms of the word
hyponyms = set()
for synset in wordnet.synsets(word):
for hyponym in synset.hyponyms():
hyponyms.add(hyponym.lemmas()[0].name())
# Find all homonyms of the word
homonyms = set()
for synset in wordnet.synsets(word):
for lemma in synset.lemmas():
if lemma.name() != word:
homonyms.add(lemma.name())
# Find the number of senses (polysemy) of the word
senses = len(wordnet.synsets(word))
17

# Print the results


print("Hyponyms of", word + ":", hyponyms)
print("Homonyms of", word + ":", homonyms)
print("Polysemy of", word + ":", senses)
OUTPUT:
enter a word:bank
Hyponyms of bank: {'food_bank', 'vertical_bank', 'sandbank', 'agent_bank', 'credit', 'state_bank',
'piggy_bank', 'thrift_institution', 'member_bank', 'lean', 'riverbank', 'credit_union', 'count', 'lead_bank',
'merchant_bank', 'soil_bank', 'blood_bank', 'bluff', 'Federal_Reserve_Bank', 'redeposit', 'waterside',
'acquirer', 'eye_bank', 'Home_Loan_Bank', 'commercial_bank'}
Homonyms of bank: {'camber', 'savings_bank', 'depository_financial_institution', 'trust', 'swear', 'cant',
'bank_building', 'rely', 'banking_concern', 'money_box', 'deposit', 'coin_bank', 'banking_company'}
Polysemy of bank: 18
18

WEEK-7
AIM: (i) Write a program to find all the stop words in any given text.
(ii) Write a function that finds the 50 most frequently occurring words of a text that are not stopwords.

DESCRIPTION:
Stopwords are common words in a language that are typically removed from text data during natural
language processing to focus on the more meaningful words. These words include prepositions,
conjunctions, pronouns, and other commonly used words that do not carry significant meaning in the
context of the text. Examples of stopwords in English include "the", "a", "an", "and", "in", "on",
"with", "for", "to", "of", "that", "this", "is", "was", and "were".
Removing stopwords can help to reduce the size of the text data and improve the efficiency of natural
language processing tasks such as text classification and sentiment analysis. However, it is important
to note that the list of stopwords may vary depending on the context and domain of the text being
analyzed. For example, certain domain-specific terms may be considered stopwords in one domain
but not in another. Therefore, it is important to carefully select the list of stopwords based on the
specific task and domain of the text data.
PROGRAM:
# to find all the stop words in any given text
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.corpus import brown
# Get list of English stop words
stop_words = set(stopwords.words('english'))
# Convert text to lowercase and tokenize
words = brown.words()
# Filter out stop words
stopWords = [word for word in words if word in stop_words]
# Print filtered words
print(set(stopWords))
OUTPUT:
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
{'a', 'haven', "weren't", 'am', 'you', 'no', 'are', 'just', 'most', 'too', 'during', "doesn't", 'who', 'o', 'should',
'that', 'why', 'to', 'about', 'then', 'there', 'him', 'below', 'than', 'before', 'we', 'so', "isn't", 'his', 'does',
'between', 'through', 'my', 'an', 'ourselves', 'her', 'further', "haven't", 'again', 'y', 'didn', 'because', 'other',
'been', 'once', 'be', 'off', 'don', "wasn't", 'our', 'but', 'your', 'was', 'them', 'were', 'into', 'now', 'theirs', 'and',
'have', 'hers', 'where', 'ours', "couldn't", "aren't", 'while', 'by', 'has', 'they', 'or', 'having', "won't",
19

"you've", 'herself', 'is', 'will', 'ma', 'after', "shouldn't", 't', 'won', 'how', 'had', 'all', 'this', 'some',
"wouldn't", "hasn't", 'she', 'down', 're', 'in', 'the', 'few', 'at', 'only', 'against', 'yours', 'here', "it's", 'own',
"don't", 'each', 'doing', 'both', 'themselves', "you're", 'yourselves', 'it', 'until', 'with', 'nor', 'above',
'himself', 'what', 'not', "mustn't", 'whom', "shan't", 'being', 'yourself', 'for', 'under', 'more', 'when', 'me',
'those', 'i', 'their', 'out', 'as', 'such', 'myself', 'same', 'these', 'any', "you'll", 'did', 'do', "didn't", 'over',
"you'd", 'up', 'very', 'from', 'can', 'if', "hadn't", 'on', 'its', 'itself', 'which', 'of', 'he', "she's"}

# to find the 50 most frequently occurring words of a text that are not stopwords.
import nltk
from nltk.corpus import stopwords
from collections import Counter
from nltk.corpus import brown
stop_words = set(stopwords.words('english'))
words = brown.words()
filtered_words = [word for word in words if word not in stop_words]
word_freq = Counter(filtered_words)
print(words)
print(word_freq.most_common(50))
OUTPUT:
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
[(',', 58334), ('.', 49346), ('``', 8837), ("''", 8789), ('The', 7258), (';', 5566), ('I', 5161), ('?', 4693), ('--',
3432), ('He', 2982), ('one', 2873), ('would', 2677), (')', 2466), ('(', 2435), ('It', 2037), ('said', 1943), ('In',
1801), (':', 1795), ('!', 1596), ('could', 1580), ('time', 1556), ('But', 1374), ('A', 1314), ('two', 1311),
('may', 1292), ('first', 1242), ('like', 1237), ('This', 1179), ('man', 1151), ('made', 1122), ('new', 1060),
('must', 1003), ('also', 999), ('Af', 995), ('even', 985), ('back', 950), ('years', 943), ('And', 938), ('many',
925), ('She', 911), ('much', 900), ('way', 892), ('There', 851), ('They', 847), ('Mr.', 844), ('people', 811),
('little', 788), ('make', 768), ('good', 767), ('well', 757)].
20

WEEK-8
AIM: Write a program to implement various stemming techniques and prepare a chart with the
performance of each method.
DESCRIPTION:
Stemming is a natural language processing technique that lowers inflection in words to their
root forms, hence aiding in the preprocessing of text, words, and documents for text normalization.

Types of Stemmer in NLTK

1. Porter Stemmer: Martin Porter invented the Porter Stemmer or Porter algorithm in 1980. Five
steps of word reduction are used in the method, each with its own set of mapping rules. Porter
Stemmer is the original stemmer and is renowned for its ease of use and rapidity. Frequently, the
resultant stem is a shorter word with the same root meaning.

2. Snowball Stemmer : Martin Porter also created Snowball Stemmer. The method utilized in this
instance is more precise and is referred to as “English Stemmer” or “Porter2 Stemmer.” It is
somewhat faster and more logical than the original Porter Stemmer.

3. Lancaster Stemmer: Lancaster Stemmer is straightforward, although it often produces results with

excessive stemming. Over-stemming renders stems non-linguistic or meaningless.

4. Regexp Stemmer

Regex stemmer identifies morphological affixes using regular expressions. Substrings matching the

regular expressions will be discarded.

PROGRAM:
# PORTER STEMMER
from nltk.stem import PorterStemmer
porter = PorterStemmer()
words = ['generous','generate','generously','generation']
for word in words:
print(word,"--->",porter.stem(word))
OUTPUT:
generous ---> gener
generate ---> gener
generously ---> gener
generation ---> gener
21

#SNOWBALL STEMMER
from nltk.stem import SnowballStemmer
snowball = SnowballStemmer(language='english')
words = ['generous','generate','generously','generation']
for word in words:
print(word,"--->",snowball.stem(word))
OUTPUT:
generous ---> generous
generate ---> generat
generously ---> generous
generation ---> generat

# LANCASTER STEMMER
from nltk.stem import LancasterStemmer
lancaster = LancasterStemmer()
words = ['eating','eats','eaten','puts','putting']
for word in words:
print(word,"--->",lancaster.stem(word))
OUTPUT:
eating ---> eat
eats ---> eat
eaten ---> eat
puts ---> put
putting ---> put

# REGEXP STEMMER
from nltk.stem import RegexpStemmer
regexp = RegexpStemmer('ing$|s$|e$|able$', min=4)
words = ['mass','was','bee','computer','advisable']
for word in words:
print(word,"--->",regexp.stem(word))

OUTPUT:
mass ---> mas
was ---> was
22

bee ---> bee


computer ---> computer
advisable ---> advis

#COMPARISION BETWEEN STEMMERS


from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer, RegexpStemmer
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer(language='english')
regexp = RegexpStemmer('ing$|s$|e$|able$', min=4)
word_list = ["friend", "friendship", "friends", "friendships"]
print("{0:20}{1:20}{2:20}{3:30}{4:40}".format("Word","Porter Stemmer","Snowball Stemmer","La
ncaster Stemmer",'Regexp Stemmer'))
for word in word_list:
print("{0:20}{1:20}{2:20}{3:30}{4:40}".format(word,porter.stem(word),snowball.stem(word),lan
caster.stem(word),regexp.stem(word)))
OUTPUT:
Word Porter Stemmer Snowball Stemmer Lancaster Stemmer Regexp Stemmer
friend friend friend friend friend
friendship friendship friendship friend friendship
friends friend friend friend friend
friendships friendship friendship friend friendship
23

WEEK-9
AIM: Write a program to implement various lemmatization techniques and prepare a chart with the
performance of each method.
DESCRIPTION:
There are several lemmatization techniques used in natural language processing (NLP), including:
1. WordNetLemmatizer: This is a lemmatization technique provided by the NLTK library, which uses
WordNet, a lexical database of English, to find the base form of a word based on its part of speech.
2. spaCy: This is a popular library for NLP that provides its own lemmatization functionality. The
lemmatization in spaCy is based on a statistical model trained on a large corpus of text, and is
designed to handle a wide range of languages.
3. StanfordNLP: This is another NLP library that provides lemmatization functionality. Like spaCy,
the lemmatization in StanfordNLP is based on a statistical model trained on a large corpus of text.
4. Pattern: This is a Python library for NLP that provides lemmatization functionality. Pattern uses a
set of rules to find the base form of a word based on its part of speech.
5. TextBlob: This is a library for NLP that provides a simple API for lemmatization. The
lemmatization in TextBlob is based on the WordNetLemmatizer provided by NLTK.
Each of these lemmatization techniques has its own strengths and weaknesses, and the best technique
to use will depend on the specific requirements of your NLP task.
PROGRAM:

pip install pattern

import time

import spacy

from textblob import Word

from pattern.en import lemma

words=['watching','eating','sleeping','series','scrolling','phone']

nlp=spacy.load('en_core_web_sm')
def run_lem(lemmatizer,words):

start=time.time()

lemm=[lemmatizer(word) for word in words]

end=time.time()

tim=end-start

return lemm,tim

print("comparing")
print()

print("wordnet:")
24

word_lem,word_time=run_lem(lambda word:Word(word).lemmatize(),words)

print(word_lem)

print(word_time)
OUTPUT:
comparing
wordnet:
['watching', 'eating', 'sleeping', 'series', 'scrolling', 'phone']
0.00022935867309570312

print("spacy:")

nlp_lem,nlp_time=run_lem(lambda word:nlp(word)[0].lemma_,words)
print(nlp_lem)

print(nlp_time)
OUTPUT:
spacy:
['watch', 'eat', 'sleep', 'series', 'scroll', 'phone']

0.058435678482055664

print("textblob:")

blob_lem,blob_time=run_lem(lambda word:Word(word).lemmatize(),words)

print(blob_lem)

print(blob_time)
OUTPUT:
textblob:
['watching', 'eating', 'sleeping', 'series', 'scrolling', 'phone']

0.00014925003051757812

print("pattern:")

pat_lem,pat_tim=run_lem(lemma,words)

print(pat_lem)

print(pat_tim)
OUTPUT:
25

pattern:
['watch', 'eat', 'sleep', 'sery', 'scroll', 'phone']

4.863739013671875e-05

import matplotlib.pyplot as plt

time=[word_time,nlp_time,blob_time,pat_tim]

lemmm=['word_lem','nlp_lem','blob_lem','pat_lem']

plt.bar(lemmm,time)
OUTPUT:
<BarContainer object of 4 artists>
26

WEEK-10
AIM: Write a program to implement TF-IDF for any corpus.
DESCRIPTION:
TF-IDF stands for "Term Frequency-Inverse Document Frequency" and is a statistical technique used
in natural language processing (NLP) to measure the importance of a word in a document or corpus.
TF-IDF assigns a weight to each word in a document based on two factors: term frequency (TF) and
inverse document frequency (IDF).
Term frequency (TF) measures the frequency of a term (word) in a document. It is calculated by
dividing the number of times a term appears in a document by the total number of words in the
document. This calculation helps to identify the most frequently occurring words in a document.
Inverse document frequency (IDF) measures how rare a word is across all documents in a corpus. It is
calculated by dividing the total number of documents in the corpus by the number of documents that
contain the word. This calculation helps to identify the words that are unique or important to a
particular document, and not commonly used in other documents.
The TF-IDF score for a word is the product of its TF and IDF values. This score indicates the
importance of a word in a document, with higher scores indicating greater importance. TF-IDF is
commonly used in applications such as text classification, search engines, and information retrieval. It
helps to identify the most relevant documents or passages based on the keywords used in a query or
search.
PROGRAM:
import math
import nltk
corpus = ["This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?"]
# Tokenize the corpus
tokenized_corpus = [nltk.word_tokenize(doc.lower()) for doc in corpus]
# Calculate document frequency (df) for each term in the corpus
df = {}
for doc in tokenized_corpus:
for term in set(doc):
if term not in df:
df[term] = 1
else:
df[term] += 1
27

# Calculate inverse document frequency (idf) for each term in the corpus
N = len(tokenized_corpus)
idf = {}
for term in df:
idf[term] = math.log(N / df[term])
# Calculate term frequency (tf) for each term in each document
tf = {}
for i, doc in enumerate(tokenized_corpus):
tf[i] = {}
for term in doc:
if term not in tf[i]:
tf[i][term] = 1
else:
tf[i][term] += 1
# Calculate TF-IDF score for each term in each document
tfidf = {}
for i, doc in enumerate(tokenized_corpus):
tfidf[i] = {}
for term in doc:
tfidf[i][term] = tf[i][term] * idf[term]
# Print the TF-IDF scores for each document
for i, doc in enumerate(tokenized_corpus):
print(f"Document {i}:")
for term in tfidf[i]:
print(f" {term}: {tfidf[i][term]}")
OUTPUT:
Document 0:
this: 0.0
is: 0.0
the: 0.0
first: 0.6931471805599453
document: 0.28768207245178085
28

.: 0.28768207245178085
Document 1:
this: 0.0
document: 0.5753641449035617
is: 0.0
the: 0.0
second: 1.3862943611198906
.: 0.28768207245178085
Document 2:
and: 1.3862943611198906
this: 0.0
is: 0.0
the: 0.0
third: 1.3862943611198906
one: 1.3862943611198906
.: 0.28768207245178085
Document 3:
is: 0.0
this: 0.0
the: 0.0
first: 0.6931471805599453
document: 0.28768207245178085
?: 1.3862943611198906
29

WEEK-11
AIM: Write a program to implement chunking and chinking for any corpus.
DESCRIPTION: Chunking and chinking are two terms used in natural language processing (NLP)
for different purposes.
Chunking is a process of grouping or chunking together linguistic units such as words, phrases, or
other parts of speech based on specific patterns and rules. The goal of chunking is to identify
meaningful information in a sentence and to extract relevant information for further analysis. For
example, in the sentence "The cat chased the mouse," a chunker might identify "the cat" as a noun
phrase and "chased the mouse" as a verb phrase.
On the other hand, chinking is the opposite of chunking. It involves removing certain parts of a chunk
or a phrase that do not fit a particular pattern or are not relevant for further analysis. Chinking is often
used in conjunction with chunking, as it helps to refine the results obtained from the chunker. For
example, in the sentence "The cat chased the mouse," a chinker might remove the determiner "the"
from the noun phrase "the cat" if it is not necessary for further analysis.
chunking is a process of identifying relevant chunks or phrases in a sentence, while chinking involves
removing non-relevant or unwanted parts from these chunks or phrases. Both these techniques are
used extensively in natural language processing for various applications such as information
extraction, sentiment analysis, and machine translation.
PROGRAM:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
corpus = "The quick brown fox jumped over the lazy dog."
tokens = nltk.word_tokenize(corpus)
# Define the grammar for chunking
grammar = r"""
NP: {<DT>?<JJ>*<NN>} # chunking for noun phrases
VP: {<VB.*><NP|PP|CLAUSE>+$} # chunking for verb phrases"""
chinkgrammar = r"""
NP: {<.*>+} # chinking for everything
}<VBZ|VBP|VB|MD>{ # except verbs"""
chunk_parser = nltk.RegexpParser(grammar)
chink_parser = nltk.RegexpParser(chinkgrammar)
chunked = chunk_parser.parse(nltk.pos_tag(tokens))
chinked = chink_parser.parse(chunked)
print(chunked)
print(chinked)
30

OUTPUT:
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.
(S
(NP The/DT quick/JJ brown/NN)
(NP fox/NN)
jumped/VBD
over/IN
(NP the/DT lazy/JJ dog/NN)
./.)
(S
(NP
(NP The/DT quick/JJ brown/NN)
(NP fox/NN)
jumped/VBD
over/IN
(NP the/DT lazy/JJ dog/NN)
./.))
31

WEEK-12
AIM: Write a program to implement all the NLP Pre-Processing Techniques required to perform
further NLP tasks.
DESCRIPTION:
Natural language processing (NLP) preprocessing is the process of cleaning and preparing text data
for further analysis. The following are the common preprocessing steps in NLP:
1. Text Cleaning: This step involves removing unwanted characters, punctuation marks, special
symbols, and other non-textual data from the text data.
2. Tokenization: This is the process of splitting the text data into individual words or tokens.
Tokenization can be done at the sentence level or the word level.
3. Stopword removal: Stopwords are common words that do not carry significant meaning in the
context of the text. These words can be removed to reduce the size of the text data and improve the
efficiency of NLP algorithms.
4. Stemming or Lemmatization: This step involves reducing words to their base or root form.
Stemming involves removing suffixes from words, while lemmatization involves converting words to
their base form based on their context..
By applying these preprocessing steps, the text data can be transformed into a format that is suitable
for further NLP analysis and modeling. The preprocessing steps may vary depending on the specific
task and domain of the text data.

PROGRAM:
import re
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
def preprocess(text):
# Convert to lowercase
text = text.lower()
print("after lower casing:",text)
# Remove numbers
text = re.sub(r'\d+', '', text)
# Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
print("after removing punctuations:",text)
32

# Tokenize the text


tokens = word_tokenize(text)
print("after tokenizing:",tokens)
# Remove stop words
stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token not in stop_words]
print("after removing stopwords:",tokens)
# Lemmatize the text
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(token) for token in tokens]
print("after lemmatization:",tokens)
# Join the tokens back into a string
text = ' '.join(tokens)
print("finally the joined text is:")
return text
text = "Natural language processing (NLP) is a field of study focused on the interactions between hum
an language and computers. It involves the development of algorithms and models that can understand
, analyze, and generate natural language. NLP is used in a variety of applications, including machine t
ranslation, sentiment analysis, text classification, chatbots, and speech recognition. The field of NLP i
s constantly evolving, and new techniques and models are being developed to improve the accuracy a
nd efficiency of natural language processing tasks."
print("original text is: ",text)
preprocess(text)
OUTPUT:
original text is: Natural language processing (NLP) is a field of study focused on the interactions
between human language and computers. It involves the development of algorithms and models that
can understand, analyze, and generate natural language. NLP is used in a variety of applications,
including machine translation, sentiment analysis, text classification, chatbots, and speech
recognition. The field of NLP is constantly evolving, and new techniques and models are being
developed to improve the accuracy and efficiency of natural language processing tasks.
after lower casing: natural language processing (nlp) is a field of study focused on the interactions
between human language and computers. it involves the development of algorithms and models that
can understand, analyze, and generate natural language. nlp is used in a variety of applications,
including machine translation, sentiment analysis, text classification, chatbots, and speech
recognition. the field of nlp is constantly evolving, and new techniques and models are being
developed to improve the accuracy and efficiency of natural language processing tasks.
after removing punctuations: natural language processing nlp is a field of study focused on the
interactions between human language and computers it involves the development of algorithms and
33

models that can understand analyze and generate natural language nlp is used in a variety of
applications including machine translation sentiment analysis text classification chatbots and speech
recognition the field of nlp is constantly evolving and new techniques and models are being developed
to improve the accuracy and efficiency of natural language processing tasks
after tokenizing: ['natural', 'language', 'processing', 'nlp', 'is', 'a', 'field', 'of', 'study', 'focused', 'on', 'the',
'interactions', 'between', 'human', 'language', 'and', 'computers', 'it', 'involves', 'the', 'development', 'of',
'algorithms', 'and', 'models', 'that', 'can', 'understand', 'analyze', 'and', 'generate', 'natural', 'language',
'nlp', 'is', 'used', 'in', 'a', 'variety', 'of', 'applications', 'including', 'machine', 'translation', 'sentiment',
'analysis', 'text', 'classification', 'chatbots', 'and', 'speech', 'recognition', 'the', 'field', 'of', 'nlp', 'is',
'constantly', 'evolving', 'and', 'new', 'techniques', 'and', 'models', 'are', 'being', 'developed', 'to',
'improve', 'the', 'accuracy', 'and', 'efficiency', 'of', 'natural', 'language', 'processing', 'tasks']
after removing stopwords: ['natural', 'language', 'processing', 'nlp', 'field', 'study', 'focused',
'interactions', 'human', 'language', 'computers', 'involves', 'development', 'algorithms', 'models',
'understand', 'analyze', 'generate', 'natural', 'language', 'nlp', 'used', 'variety', 'applications', 'including',
'machine', 'translation', 'sentiment', 'analysis', 'text', 'classification', 'chatbots', 'speech', 'recognition',
'field', 'nlp', 'constantly', 'evolving', 'new', 'techniques', 'models', 'developed', 'improve', 'accuracy',
'efficiency', 'natural', 'language', 'processing', 'tasks']
after lemmatization: ['natural', 'language', 'processing', 'nlp', 'field', 'study', 'focused', 'interaction',
'human', 'language', 'computer', 'involves', 'development', 'algorithm', 'model', 'understand', 'analyze',
'generate', 'natural', 'language', 'nlp', 'used', 'variety', 'application', 'including', 'machine', 'translation',
'sentiment', 'analysis', 'text', 'classification', 'chatbots', 'speech', 'recognition', 'field', 'nlp', 'constantly',
'evolving', 'new', 'technique', 'model', 'developed', 'improve', 'accuracy', 'efficiency', 'natural',
'language', 'processing', 'task']
finally the joined text is: natural language processing nlp field study focused interaction human
language computer involves development algorithm model understand analyze generate natural
language nlp used variety application including machine translation sentiment analysis text
classification chatbots speech recognition field nlp constantly evolving new technique model
developed improve accuracy efficiency natural language processing task

You might also like