0% found this document useful (0 votes)

32 views16 pages

NLP Record

The document outlines various experiments in Natural Language Processing (NLP) using Python, covering tasks such as noise removal, lemmatization, stemming, slang standardization, part of speech tagging, topic modeling, TF-IDF, word embeddings, text classification, and cosine similarity. Each experiment includes code snippets and sample outputs demonstrating the techniques applied to textual data. It serves as a practical guide for implementing NLP concepts using libraries like NLTK, Gensim, and Scikit-learn.

Uploaded by

Tejaswini Adam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views16 pages

NLP Record

Uploaded by

Tejaswini Adam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Data Science : Natural Language Processing ( SOC )

Experiment – 1

Demonstrate noise removal for any textual data and remove regular expression pattern such
as hash tag from textual data.

import re
def remove_noise(text):
# Remove hashtags
text = re.sub(r'#\w+', '', text)
# Remove other noise (e.g., special characters, URLs, numbers)
text = re.sub(r"[^a-zA-Z\s]", "", text) # Remove special characters
text = re.sub(r"\s+", " ", text) # Remove extra whitespaces
text = re.sub(r"http\S+|www\S+|https\S+", "", text) # Remove URLs
text = re.sub(r"\b\d+\b", "", text) # Remove numbers

return text.strip()

# Example usage
text = "Hello! This is a #sample text with #hashtags and some special characters!! 123 @acet.ac.in"
clean_text = remove_noise(text)
print(clean_text)

Output:

Hello This is a text with and some special characters acetacin

1
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Experiment – 2

Perform lemmatization and stemming using python library nltk.

import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer=WordNetLemmatizer()
print(lemmatizer.lemmatize("running"))
print(lemmatizer.lemmatize("runs"))

def lemmatize(word):
lemmatizer=WordNetLemmatizer()
print("Verb Form: "+lemmatizer.lemmatize(word,pos="v"))
print("Noun Form: "+lemmatizer.lemmatize(word,pos="n"))
print("Adverb Form: "+lemmatizer.lemmatize(word,pos="r"))
print("Adjective Form: "+lemmatizer.lemmatize(word,pos="a"))
lemmatize('skewing')

Output:
running
run

Verb Form: skew

Noun Form: skewing
Adverb Form: skewing
Adjective Form: skewing

2
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

import nltk
from nltk.stem import PorterStemmer, LancasterStemmer
porter_stemmer=PorterStemmer()
print(porter_stemmer.stem('running'))
print(porter_stemmer.stem('runs'))
print(porter_stemmer.stem('ran'))

Output:
run
run
ran

3
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Experiment – 3

Demonstrate object standardization such as replace social media slags from text.

slang_dict = {
"lol": "laughing out loud",
"omg": "oh my god",
"btw": "by the way",
"brb": "be right back",
"idk": "I don't know",
"tbh": "to be honest",
"imho": "in my humble opinion",
"afaik": "as far as I know",
"smh": "shaking my head",
"jk": "just kidding"
}

def standardize_text(text):
words = text.split()
standardized_words = []

for word in words:

if word.lower() in slang_dict:
standardized_words.append(slang_dict[word.lower()])
else:
standardized_words.append(word)

return ' '.join(standardized_words)

text = "lol that's so tbh idk why imho they would do that"
standardized_text = standardize_text(text)
print("Standardized text:", standardized_text)

Output:
Standardized text: laughing out loud that's so to be honest I don't know why in my humble opinion
they would do that

4
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Experiment – 4

Perform part of speech tagging on any textual data.

import nltk
from nltk import word_tokenize, pos_tag
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

def perform_pos_tagging(text):
tokens = nltk.word_tokenize(text)
tagged_tokens = nltk.pos_tag(tokens)
return tagged_tokens

# Example usage
text = "I love to explore new places and try different cuisines."
tagged_text = perform_pos_tagging(text)
print(tagged_text)

Output:

[('I', 'PRP'), ('love', 'VBP'), ('to', 'TO'), ('explore', 'VB'), ('new', 'JJ'), ('places', 'NNS'), ('and', 'CC'),
('try', 'VB'), ('different', 'JJ'), ('cuisines', 'NNS'), ('.', '.')]

5
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Experiment – 5

Implement topic modeling using Latent Dirichlet Allocation (LDA) in python.

import gensim
from gensim import corpora

# Sample documents

documents = [
"Machine learning is a subset of artificial intelligence.",
"Python is a popular programming language for data science.",
"Natural language processing is used in many applications such as chatbots.",
"Topic modeling is a technique for extracting topics from text data."
]

# Tokenize and preprocess the documents

tokenized_docs = [doc.lower().split() for doc in documents]

# Create a dictionary from the tokenized documents

dictionary = corpora.Dictionary(tokenized_docs)

# Create a corpus (term-document frequency)

corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]

# Build the LDA model

num_topics = 2 # Number of topics to extract
lda_model = gensim.models.LdaModel(corpus, num_topics=num_topics, id2word=dictionary,
passes=10)

# Print the extracted topics and their top words

for idx, topic in lda_model.print_topics(num_words=5):
print(f"Topic {idx + 1}: {topic}")

# Get the topic distribution for a sample document

sample_doc = "Machine learning and data science go hand in hand."
sample_doc_bow = dictionary.doc2bow(sample_doc.lower().split())
sample_doc_topics = lda_model.get_document_topics(sample_doc_bow)
print(f"\nSample Document Topics: {sample_doc_topics}")

Output:

Topic 1: 0.056"language" + 0.056"is" + 0.055"such" + 0.055"applications" + 0.055*"many"

6
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Topic 2: 0.080"a" + 0.080"is" + 0.057"for" + 0.034"data." + 0.034*"topics"

Sample Document Topics: [(0, 0.30881903), (1, 0.691181)]

7
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Experiment – 6

Demonstrate Term Frequency – Inverse Document Frequency ( TF – IDF ) using python.

!pip install scikit-learn

import nltk
nltk.download('punkt')

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = [
"Machine learning is a subset of artificial intelligence.",
"Python is a popular programming language for data science.",
"Natural language processing is used in many applications such as chatbots.",
"Topic modeling is a technique for extracting topics from text data."
]

# Create the TF-IDF vectorizer

vectorizer = TfidfVectorizer()

# Fit the vectorizer to the documents and transform the documents into TF-IDF features
tfidf_matrix = vectorizer.fit_transform(documents)

# Get the feature names (terms)

feature_name = vectorizer.get_feature_names_out()

# Print the TF-IDF matrix

print("TF-IDF Matrix:")
print(tfidf_matrix.toarray())

# Print the TF-IDF values for each term in each document

print("\nTF-IDF Values:")
for doc_index, doc in enumerate(documents):
print(f"Document {doc_index + 1}:")
for term_index, term in enumerate(feature_name):
tfidf_value = tfidf_matrix[doc_index, term_index]
if tfidf_value > 0:
print(f"{term}: {tfidf_value:.4f}")

Output:

8
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

TF-IDF Values:
Document 1:
artificial: 0.3993
intelligence: 0.3993
is: 0.2084
learning: 0.3993
machine: 0.3993
of: 0.3993
subset: 0.3993
Document 2:
data: 0.3183
for: 0.3183
is: 0.2106
language: 0.3183
popular: 0.4037
programming: 0.4037
python: 0.4037
science: 0.4037
Document 3:
applications: 0.3179
as: 0.3179
chatbots: 0.3179
in: 0.3179
is: 0.1659
language: 0.2507
many: 0.3179

9
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

natural: 0.3179
processing: 0.3179
such: 0.3179
used: 0.3179
Document 4:
data: 0.2702
extracting: 0.3427
for: 0.2702
from: 0.3427
is: 0.1788
modeling: 0.3427
technique: 0.3427
text: 0.3427
topic: 0.3427
topics: 0.3427

10
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Experiment – 7

Demonstrate Word Embeddings using word2vec.

!pip install gensim

from gensim.models import Word2Vecec

# Step 1: Prepare training data (list of tokenized sentences)

sentences = [
["machine", "learning", "is", "fun"],
["deep", "learning", "uses", "neural", "networks"],
["natural", "language", "processing", "is", "a", "part", "of", "AI"],
["word2vec", "creates", "word", "embeddings"],
["AI", "is", "the", "future"],
]

# Step 2: Train Word2Vec model

model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, workers=2, sg=1)

# Step 3: Use the model

# Get embedding vector for a word

word_vector = model.wv["learning"]
print("Vector for 'learning':\n", word_vector)

# Find similar words

print("\nMost similar words to 'AI':")
similar = model.wv.most_similar("AI", topn=3)
for word, score in similar:
print(f"{word}: {score:.4f}")

11
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Experiment – 8

Implement text classification using naïve bayes classifier and text blob library.

!pip install -U scikit-learn

!pip install -U textblob

import nltk
nltk.download('punkt')

from textblob import TextBlob

from textblob.classifiers import NaiveBayesClassifier

# Sample training data

train_data = [
('I love this car.', 'positive'),
('This view is amazing.', 'positive'),
('I feel great!', 'positive'),
('I dislike this product.', 'negative'),
('This place is horrible.', 'negative'),
('I feel sad.', 'negative')
]

# Create the Naive Bayes classifier

classifier = NaiveBayesClassifier(train_data)

# Sample test data

test_data = [
'I like this movie.',
'This food is terrible.',
'I am happy.'
]

# Classify the test data

for text in test_data:
sentiment = classifier.classify(text)
print(f'Text: {text}')
print(f'Sentiment: {sentiment}\n')

Output:

Text: I like this movie.

Sentiment: positive

12
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Text: This food is terrible.

Sentiment: positive

Text: I am happy.
Sentiment: positive

13
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Experiment – 9

Apply support vector machine for text classification.

!pip install -U scikit-learn

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Sample data
documents = [
("I love natural language processing.", "positive"),
("Machine learning is fascinating.", "positive"),
("Python is widely used in data science.", "positive"),
("I dislike noisy environments.", "negative"),
("This movie is terrible.", "negative"),
("I feel sad today.", "negative")
]

# Split the data into features and labels

texts, labels = zip(*documents)

# Create the TF-IDF vectorizer

vectorizer = TfidfVectorizer()

# Transform the text data into TF-IDF features

features = vectorizer.fit_transform(texts)

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

# Initialize the SVM classifier

svm_classifier = SVC()

# Train the classifier

svm_classifier.fit(X_train, y_train)

# Make predictions on the test set

y_pred = svm_classifier.predict(X_test)

# Evaluate the performance of the classifier

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)

14
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

print("\nClassification Report:")
print(report)

Output:
Accuracy: 0.0

Classification Report:
precision recall f1-score support

negative 0.00 0.00 0.00 0.0

positive 0.00 0.00 0.00 2.0
accuracy 0.00 2.0
macro avg 0.00 0.00 0.00 2.0
weighted avg 0.00 0.00 0.00 2.0

15
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Experiment – 10

Convert text to vectors ( using term frequency ) and apply cosine similarity to provide
closeness among two text.

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.metrics.pairwise import cosine_similarity

# Sample documents
documents = [
"I love natural language processing.",
"Machine learning is fascinating.",
"Python is widely used in data science."
]

# Initialize the CountVectorizer

vectorizer = CountVectorizer()

# Fit and transform the documents to obtain the term frequency (TF) vectors
tf_vectors = vectorizer.fit_transform(documents).toarray()

# Calculate the cosine similarity between two documents

doc1 = tf_vectors[0]
doc2 = tf_vectors[1]
similarity = cosine_similarity([doc1], [doc2])[0][0]

print(f"Text 1: {documents[0]}")
print(f"Text 2: {documents[1]}")
print(f"Cosine Similarity: {similarity:.4f}")

Output:

Text 1: I love natural language processing.

Text 2: Machine learning is fascinating.
Cosine Similarity: 0.0000

16
Aditya College of Engineering & Technology

NLP Lab Manual
No ratings yet
NLP Lab Manual
21 pages
Natural Language Processing in Data Science
No ratings yet
Natural Language Processing in Data Science
7 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
NLP Soc
No ratings yet
NLP Soc
15 pages
Laboratory Manual: Faculty of Engineering and Technology Bachelor of Technology
No ratings yet
Laboratory Manual: Faculty of Engineering and Technology Bachelor of Technology
10 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
NLP with Python Lab Manual
No ratings yet
NLP with Python Lab Manual
15 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Python NLP Techniques Guide
No ratings yet
Python NLP Techniques Guide
18 pages
Information Retrival
No ratings yet
Information Retrival
43 pages
NLP 1 Week Tutorial NLTK
No ratings yet
NLP 1 Week Tutorial NLTK
15 pages
NLP Record
No ratings yet
NLP Record
15 pages
Gen Ai Lab
No ratings yet
Gen Ai Lab
3 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
C24064 - NLP - Lab Manual
No ratings yet
C24064 - NLP - Lab Manual
28 pages
Generative AI
No ratings yet
Generative AI
16 pages
1a NLTK
No ratings yet
1a NLTK
10 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
TextFeatureEnginerring-NLP Lec2
No ratings yet
TextFeatureEnginerring-NLP Lec2
60 pages
Allnlp
No ratings yet
Allnlp
15 pages
DS 7
No ratings yet
DS 7
3 pages
Building A Simple Chatbot From Scratch in Python1
No ratings yet
Building A Simple Chatbot From Scratch in Python1
8 pages
Samaksh Gupta Programming Ass. IR
No ratings yet
Samaksh Gupta Programming Ass. IR
13 pages
NLP Exp4
No ratings yet
NLP Exp4
10 pages
CSE 3652 Lab Record Format - PDF
No ratings yet
CSE 3652 Lab Record Format - PDF
13 pages
Minor Assignment-3 (NLP)
No ratings yet
Minor Assignment-3 (NLP)
2 pages
NLP Lab Manual for B.E. Students
No ratings yet
NLP Lab Manual for B.E. Students
21 pages
NLP Lab Codes Till Mod3
No ratings yet
NLP Lab Codes Till Mod3
7 pages
NLP Lab Manual-1
No ratings yet
NLP Lab Manual-1
18 pages
NLP Practical Journal 2023-24
No ratings yet
NLP Practical Journal 2023-24
22 pages
NLP PDF
No ratings yet
NLP PDF
3 pages
Experiment: 1
No ratings yet
Experiment: 1
28 pages
EX1
No ratings yet
EX1
6 pages
NLP Lab
No ratings yet
NLP Lab
63 pages
Natural Language Processing
No ratings yet
Natural Language Processing
22 pages
Python NLP with Transformers
No ratings yet
Python NLP with Transformers
275 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
11 pages
NLP Lab Tasks for Students
No ratings yet
NLP Lab Tasks for Students
16 pages
NLP Assignment (917722H031)
No ratings yet
NLP Assignment (917722H031)
18 pages
Lab2 IR
No ratings yet
Lab2 IR
16 pages
NLTK Tokenization & Stemming Guide
No ratings yet
NLTK Tokenization & Stemming Guide
8 pages
Gen AI Lab
No ratings yet
Gen AI Lab
22 pages
Natural Language Processing Tasks
No ratings yet
Natural Language Processing Tasks
5 pages
Ccs339 Text and Speech Analysis Lab Manual
No ratings yet
Ccs339 Text and Speech Analysis Lab Manual
51 pages
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
No ratings yet
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
11 pages
NLP FinAL
No ratings yet
NLP FinAL
27 pages
DSBD 7 Ass
No ratings yet
DSBD 7 Ass
9 pages
L5 - L6 - Natural Language Processing
100% (1)
L5 - L6 - Natural Language Processing
94 pages
Assignment-10 (NLP-part-2)
No ratings yet
Assignment-10 (NLP-part-2)
2 pages
Methodology
No ratings yet
Methodology
9 pages
Ram Chandra Padwal - Pratical Guide To NLTK For Data Science
No ratings yet
Ram Chandra Padwal - Pratical Guide To NLTK For Data Science
37 pages
Assignment - 7: Import Import Import Import
No ratings yet
Assignment - 7: Import Import Import Import
3 pages
A Tutorial On: Linguistic Data Analysis
No ratings yet
A Tutorial On: Linguistic Data Analysis
99 pages
Python NLP Assignment
No ratings yet
Python NLP Assignment
9 pages
Module III
No ratings yet
Module III
42 pages
Generative AI 2
No ratings yet
Generative AI 2
24 pages
12.1. NLP Intro
No ratings yet
12.1. NLP Intro
53 pages
Assignment 2 IR
No ratings yet
Assignment 2 IR
6 pages
Lab Manual
No ratings yet
Lab Manual
10 pages
Ekaterine Nakhutsrishvili
No ratings yet
Ekaterine Nakhutsrishvili
7 pages
Upper Intermediate ELT Lesson Plan
No ratings yet
Upper Intermediate ELT Lesson Plan
17 pages
Title Sam's New Friend
No ratings yet
Title Sam's New Friend
4 pages
Review of Language Myths, Mysteries and Magic
No ratings yet
Review of Language Myths, Mysteries and Magic
3 pages
TVI Script finalMATH5 EUNICE Edited
No ratings yet
TVI Script finalMATH5 EUNICE Edited
12 pages
Activities For Practising Spelling - Toxic To Helpful Lyn Stone
100% (1)
Activities For Practising Spelling - Toxic To Helpful Lyn Stone
4 pages
Phonology Reviewer
No ratings yet
Phonology Reviewer
3 pages
Active Voice Tense Formulas Guide
No ratings yet
Active Voice Tense Formulas Guide
1 page
Gods Teeth Evidence Kit
100% (5)
Gods Teeth Evidence Kit
32 pages
Prepositions With Verbs
No ratings yet
Prepositions With Verbs
3 pages
Phrasal Verbs for Beginners
No ratings yet
Phrasal Verbs for Beginners
2 pages
Math Book GMAT Club
No ratings yet
Math Book GMAT Club
126 pages
SM 2 English
No ratings yet
SM 2 English
18 pages
Factors Causing The Reading Difficulties of Grade 7 Students
No ratings yet
Factors Causing The Reading Difficulties of Grade 7 Students
30 pages
2009 Agile FINAL Hyperlinked
No ratings yet
2009 Agile FINAL Hyperlinked
201 pages
ENGL 1900 Multimedia Project FA19
No ratings yet
ENGL 1900 Multimedia Project FA19
2 pages
Banumathi Dissertation
No ratings yet
Banumathi Dissertation
168 pages
DAF Upsc Mains 2022
No ratings yet
DAF Upsc Mains 2022
8 pages
FunAudioLLM: Voice Interaction Models
No ratings yet
FunAudioLLM: Voice Interaction Models
21 pages
The Computer For The 21st Century PDF
No ratings yet
The Computer For The 21st Century PDF
8 pages
Guide To STAR Assess
No ratings yet
Guide To STAR Assess
3 pages
Batch Programming Guide for Beginners
No ratings yet
Batch Programming Guide for Beginners
10 pages
Hwy Int Skillstest Answers
100% (2)
Hwy Int Skillstest Answers
5 pages
Roll and Read Games: Long Vowel
100% (1)
Roll and Read Games: Long Vowel
11 pages
2018 2019 Ce Graduate Student Handbook
No ratings yet
2018 2019 Ce Graduate Student Handbook
10 pages
Lesson 19 Mathematical Phrase and Verbal Phrases
No ratings yet
Lesson 19 Mathematical Phrase and Verbal Phrases
17 pages
Exploratory and Confirmatory Factor Analysis, and Psychometric Properties, of A Spanish Translation of The Body Appreciation Scale-2 (BAS-2)
No ratings yet
Exploratory and Confirmatory Factor Analysis, and Psychometric Properties, of A Spanish Translation of The Body Appreciation Scale-2 (BAS-2)
24 pages
Unit 4: People Around Me
No ratings yet
Unit 4: People Around Me
2 pages
5 Language Dictionary PDF
100% (7)
5 Language Dictionary PDF
400 pages
Unit V
No ratings yet
Unit V
11 pages