Data Science : Natural Language Processing ( SOC )
Experiment – 1
Demonstrate noise removal for any textual data and remove regular expression pattern such
as hash tag from textual data.
import re
def remove_noise(text):
# Remove hashtags
text = re.sub(r'#\w+', '', text)
# Remove other noise (e.g., special characters, URLs, numbers)
text = re.sub(r"[^a-zA-Z\s]", "", text) # Remove special characters
text = re.sub(r"\s+", " ", text) # Remove extra whitespaces
text = re.sub(r"http\S+|www\S+|https\S+", "", text) # Remove URLs
text = re.sub(r"\b\d+\b", "", text) # Remove numbers
return text.strip()
# Example usage
text = "Hello! This is a #sample text with #hashtags and some special characters!! 123 @acet.ac.in"
clean_text = remove_noise(text)
print(clean_text)
Output:
Hello This is a text with and some special characters acetacin
1
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
Experiment – 2
Perform lemmatization and stemming using python library nltk.
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')
lemmatizer=WordNetLemmatizer()
print(lemmatizer.lemmatize("running"))
print(lemmatizer.lemmatize("runs"))
def lemmatize(word):
lemmatizer=WordNetLemmatizer()
print("Verb Form: "+lemmatizer.lemmatize(word,pos="v"))
print("Noun Form: "+lemmatizer.lemmatize(word,pos="n"))
print("Adverb Form: "+lemmatizer.lemmatize(word,pos="r"))
print("Adjective Form: "+lemmatizer.lemmatize(word,pos="a"))
lemmatize('skewing')
Output:
running
run
Verb Form: skew
Noun Form: skewing
Adverb Form: skewing
Adjective Form: skewing
2
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
import nltk
from nltk.stem import PorterStemmer, LancasterStemmer
porter_stemmer=PorterStemmer()
print(porter_stemmer.stem('running'))
print(porter_stemmer.stem('runs'))
print(porter_stemmer.stem('ran'))
Output:
run
run
ran
3
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
Experiment – 3
Demonstrate object standardization such as replace social media slags from text.
slang_dict = {
"lol": "laughing out loud",
"omg": "oh my god",
"btw": "by the way",
"brb": "be right back",
"idk": "I don't know",
"tbh": "to be honest",
"imho": "in my humble opinion",
"afaik": "as far as I know",
"smh": "shaking my head",
"jk": "just kidding"
}
def standardize_text(text):
words = text.split()
standardized_words = []
for word in words:
if word.lower() in slang_dict:
standardized_words.append(slang_dict[word.lower()])
else:
standardized_words.append(word)
return ' '.join(standardized_words)
text = "lol that's so tbh idk why imho they would do that"
standardized_text = standardize_text(text)
print("Standardized text:", standardized_text)
Output:
Standardized text: laughing out loud that's so to be honest I don't know why in my humble opinion
they would do that
4
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
Experiment – 4
Perform part of speech tagging on any textual data.
import nltk
from nltk import word_tokenize, pos_tag
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
def perform_pos_tagging(text):
tokens = nltk.word_tokenize(text)
tagged_tokens = nltk.pos_tag(tokens)
return tagged_tokens
# Example usage
text = "I love to explore new places and try different cuisines."
tagged_text = perform_pos_tagging(text)
print(tagged_text)
Output:
[('I', 'PRP'), ('love', 'VBP'), ('to', 'TO'), ('explore', 'VB'), ('new', 'JJ'), ('places', 'NNS'), ('and', 'CC'),
('try', 'VB'), ('different', 'JJ'), ('cuisines', 'NNS'), ('.', '.')]
5
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
Experiment – 5
Implement topic modeling using Latent Dirichlet Allocation (LDA) in python.
import gensim
from gensim import corpora
# Sample documents
documents = [
"Machine learning is a subset of artificial intelligence.",
"Python is a popular programming language for data science.",
"Natural language processing is used in many applications such as chatbots.",
"Topic modeling is a technique for extracting topics from text data."
]
# Tokenize and preprocess the documents
tokenized_docs = [doc.lower().split() for doc in documents]
# Create a dictionary from the tokenized documents
dictionary = corpora.Dictionary(tokenized_docs)
# Create a corpus (term-document frequency)
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
# Build the LDA model
num_topics = 2 # Number of topics to extract
lda_model = gensim.models.LdaModel(corpus, num_topics=num_topics, id2word=dictionary,
passes=10)
# Print the extracted topics and their top words
for idx, topic in lda_model.print_topics(num_words=5):
print(f"Topic {idx + 1}: {topic}")
# Get the topic distribution for a sample document
sample_doc = "Machine learning and data science go hand in hand."
sample_doc_bow = dictionary.doc2bow(sample_doc.lower().split())
sample_doc_topics = lda_model.get_document_topics(sample_doc_bow)
print(f"\nSample Document Topics: {sample_doc_topics}")
Output:
Topic 1: 0.056*"language" + 0.056*"is" + 0.055*"such" + 0.055*"applications" + 0.055*"many"
6
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
Topic 2: 0.080*"a" + 0.080*"is" + 0.057*"for" + 0.034*"data." + 0.034*"topics"
Sample Document Topics: [(0, 0.30881903), (1, 0.691181)]
7
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
Experiment – 6
Demonstrate Term Frequency – Inverse Document Frequency ( TF – IDF ) using python.
!pip install scikit-learn
import nltk
nltk.download('punkt')
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample documents
documents = [
"Machine learning is a subset of artificial intelligence.",
"Python is a popular programming language for data science.",
"Natural language processing is used in many applications such as chatbots.",
"Topic modeling is a technique for extracting topics from text data."
]
# Create the TF-IDF vectorizer
vectorizer = TfidfVectorizer()
# Fit the vectorizer to the documents and transform the documents into TF-IDF features
tfidf_matrix = vectorizer.fit_transform(documents)
# Get the feature names (terms)
feature_name = vectorizer.get_feature_names_out()
# Print the TF-IDF matrix
print("TF-IDF Matrix:")
print(tfidf_matrix.toarray())
# Print the TF-IDF values for each term in each document
print("\nTF-IDF Values:")
for doc_index, doc in enumerate(documents):
print(f"Document {doc_index + 1}:")
for term_index, term in enumerate(feature_name):
tfidf_value = tfidf_matrix[doc_index, term_index]
if tfidf_value > 0:
print(f"{term}: {tfidf_value:.4f}")
Output:
8
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
TF-IDF Values:
Document 1:
artificial: 0.3993
intelligence: 0.3993
is: 0.2084
learning: 0.3993
machine: 0.3993
of: 0.3993
subset: 0.3993
Document 2:
data: 0.3183
for: 0.3183
is: 0.2106
language: 0.3183
popular: 0.4037
programming: 0.4037
python: 0.4037
science: 0.4037
Document 3:
applications: 0.3179
as: 0.3179
chatbots: 0.3179
in: 0.3179
is: 0.1659
language: 0.2507
many: 0.3179
9
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
natural: 0.3179
processing: 0.3179
such: 0.3179
used: 0.3179
Document 4:
data: 0.2702
extracting: 0.3427
for: 0.2702
from: 0.3427
is: 0.1788
modeling: 0.3427
technique: 0.3427
text: 0.3427
topic: 0.3427
topics: 0.3427
10
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
Experiment – 7
Demonstrate Word Embeddings using word2vec.
!pip install gensim
from gensim.models import Word2Vecec
# Step 1: Prepare training data (list of tokenized sentences)
sentences = [
["machine", "learning", "is", "fun"],
["deep", "learning", "uses", "neural", "networks"],
["natural", "language", "processing", "is", "a", "part", "of", "AI"],
["word2vec", "creates", "word", "embeddings"],
["AI", "is", "the", "future"],
]
# Step 2: Train Word2Vec model
model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, workers=2, sg=1)
# Step 3: Use the model
# Get embedding vector for a word
word_vector = model.wv["learning"]
print("Vector for 'learning':\n", word_vector)
# Find similar words
print("\nMost similar words to 'AI':")
similar = model.wv.most_similar("AI", topn=3)
for word, score in similar:
print(f"{word}: {score:.4f}")
11
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
Experiment – 8
Implement text classification using naïve bayes classifier and text blob library.
!pip install -U scikit-learn
!pip install -U textblob
import nltk
nltk.download('punkt')
from textblob import TextBlob
from textblob.classifiers import NaiveBayesClassifier
# Sample training data
train_data = [
('I love this car.', 'positive'),
('This view is amazing.', 'positive'),
('I feel great!', 'positive'),
('I dislike this product.', 'negative'),
('This place is horrible.', 'negative'),
('I feel sad.', 'negative')
]
# Create the Naive Bayes classifier
classifier = NaiveBayesClassifier(train_data)
# Sample test data
test_data = [
'I like this movie.',
'This food is terrible.',
'I am happy.'
]
# Classify the test data
for text in test_data:
sentiment = classifier.classify(text)
print(f'Text: {text}')
print(f'Sentiment: {sentiment}\n')
Output:
Text: I like this movie.
Sentiment: positive
12
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
Text: This food is terrible.
Sentiment: positive
Text: I am happy.
Sentiment: positive
13
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
Experiment – 9
Apply support vector machine for text classification.
!pip install -U scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
# Sample data
documents = [
("I love natural language processing.", "positive"),
("Machine learning is fascinating.", "positive"),
("Python is widely used in data science.", "positive"),
("I dislike noisy environments.", "negative"),
("This movie is terrible.", "negative"),
("I feel sad today.", "negative")
]
# Split the data into features and labels
texts, labels = zip(*documents)
# Create the TF-IDF vectorizer
vectorizer = TfidfVectorizer()
# Transform the text data into TF-IDF features
features = vectorizer.fit_transform(texts)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)
# Initialize the SVM classifier
svm_classifier = SVC()
# Train the classifier
svm_classifier.fit(X_train, y_train)
# Make predictions on the test set
y_pred = svm_classifier.predict(X_test)
# Evaluate the performance of the classifier
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
print("Accuracy:", accuracy)
14
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
print("\nClassification Report:")
print(report)
Output:
Accuracy: 0.0
Classification Report:
precision recall f1-score support
negative 0.00 0.00 0.00 0.0
positive 0.00 0.00 0.00 2.0
accuracy 0.00 2.0
macro avg 0.00 0.00 0.00 2.0
weighted avg 0.00 0.00 0.00 2.0
15
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
Experiment – 10
Convert text to vectors ( using term frequency ) and apply cosine similarity to provide
closeness among two text.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Sample documents
documents = [
"I love natural language processing.",
"Machine learning is fascinating.",
"Python is widely used in data science."
]
# Initialize the CountVectorizer
vectorizer = CountVectorizer()
# Fit and transform the documents to obtain the term frequency (TF) vectors
tf_vectors = vectorizer.fit_transform(documents).toarray()
# Calculate the cosine similarity between two documents
doc1 = tf_vectors[0]
doc2 = tf_vectors[1]
similarity = cosine_similarity([doc1], [doc2])[0][0]
print(f"Text 1: {documents[0]}")
print(f"Text 2: {documents[1]}")
print(f"Cosine Similarity: {similarity:.4f}")
Output:
Text 1: I love natural language processing.
Text 2: Machine learning is fascinating.
Cosine Similarity: 0.0000
16
Aditya College of Engineering & Technology