A
MAJOR PROJECT REPORT
ON
“Whatsapp Chat Sentiment Analysis”
In partial fulfillment
For the award of degree of
“Bachelor of Technology”
Department of Computer Engineering & Information technology
Submitted To: Submitted By:
Ms. Neny Pandel DEEPAK PANDEY
(Assistant Professor) SID - 88720
Department of Computer Science and Engineering
Suresh Gyan Vihar University,Jaipur
NOVEMBER 2022
STUDENT DECLARATION
I declare that my 5th semester report entitled ‘Whatsapp Chat Sentiment Analysis’ is my own
work conducted under supervision of Ms. Neny Pandel.
I further declare that to the best of our knowledge the report for B.tech 5 th semester does not
contain part of the work which has submitted for the award of B.tech degree either in this or any
other university without proper citation.
Student’s sign Submitted to:
Ms. Neny Pandel
(Assistant professor)
ACKNOWLEDGEMENT
Working in a good environment and motivation enhance the quality of the work and I get it from my
college through our CLNLP project .
I have been permitted to take this golden opportunity under the expert guidance of Ms. Neny
Pandel from SGVU , Jaipur. I am heartily thankful to her to make complete my project successfully.
She has given us her full experience and extra knowledge in practical field.
I am also thankful to my head of department Mr. Sohit Agarwal and all CEIT staff to guide us.
Finally, we think all the people who had directly or indirectly help as to complete our project.
Student name:
DEEPAK PANDEY
SID :- 88720
CERTIFICATE
This is to certify that the project report entitled ‘ WHATSAPP CHAT SENTIMENT ANALYSIS.
Is a bonafied report of the work carried by Saurabh kumar under guidance and supervision for the
partial fulfilment of degree of the B.tech CSE at Suresh Gyan Vihar University, Jaipur.
To the best of our knowledge and belief, this work embodies the work of candidates themselves,
has duly been completed, fulfils the requirement of the ordinance relating to the bachelor degree of
the university and is up to the standard in respect of content, presentation and language for being
referred to the examiner.
Ms. Neny Pandel Mr. Sohit agarwal
Assistant Professor HOD, CEIT
ABSTRACT
Sentiment Analysis also known as Opinion Mining refers to the use of natural language processing,
text analysis to systematically identify, extract, quantify, and study affective states and subjective
information. Sentiment analysis is widely applied to reviews and survey responses, online and
social media, and healthcare materials for applications that range from marketing to customer
service to clinical medicine. In this project, we aim to perform Sentiment Analysis of product based
reviews. Data used in this project are online product reviews collected from “amazon.com”. We
expect to do review-level categorization of review data with promising outcomes.
INTRODUCTION
Sentiment is an attitude, thought, or judgment prompted by feeling. Sentiment analysis, which is
also known as opinion mining, studies people’s sentiments towards certain entities. From a user’s
perspective, people are able to post their own content through various social media, such as
forums, micro-blogs, or online social networking sites. From a researcher’s perspective, many
social media sites release their application programming interfaces (APIs), prompting data
collection and analysis by researchers and developers. However, those types of online data have
several flaws that potentially hinder the process of sentiment analysis. The first flaw is that since
people can freely post their own content, the quality of their opinions cannot be guaranteed. he
second flaw is that ground truth of such online data is not always available. A ground truth is more
like a tag of a certain opinion, indicating whether the opinion is positive, negative, or neutral.
“It is a quite boring movie… ....... but the scenes were good enough. ”
The given line is a movie review that states that “it” (the movie) is quite boring but the scenes were
good. Understanding such sentiments require multiple tasks.
Hence, SENTIMENTAL ANALYSIS is a kind of text classification based on Sentimental Orientation
(SO) of opinion they contain. Sentiment analysis of product reviews has recently become very
popular in text mining and computational linguistics research.
• Firstly, evaluative terms expressing opinions must be extracted from the review.
• Secondly, the SO, or the polarity, of the opinions must be determined.
• Thirdly, the opinion strength, or the intensity, of an opinion should also be determined.
• Finally, the review is classified with respect to sentiment classes, such as Positive and Negative,
based on the SO of the opinions it contains
REVIEW OF LITREATURE
The most fundamental problem in sentiment analysis is the sentiment polarity categorization, by
considering a dataset containing over 5.1 million product reviews from Amazon.com with the
products belonging to four categories
. A max-entropy POS tagger is used in order to classify the words of the sentence, an additional
python program to speed up the process. The negation words like no, not, and more are included
in the adverbs whereas Negation of Adjective and Negation of Verb are specially used to identify
the phrases.
The following are the various classification models which are selected for categorization: Naïve
Bayesian, Random Forest, Logistic Regression and Support Vector Machine.
For feature selection, Pang and Lee suggested to remove objective sentences by extracting
subjective ones. They proposed a text-categorization technique that is able to identify subjective
content using minimum cut. Gann et al. selected 6,799 tokens based on Twitter data, where each
token is assigned a sentiment score, namely TSI (Total Sentiment Index), featuring itself as a
positive token or a negative token. Specifically, a TSI for a certain token is computed as:
where p is the number of times a token appears in positive tweets and n is the number of times a
token appears in negative tweets is the ratio of total number of positive tweets over total number of
negative tweets.
OBJECTIVE
Scrapping product reviews on various websites featuring various products specifically
amazon.com.
Analyze and categorize review data.
Analyze sentiment on dataset from document level (review level).
Categorization or classification of opinion sentiment into-
• Positive
• Negative
System Design
Hardware Requirements:
• Core i5/i7 processor
• At least 8 GB RAM
• At least 60 GB of Usable Hard Disk Space
Software Requirements:
• Python 3.x
• Anaconda Distribution
• Google Colab
• Jupyter Notebook
• NLTK Toolkit
• UNIX/LINUX Operating System
Data Information
➢ Firstly we will Export Whatsapp group chat as txt.
➢ Secondly, Make a copy of this notebook.
➢ After this step You will be prompted to enter file path in 1.2. Load Whatsapp Group
Chat Data.
➢ At last we will Enter the path of your chat export.
WhatsApp-Analyzer is a statistical analysis tool for
WhatsApp chats. Working on the chat files that can be
exported from WhatsApp it generates various plots
showing, for example, which another participant a user
responds to the most. We propose to employ dataset
manipulation techniques to have a better understanding of
WhatsApp chat present in our phones.
Data Format:
The dataset we will use is .json file. The sample of the dataset is given below.
{
"reviewSummary": "Surprisingly delightful",
"reviewText": “ This is a first read filled with unexpected humor and
profound insights into the art of politics and policy. In brief, it is sly, wry, and
wise. ”,
"reviewRating": “4”,
}
Methodology for Implementation
(Formulation/Algorithm)
DATA COLLECTION:
Data which means product reviews collected from amazon.com from May
1996 to July 2014. Each review includes the following information: 1) reviewer ID; 2)
product ID; 3) rating; 4) time of the review; 5) helpfulness; 6) review text. Every rating is
based on a 5-star scale, resulting all the ratings to be ranged from 1-star to 5-star with no
existence of a half-star or a quarter-star.
SENTIMENT SENTENCE EXTRACTION & POS TAGGING:
Tokenization of reviews after removal of STOP words which mean nothing
related to sentiment is the basic requirement for POS tagging. After proper removal of
STOP words like “am, is, are, the, but” and so on the remaining sentences are converted
in tokens. These tokens take part in POS tagging
In natural language processing, part-of-speech (POS) taggers have been
developed to classify words based on their parts of speech. For sentiment analysis, a
POS tagger is very useful because of the following two reasons: 1) Words like nouns and
pronouns usually do not contain any sentiment. It is able to filter out such words with the
help of a POS tagger; 2) A POS tagger can also be used to distinguish words that can be
used in different parts of speech.
NEGETIVE PHRASE IDENTIFICATION:
Words such as adjectives and verbs are able to convey opposite sentiment
with the help of negative prefixes. For instance, consider the following sentence that was
found in an electronic device’s review: “The built in speaker also has its uses but so far
nothing revolutionary." The word, “revolutionary" is a positive word according to the list in.
However, the phrase “nothing revolutionary" gives more or less negative feelings.
Therefore, it is crucial to identify such phrases. In this work, there are two types of
phrases have been identified, namely negation-of-adjective (NOA) and negation-of-verb
(NOV).
SENTIMENT CLASSIFICATION ALGORITHMS:
Naïve Bayesian classifier:
The Naïve Bayesian classifier works as follows: Suppose that there exist a set
of training data, D, in which each tuple is represented by an n-dimensional feature
vector, X=x 1,x 2,..,x n , indicating n measurements made on the tuple from n attributes
or features. Assume that there are m classes, C 1,C 2,...,C m . Given a tuple X, the
classifier will predict that X belongs to C i if and only if: P(C i |X)>P(C j |X),
where i,j∈[1,m]a n d i≠j. P(C i |X) is computed as:
Random forest
The random forest classifier was chosen due to its superior performance over a single
decision tree with respect to accuracy. It is essentially an ensemble method based on
bagging. The classifier works as follows: Given D, the classifier firstly creates k bootstrap
samples of D, with each of the samples denoting as Di . A Di has the same number of
tuples as D that are sampled with replacement from D. By sampling with replacement, it
means that some of the original tuples of D may not be included in Di , whereas others
may occur more than once. The classifier then constructs a decision tree based on each
Di . As a result,
a “forest" that consists of k decision trees is formed.
To classify an unknown tuple, X, each tree returns its class prediction counting as one
vote. The final decision of X’s class is assigned to the one that has the most votes.
The decision tree algorithm implemented in scikit-learn is CART (Classification and
Regression Trees). CART uses Gini index for its tree induction. For D, the Gini index
is computed as:
Where pi is the probability that a tuple in D belongs to class C i . The Gini index
measures the impurity of D. The lower the index value is, the better D was partitioned.
Support vector machine
Support vector machine (SVM) is a method for the classification of both linear and
nonlinear data. If the data is linearly separable, the SVM searches for the linear optimal
separating hyperplane (the linear kernel), which is a decision boundary that separates
data of one class from another. Mathematically, a separating hyper plane can be written
as: W·X+b=0, where W is a weight vector and W=w1,w2,...,w n. X is a training tuple. b is a
scalar. In order to optimize the hyperplane, the problem essentially transforms to the
minimization of ∥W∥, which is eventually computed as:
where αi are numeric parameters, and yi are labels based on support
vectors, Xi .
That is: if yi =1 then
if y i =−1 then
Implementation Details
The training of dataset consists of the following steps:
Unpacking of data:A small python code has been implemented in order to read
the dataset from those files and dump them in to a pickle file for easier and
fastaccess and object serialization.
Preparing Data for Sentiment Analysis:
i) The pickle file is hence loaded in this step and the data besides the one
used for sentiment analysis is removed. As shown in our sample dataset in Page
11, there are a lot of columns in the data out of which only rating and text review is
what we require. So, the column, “reviewSummary” is dropped from the data file.
ii) After that, the review ratings which are 3 out of 5 are removed as they
signify neutral review, and all we are concerned of is positive and negative
reviews.
Preprocessing Data:This is a vital part of training the dataset. Here Words
present in the file are accessed both as a solo word and also as pair of words.
Because, for example the word “bad” means negative but when someone writes
“not bad” it refers to as positive. In such cases considering single word for
training data will work otherwise. So words in pairs are checked to find the
occurrence to modifiers before any adjective which if present which might
provide a different meaning to the outlook
Training Data/ Evaluation:The main chunk of code that does the whole
evaluation of sentimental analysis based on the preprocessed data is a part
of this.
i) The Accuracy, Precision, Recall, and Evaluation time is calculated and displayed.
ii) Navie Bayes, Logistic Regression, Linear SVM and Random forest
classifiers are applied on the dataset for evaluation of sentiments.
iii) Prediction of test data is done and Confusion Matrix of prediction isdisplayed.
iv) Total positive and negative reviews are counted.
v) A review like sentence is taken as input on the console and if positive the
console gives 1 as output and 0 for negative input.
Results and Sample Output
The ultimate outcome of this Training of Public reviews dataset is that, the machine
is capable of judging whether an entered sentence bears positive response or negative
response.
Precision (also called positive predictive value) is the fraction of relevant
instances among the retrieved instances, while Recall (also known as sensitivity) is the
fraction of relevant instances that have been retrieved over the total amount of relevant
instances. Both precision and recall are therefore based on an understanding and
measure of relevance.
F1 score (also F-score or F-measure) is a measure of a test's accuracy. It considers
both the precision p and the recall r of the test to compute the score: p is the number of correct
positive results divided by the number of all positive results returned by the classifier, and r is
the number of correct positive results divided by the number of all relevant samples (all
samples that should have been identified as positive). The F1 score is the harmonic average of
the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and
recall) and worst at 0.
In statistics, a receiver operating characteristic curve, i.e. ROC curve, is a graphical
plot that illustrates the diagnostic ability of a binary classifier system as its discrimination
threshold is varied. The Total Operating Characteristic (TOC) expands on the idea of ROC by
showing the total information in the two-by-two contingency table for each threshold. ROC gives
only two bits of relative information for each threshold, thus the TOC gives strictly more
information than the ROC.
The machine evaluates the accuracy of training the data along with
precision Recall and F1
The Confusion matrix of evaluation is calculated.
It is thus capable of judging an externally written review as positive or
negative.
A positive review will be marked as [1], and a negative review will be hence
marked as [0].
Results obtained using Hold-out Strategy(Train-Test split) [values rounded
upto 2 decimal places].
The Confusion Matrix Format is as follows:
True
Negative False Positive
False
Negative True Positive
Output
Conclusion
Sentiment analysis deals with the classification of texts based on the sentiments they
contain. This article focuses on a typical sentiment analysis model consisting of three
core steps, namely data preparation, review analysis and sentiment classification,
and describes representative techniques involved in those steps.
Sentiment analysis is an emerging research area in text mining and computational
linguistics, and has attracted considerable research attention in the past few years.
Future research shall explore sophisticated methods for opinion and product feature
extraction, as well as new classification models that can address the ordered labels
property in rating inference. Applications that utilize results from sentiment analysis
is also expected to emerge in the near future.
Future Scope
Sentiment analysis is a uniquely powerful tool for businesses(Whatsapp) that are looking to
measure attitudes, feelings and emotions regarding their brand. To date, the majority of
sentiment analysis projects have been conducted almost exclusively by companies and brands
through the use of social media data, survey responses and other hubs of user-generated content.
By investigating and analyzing customer sentiments, these brands are able to get an inside look
at consumer behaviors and, ultimately, better serve their audiences with the products, services
and experiences they offer.
The future of sentiment analysis is going to continue to dig deeper, far past the surface of the
number of likes, comments and shares, and aim to reach, and truly understand, the significance
of social media interactions and what they tell us about the consumers behind the screens. This
forecast also predicts broader applications for sentiment analysis – brands will continue to
leverage this tool, but so will individuals in the public eye, governments, nonprofits, education
centers and many other organizations.
References
• S. ChandraKala1 and C. Sindhu2, “OPINION MINING
AND SENTIMENT CLASSIFICATION: A SURVEY,”.Vol
.3(1),Oct 2012,420-427
• G.Angulakshmi , Dr.R.ManickaChezian ,”An Analysis on Opinion
Mining: Techniques and Tools”. Vol 3(7), 2014 www.iarcce.com.
• Callen Rain,”Sentiment Analysis in Amazon Reviews Using
Probabilistic Machine Learning” Swarthmore College,
Department of Computer Science.
• Alexander Pak, Patrick Paroubek. 2010, Twitter as a Corpus for
Sentiment Analysis and Opinion Mining.
• Alec Go, Richa Bhayani, Lei Huang. Twitter Sentiment
Classification using Distant Supervision.
• Jin Bai, JianYun Nie. Using Language Models for Text
Classification.
• Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen Rambow,
Rebecca Passonneau. Sentiment Analysis of Twitter Data.
• Fuchun Peng. 2003, Augmenting Naive Bayes Classifiers with
Statistical Language Models