KEMBAR78
NLP Project Report | PDF | Receiver Operating Characteristic | Part Of Speech
0% found this document useful (0 votes)
278 views27 pages

NLP Project Report

The document is a project report on sentiment analysis of WhatsApp chat. It discusses conducting sentiment analysis on WhatsApp group chat data exported to a text file. The report includes sections on the objectives, which are to analyze and categorize WhatsApp chat data to determine sentiment at the document level. It also discusses the required hardware, software and data format used, which is a JSON file containing review text, summary, rating. The methodology section describes collecting data from Amazon reviews from 1996-2014, extracting sentiment sentences and performing part-of-speech tagging.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
278 views27 pages

NLP Project Report

The document is a project report on sentiment analysis of WhatsApp chat. It discusses conducting sentiment analysis on WhatsApp group chat data exported to a text file. The report includes sections on the objectives, which are to analyze and categorize WhatsApp chat data to determine sentiment at the document level. It also discusses the required hardware, software and data format used, which is a JSON file containing review text, summary, rating. The methodology section describes collecting data from Amazon reviews from 1996-2014, extracting sentiment sentences and performing part-of-speech tagging.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

A

MAJOR PROJECT REPORT

ON

“Whatsapp Chat Sentiment Analysis”

In partial fulfillment

For the award of degree of

“Bachelor of Technology”

Department of Computer Engineering & Information technology

Submitted To: Submitted By:

Ms. Neny Pandel DEEPAK PANDEY


(Assistant Professor) SID - 88720

Department of Computer Science and Engineering

Suresh Gyan Vihar University,Jaipur

NOVEMBER 2022
STUDENT DECLARATION

I declare that my 5th semester report entitled ‘Whatsapp Chat Sentiment Analysis’ is my own

work conducted under supervision of Ms. Neny Pandel.

I further declare that to the best of our knowledge the report for B.tech 5 th semester does not

contain part of the work which has submitted for the award of B.tech degree either in this or any

other university without proper citation.

Student’s sign Submitted to:

Ms. Neny Pandel

(Assistant professor)
ACKNOWLEDGEMENT

Working in a good environment and motivation enhance the quality of the work and I get it from my
college through our CLNLP project .

I have been permitted to take this golden opportunity under the expert guidance of Ms. Neny
Pandel from SGVU , Jaipur. I am heartily thankful to her to make complete my project successfully.
She has given us her full experience and extra knowledge in practical field.

I am also thankful to my head of department Mr. Sohit Agarwal and all CEIT staff to guide us.

Finally, we think all the people who had directly or indirectly help as to complete our project.

Student name:

DEEPAK PANDEY

SID :- 88720
CERTIFICATE

This is to certify that the project report entitled ‘ WHATSAPP CHAT SENTIMENT ANALYSIS.

Is a bonafied report of the work carried by Saurabh kumar under guidance and supervision for the

partial fulfilment of degree of the B.tech CSE at Suresh Gyan Vihar University, Jaipur.

To the best of our knowledge and belief, this work embodies the work of candidates themselves,

has duly been completed, fulfils the requirement of the ordinance relating to the bachelor degree of

the university and is up to the standard in respect of content, presentation and language for being

referred to the examiner.

Ms. Neny Pandel Mr. Sohit agarwal

Assistant Professor HOD, CEIT


ABSTRACT

Sentiment Analysis also known as Opinion Mining refers to the use of natural language processing,

text analysis to systematically identify, extract, quantify, and study affective states and subjective

information. Sentiment analysis is widely applied to reviews and survey responses, online and

social media, and healthcare materials for applications that range from marketing to customer

service to clinical medicine. In this project, we aim to perform Sentiment Analysis of product based

reviews. Data used in this project are online product reviews collected from “amazon.com”. We

expect to do review-level categorization of review data with promising outcomes.


INTRODUCTION

Sentiment is an attitude, thought, or judgment prompted by feeling. Sentiment analysis, which is

also known as opinion mining, studies people’s sentiments towards certain entities. From a user’s

perspective, people are able to post their own content through various social media, such as

forums, micro-blogs, or online social networking sites. From a researcher’s perspective, many

social media sites release their application programming interfaces (APIs), prompting data

collection and analysis by researchers and developers. However, those types of online data have

several flaws that potentially hinder the process of sentiment analysis. The first flaw is that since

people can freely post their own content, the quality of their opinions cannot be guaranteed. he

second flaw is that ground truth of such online data is not always available. A ground truth is more

like a tag of a certain opinion, indicating whether the opinion is positive, negative, or neutral.

“It is a quite boring movie… ....... but the scenes were good enough. ”

The given line is a movie review that states that “it” (the movie) is quite boring but the scenes were

good. Understanding such sentiments require multiple tasks.


Hence, SENTIMENTAL ANALYSIS is a kind of text classification based on Sentimental Orientation

(SO) of opinion they contain. Sentiment analysis of product reviews has recently become very

popular in text mining and computational linguistics research.

• Firstly, evaluative terms expressing opinions must be extracted from the review.

• Secondly, the SO, or the polarity, of the opinions must be determined.

• Thirdly, the opinion strength, or the intensity, of an opinion should also be determined.

• Finally, the review is classified with respect to sentiment classes, such as Positive and Negative,

based on the SO of the opinions it contains


REVIEW OF LITREATURE

The most fundamental problem in sentiment analysis is the sentiment polarity categorization, by

considering a dataset containing over 5.1 million product reviews from Amazon.com with the

products belonging to four categories

. A max-entropy POS tagger is used in order to classify the words of the sentence, an additional

python program to speed up the process. The negation words like no, not, and more are included

in the adverbs whereas Negation of Adjective and Negation of Verb are specially used to identify

the phrases.

The following are the various classification models which are selected for categorization: Naïve

Bayesian, Random Forest, Logistic Regression and Support Vector Machine.


For feature selection, Pang and Lee suggested to remove objective sentences by extracting

subjective ones. They proposed a text-categorization technique that is able to identify subjective

content using minimum cut. Gann et al. selected 6,799 tokens based on Twitter data, where each

token is assigned a sentiment score, namely TSI (Total Sentiment Index), featuring itself as a

positive token or a negative token. Specifically, a TSI for a certain token is computed as:

where p is the number of times a token appears in positive tweets and n is the number of times a

token appears in negative tweets is the ratio of total number of positive tweets over total number of

negative tweets.
OBJECTIVE

Scrapping product reviews on various websites featuring various products specifically


amazon.com.

Analyze and categorize review data.

Analyze sentiment on dataset from document level (review level).

Categorization or classification of opinion sentiment into-

• Positive

• Negative
System Design

Hardware Requirements:

• Core i5/i7 processor

• At least 8 GB RAM

• At least 60 GB of Usable Hard Disk Space

Software Requirements:

• Python 3.x

• Anaconda Distribution

• Google Colab

• Jupyter Notebook

• NLTK Toolkit

• UNIX/LINUX Operating System


Data Information

➢ Firstly we will Export Whatsapp group chat as txt.

➢ Secondly, Make a copy of this notebook.

➢ After this step You will be prompted to enter file path in 1.2. Load Whatsapp Group
Chat Data.

➢ At last we will Enter the path of your chat export.


WhatsApp-Analyzer is a statistical analysis tool for
WhatsApp chats. Working on the chat files that can be
exported from WhatsApp it generates various plots
showing, for example, which another participant a user
responds to the most. We propose to employ dataset
manipulation techniques to have a better understanding of
WhatsApp chat present in our phones.

Data Format:
The dataset we will use is .json file. The sample of the dataset is given below.
{

"reviewSummary": "Surprisingly delightful",

"reviewText": “ This is a first read filled with unexpected humor and


profound insights into the art of politics and policy. In brief, it is sly, wry, and
wise. ”,

"reviewRating": “4”,

}
Methodology for Implementation
(Formulation/Algorithm)

DATA COLLECTION:

Data which means product reviews collected from amazon.com from May
1996 to July 2014. Each review includes the following information: 1) reviewer ID; 2)
product ID; 3) rating; 4) time of the review; 5) helpfulness; 6) review text. Every rating is
based on a 5-star scale, resulting all the ratings to be ranged from 1-star to 5-star with no
existence of a half-star or a quarter-star.

SENTIMENT SENTENCE EXTRACTION & POS TAGGING:

Tokenization of reviews after removal of STOP words which mean nothing


related to sentiment is the basic requirement for POS tagging. After proper removal of
STOP words like “am, is, are, the, but” and so on the remaining sentences are converted
in tokens. These tokens take part in POS tagging
In natural language processing, part-of-speech (POS) taggers have been
developed to classify words based on their parts of speech. For sentiment analysis, a
POS tagger is very useful because of the following two reasons: 1) Words like nouns and
pronouns usually do not contain any sentiment. It is able to filter out such words with the
help of a POS tagger; 2) A POS tagger can also be used to distinguish words that can be
used in different parts of speech.

NEGETIVE PHRASE IDENTIFICATION:

Words such as adjectives and verbs are able to convey opposite sentiment
with the help of negative prefixes. For instance, consider the following sentence that was
found in an electronic device’s review: “The built in speaker also has its uses but so far
nothing revolutionary." The word, “revolutionary" is a positive word according to the list in.
However, the phrase “nothing revolutionary" gives more or less negative feelings.
Therefore, it is crucial to identify such phrases. In this work, there are two types of
phrases have been identified, namely negation-of-adjective (NOA) and negation-of-verb
(NOV).
SENTIMENT CLASSIFICATION ALGORITHMS:

Naïve Bayesian classifier:

The Naïve Bayesian classifier works as follows: Suppose that there exist a set

of training data, D, in which each tuple is represented by an n-dimensional feature

vector, X=x 1,x 2,..,x n , indicating n measurements made on the tuple from n attributes

or features. Assume that there are m classes, C 1,C 2,...,C m . Given a tuple X, the

classifier will predict that X belongs to C i if and only if: P(C i |X)>P(C j |X),

where i,j∈[1,m]a n d i≠j. P(C i |X) is computed as:


Random forest

The random forest classifier was chosen due to its superior performance over a single
decision tree with respect to accuracy. It is essentially an ensemble method based on
bagging. The classifier works as follows: Given D, the classifier firstly creates k bootstrap
samples of D, with each of the samples denoting as Di . A Di has the same number of
tuples as D that are sampled with replacement from D. By sampling with replacement, it
means that some of the original tuples of D may not be included in Di , whereas others
may occur more than once. The classifier then constructs a decision tree based on each
Di . As a result,

a “forest" that consists of k decision trees is formed.

To classify an unknown tuple, X, each tree returns its class prediction counting as one
vote. The final decision of X’s class is assigned to the one that has the most votes.
The decision tree algorithm implemented in scikit-learn is CART (Classification and
Regression Trees). CART uses Gini index for its tree induction. For D, the Gini index
is computed as:

Where pi is the probability that a tuple in D belongs to class C i . The Gini index
measures the impurity of D. The lower the index value is, the better D was partitioned.

Support vector machine

Support vector machine (SVM) is a method for the classification of both linear and
nonlinear data. If the data is linearly separable, the SVM searches for the linear optimal
separating hyperplane (the linear kernel), which is a decision boundary that separates
data of one class from another. Mathematically, a separating hyper plane can be written
as: W·X+b=0, where W is a weight vector and W=w1,w2,...,w n. X is a training tuple. b is a
scalar. In order to optimize the hyperplane, the problem essentially transforms to the
minimization of ∥W∥, which is eventually computed as:

where αi are numeric parameters, and yi are labels based on support


vectors, Xi .

That is: if yi =1 then

if y i =−1 then
Implementation Details

The training of dataset consists of the following steps:

Unpacking of data:A small python code has been implemented in order to read

the dataset from those files and dump them in to a pickle file for easier and

fastaccess and object serialization.

Preparing Data for Sentiment Analysis:

i) The pickle file is hence loaded in this step and the data besides the one

used for sentiment analysis is removed. As shown in our sample dataset in Page

11, there are a lot of columns in the data out of which only rating and text review is

what we require. So, the column, “reviewSummary” is dropped from the data file.

ii) After that, the review ratings which are 3 out of 5 are removed as they

signify neutral review, and all we are concerned of is positive and negative

reviews.
Preprocessing Data:This is a vital part of training the dataset. Here Words

present in the file are accessed both as a solo word and also as pair of words.

Because, for example the word “bad” means negative but when someone writes

“not bad” it refers to as positive. In such cases considering single word for

training data will work otherwise. So words in pairs are checked to find the

occurrence to modifiers before any adjective which if present which might

provide a different meaning to the outlook

Training Data/ Evaluation:The main chunk of code that does the whole

evaluation of sentimental analysis based on the preprocessed data is a part

of this.

i) The Accuracy, Precision, Recall, and Evaluation time is calculated and displayed.

ii) Navie Bayes, Logistic Regression, Linear SVM and Random forest

classifiers are applied on the dataset for evaluation of sentiments.

iii) Prediction of test data is done and Confusion Matrix of prediction isdisplayed.

iv) Total positive and negative reviews are counted.

v) A review like sentence is taken as input on the console and if positive the

console gives 1 as output and 0 for negative input.


Results and Sample Output

The ultimate outcome of this Training of Public reviews dataset is that, the machine

is capable of judging whether an entered sentence bears positive response or negative

response.

Precision (also called positive predictive value) is the fraction of relevant

instances among the retrieved instances, while Recall (also known as sensitivity) is the

fraction of relevant instances that have been retrieved over the total amount of relevant

instances. Both precision and recall are therefore based on an understanding and

measure of relevance.
F1 score (also F-score or F-measure) is a measure of a test's accuracy. It considers
both the precision p and the recall r of the test to compute the score: p is the number of correct
positive results divided by the number of all positive results returned by the classifier, and r is
the number of correct positive results divided by the number of all relevant samples (all
samples that should have been identified as positive). The F1 score is the harmonic average of
the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and
recall) and worst at 0.

In statistics, a receiver operating characteristic curve, i.e. ROC curve, is a graphical


plot that illustrates the diagnostic ability of a binary classifier system as its discrimination
threshold is varied. The Total Operating Characteristic (TOC) expands on the idea of ROC by
showing the total information in the two-by-two contingency table for each threshold. ROC gives
only two bits of relative information for each threshold, thus the TOC gives strictly more
information than the ROC.
The machine evaluates the accuracy of training the data along with
precision Recall and F1
The Confusion matrix of evaluation is calculated.
It is thus capable of judging an externally written review as positive or
negative.
A positive review will be marked as [1], and a negative review will be hence
marked as [0].

Results obtained using Hold-out Strategy(Train-Test split) [values rounded


upto 2 decimal places].

The Confusion Matrix Format is as follows:


True
Negative False Positive

False
Negative True Positive
Output
Conclusion

Sentiment analysis deals with the classification of texts based on the sentiments they
contain. This article focuses on a typical sentiment analysis model consisting of three
core steps, namely data preparation, review analysis and sentiment classification,
and describes representative techniques involved in those steps.

Sentiment analysis is an emerging research area in text mining and computational


linguistics, and has attracted considerable research attention in the past few years.
Future research shall explore sophisticated methods for opinion and product feature
extraction, as well as new classification models that can address the ordered labels
property in rating inference. Applications that utilize results from sentiment analysis
is also expected to emerge in the near future.
Future Scope

Sentiment analysis is a uniquely powerful tool for businesses(Whatsapp) that are looking to
measure attitudes, feelings and emotions regarding their brand. To date, the majority of
sentiment analysis projects have been conducted almost exclusively by companies and brands
through the use of social media data, survey responses and other hubs of user-generated content.
By investigating and analyzing customer sentiments, these brands are able to get an inside look
at consumer behaviors and, ultimately, better serve their audiences with the products, services
and experiences they offer.

The future of sentiment analysis is going to continue to dig deeper, far past the surface of the
number of likes, comments and shares, and aim to reach, and truly understand, the significance
of social media interactions and what they tell us about the consumers behind the screens. This
forecast also predicts broader applications for sentiment analysis – brands will continue to
leverage this tool, but so will individuals in the public eye, governments, nonprofits, education
centers and many other organizations.
References

• S. ChandraKala1 and C. Sindhu2, “OPINION MINING


AND SENTIMENT CLASSIFICATION: A SURVEY,”.Vol
.3(1),Oct 2012,420-427
• G.Angulakshmi , Dr.R.ManickaChezian ,”An Analysis on Opinion
Mining: Techniques and Tools”. Vol 3(7), 2014 www.iarcce.com.
• Callen Rain,”Sentiment Analysis in Amazon Reviews Using
Probabilistic Machine Learning” Swarthmore College,
Department of Computer Science.
• Alexander Pak, Patrick Paroubek. 2010, Twitter as a Corpus for
Sentiment Analysis and Opinion Mining.
• Alec Go, Richa Bhayani, Lei Huang. Twitter Sentiment
Classification using Distant Supervision.
• Jin Bai, JianYun Nie. Using Language Models for Text
Classification.
• Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen Rambow,
Rebecca Passonneau. Sentiment Analysis of Twitter Data.
• Fuchun Peng. 2003, Augmenting Naive Bayes Classifiers with
Statistical Language Models

You might also like