ABSTARCT
Recently, social media is playing a vital role in social networking and sharing
of data. Social media is favored by many users as it is available to millions of
people without any limitations to share their opinions, educational learning
experience and concerns via their status. Twitter API is processed to search
for the tweets based on the geo-location. Someone posts on social network
gives us a better concern to Analyze about the particular systems process of
the system. Evaluating such data in social network is quite a challenging
process. In the proposed system, there will be a workflow to mine the data
which integrates both qualitative analysis and large-scale machine learning
technique. Based on the different prominent theme’s tweets will be
categorized into different groups. Machine learning classifier will be
implemented on mined data for qualitative analysis purpose to get the
deeper understanding of the data. It uses multi label classification technique
as each label falls into different categories and all the attributes are
independent to each other. Label based measures will be taken to analyze
the results and comparing them with the existing sentiment analysis
technique.
v
LIST OF FIGURES
FIGURE NO FIGURE NAME PAGE NO
7.1 SYSTEM ARCHITECTURE 30
5.3.2 ER DIAGRAM 25
5.3.1 WORKFLOW DIAGRAM 25
5.3.3 USECASE DIAGRAM 26
5.3.4 SEQUENCE DIAGRAM 27
5.3.5 COLLABRATION DIAGRAM 27
11 OUTPUTS 37
LIST OF TABLES
TABLE NO TABLE NAME PAGE NO
4.5 .NET FRAMEWORK 18
LIST OF GRAPHS
GRAPH NO GRAPH NAME PAGE NO
1.2 SVM 05
1.2.2 LINEAR SVM 07
1.2.3 NON-LINEAR SVM 08
vi
TABLE OF CONTENTS
CHAPTER NO TITLE PAGE NO
ABSTRACT V
LIST OF FIGURES VI
LIST OF TABLES VI
LIST OF GRAPHS VI
1. INTRODUCTION 01
1.1 OUTLINE OF PROJECT 01
1.2 PROPOSED ALGORITHM 01
1.3 DOMAIN INTRODUCTION 02
1.3.1 PYTHON IMPLEMENTATION 04
2. LITERATURE REVIEW 10
3. AIM AND SCOPE 14
3.1 PURPOSE
3.2 SCOPE 14
3.3 EXISTING SYSTEM 14
3.4 PROPOSED SYSTEM 14
4. SYSTEM REQURIMENTS 15
4.1 HARDWARE REQURIEMENTS 15
vii
4.2 SOFTWARE REQURIMENTS 15
4.3 LANGUAGE SPECIFICATION 15
4.4THE .NET FRAMEWORK 15
4.5 LANGUAGES SUPPORTED BY.NET 17
4.6 SQL SERVER 19
4.7 FEASIBILITY STUDY 22
4.7.1 ECONOMICAL FEASIBILITY 23
4.7.2 TECHNICAL FEASIBILITY 23
4.7.3 SOCIAL FEASIBILITY 23
5. SYSTEM DESIGN 23
5.1 INPUT DESIGN 23
5.2 OUTPUT DESIGN 24
5.3 DATA FLOW DIAGRAMS 24
6. MODULES 28
6.1 COLLECTION OF USER REVIEWS 28
6.2 PRE-PROCESSING 29
6.3 FEATURE SELECTION 29
6.4 SENTIMENT CLASSIFICATION 29
6.5 ANALYSIS OF REVIEWS 29
7. ARCHITECTURE DIAGRAM 29
7.1 SYSTEM ARCHITECTURE 30
viii
8. SYSTEM TESTING 30
8.1 TEST PLAN 30
8.2 VERIFICATION 30
8.3 VALIDATION 30
8.4 BASICS OF SOFTWARE TESTING 30
8.5 BLACK BOX TESTING 31
8.6 WHITE BOX TESTING 31
8.7 TYPES OF TESTING 31
9. CONCLUSION 35
10. REFERENCE 36
11. APPENDIX 37
A.OUTPUTS 37
ix
CHAPTER 1
INTRODUCTION
Social media website is defined as “a website that facilitates meeting people,
finding like minds, communicating and sharing content, and building
community”; this kind of website allows or encourages various types of
activities, such as commercial, social, or a combination of the two. Social
media categories include digital library, e-commerce, entertainment, forum,
geolocation, social bookmark, social review, social game, and social network.
Social network is the subcategory of social media, which is the social
structure of people who are joined by common interest. Social media are
social channels of communication using web-based technologies, desktop
computers, and mobile technologies. These technologies create highly
interactive platforms through which individuals, communities, and
organizations can share information, discuss, rate, comment, and modify
user-generated and online contents. These advancements enable
communication among businesses, organizations, communities, and
individuals. Social media technologies change the way individuals and large
organizations communicate, and they are increasingly being developed.
Wide range of applications in business and public policy uses sentiment
analysis. Sentimental analysis is now being used from specific product
marketing to antisocial behaviour recognition. Businesses and organizations
have always been concerned about how they are perceived by the public.
This concern results from a variety of motivations, including marketing and
public relations. Before the era of Internet, the only way for an organization to
track its reputation in the media was to hire someone for the specific task of
reading newspapers and manually compiling lists of positive, negative and
neutral references to the organization, it could undertake expensive surveys
of uncertain validity. Today, many newspapers are published online. Some of
them publish dedicated online editions, while others publish the pages of their
print edition in PDF. In addition to newspapers, there are a wide range of
opinionated articles posted online in blogs and other social media. This opens
up the possibility of automatically detecting positive or negative mentions of
an organization in articles published online, thereby dramatically reducing the
effort required to collect this type of information. To this end, Organizations
are becoming increasingly interested in acquiring fine sentiment analysis from
news articles. Fine-grained sentiment analysis is an extremely challenging
problem because of the variety of ways in which opinions can be expressed.
News articles present an even greater challenge, as they usually avoid overt
indicators of attitudes. However, despite their apparent neutrality, news
articles can still bear polarity if they describe events that are objectively
positive or negative. Many techniques used for sentiment analysis involve
naïve approaches based on spotting certain keywords which reveal the author
or speaker’s emotions. We use naïve performs fine-grained sentiment
analysis to classify sentences as positive, negative or neutral.
1.2 PROPOSED ALGORITHM
Naïve Bayes Classifier Algorithm
1
o Naïve Bayes algorithm is a supervised learning algorithm, which is
based on Bayes theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional
training dataset.
o Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine
learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis
of the probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.
Why is it called Naïve Bayes?
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes,
Which can be described as:
o Naïve: It is called Naïve because it assumes that the occurrence of a
certain feature is independent of the occurrence of other features. Such
as if the fruit is identified on the bases of color, shape, and taste, then
red, spherical, and sweet fruit is recognized as an apple. Hence each
feature individually contributes to identify that it is an apple without
depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes'
Theorem
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is
used to determine the probability of a hypothesis with prior knowledge.
It depends on the conditional probability.
o The formula for Bayes' theorem is given as:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed
event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the
probability of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the
evidence.
2
P(B) is Marginal Probability: Probability of Evidence.
Working of Naïve Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with the help of the
below example:
Suppose we have a dataset of weather conditions and corresponding target
variable "Play". So using this dataset we need to decide that whether we
should play or not on a particular day according to the weather conditions. So
to solve this problem, we need to follow the below steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
Advantages of Naïve Bayes Classifier:
o Naïve Bayes is one of the fast and easy ML algorithms to predict a
class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other
Algorithms.
o It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
o Naive Bayes assumes that all features are independent or unrelated,
so it cannot learn the relationship between features.
Applications of Naïve Bayes Classifier:
o It is used for Credit Scoring.
o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes
Classifier is an eager learner.
o It is used in Text classification such as Spam filtering and Sentiment
analysis.
Types of Naïve Bayes Model:
There are three types of Naive Bayes Model, which are given below:
o Gaussian: The Gaussian model assumes that features follow a normal
distribution. This means if predictors take continuous values instead of
discrete, then the model assumes that these values are sampled from
the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the
data is multinomial distributed. It is primarily used for document
3
classification problems, it means a particular document belongs to
which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial
classifier, but the predictor variables are the independent Booleans
variables. Such as if a particular word is present or not in a document.
This model is also famous for document classification tasks.
Python Implementation of the Naïve Bayes algorithm:
Now we will implement a Naive Bayes Algorithm using Python. So for this, we
will use the "user_data" dataset, which we have used in our other
classification model. Therefore we can easily compare the Naive Bayes model
with the other models.
Steps to implement:
o Data Pre-processing step
o Fitting Naive Bayes to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.
SVM
Support Vector Machine or SVM is one of the most popular Supervised
Learning algorithms, which is used for Classification as well as Regression
problems. However, primarily, it is used for Classification problems in Machine
Learning.
The goal of the SVM algorithm is to create the best line or decision boundary
that can segregate n-dimensional space into classes so that we can easily put
the new data point in the correct category in the future. This best decision
boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is
termed as Support Vector Machine. Consider the below diagram in which
there are two different categories that are classified using a decision boundary
or hyperplane:
4
Graph 1.2.1
Example: SVM can be understood with the example that we have used in the
KNN classifier. Suppose we see a strange cat that also has some features of
dogs, so if we want a model that can accurately identify whether it is a cat or
dog, so such a model can be created by using the SVM algorithm. We will first
train our model with lots of images of cats and dogs so that it can learn about
different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these
two data (cat and dog) and choose extreme cases (support vectors), it will see
the extreme case of cat and dog. On the basis of the support vectors, it will
classify it as a cat. Consider the below diagram:
Fig 1.2.1
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
SVM can be of two types:
5
o Linear SVM: Linear SVM is used for linearly separable data, which
means if a dataset can be classified into two classes by using a single
straight line, then such data is termed as linearly separable data, and
classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated
data, which means if a dataset cannot be classified by using a straight
line, then such data is termed as non-linear data and classifier used is
called as Non-linear SVM classifier.
Hyperplane and Support Vectors in the SVM algorithm:
Hyperplane: There can be multiple lines/decision boundaries to segregate
the classes in n-dimensional space, but we need to find out the best decision
boundary that helps to classify the data points. This best boundary is known
as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the
dataset, which means if there are 2 features (as shown in image), then
hyperplane will be a straight line. And if there are 3 features, then hyperplane
will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means
the maximum distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which
affect the position of the hyperplane are termed as Support Vector. Since
these vectors support the hyperplane, hence called a Support vector.
How does SVM works?
Linear SVM:
The working of the SVM algorithm can be understood by using an example.
Suppose we have a dataset that has two tags (green and blue), and the
dataset has two features x1 and x2. We want a classifier that can classify the
pair(x1, x2) of coordinates in either green or blue. Consider the below image: