KEMBAR78
HPC Mini Project Report | PDF | Statistical Classification | Ligo
100% found this document useful (1 vote)
2K views12 pages

HPC Mini Project Report

This document describes using classification algorithms like logistic regression and random forest in SPSS Modeler to analyze gravitational wave strain data. It aims to predict gravitational wave events using attributes like strain value and type. The dataset will be split into 70% for training models and 30% for testing. Both algorithms will be used to classify the testing data and compare their accuracy, with the most accurate model being selected.

Uploaded by

Ketan Ingale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
2K views12 pages

HPC Mini Project Report

This document describes using classification algorithms like logistic regression and random forest in SPSS Modeler to analyze gravitational wave strain data. It aims to predict gravitational wave events using attributes like strain value and type. The dataset will be split into 70% for training models and 30% for testing. Both algorithms will be used to classify the testing data and compare their accuracy, with the most accurate model being selected.

Uploaded by

Ketan Ingale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

“CLASSIFICATION ALGORITHMS USING

SPSS MODELER”

A Mini Project

Submitted by

Rakshitha Shettigar (BC058)

Nishant Dalvi (BC051)

Ketan Ingale (BC045)

Farhan Ansari (BC007)

FOURTH YEAR COMPUTER ENGINEERING

Department of Computer Engineering

Hope Foundation's
International Institute of Information Technology

Hinjawadi, Pune – 411057

AY 2018-2019
Semester-1
Classification algorithms using SPSS Modeler

TABLE OF CONTENTS

1. PROBLEM STATEMENT 3
2. ABSTRACT 3
3. INTRODUCTION 3
4. OBJECTIVE 6
5. METHODOLOGY 6
6. MATHEMATICAL MODEL 7
7. ALGORITHM 8
8. FLOWCHART 10
9. RESULT 11
10. CONCLUSION 12
11. REFERENCES 12

2 Department of Computer Engineering I2IT, Pune


Classification algorithms using SPSS Modeler

1. PROBLEM STATEMENT

Perform Logistic Regression Classifier and Random Forest Classifier of CBC data using
SPSS Modeler tool

Dataset used- Gravitational wave strain for H1 and L1.

2. ABSTRACT

Gravitational waves are disturbances in the curvature of space-time, generated by accelerated


masses that propagate as waves outward from their source at the speed of light. As a
gravitational wave passes an observer, that observer will find space-time distorted by the
effects of strain.

The Laser Interferometer Gravitational-Wave Observatory (LIGO) the Virgo detector are
large-scale physics experiments designed to directly detect gravitational waves. The LIGO
Scientific Collaboration (LSC) and the Virgo Collaboration pursue gravitational wave
science with these detectors, along with partner collaborations around the world. These
gravitational strain waves are represented in the form of events.

To perform supervised machine learning algorithm to predict an event based on the strain
type and strain value, we are to train the model by feeding 70% data as input. The testing is
done on the remaining dataset in which strain value and strain type will be taken as input and
the model will predict the event.

3. INTRODUCTION

Data Mining is a technique used in various domains to give meaning to the available data
Classification is a data mining (machine learning) technique used to predict group
membership for data instances.
Classification is a technique where we categorize data into a given number of classes. The
main goal of a classification problem is to identify the category/class to which a new data
will fall under.
Classification is used to find out in which group each data instance is related within a
given dataset. It is used for classifying data into different classes according to some
constrains. Several major kinds of classification algorithms including C4.5, ID3, k-nearest
neighbor classifier, Naive Bayes, SVM, and ANN are used for classification. Generally, a
classification technique follows three approaches Statistical, Machine Learning and Neural
Network for classification.
Classification is a two step process. During first step the model is created by applying
classification algorithm on training data set then in second step the extracted model is tested
against a predefined test data set to measure the model trained performance and accuracy.
Therefore, classification is the process to assign class label from data set whose class label is
unknown.

3 Department of Computer Engineering I2IT, Pune


Classification algorithms using SPSS Modeler

SPSS Modeller

IBM SPSS Modeler is a data mining and text analytics software application from IBM.
It is used to build predictive models and conduct other analytic tasks. It has a visual
interface which allows users to leverage statistical and data mining algorithms without
programming.
One of its main aims from the outset was to get rid of unnecessary complexity in data
transformations, and to make complex predictive models very easy to use. The first
version incorporated decision trees (ID3), and neural networks (backprop), which could
both be trained without underlying knowledge of how those techniques worked.
IBM SPSS Modeler was originally named Clementine by its creators, Integral
Solutions Limited. This name continued for a while after SPSS's acquisition of the
product. SPSS later changed the name to SPSS Clementine, and then later to PASW
Modeler.[1] Following IBM's 2009 acquisition of SPSS, the product was renamed IBM
SPSS Modeler.

Applications:

a. Customer analytics and Customer relationship management (CRM)


b. Fraud detection and prevention
c. Optimizing insurance claims
d. Risk management
e. Manufacturing quality improvement
f. Healthcare quality improvement
g. Forecasting demand or sales
h. Law enforcement and border security
i. Education
j. Telecommunications
k. Entertainment: e.g., predicting movie box office receipts

4 Department of Computer Engineering I2IT, Pune


Classification algorithms using SPSS Modeler

Classification algorithms :
• Logistic Regression

Logistic regression is the appropriate regression analysis to conduct when the


dependent variable is dichotomous (binary). Like all regression analyses, the logistic
regression is a predictive analysis. Logistic regression is used to describe data and to
explain the relationship between one dependent binary variable and one or more
nominal, ordinal, interval or ratio-level independent variables.
Sometimes logistic regressions are difficult to interpret; the Intellectus Statistics tool
easily allows you to conduct the analysis, then in plain English interprets the output.

• Random Forrest Classifier

Random forest, as its name implies, consists of a large number of individual decision
trees that operate as an ensemble. Each individual tree in the random forest spits out a
class prediction and the class with the most votes becomes our model’s prediction (see
figure below).

Visualization of a Random Forest Model Making a Prediction

The fundamental concept behind random forest is a simple but powerful one — the
wisdom of crowds. In data science speak, the reason that the random forest model works so
well is:

A large number of relatively uncorrelated models (trees) operating as a committee will


outperform any of the individual constituent models.

5 Department of Computer Engineering I2IT, Pune


Classification algorithms using SPSS Modeler

The low correlation between models is the key. Just like how investments with low
correlations (like stocks and bonds) come together to form a portfolio that is greater than the
sum of its parts, uncorrelated models can produce ensemble predictions that are more
accurate than any of the individual predictions. The reason for this wonderful effect is that
the trees protect each other from their individual errors (as long as they do not
constantly all err in the same direction). While some trees may be wrong, many other trees
will be right, so as a group the trees are able to move in the correct direction. Therefore, the
prerequisites for random forest to perform well are:

1. There needs to be some actual signal in our features so that models built using those
features do better than random guessing.

2. The predictions (and therefore the errors) made by the individual trees need to have low
correlations with each other.

4. OBJECTIVE
• To perform supervised machine learning on gravitational wave strain dataset.
• To use multiple classification algorithms and find the efficiency of them.\
• To find out which classification algorithm has the highest accuracy and correctly
predicts the event.

5. METHODOLOGY
• The gravitational wave strain data for H1 and L1 has 3 attributes – strain value,
strain type and event.
• The dataset is split into training dataset and testing dataset in 70% and 30%
respectively.
• The training dataset is fed to the classification algorithm to train the model to
correctly predict the event.
• The model is tested on the testing dataset where the event is predicted as the final
output.
• Accuracy of every testing model is compared and the model with the best
accuracy is found.

6 Department of Computer Engineering I2IT, Pune


Classification algorithms using SPSS Modeler

6. MATHEMATICAL MODEL

• Logistic Regression:

b0 = Regression constant. b1 = Steepness of curve.

p = probability of a class. x = categorical variable.

Logistic regression can handle any number of numerical and/or categorical variables.

b0 = Regression constant.

b1, b2.……bp = Steepness of curve.

p = probability of a class.

x1, x2…….xn = categorical variables.

7 Department of Computer Engineering I2IT, Pune


Classification algorithms using SPSS Modeler

• Random Forest:

It is made up of multiple decision trees. In decision analysis, a decision tree can


be used to visually and explicitly represent decisions and decision making. In data mining, a
decision tree describes data (but the resulting classification tree can be an input for decision
making)

In Decision Tree the major challenge is to identification of the attribute for the root node in
each level. This process is known as attribute selection. We have two popular attribute
selection measures:
1. Information Gain
2. Gini Index
3. Gain Ratio

Information Gain
When we use a node in a decision tree to partition the training instances into smaller subsets
the entropy changes. Information gain is a measure of this change in entropy.

Entropy
Entropy is the measure of uncertainty of a random variable, it characterizes the impurity of
an arbitrary collection of examples. The higher the entropy more the information content.

8 Department of Computer Engineering I2IT, Pune


Classification algorithms using SPSS Modeler

7. ALGORITHM

1) Split dataset into training dataset( 70% ) and testing dataset (30%).
2) Train the model using the training dataset and apply one of the classification
algorithms.
3) Compare the accuracy of every classification algorithm.

Random Forest Algorithm:

a. Takes the test features and use the rules of each randomly created decision tree to
predict the outcome and stores the predicted outcome (target)
b. Calculate the votes for each predicted target.
c. Consider the high voted predicted target as the final prediction from the random
forest algorithm.
d. To perform the prediction using the trained random forest algorithm we need to pass
the test features through the rules of each randomly created trees. Suppose let’s say
we formed 100 random decision trees to from the random forest.
e. Each random forest will predict different target (outcome) for the same test feature.
Then by considering each predicted target votes will be calculated. Suppose the 100
random decision trees are prediction some 3 unique targets x, y, z then the votes of x
is nothing but out of 100 random decision tree how many trees prediction is x.

Likewise for other 2 targets (y, z). If x is getting high votes. Let’s say out of 100
random decision tree 60 trees are predicting the target will be x. Then the final random
forest returns the x as the predicted target.

This concept of voting is known as majority voting.

9 Department of Computer Engineering I2IT, Pune


Classification algorithms using SPSS Modeler

8. FLOWCHART

End

10 Department of Computer Engineering I2IT, Pune


Classification algorithms using SPSS Modeler

4 RESULT
Logistic Regression

Random Forest Classifier

11 Department of Computer Engineering I2IT, Pune


Classification algorithms using SPSS Modeler

Logistic Regression:

Frequency Count Percentage Accuracy


Correctly Classified Records 8,591,864 60.44%
Incorrectly Classified Records 5,623,374 39.56%
Total 14,215,238

Random Forest Classifier:


Frequency Count Percentage Accuracy
Correctly Classified Records 12,897,318 90.68%
Incorrectly Classified Records 1,326,117 9.32%
Total 14,223,435

6 CONCLUSION

Thus we applied two different classification algorithms (Logistic Regression and Random
Forest Classifier) on the gravitational wave strain dataset. The efficiency of Random Forest
Classifier is substantially more than that of Logistic Regression.

7 REFERENCES

• https://stackabuse.com/decision-trees-in-python-with-scikit-learn/
• https://stackabuse.com/k-nearest-neighbors-algorithm-in-python-and-scikit-learn/
• https://stackabuse.com/the-naive-bayes-algorithm-in-python-with-scikit-learn/

12 Department of Computer Engineering I2IT, Pune

You might also like