KEMBAR78
DWM Manual | PDF | Data Warehouse | Data Management
0% found this document useful (0 votes)
272 views60 pages

DWM Manual

This experiment aims to design dimensional models for a data warehouse including star and snowflake schemas. It discusses dimensional modeling and different schema types. As a case study, students are expected to write a problem statement and design dimensional models for a given case study including star and snowflake schemas.

Uploaded by

giril16687
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
272 views60 pages

DWM Manual

This experiment aims to design dimensional models for a data warehouse including star and snowflake schemas. It discusses dimensional modeling and different schema types. As a case study, students are expected to write a problem statement and design dimensional models for a given case study including star and snowflake schemas.

Uploaded by

giril16687
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Shivajirao S.

Jondhale College of Engineering, Dombivli (E)


Department of Computer Engineering

Laboratory Manual
Year 2023-2024

DATA WAREHOUSE AND MINING LAB(CSL504)


Semester – V

Prepared by
Prof. Diksha D. Bhave Dr. Uttara Gogate

Prof. Reena Deshmukh HOD


Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Engineering

VISION
To impart high quality technical education for creating competent and ethically strong
professionals with capabilities of accepting new challenges.

MISSION
 Our efforts are dedicated to impart high quality technical education based on a balanced
program of instructions and practical experiences.
 Our strength is to provide value based technical education to develop core competencies
and ethics for overall personality development.
 Our endeavor is to impart in depth knowledge and versatility to meet the global
challenges.
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Engineering

VISION
To impart quality technical education in the department of Computer Engineering for
creating competent and ethically strong engineers with capabilities of accepting new
challenges.

MISSION

 Our efforts are dedicated to impart quality technical education to prepare


engineering graduates who excel in programming skills.
 Our strength is to serve society by producing globally competent professionals.
 Our endeavor is to provide all possible support to build strong teaching environment
to provide quality education in Computer Engineering.

PROGRAM EDUCATIONAL OBJECTIVES (PEOS)

 To prepare learners with a sound foundation in the mathematical, scientific and


Engineering fundamentals
 To develop among learners ability to formulate, analyze and solve engineering
problems in real life
 To encourage, motivate and prepare learners to inculcate professional and ethical
attitude for lifelong learning
 To prepare learner to become generalist engineers and for pursuing higher studies

Program Specific Outcomes (PSOs)

 Ability to use software methodology and various software tools for developing
system programs, high quality web apps and solutions to complex real world
problems.
 Ability to identify and use suitable data structure and analyze the various algorithm
for given problem from different domains
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Data warehousing and Mining Lab


Semester – V

Lab Objectives:

1. Learn how to build a data warehouse and query it.


2. Learn about the data sets and data preprocessing.
3. Demonstrate the working of algorithms for data mining tasks such Classification, clustering,
Association rule mining & Web mining
4. Apply the data mining techniques with varied input values for different parameters.
5. Explore open source software (like WEKA) to perform data mining tasks.

Lab Outcomes:
Learner will be able to…
1. Design data warehouse and perform various OLAP operations.
2. Implement data mining algorithms like classification.
3. Implement clustering algorithms on a given set of data sample.
4. Implement Association rule mining & web mining algorithm.
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Data warehousing and Mining Lab


Semester – V

List of experiment

Sr. No. Title of the Experiment Page No.

1 1
One case study on building Data warehouse/Data Mart.
Write Detailed Problem statement and design
dimensional modelling (creation of star and snowflake
schema)

2 4
Implementation of all dimension table and fact table
based on experiment No 1 case study
3 8
To study various OLAP operations such as Slice, Dice,
Roll up, Roll Down and Pivote.
4 14
Implement Naïve Bays classification algorithm using
JAVA
5 22
Introduction to weka tool
6 30
Implement Decision tree classification algorithm using
WEKA tool
7 34
Implement K-means clustering algorithm using JAVA.
8 39
Implementation of HITS algorithm.
9 43
Implement Apriori association algorithm WEKA tool.
10 49
Implement linear regression algorithm using Python
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Data warehousing and Mining Lab


Semester – V

Experiments to CO Mapping

Expt Title (Subtopic Prior Concept References CO


No. Name) (Chapter (From Syllabus) Mapping
Name/Concept)
1 Star Dimension ReemaTheraja “Data LO1
schema/snowflake modelling warehousing”,
schema Oxford University
Press.
2 Dimension table Dimension ReemaTheraja “Data LO1
and fact table modelling warehousing”,
Oxford University
Press.
3 OLAP Operations Analytical ReemaTheraja “Data LO1
Operations warehousing”,
Oxford University
Press
4 Naïve Bays Classification Han, Kamber, "Data L02
classification Mining Concepts and
algorithm using Techniques
JAVA
5 Introduction to weka Introduction to Han, Kamber, "Data LO1
tool data mining Mining Concepts and
Techniques
6 Decision tree Classification Han, Kamber, "Data LO2
classification Mining Concepts and
Techniques
7 K-means clustering Clustering Han, Kamber, "Data LO3
algorithm Mining Concepts and
Techniques
8 HITS Algorithm Web Mining Han, Kamber, "Data LO4
Pattern Mining Concepts and
Techniques
9 Apriori association Mining Frequent Daniel Larose, “Data LO4
algorithm Pattern and Mining Methods and
Association Rule Models
10 Liner regression Mining Frequent Han, Kamber, "Data LO4
Pattern and Mining Concepts and
Association Rule Techniques
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Experiment No - 01
Title: Dimensional modeling
Aim: -
One case study on building Data warehouse/Data Mart .Write Detailed
Problem statement and design dimensional modeling (creation of star
and snowflake schema)
Theory:
“A data warehouse is a subject-oriented, integrated, time-variant, and non-
volatile collection of data in support of management’s decision-making process.”

Schema is a logical description of the entire database. It includes the name and
description of records of all record types including all associated data-items and
aggregates. Much like a database, a data warehouse also requires to maintain a
schema. A database uses relational model, while a data warehouse uses Star,
Snowflake, and Fact Constellation schema.

Figure 1.1: Star Schema

DWM Lab/ V 1
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

1. Star Schema:-
 Each dimension in a star schema is represented with only one-dimension
table.
 This dimension table contains the set of attributes.
 The above diagram shows the sales data of a company with respect to the
four dimensions, namely time, item, branch, and location.
 There is a fact table at the center. It contains the keys to each of four
dimensions.
 The fact table also contains the attributes, namely dollars sold and units sold.

Each dimension has only one dimension table and each table holds a set of attributes.
For example, the location dimension table contains the attribute set {location_key,
street, city, province or state, country}. This constraint may cause data redundancy.
For example, "Vancouver" and "Victoria" both the cities are in the Canadian province
of British Columbia. The entries for such cities may cause data redundancy along the
attributes province or state and country.

2. Snowflake schema:

A Snowflake Schema is an extension of a Star Schema, and it adds additional


dimensions. It is called snowflake because its diagram resembles a Snowflake. The
dimension tables are normalized which splits data into additional tables. In the
following example, Country is further normalized into an individual table .

Characteristics of Snowflake Schema:

 The main benefit of the snowflake schema it uses smaller disk space.
 Easier to implement a dimension is added to the Schema
 Due to multiple tables query performance is reduced
 The primary challenge that you will face while using the snowflake Schema
is that you need to perform more maintenance efforts because of the more
lookup tables.
DWM Lab/ V 2
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Result :

Figure 1.2 : Star Schema for Student Databse Management system

Conclusion: - Thus we have implemented star schema for student management


system using MySql successfully.

DWM Lab/ V 3
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Experiment No - 02

Title: Fact Constellation Schema


Aim: Implementation of all dimension table and fact table based on
experiment No 1 case study
Theroy:
Fact constellation is a measure of online analytical processing, which is a
collection of multiple fact tables sharing dimension tables, viewed as a collection of
stars. This is an improvement over Star schema. A fact constellation schema has
multiple fact tables. It is also known as galaxy schema. It is widely used schema and
more complex than star schema and snowflake schema. It is possible to create fact
constellation schema by splitting original star schema into more star schema. It has
many fact tables and some common dimension table.

Figure 2.1 – General structure of Fact Constellation

DWM Lab/ V 4
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Figure 2.2: Fact table of Sales management system

 Placement is a fact table having attributes: (Stud_roll, Company_id, TPO_id) with


facts: (Number of students eligible, Number of students placed).
 Workshop is a fact table having attributes: (Stud_roll, Institute_id, TPO_id) with
facts: (Number of students selected, Number of students attended the workshop).
 Company is a dimension table having attributes: (Company_id, Name,
Offer_package).
 Student is a dimension table having attributes: (Student_roll, Name, CGPA).
 TPO is a dimension table having attributes: (TPO_id, Name, Age).
 Training Institute is a dimension table having attributes: (Institute_id, Name,
Full_course_fee).

DWM Lab/ V 5
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

DIMENSION TABLE:

A dimension table allows keeping records of the dimensions. Each dimension may have a table
associated with it this is called dimension table. Dimension table consist of the textual description
of dimension of the table. For example: in a class we need to create a class data warehouse
containing information such as subjects offered, students name, students roll no, student marks, etc.
these dimension allows to keep track of the student performance

. Properties of dimension table:

1). Dimension table is related to fact table with the help of simple primary key.

2). these consist of the constraint used to link them to fact table
Steps to Create Dimensional Data Modelling:

 Step-1: Identifying the business objective –


The first step is to identify the business objective. Sales, HR, Marketing, etc. are some examples
as per the need of the organization. Since it is the most important step of Data Modelling the
selection of business objective also depends on the quality of data available for that process.

 Step-2: Identifying Granularity –


Granularity is the lowest level of information stored in the table. The level of detail for business
problem and its solution is described by Grain.

 Step-3: Identifying Dimensions and its Attributes –


Dimensions are objects or things. Dimensions categorize and describe data warehouse facts and
measures in a way that supports meaningful answers to business questions. A data warehouse
organizes descriptive attributes as columns in dimension tables. For Example, the data
dimension may contain data like a year, month and weekday.

 Step-4: Identifying the Fact –


The measurable data is held by the fact table. Most of the fact table rows are numerical values

DWM Lab/ V 6
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

like price or cost per unit, etc.

 Step-5: Building of Schema –


We implement the Dimension Model in this step. A schema is a database structure

Conclusion: - Thus we have implemented all dimension table and fact table based on

experiment No 1 case study

DWM Lab/ V 7
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Experiment No - 03

Title: OLAP operations


Aim:
To study various OLAP operations such as Slice, Dice, Roll up, Roll Down
and Pivote.
Theroy:
Online Analytical Processing Server (OLAP) is based on the
multidimensional data model. It allows managers, and analysts to get an insight of
the information through fast, consistent, and interactive access to information. This
experiment covers the types of OLAP operations.

OLAP Operations:

Since OLAP servers are based on multidimensional view of data, we will


discuss OLAP operations in multidimensional data.

Here is the list of OLAP operations −

 Roll-up
 Drill-down
 Slice and dice
 Pivot (rotate)

Roll-up Operation:
Roll-up performs aggregation on a data cube in any of the following ways −

 By climbing up a concept hierarchy for a dimension


 By dimension reduction
The following diagram illustrates how roll-up works.

DWM Lab/ V 8
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Figure 3.1: Roll up operation

 Roll-up is performed by climbing up a concept hierarchy for the dimension


location.
 Initially the concept hierarchy was "street < city < province < country".
 On rolling up, the data is aggregated by ascending the location hierarchy
from the level of city to the level of country.
 The data is grouped into cities rather than countries.
 When roll-up is performed, one or more dimensions from the data cube are
removed.

Drill-down Operation:
Drill-down is the reverse operation of roll-up. It is performed by either of the
following ways

DWM Lab/ V 9
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

 By stepping down a concept hierarchy for a dimension


 By introducing a new dimension.

The following diagram illustrates how drill-down works −

Figure 3. 2: Drill Down Operation

 Drill-down is performed by stepping down a concept hierarchy for the


dimension time.

 Initially the concept hierarchy was "day < month < quarter < year."

 On drilling down, the time dimension is descended from the level of quarter
to the level of month.

 When drill-down is performed, one or more dimensions from the data cube
are added.

 It navigates the data from less detailed data to highly detailed data.

DWM Lab/ V 10
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Slice Operation:
The slice operation selects one particular dimension from a given cube and provides
a new sub-cube. Consider the following diagram that shows how slice works.

Figure 3.3: Slice Operation

 Here Slice is performed for the dimension "time" using the criterion

time = "Q1".

 It will form a new sub-cube by selecting one or more dimensions.

Dice Operation:
Dice selects two or more dimensions from a given cube and provides a new sub-
cube. Consider the following diagram that shows the dice operation.

DWM Lab/ V 11
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Figure 3.4: Dice Operation


The dice operation on the cube based on the following selection criteria involves
three dimensions.

 (location = "Toronto" or "Vancouver")


 (time = "Q1" or "Q2")
 (item =" Mobile" or "Modem")
Pivot Operation:
The pivot operation is also known as rotation. It rotates the data axes in view in
order to provide an alternative presentation of data. Consider the following diagram
that shows the pivot operation.

DWM Lab/ V 12
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Figure 3.5: Pivot Operation


Conclusion:
In this way we have studied different OLAP operations.

DWM Lab/ V 13
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Experiment No - 4
Title: Naïve Bayes Algorithm
Aim:-
Implementation of Naïve Bayes classifier is using java.

Theory: -
Naive Bayesian Classifier a statistical classifier. This classifier is based on
the Bayes’ Theorem and the maximum posteriori hypothesis. The naive assumption
of class conditional independence is often made to reduce the computational cost.
Bayesian classifiers are statistical classifiers. They can predict class
membership probabilities, such as the probability that a given sample belongs to a
particular class. Bayesian classifier is based on Bayes’ theorem. Naive Bayesian
classifiers assume that the effect of an attribute value on a given class is independent
of the values of the other attributes. This assumption is called class conditional
independence. It is made to simplify the computation involved and, in this sense, is
considered”naive”.

Bayes’ Theorem: - Let X = {x1, x2. . . xn} be a sample, whose components represent
values made on a set of n attributes. In Bayesian terms, X is considered”evidence”.
Let H be some hypothesis, such as that the data X belongs to a specific class C. For
classification problems, our goal is to determine P(H|X), the probability that the
hypothesis H holds given the ”evidence”, (i.e. the observed data sample X). In other
words, we are looking for the probability that sample X belongs to class C, given
that we know the attribute description of X. P(H|X) is the a posteriori probability of
H conditioned on X. Fox example, suppose our data samples have attributes: age and
income, and that sample X is a 35-year-old customer with an income of $40,000.
Suppose that H is the hypothesis that our customer will buy a computer. Then
P(H|X) is the probability that customer X will buy a computer given that we know
the customer’s age and income. In contrast, P(H) is the a priori probability of H. For
our example, this is the probability that any given customer will buy a computer,
DWM Lab/ V 14
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

regardless of age, income, or ny other information. The a posteriori probability


P(H|X) is based on more information (about the customer) than the a priori
probability, P(H), which is independent of X. Similarly, P(X|H) is the a posteriori
probability of X conditioned on H. That is, it is the probability that a customer X, is
35 years old and earns $40,000, given that we know the customer will buy a
computer. P(X) is the a priori probability of X. In our example, it is the probability
that a person from our set of customers is 35 years old and earns $40,000.
According to Bayes’ theorem, the probability that we want to compute
P(H|X) can be expressed in terms of probabilities P(H), P(X|H), and P(X) as P(H|X)
= P(X|H) P(H) / P(X) , and these probabilities may be estimated from the given data.

Naive Bayesian Classifier:-


The naive Bayesian classifier algorithm works as follows:
I. Let T be a training set of samples, each with their class labels. There are k classes,
C1,C2. . .Ck. Each sample is represented by an n-dimensional vector, X = {x1, x2. . .
xn}, depicting n measured values of the n attributes, A1,A2, . . . ,An, respectively.
II. Given a sample X, the classifier will predict that X belongs to the class having the
highest a posteriori probability, conditioned on X. That is X is predicted to belong to
the class Ci if and only if

P(Ci|X) > P(Cj |X) for 1 ≤ j ≤ m, j ≠ i.


Thus we find the class that maximizes P(Ci|X). The class Ci for which P(Ci|X) is
maximized is called the maximum posteriori hypothesis. By Bayes’ theorem.
P(Ci|X) = P(X|Ci) P(Ci) / P(X)
III. As P(X) is the same for all classes, only P(X|Ci)P(Ci) need be maximized. If the
class a priori probabilities, P(Ci), are not known, then it is commonly assumed that
the classes are equally likely, that is, P(C1) = P(C2) = . . . = P(Ck), and we would
therefore maximize P(X|Ci). Otherwise we maximize P(X|Ci)P(Ci). Note that the
class a priori probabilities may be estimated by P(Ci) = freq(Ci, T)/|T|.

DWM Lab/ V 15
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

IV. Given data sets with many attributes, it would be computationally expensive to
compute P(X|Ci). In order to reduce computation in evaluating P(X|Ci) P(Ci), the
naive assumption of class conditional independence is made. This presumes that the
values of the attributes are conditionally independent of one another, given the class
label of the sample. Mathematically this means that

Program and output:

// Implementations of Naive Bayes Classification algo


import java.io.*;
classNaiveBayes
{
public static void main(String args[])throws IOException
{
BufferedReaderbr=new BufferedReader(new InputStreamReader(System.in));
System.out.println("Enter the no. of columns:");
int n=Integer.parseInt(br.readLine());
System.out.println("Enter the no. of rows:");
int m=Integer.parseInt(br.readLine());
String a[][]=new String[m][n];
String col[]=new String[n];
System.out.println("Enter the names of the attributes and keep the classification
column last and its value as yes or no:");
for(int i=0;i<n;i++)
{
System.out.print("Column"+i+":");
col[i]=br.readLine();
}
for(int i=0;i<m;i++)
{
System.out.println("Enter the values of row"+(i+1)+":");
for(int j=0;j<n;j++)
{
System.out.print(col[j]+":");
a[i][j]=br.readLine();
}
}
doublecount_yes=0.0;
DWM Lab/ V 16
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

doublecount_no=0.0;
System.out.println("The Table that you have entered is:");
for(int i=0;i<n;i++)
{
System.out.print(col[i]+"\t\t");
}
for(int i=0;i<m;i++)
{
for(int j=0;j<n;j++)
{
if(j==(n-1))
{
if(a[i][j].equals("Yes")||a[i][j].equals("yes"))
{
count_yes++;
}
else if(a[i][j].equals("No")||a[i][j].equals("no"))
{
count_no++;
}
}
System.out.print("a[i][j]+");
}
System.out.println();
}
System.out.println("p(yes):"+count_yes+"/"+m);
System.out.println("p(no):"+count_no+"/"+m);
String decision[]=new String[n-1];
doubledecisiony_ctr[]=new double[n-1];
doubledecisionn_ctr[]=new double[n-1];
System.out.println("Enter the unseen tuple for classification:");
for(int i=0;i<n-2;i++)
{
System.out.print(col[i]+":");
decision[i]=br.readLine();
System.out.println();
}
for(int i=0;i<m;i++)
{
for(int j=0;j<n-1;j++)
{
if(a[i][j].equals(decision[j]))
{
if(a[i][n-1].equals("Yes"))
DWM Lab/ V 17
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

{
decisiony_ctr[j]++;
}
else if(a[i][n-1].equals("No"))
{
decisionn_ctr[j]++;
}
}
}
}
for(int j=0;j<n-1;j++)
{
System.out.println(decision[j]+"/Yes:"+decisiony_ctr[j]);
System.out.println(decision[j]+"/No:"+decisionn_ctr[j]);
}
doubleyprobability=1.0;
doublenprobability=1.0;
for(int j=0;j<n-2;j++)
{
yprobability*=(decisiony_ctr[j]/count_yes);
nprobability*=(decisionn_ctr[j]/count_no);
}
double temp=(count_yes)/(m);
double temp1=(count_no)/(m);
yprobability=temp*yprobability;
nprobability=temp1*nprobability;
if(yprobability>nprobability)
{
System.out.println("The Decision is Yes");
}
else
{
System.out.println("The Decision is No");
}
}
}

******Output******
C:\Program Files\Java\jdk1.8.0_31\bin>javac NaiveBayes.java
C:\Program Files\Java\jdk1.8.0_31\bin>java NaiveBayes
Enter the no. of columns: 5
Enter the no. of rows: 14
Enter the names of the attributes and keep the classification column last and its value
as yes or no:
DWM Lab/ V 18
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Column0:Age
Column1:Income
Column2:Student
Column3:Credit
Column4:Buys
Enter the values of row1:
Age:<=30
Income:High
Student:No
Credit:Fair
Buys:No
Enter the values of row2:
Age:<=30
Income:High
Student:No
Credit:Excellent
Buys:No
Enter the values of row3:
Age:31-40
Income:High
Student:No
Credit:Fair
Buys:Yes
Enter the values of row4:
Age:>40
Income:Medium
Student:No
Credit:Fair
Buys:Yes
Enter the values of row5:
Age:>40
Income:Low
Student:Yes
Credit:Fair
Buys:Yes
Enter the values of row6:
Age:>40
Income:Low
Student:Yes
Credit:Excellent
Buys:No
Enter the values of row7:
Age:31-40
Income:Low
DWM Lab/ V 19
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Student:Yes
Credit:Excellent
Buys:Yes
Enter the values of row8:
Age:<=30
Income:Medium
Student:NO
Credit:Fair
Buys:No
Enter the values of row9:
Age:<=30
Income:Low
Student:Yes
Credit:Fair
Buys:Yes
Enter the values of row10:
Age:>40
Income:Medium
Student:Yes
Credit:Fair
Buys:Yes
Enter the values of row11:
Age:<=30
Income:Medium
Student:Yes
Credit:Excellent
Buys:Yes
Enter the values of row12:
Age:31-40
Income:Medium
Student:No
Credit:Excellent
Buys:Yes
Enter the values of row13:
Age:31-40
Income:High
Student:Yes
Credit:Fair
Buys:Yes
Enter the values of row14:
Age:>40
Income:Medium
Student:No
Credit:Excellent
DWM Lab/ V 20
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Buys:No
The Table that you have entered is:
Age Income Student Credit Buys
<=30 High No Fair No
<=30 High No Excellent No
31-40 High No Fair Yes
>40 Medium No Fair Yes
>40 Low Yes Fair Yes
>40 Low Yes Excellent No
31-40 Low Yes Excellent Yes
<=30 Medium No Fair No
<=30 Low Yes Fair Yes
>40 Medium Yes Fair Yes
<=30 Medium Yes Excellent Yes
31-40 Medium No Excellent Yes
31-40 High Yes Fair Yes
>40 Medium No Excellent No
p(yes):9.0/14
p(no):5.0/14
Enter the unseen tuple for classification:
Age:<=30
Income:Medium
Student:Yes
Credit:Fair
<=30/Yes:2.0
<=30/No:3.0
Medium/Yes:4.0
Medium/No:2.0
Yes/Yes:6.0
Yes/No:1.0
Fair/Yes:6.0
Fair/No:2.0
The Decision is Yes

Conclusion: -Thus we have implemented Naïve Bayes classifier using java.

DWM Lab/ V 21
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Experiment no - 05

Title: Weka Tool


Aim– Introduction to Weka Tool
Theory -
WEKA is a data mining system developed by the University of Waikato in
New Zealandthat implements data mining algorithms. WEKA is a state-of-the-art
facility for developingmachine learning (ML) techniques and their application to
real-world data mining problems. It isa collection of machine learning algorithms for
data mining tasks. The algorithms are applieddirectly to a dataset. WEKA
implements algorithms for data preprocessing, classification,regression, clustering,
association rules; it also includes a visualization tools. The new machinelearning
schemes can also be developed with this package. WEKA is open source
softwareissued under the GNU General Public License
Launching WEKA Explorer
You can launch Weka from C:\Program Files directory, from your desktop
selectingicon, or from the Windows task bar ‘Start’ → ‘Programs’→ ‘Weka 3-4’.
When ‘WEKAGUI Chooser’ window appears on the screen, you can select one of
the four options at thebottom of the window

DWM Lab/ V 22
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Figure 5.1: Weka GUI

1. Simple CLI provides a simple command-line interface and allows direct


execution of Weka commands.
2. Explorer is an environment for exploring data.
3. Experimenter is an environment for performing experiments and conducting
statistical tests between learning schemes.
4. Knowledge Flowis a Java-Beans-based interface for setting up and running
machine learning experiments.
For the exercises in this tutorial you will use ‘Explorer’. Click on ‘Explorer’ button
in the ‘WEKA GUI Chooser’ window.

‘WEKA Explorer’ window appears on a screen.

DWM Lab/ V 23
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Figure 5. 2: Weka Explorer

Preprocessing Data:
At the very top of the window, just below the title bar there is a row of tabs. Only the
first tab, ‘Preprocess’, is active at the moment because there is no dataset open. The
first threebuttons at the top of the preprocess section enable you to load data into
WEKA. Data can be imported from a file in various formats: ARFF, CSV, C4.5,
binary, it can also be read from a URL or from an SQL database (using JDBC) . The
easiest and the most common way of getting the data into WEKA is to store it as
Attribute-Relation File Format (ARFF) file.
File Conversion
We assume that all your data stored in a Microsoft Excel spreadsheet “weather.xls”.

DWM Lab/ V 24
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Figure 5.3: Excel Data Sheet

WEKA expects the data file to be in Attribute-Relation File Format (ARFF) file.
Before you apply the algorithm to your data, you need to convert your data
intocomma-separated file into ARFF format (into the file with .arff extension) . To
save you data in comma-separated format, select the ‘Save As…’ menu item from
Excel ‘File’ pull-down menu. In the ensuing dialog box select ‘CSV (Comma
Delimited)’ from the file type pop-up menu, enter a name of the file, and click
‘Save’ button. Ignore all messages that appear by clicking ‘OK’. Open this file with
Microsoft Word. Your screen will look like the screen below.

DWM Lab/ V 25
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Figure 5.4: File conversion excel to .CSV

The rows of the original spreadsheet are converted into lines of text where the
elements are separated from each other by commas. In this file you need to change
the first line, which holds the attribute names, into the header structure that makes up
the beginning of an ARFF file. Add a @relation tag with the dataset’s name, an
@attribute tag with the attribute information, and a @data tag as shown below.

DWM Lab/ V 26
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Figure 5.5: Data Sheet after File conversion

Choose ‘Save As…’ from the ‘File‘ menu and specify ‘Text Only with Line Breaks’
as the file type. Enter a file name and click ‘Save’ button. Rename the file to the file
with extension .arff to indicate that it is in ARFF format.
Opening file from a local file system
Click on ‘Open file…’ button.It brings up a dialog box allowing you to browse for
the data file on the local file system, choose“weather.arff” file.Some databases have
the ability to save data in CSV format. In this case, you can select CSVfile from the
local filesystem. If you would like to convert this file into ARFF format, you can
click on ‘Save’ button. WEKA automatically creates ARFF file from your CSV file.

DWM Lab/ V 27
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Figure 5.6: Arff viewer to open CSV flie

Figure 5.7: Arff viewer to select CSV file

DWM Lab/ V 28
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Figure 5.8: Main Window

Conclusion :-In this way we have studied weka tool.

DWM Lab/ V 29
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Experiment no – 06

Title: Decision Tree


Aim: -
Implementation of Decision tree using WEKA.
Theory: -

Decision tree learning uses a decision tree (as a predictive model) to go from
observations about an item (represented in the branches) to conclusions about the
item's target value (represented in the leaves). It is one of the predictive modelling
approaches used in statistics, data mining and machine learning. Tree models where
the target variable can take a discrete set of values are called classification trees; in
these tree structures, leaves represent class labels and branches
represent conjunctions of features that lead to those class labels. Decision trees
where the target variable can take continuous values (typically real numbers) are
called regression trees. In decision analysis, a decision tree can be used to visually
and explicitly represent decisions and decision making. In data mining, a decision
tree describes data (but the resulting classification tree can be an input for decision
making). This page deals with decision trees in data mining.

Decision tree learning is a method commonly used in data mining. The goal is to
create a model that predicts the value of a target variable based on several input
variables. An example is shown in the diagram at right. Each interior node
corresponds to one of the input variables; there are edges to children for each of the
possible values of that input variable. Each leaf represents a value of the target
variable given the values of the input variables represented by the path from the root
to the leaf.

A decision tree is a simple representation for classifying examples. For this section,
assume that all of the input features have finite discrete domains, and there is a
single target feature called the "classification". Each element of the domain of the

DWM Lab/ V 30
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

classification is called a class. A decision tree or a classification tree is a tree in


which each internal (non-leaf) node is labeled with an input feature. The arcs coming
from a node labeled with an input feature are labeled with each of the possible values
of the target or output feature or the arc leads to a subordinate decision node on a
different input feature

Set of conditions organized hierarchically in such a way that the final decision can
be determined following the conditions that are fulfilled from the root of the tree to
one of its leaves.
 They are easily understandable. They build a model (made up by rules) easy
to understand for the user.
 They only work over a single table, and over a single attribute at a time.
 They are one of the most used data mining techniques.
 The decision trees are based on the strategy "divide and conquer".
 There are two possible types of divisions or partitions:
 Nominal partitions: a nominal attribute may lead to a split with as many
branches as values there are for the attribute.
 Numerical partitions: typically, they allow partitions like "X>" and "X<a
Partitions relating two different attributes are not permitted.
Result:

DWM Lab/ V 31
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

DWM Lab/ V 32
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Conclusion: Thus we have implemented decision tree using WEKA


DWM Lab/ V 33
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Experiment no – 07

Title: K-means Algorithm


Aim:
Implementation of K-means clustering algorithm using java.

Theory: -
Clustering is the process of partitioning or grouping a given set of patterns
into disjoint clusters. This is done such that patterns in the same cluster are alike and
patterns belonging to two different clusters are different. Clustering has been a
widely studied problem in a variety of application domains including neural
networks, AI, and statistics.
K-Means clustering intends to partition n objects into k clusters in which each
object belongs to the cluster with the nearest mean. This method produces
exactly k different clusters of greatest possible distinction. The best number of
clusters k leading to the greatest separation (distance) is not known as a priori and
must be computed from the data. The objective of K-Means clustering is to
minimize total intra-cluster variance, or, the squared error function:

Algorithm
1. Clusters the data into k groups where k is predefined.
2. Select k points at random as cluster centers.

DWM Lab/ V 34
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

3. Assign objects to their closest cluster center according to the Euclidean distance function.
4. Calculate the centroid or mean of all objects in each cluster.
5. Repeat steps 2, 3 and 4 until the same points are assigned to each cluster in
consecutive rounds.

Figure 7.1 : Clustering


K-Means is relatively an efficient method. However, we need to specify the
number of clusters, in advance and the final results are sensitive to initialization
and often terminates at a local optimum. Unfortunately there is no global
theoretical method to find the optimal number of clusters. A practical approach is
to compare the outcomes of multiple runs with different k and choose the best one
based on a predefined criterion. In general, a large k probably decreases the error
but increases the risk of overfitting.

Program and output:


// Implementations of Kmeans Clustering algo
import java.io.*;
classKmeans
{
public static void main(String args[])throws IOException
{
BufferedReaderbr=new BufferedReader(new InputStreamReader(System.in));
System.out.println("Enter the no. of elements:");
int n=Integer.parseInt(br.readLine());
DWM Lab/ V 35
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

intcount,sum;
int elements[][]=new int[n][2];
System.out.println("Enter the elements:");
for(int i=0;i<n;i++)
elements[i][0]=Integer.parseInt(br.readLine());
System.out.println("No. of clusters:");
intnoc=Integer.parseInt(br.readLine());
double m[]=new double[noc];
System.out.println("Enter initial means:");
for(int i=0;i<noc;i++)
m[i]=Double.parseDouble(br.readLine());
int temp[][]=new int[n][2];
intitn=1;
while(true)
{
for(int i=0;i<n;i++)
{
for(int j=0;j<noc;j++)
{
if(Math.abs(elements[i][0]-m[elements[i][1]])>Math.abs(elements[i][0]-m[j]))
{
elements[i][1]=j;
}
}
}
for(int j=0;j<noc;j++)
{
sum=0;
count=0;
for(int i=0;i<n;i++)
{
if(elements[i][1]==j)
{
sum=sum+elements[i][0];
count++;
}
}
m[j]=(double)sum/count;
}
int c=0;
for(int i=0;i<n;i++)
if(elements[i][1]==temp[i][1])
c++;
if(c==n)
DWM Lab/ V 36
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

break;
for(int i=0;i<n;i++)
temp[i][1]=elements[i][1];
System.out.println("\nIteration"+itn);
for(int j=0;j<noc;j++)
System.out.println("Mean"+(j+1)+":"+m[j]);
for(int i=0;i<noc;i++)
{
System.out.println("Cluster"+(i+1)+":");
for(int j=0;j<n;j++)
{
if(elements[j][1]==i)
{
System.out.println(""+elements[j][0]+"");
}
}
System.out.println();
}
itn++;
}
}
}

******Output******
C:\Program Files\Java\jdk1.8.0_31\bin>javac Kmeans.java
C:\Program Files\Java\jdk1.8.0_31\bin>java Kmeans
Enter the no. of elements: 9
Enter the elements:2 4 10 12 3 20 30 11 25
No. of clusters: 2
Enter initial means: 3 4
Iteration1
Mean1:2.5
Mean2:16.0
Cluster1: 2 3
Cluster2: 4 10 12 20 30 11 25
Iteration2
Mean1:3.0
Mean2:18.0
Cluster1: 2 4 3
Cluster2: 10 12 20 30 11 25
Iteration3
Mean1:4.75
Mean2:19.6
Cluster1: 2 4 10 3
DWM Lab/ V 37
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Cluster2: 12 20 30 11 25
Iteration4
Mean1:7.0
Mean2:25.0
Cluster1: 2 4 10 12 3 11
Cluster2: 20 30 25

Conclusion:-Thus we have implemented K-means clustering algorithm


successfully.

DWM Lab/ V 38
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Experiment no - 08

Title: HITS algorithm


Aim:
Implementation of HITS algorithm
Theory:
HITS algorithm identifies two different forms of Web pages called hubs and
authorities. Authorities are pages having important contents. Hubs are pages that act
as resource lists, guiding users to authorities. Thus, a good hub page for a subject
points to many authoritative pages on that content, and a good authority page is
pointed by many good hub pages on the same subject. Hubs and Authorities are
shown in figure 8.1. In this a page may be a good hub and a good authority at the
same time. This circular relationship leads to the definition of an iterative algorithm
called HITS (Hyperlink Induced Topic Selection). HITS algorithm is ranking the
web page by using inlinks and outlinks of the web pages. If the web page is pointed
by many hyper links, it is called authority and if the page point to various hyperlinks,
it is called hub. HITS is a link based algorithm. In HITS algorithm, ranking of the
web page is decided by analyzing their textual contents against a given query. After
collection of the web pages, the HITS algorithm concentrates on the structure of the
web only, neglecting their textual contents.

Figure 8.1: Hubs and Authorities

DWM Lab/ V 39
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Algorithm:

In this HITS algorithm, the hub and authority are calculated using the following algorithm.
HITS Algorithm

1. Initialize all weights to 1

2. Repeat until the weights converge:

3. For every hub p∈H

4.

5. For every authority p∈A

6.

7. Normalize

The algorithm performs a series of iterations, each consisting of two basic steps:

 Authority update: Update each node's authority score to be equal to the sum of the hub
scores of each node that points to it. That is, a node is given a high authority score by being
linked from pages that are recognized as Hubs for information.
 Hub update: Update each node's hub score to be equal to the sum of the authority scores of
each node that it points to. That is, a node is given a high hub score by linking to nodes that are
considered to be authorities on the subject.
Program:
# importing modules
import networkx as nx
import matplotlib.pyplot as plt
G = nx.DiGraph()
G.add_edges_from([('A', 'D'), ('B', 'C'), ('B', 'E'), ('C', 'A'),
('D', 'C'), ('E', 'D'), ('E', 'B'), ('E', 'F'),

DWM Lab/ V 40
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

('E', 'C'), ('F', 'C'), ('F', 'H'), ('G', 'A'),


('G', 'C'), ('H', 'A')])
plt.figure(figsize =(10, 10))
nx.draw_networkx(G, with_labels = True)

hubs, authorities = nx.hits(G, max_iter = 50, normalized = True)


# The in-built hits function returns two dictionaries keyed by nodes
# containing hub scores and authority scores respectively.

print("Hub Scores: ", hubs)


print("Authority Scores: ", authorities)
output:

Hub Scores: {'A': 0.04642540386472174, 'D': 0.133660375232863,


'B': 0.15763599440595596, 'C': 0.037389132480584515,
'E': 0.2588144594158868, 'F': 0.15763599440595596,
'H': 0.037389132480584515, 'G': 0.17104950771344754}

Authority Scores: {'A': 0.10864044085687284, 'D': 0.13489685393050574,

DWM Lab/ V 41
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

'B': 0.11437974045401585, 'C': 0.3883728005172019,


'E': 0.06966521189369385, 'F': 0.11437974045401585,
'H': 0.06966521189369385, 'G': 0.0}

Conclusion: Thus we have implemented HITS algorithm successfully

DWM Lab/ V 42
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Experiment no - 09

Title: Aprori Algorithm

Aim:
Implement Apriori association algorithm WEKA tool.
Theory:
Association rule generation is usually split up into two separate steps:
1. First, minimum support is applied to find all frequent itemsets in a database.
2. Second, these frequent itemsets and the minimum confidence constraint are used
to form rules.
While the second step is straight forward, the first step needs more attention. Finding
all frequent itemsets in a database is difficult since it involves searching all possible
itemsets (item combinations). The set of possible itemsets is the power set over I and
has size 2n − 1 (excluding the empty set which is not a valid itemset). Although the
size of the powerset grows exponentially in the number of items n in I, efficient
search is possible using the downward-closure property of support (also called anti-
monotonicity) which guarantees that for a frequent itemset, all its subsets are also
frequent and thus for an infrequent itemset, all its supersets must also be infrequent.
Exploiting this property, efficient algorithms (e.g., Apriori and Eclat) can find all
frequent itemsets.
Association Rule Mining:

Association rule mining finds interesting association relationships among a


large set of data items with massive amount of data continuously being collected and
stored.

Given a set of transaction, association rule mining aims to find the rules
which enable us to predict the occurrence of a specific item based on the occurrence
of other items .

Association Rule:

DWM Lab/ V 43
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

For any transaction database association rule is given as x=>y where x and y
are subsets of A. The rule x=>y is specified with two factors support factor and
confidence factor.

The rule x=>y has a confident factor , z which means that z % of transactions
in the database that support x also supports y

Similarly , the rule x=>y has a support which means that % of transactions xUy
we use association rule to find transactions that contain x tends to y as well

Apriori Algorithm:

An algorithm for frequent item set mining and association rule learning over
transactional database proposed by Agrawal and Srikant in 1994.

 It is designed to operate on database containing transactions.


 It uses bottom up approach and breadth first search and a Hash tree structure
to count candidate item sets efficiently
 It generates candidate items sets of length k from item sets of length k-1
 The resulting set is denoted by l1. L1 is used to find l3 and so on

Apriori Property:

The property is based on the following assumptions


 If an item set I doesn’t satisfy the minimum support threshold min_sup . then
I is not frequent ie P(I)<min_supp
 If an item a is added to the item set I then resulting itemsetcant occur more
frequently than therefore I U A is not frequent either ie P(I U A) <min_sup

This property belongs to special category of properties called antimonotonicity in the


sense that if a set cannot pass test all of its super set will fail the same test as well.

1) The Join step:

DWM Lab/ V 44
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

To find Lk ,a set of candidate k-itemset is generated by joining Lk-1 with


itself . let L1 and L2 be in Lk-1 . Li[j] refers to the jth item in Li refers to the jth item
in Li refers to the second to the last item in L1,apriority assumes that items within a
transaction as itemset are sorted in proper order

2) The Prune step:

Ck is a superset of Lk that is all of frequent k-itemsets are included in Ck .Ck


can be huge and so this could involve heavy computation . To reduce the size of Ck ,
the apriori property is used . This subset testing can be done quickly by maintaining
hash tree of frequent itemsets.

Steps In Apriori

Apriori algorithm is a sequence of steps to be followed to find the most frequent


itemset in the given database. This data mining technique follows the join and the
prune steps iteratively until the most frequent itemset is achieved. A minimum
support threshold is given in the problem or it is assumed by the user.

1. In the first iteration of the algorithm, each item is taken as a 1-itemsets candidate.
The algorithm will count the occurrences of each item.
2. Let there be some minimum support, min_sup( eg 2). The set of 1 – itemsets
whose occurrence is satisfying the min sup are determined. Only those
candidates which count more than or equal to min_sup, are taken ahead for the
next iteration and the others are pruned
3. Next, 2-itemset frequent items with min_sup are discovered. For this in the join
step, the 2-itemset is generated by forming a group of 2 by combining items with
itself.
4. The 2-itemset candidates are pruned using min-sup threshold value. Now the
table will have 2 –itemsets with min-sup only.
5. The next iteration will form 3 –itemsets using join and prune step. This iteration
will follow antimonotone property where the subsets of 3-itemsets, that is the 2 –

DWM Lab/ V 45
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

itemset subsets of each group fall in min_sup. If all 2-itemset subsets are
frequent then the superset will be frequent otherwise it is pruned.
6. Next step will follow making 4-itemset by joining 3-itemset with itself and
pruning if its subset does not meet the min_sup criteria. The algorithm is stopped
when the most frequent itemset is achieved.

Advantages:

 Uses large itemset property


 Easily parallelized
 Easy to implement

Disadvantages:

 Assumes transaction database is memory resident.


 Requires many database scans.

Applications Of Apriori Algorithm


1. In Education Field: Extracting association rules in data mining of admitted students
through characteristics and specialties.

DWM Lab/ V 46
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

2. In the Medical field: For example Analysis of the patient’s database.


3. In Forestry: Analysis of probability and intensity of forest fire with the forest fire data.
4. Apriori is used by many companies like Amazon in the Recommender System and by
Google for the auto-complete feature.

Figure 9.1: Weka Explorer

DWM Lab/ V 47
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Figure 9.2: selection of program

Figure 9.3: Output

Conclusion: In this way we have implemented Apriori algorithm using weka tool

DWM Lab/ V 48
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Experiment No - 10
Title: Linear Regression

Aim: -Implementation of linear regression algorithm using python


Theory:
Regression
Regression analysis is one of the most important fields in statistics and machine
learning. There are many regression methods available. Linear regression is one of
them.

What Is Regression?

Regression searches for relationships among variables.

For example, you can observe several employees of some company and try to
understand how their salaries depend on the features, such as experience, level of
education, role, city they work in, and so on.

This is a regression problem where data related to each employee represent


one observation. The presumption is that the experience, education, role, and city are
the independent features, while the salary depends on them.

Similarly, you can try to establish a mathematical dependence of the prices of houses
on their areas, numbers of bedrooms, distances to the city center, and so on.

Generally, in regression analysis, you usually consider some phenomenon of interest


and have a number of observations. Each observation has two or more features.
Following the assumption that (at least) one of the features depends on the others,
you try to establish a relation among them.

In other words, you need to find a function that maps some features or variables to
others sufficiently well.

DWM Lab/ V 49
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

- The dependent features are called the dependent variables, outputs, or responses.
- The independent features are called the independent variables, inputs,
or predictors.

Regression problems usually have one continuous and unbounded dependent


variable. The inputs, however, can be continuous, discrete, or even categorical data
such as gender, nationality, brand, and so on.

It is a common practice to denote the outputs with 𝑦 and inputs with 𝑥. If there are
two or more independent variables, they can be represented as the vector 𝐱 = (𝑥₁,
…,ᵣ), where 𝑟 is the number of inputs.

When Do You Need Regression?

Typically, you need regression to answer whether and how some phenomenon
influences the other or how several variables are related. For example, you can use it
to determine if and to what extent the experience or gender impact salaries.

Regression is also useful when you want to forecast a response using a new set of
predictors. For example, you could try to predict electricity consumption of a
household for the next hour given the outdoor temperature, time of day, and number
of residents in that household.

Regression is used in many different fields: economy, computer science, social


sciences, and so on. Its importance rises every day with the availability of large
amounts of data and increased awareness of the practical value of data.

Linear Regression
Linear regression is probably one of the most important and widely used regression
techniques. It’s among the simplest regression methods. One of its main advantages
is the ease of interpreting results.

DWM Lab/ V 50
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Problem Formulation

When implementing linear regression of some dependent variable 𝑦 on the set of


independent variables 𝐱 = (𝑥₁, …,ᵣ), where 𝑟 is the number of predictors, you assume
a linear relationship between 𝑦 and 𝐱: 𝑦 = 𝛽₀ + 𝛽₁𝑥₁ + ⋯ + 𝛽ᵣ𝑥ᵣ + 𝜀. This equation is
the regression equation. 𝛽₀, 𝛽₁, …,ᵣ are the regression coefficients, and 𝜀 is
the random error.

Linear regression calculates the estimators of the regression coefficients or simply


the predicted weights, denoted with 𝑏₀, 𝑏₁, …,ᵣ. They define the estimated regression
function (𝐱) = 𝑏₀ + 𝑏₁𝑥₁ + ⋯ + 𝑏ᵣ𝑥ᵣ. This function should capture the dependencies
between the inputs and output sufficiently well.

The estimated or predicted response, (𝐱ᵢ), for each observation 𝑖 = 1, …, 𝑛, should be


as close as possible to the corresponding actual response 𝑦ᵢ. The differences 𝑦ᵢ - (𝐱ᵢ)
for all observations 𝑖 = 1, …, 𝑛, are called the residuals. Regression is about
determining the best predicted weights, that is the weights corresponding to the
smallest residuals.

To get the best weights, you usually minimize the sum of squared residuals (SSR)
for all observations 𝑖 = 1, …,: SSR = Σᵢ(𝑦ᵢ - 𝑓(𝐱ᵢ))². This approach is called
the method of ordinary least squares.

Regression Performance

The variation of actual responses 𝑦ᵢ, 𝑖 = 1, …,, occurs partly due to the dependence
on the predictors 𝐱ᵢ. However, there is also an additional inherent variance of the
output.

The coefficient of determination, denoted as 𝑅², tells you which amount of variation
in 𝑦 can be explained by the dependence on 𝐱 using the particular regression model.

DWM Lab/ V 51
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Larger 𝑅² indicates a better fit and means that the model can better explain the
variation of the output with different inputs.

The value 𝑅² = 1 corresponds to SSR = 0, that is to the perfect fit since the values of
predicted and actual responses fit completely to each other.

Simple Linear Regression

Simple or single-variate linear regression is the simplest case of linear regression


with a single independent variable, 𝐱 = 𝑥.

The following figure illustrates simple linear regression:

Figure 10.1: Linear regression


Conclusion: In this way we have successfully implemented linear regression algorithm using
python

DWM Lab/ V 52
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

Program and output:


Note: Apply this code on your own database otherwise you can use your own code

import numpy as np
import matplotlib.pyplot as plt

defestimate_coef(x, y):
# number of observations/points
n = np.size(x)

# mean of x and y vector


m_x = np.mean(x)
m_y = np.mean(y)

# calculating cross-deviation and deviation about x


SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x

# calculating regression coefficients


b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x

return (b_0, b_1)

defplot_regression_line(x, y, b):
# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",
marker = "o", s = 30)

# predicted response vector


y_pred = b[0] + b[1]*x

# plotting the regression line


plt.plot(x, y_pred, color = "g")

# putting labels
plt.xlabel('x')
plt.ylabel('y')

# function to show plot


plt.show()

DWM Lab/ V 53
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering

def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])

# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))

# plotting regression line


plot_regression_line(x, y, b)

if __name__ == "__main__":
main()

Output of above piece of code is:

Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437
And graph obtained looks like this:

DWM Lab/ V 54

You might also like