KEMBAR78
Data Mining Lab Manual | PDF | Data Warehouse | Statistical Classification
0% found this document useful (0 votes)
146 views85 pages

Data Mining Lab Manual

Data mining

Uploaded by

sri.03.saiv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
146 views85 pages

Data Mining Lab Manual

Data mining

Uploaded by

sri.03.saiv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 85

INDEX

Page No.
S.No NAME OF THE EXPERIMENT
1. Study of Creation of a Data Warehouse.
2. Creation of DataSet
3. Apriori Algorithm.

4. FP-Growth Algorithm.

5. K-means clustering.

6. One Hierarchical clustering algorithm.

7. Bayesian Classification using WEKA

8. Decision Tree.

9. Support Vector Machines.

10. Case study on Banking Application

11. Case study on Text Mining

Additional Experiments

12. Data Discretization

13. Bayesian Classification using javaAPI


IT6711- Data Mining Laboratory Department of IT 2017-2018

IT6711 DATA MINING LABORATORY LTPC


0032
OBJECTIVES:
The student should be made to:
 Be familiar with the algorithms of data mining,
 Be acquainted with the tools and techniques used for Knowledge Discovery in Databases.
 Be exposed to web mining and text mining
LIST OF EXPERIMENTS:
1. Creation of a Data Warehouse.
2. Apriori Algorithm.
3. FP-Growth Algorithm.
4. K-means clustering.
5. One Hierarchical clustering algorithm.
6. Bayesian Classification.
7. Decision Tree.
8. Support Vector Machines.
9. Applications of classification for web mining.
10. Case Study on Text Mining or any commercial application.
TOTAL : 45 PERIODS
OUTCOMES:
After completing this course, the student will be able to:
 Apply data mining techniques and methods to large data sets.
 Use data mining tools.
 Compare and contrast the various classifiers.
LAB EQUIPMENT FOR A BATCH OF 30 STUDENTS:
SOFTWARE:
WEKA, RapidMiner, DB Miner or Equivalent
HARDWARE
Standalone desktops 30 Nos
IT6711- Data Mining Laboratory Department of IT 2017-2018

IT6711 Data Mining


Area / Domain: Data Analysis
Corresponding Theory, with code (If any) : Data Warehousing and Data mining
Course Outcomes

On successful completion of this course, the student will be able to

IT6711.1 Apply data mining techniques and methods to large data sets.
IT6711.2 Use data mining tools.
IT6711.3 Compare and contrast the various classifiers

MAPPING BETWEEN CO AND PO, PSO WITH CORRELATION LEVEL 1/2/3

IT6711 PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2 PSO3 PSO4

IT6711.1 1 1 2 2 1 - - 1 - 1 - 2 1 2 2 1

IT6711.2 1 1 2 2 3 1 1 1 2 1 - 2 1 2 2 1

IT6711.3 1 1 2 2 3 1 1 1 2 1 - 2 1 1 2 1

RELATION BETWEEN COURSE CONTENT / EXPERIMENTS WITH CO

Knowledge Course
S. No Experiment
level Outcomes
CYCLE-I
1 L2& L5 Study of Creation of a Data Warehouse. IT6711.1
2 L2& L5 Creation of DataSet IT6711.1
3 L2 & L3 Apriori Algorithm. IT6711.1
4 L2 & L3 FP-Growth Algorithm. IT6711.1
CYCLE-II
5 L5,L3,L4 K-means clustering. IT6711.2
6 L5,L3,L4 One Hierarchical clustering algorithm. IT6711.2
7 L5,L3,L4 Bayesian Classification using WEKA IT6711.2
8 L5,L3,L4 Decision Tree. IT6711.2
9 L5,L3,L4 Support Vector Machines. IT6711.2
IT6711- Data Mining Laboratory Department of IT 2017-2018

CYCLE-III
10. L5,L3,L4 Case study on Banking Application IT6711.3
11. L5,L3,L4 Case study on Text Mining IT6711.3

L1 – Remember; ;L2 – Understand, L3 – Apply; L4 – Analyze, L5-Create

(A) PROGRAM OUTCOMES (POs)


Engineering graduates will be able to:
1. Engineering Knowledge: Apply the knowledge of mathematics, science, engineering fundamentals, and
an engineering specialization to the solution of complex engineering problems.
2. Problem analysis: Identify, formulate, review research literature, and analyze complex engineering prob -
lems reaching substantiated conclusions using first principles of mathematics, natural sciences, and engi-
neering sciences.
3. Design/development of solutions: Design solution for complex engineering problems and design systems
components or process that meet the specified needs with appropriate consideration for the public health
and safety , and the cultural, societal, and environmental considerations.
4. Conduct investigations of complex problems: Use research- based knowledge and research methods in-
cluding design of experiments, analysis and interpretation of data, and synthesis of the information to pro -
vide valid conclusions.
5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern engineering
and IT tools including prediction and modeling to complex engineering activities with an understanding of
the limitations.
6. The engineer and society: Apply reasoning informed by the contextual knowledge to assess societal,
health, safety, legal and cultural issues and the consequent responsibilities relevant to the professional engi-
neering practice.
7. Environmental and sustainability: Understand the impact of the professional engineering solutions in so-
cietal and environmental contexts and demonstrate the knowledge of, and need for sustainable develop-
ment.
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms of the en-
gineering practice.
9. Individual and team work: Function effectively as an individual, and as a member or leader in diverse
teams, and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities with the engineering com-
munity and with society at large, such as , being able to comprehend and write effective reports and design
documentation, make effective presentations, and give and receive clear instructions.
11. Project management and finance: Demonstrate knowledge and understanding of the engineering and
management principles and apply these to one’s own work , as a member and leader in a team, to manage
projects and in multidisciplinary environments.
12. Life-Long learning: Recognize the need for, and have the preparation and ability to engage in independent
and life-long learning in the broadest context of technological change.

(B) PROGRAM EDUCATIONAL OBJECTIVES (PEOs)

1. By acquiring a strong foundation in basic and advanced engineering concepts, students are expertised in
formulating, analyzing and solving engineering problems.
2. By enhancing the logical reasoning skills, students are made capable to design optimized Technological
solutions in industry and academics as well.
3.By moulding students to be an active team player, possessing strong interpersonal skills and leadership
quality with entrepreneurial ability
4. By encouraging continuous self-learning ability, students are trained to meet the current demands of
industry, carrying out research in cutting edge technologies.
5. By infusing professional ethical approach in solving critical engineering problems, students are encouraged
to derive solutions considering economical, environmental, ethical, and societal issues.
IT6711- Data Mining Laboratory Department of IT 2017-2018

(C) PROGRAM SPECIFIC OBJECTIVES (PSOs)

1.Be able to use and apply mathematical foundations, algorithmic principles and computer science theory in
the modeling and design of computer-based systems for providing competent technological solutions.
2. Be able to identify and analyze user needs and take them into account for selecting, creating, evaluating,
thereby effectively integrating IT based solutions using intelligent information tools for the society.
3.Be able to apply design, development and management ideologies in the creation of an effective information
system with varying complexity.
4.Understand best practices, ethical standards and replicate the same in the design and development of IT
solutions.
IT6711- Data Mining Laboratory Department of IT 2017-2018

Ex.No. 1 CREATION OF DATAWAREHOUSE


AIM
To design and study the data warehouse for auto sales analysis

PROCEDURE
Designing the data warehouse using star, snowflake & Galaxy schema.
Design a data cube which contain one fact table and design item, time, supplier, location,
customer dimension table, also identify measures for sales. Insert minimum 4 items like bikes,
small cars, mid segment cars, car consumables items etc. Also enter minimum 10‐12 records
Region/location, enter minimum 2 cities from each state also enter minimum 2 states. Keep track
of sales quarter wise.
Perform and implement above fact & dimension tables in oracle10g which are same as relational
table of database, perform analyze above with the help of SQL tool.
Use concepts of OLAP operation like slice, dice, roll‐up, drill‐down etc

Dimensional modeling (DM) is the name of a logical design technique often used for data
warehouses. Dimensional modeling always uses the concepts of facts, measures, and dimensions.
Facts are typically (but not always) numeric values that can be aggregated, Dimensions are
groups of hierarchies and descriptors that define the facts. For example, sales amount is a fact;
timestamp, product, register#, store#, etc. are elements of dimensions.
Dimensional models are built by business process area, e.g. store sales, inventory, claims, etc.

Fact table
The fact table is not a typical relational database table as it is de‐normalized on purpose ‐
to enhance query response times. The fact table typically contains records that are ready to
explore, usually with ad hoc queries. Records in the fact table are often referred to as events, due
to the time‐variant nature of a data warehouse environment.The primary key for the fact table is
a composite of all the columns except numeric values / scores (like QUANTITY, TURNOVER,
exact invoice date and time).
Typical fact tables in a global enterprise data warehouse are (usually there may be
additional company or business specific fact tables):

6
IT6711- Data Mining Laboratory Department of IT 2017-2018

Sales fact table ‐ contains all details regarding sales


Orders fact table ‐ in some cases the table can be split into open orders and historical orders.
Sometimes the values for historical orders are stored in a sales fact table.
Budget fact table ‐ usually grouped by month and loaded once at the end of a year. Forecast fact
table ‐ usually grouped by month and loaded daily, weekly or monthly. Inventory fact table ‐
report stocks, usually refreshed daily

Dimension table
Nearly all of the information in a typical fact table is also present in one or more
dimension tables. The main purpose of maintaining Dimension Tables is to allow browsing the
categories quickly and easily. The primary keys of each of the dimension tables are linked
together to form the composite primary key of the fact table. In a star schema design, there is
only one de‐normalized table for a given dimension.
Typical dimension tables in a data warehouse are:
Time dimension table Customers dimension table Products dimension table
Key account managers (KAM) dimension table
Sales office dimension table

Star schema architecture


Star schema architecture is the simplest data warehouse design. The main feature of a star
schema is a table at the center, called the fact table and the dimension tables which allow
browsing of specific categories, summarizing, drill‐downs and specifying criteria.
Typically, most of the fact tables in a star schema are in database third normal form, while
dimensional tables are de‐normalized (second normal form).
Despite the fact that the star schema is the simplest data warehouse architecture, it is most
commonly used in the data warehouse implementations across the world today (about 90‐95%
cases).

7
IT6711- Data Mining Laboratory Department of IT 2017-2018

Snowflake Schema architecture


Snowflake schema architecture is a more complex variation of a star schema design. The
main difference is that dimensional tables in a snowflake schema are normalized, so they have a
typical relational database design.
Snowflake schemas are generally used when a dimensional table becomes very big and
when a star schema can’t represent the complexity of a data structure. For example if a
PRODUCT dimension table contains millions of rows, the use of snowflake schemas should
significantly improve performance by moving out some data to other table (with BRANDS for
instance).

8
IT6711- Data Mining Laboratory Department of IT 2017-2018

The problem is that the more normalized the dimension table is, the more complicated SQL joins
must be issued to query them. This is because in order for a query to be answered, many tables
need to be joined and aggregates generated.
Fact constellation/Galaxy schema Architecture
For each star schema or snowflake schema it is possible to construct a fact constellation schema.
This schema is more complex than star or snowflake architecture, which is because it contains
multiple fact tables. This allows dimension tables to be shared amongst many fact tables.
In a fact constellation schema, different fact tables are explicitly assigned to the dimensions,
which are for given facts relevant. This may be useful in cases when some facts are associated
with a given dimension level and other facts with a deeper dimension level.
Use of that model should be reasonable when for example, there is a sales fact table (with details
down to the exact date and invoice header id) and a fact table with sales forecast which is
calculated based on month, client id and product id.

9
IT6711- Data Mining Laboratory Department of IT 2017-2018

These dimensions allow us to answer questions such as


In what regions of the country are pleated pants most popular? (fact table joined with the
product and ship‐to dimensions)
What percentage of pants were bought with coupons and how has that varied from
quarter to quarter? (fact table joined with the promote
on and time dimensions)
How many pants were sold on holidays versus non‐holidays? (fact table joined with the
time dimension)

RESULT

10
IT6711- Data Mining Laboratory Department of IT 2017-2018

Ex.No.2 Creation of Dataset

AIM
To create a simple data set that can be opened in WEKA.
PROCEDURE
1.Data can be imported from various file format csv,ARFF
2.To create CSV ,create the table for student mark details.
3.Save as csv in EXCEL file format
4.open in weka
5.save as ARFF file by selecting save button in preprocessing Menu
ARFF contains two sections HEADER AND DATA
@relation marks

@attribute Regno numeric


@attribute Name {'ABHISHEKRAM M','AISWARYA PL','AKASH G','AKSHAY
KUMAR S','AKSHAY RAMANUJAM RANGANATHAN','ANISHA JULIET E','ANU-
GRAHA S','ARAVIND BHARATHY S','ARAVIND KUMARAN R','ARCHANA V
K','AROKIA JOYCE A','ASHWIN SHANMUGAM I'}
@attribute DAA numeric
@attribute SE {97.0,100.0,A,57.0,82.0,42.0,60.0,30.0}
@attribute OS
{57.0,27.0,88.0,68.0,52.0,9.0,78.0,87.0,83.0,59.0,49.0,A}
@attribute probability numeric
@attribute microprocessor numeric
@attribute Result {pass,fail}

@data
100,'ABHISHEKRAM M',97,97.0,57.0,48,48,pass
102,'AISWARYA PL',100,100.0,27.0,90,90,fail
104,'AKASH G',100,100.0,88.0,100,100,pass
106,'AKSHAY KUMAR S',62,A,68.0,100,100,fail
108,'AKSHAY RAMANUJAM RANGANATHAN',57,57.0,52.0,89,89,pass
110,'ANISHA JULIET E',82,82.0,9.0,34,34,fail
112,'ANUGRAHA S',100,100.0,78.0,100,100,pass
114,'ARAVIND BHARATHY S',100,100.0,87.0,100,100,pass
116,'ARAVIND KUMARAN R',100,100.0,83.0,100,100,pass
118,'ARCHANA V K',42,42.0,59.0,100,100,fail
120,'AROKIA JOYCE A',60,60.0,49.0,78,78,pass
122,'ASHWIN SHANMUGAM I',25,30.0,A,21,30,fail
?,?,?,?,?,?,?,?

11
IT6711- Data Mining Laboratory Department of IT 2017-2018

Result:

12
IT6711- Data Mining Laboratory Department of IT 2017-2018

Ex.No.3 APRIORI ALGORITHM


AIM
This experiment illustrates some of the basic elements of association rule mining using WEKA.
The sample dataset used for this example is contact-lenses.arff
APRIORI
APRIORI is an algorithm for frequent item set mining and association rule learning over
transactional databases. It proceeds by identifying the frequent individual items in the database
and extending them to larger and larger item sets as long as those item sets appear sufficiently
often in the database. The frequent item sets determined by Apriori can be used to determine
association rules which highlight general trends in the database. This has applications in domains
such as market basket analysis.
PROCEDURE
1. Open the data file in Weka Explorer. It is presumed that the required data fields have been
discretized. In this example it is age attribute.
2. Clicking on the associate tab will bring up the interface for association rule algorithm.
3. We will use apriori algorithm. This is the default algorithm.
4. Inorder to change the parameters for the run (example support, confidence etc) we click on the
text box immediately to the right of the choose button.
DATASET contact-lenses.arff
@relation FRUIT_1
@attribute ‘straw’ {T}
@attribute ‘litcy’ {T}
@attribute ‘orange’ {T}
@attribute ‘butter’ {T}
@attribute ‘vannila’ {T}
@attribute ‘bannana’ {T}
@attribute ‘apple’ {T}
@data
T,T,T,?,?,?,?
T,?,?,T,?,?,?
?,?,?,T,T,?,?
T,T,T,?,?,?,?

13
IT6711- Data Mining Laboratory Department of IT 2017-2018

?,?,T,?,?,T,?
?,?,?,?,?,T,?
?,?,?,T,?,T,?
T,T,T,?,?,?,T
?,?,?,?,T,?,T
T,T,?,?,?,?,?
Load the Fruit dataset

Set Lower bound support value to 0.3

14
IT6711- Data Mining Laboratory Department of IT 2017-2018

15
IT6711- Data Mining Laboratory Department of IT 2017-2018

16
IT6711- Data Mining Laboratory Department of IT 2017-2018

OUTPUT
=== Run information ===

Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.3 -S -1.0 -c -1


Relation: FRUIT_1
Instances: 10
Attributes: 7
‘straw’
‘litchy’
‘orange’
‘butter’
‘vannila’
‘bannana’

17
IT6711- Data Mining Laboratory Department of IT 2017-2018

‘apple’
=== Associator model (full training set) ===
Apriori
=======
Minimum support: 0.3 (3 instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: 14

Generated sets of large itemsets:

Size of set of large itemsets L(1): 5

Size of set of large itemsets L(2): 3

Size of set of large itemsets L(3): 1

Best rules found:

1. ‘litchy’=T 4 ==> ‘straw’=T 4 conf:(1)


2. ‘litchy’=T ‘orange’=T 3 ==> ‘straw’=T 3 conf:(1)
3. ‘straw’=T ‘orange’=T 3 ==> ‘litchy’=T 3 conf:(1)

RESULT

18
IT6711- Data Mining Laboratory Department of IT 2017-2018

Ex.No.4 FP-Growth ALGORITHM


AIM
This experiment illustrates the use of FP-Growth associate in weka. The sample data set used in
this experiment is “vote”data available at arff format. This document assumes that appropriate
data pre processing has been performed.
FP-Growth:
The FP-Growth Algorithm, is an efficient and scalable method for mining the complete set of
frequent patterns by pattern fragment growth, using an extended prefix-tree structure for storing
compressed and crucial information about frequent patterns named frequent-pattern tree (FP-
tree). Frequent item set Mining is possible without candidate generation, only two scan is
needed.
In step 1 build a compact data structure called the FP-tree.
In step 2 extracts frequent item sets directly from the FP-tree
PROCEDURE:
1. Open the data file in Weka Explorer. It is presumed that the required data fields have been
discretized.
2. Clicking on the associate tab will bring up the interface for association rule algorithm.
3. We will use FP-Growth algorithm.
4. Inorder to change the parameters for the run (example support, confidence etc) we click on the
text box immediately to the right of the choose button.
DATASET:
@relation FRUIT_1
@attribute ‘straw’ {T}
@attribute ‘litcy’ {T}
@attribute ‘orange’ {T}
@attribute ‘butter’ {T}
@attribute ‘vannila’ {T}
@attribute ‘bannana’ {T}
@attribute ‘apple’ {T}
@data
T,T,T,?,?,?,?

19
IT6711- Data Mining Laboratory Department of IT 2017-2018

T,?,?,T,?,?,?
?,?,?,T,T,?,?
T,T,T,?,?,?,?
?,?,T,?,?,T,?
?,?,?,?,?,T,?
?,?,?,T,?,T,?
T,T,T,?,?,?,T
?,?,?,?,T,?,T
T,T,?,?,?,?,?
Load the Fruit dataset

20
IT6711- Data Mining Laboratory Department of IT 2017-2018

Number of Instances: 435


Number of Attributes: 17
Number of Class: 2
The following screenshot shows the association rules that were generated when FP-growth
algorithm is applied on the given dataset.

21
IT6711- Data Mining Laboratory Department of IT 2017-2018

OUTPUT:
=== Run information ===

Scheme: weka.associations.FPGrowth -P 2 -I -1 -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.3


Relation: FRUIT_1
Instances: 10
Attributes: 7
‘straw’
‘litchy’
‘orange’
‘butter’
‘vannila’
‘bannana’
‘apple’
=== Associator model (full training set) ===

FPGrowth found 3 rules (displaying top 3)

1. [‘litchy’=T]: 4 ==> [‘straw’=T]: 4 <conf:(1)> lift:(2) lev:(0.2) conv:(2)


2. [‘straw’=T, ‘orange’=T]: 3 ==> [‘litchy’=T]: 3 <conf:(1)> lift:(2.5) lev:(0.18) conv:(1.8)
3. [‘orange’=T, ‘litchy’=T]: 3 ==> [‘straw’=T]: 3 <conf:(1)> lift:(2) lev:(0.15) conv:(1.5)

RESULT:

22
IT6711- Data Mining Laboratory Department of IT 2017-2018

K-MEANS CLUSTERING
Ex.No.5
AIM
This experiment illustrates the use of simple k-mean clustering with Weka explorer. The
sample data set used for this example is based on the iris.arff data set. This document assumes
that appropriate pre-processing has been performed.
K-MEANS CLUSTERING:
K-Means is simplest unsupervised learning algorithms that solve the well known
clustering problem. The procedure follows a simple and easy way to classify a given data
set through a certain number of clusters (assume k clusters) fixed apriori. The main idea is to
define k centers, one for each cluster. These centers should be placed in a cunning way because
of different location causes different RESULt. So, the better choice is to place them as much as
possible far away from each other. The next step is to take each point belonging to a given data
set and associate it to the nearest center. When no point is pending, the first step is completed
and an early group age is done.
PROCEDURE:
1. Run the Weka explorer and load the data file iris.arff in preprocessing interface.

2. Inorder to perform clustering select the ‘cluster’ tab in the explorer and click on the choose
button. This step RESULts in a dropdown list of available clustering algorithms.

3. In this case we select ‘simple k-means’.

4. Next click in text button to the right of the choose button to get popup window shown in the
screenshots. In this window we enter six on the number of clusters and we leave the value of the
seed on as it is. The seed value is used in generating a random number which is used for making
the internal assignments of instances of clusters.

5. Once of the option have been specified. We run the clustering algorithm there we must make
sure that they are in the ‘cluster mode’ panel. The use of training set option is selected and then

23
IT6711- Data Mining Laboratory Department of IT 2017-2018

we click ‘start’ button. This process and RESULting window are shown in the following
screenshots.

6. The RESULt window shows the centroid of each cluster as well as statistics on the number
and the percent of instances assigned to different clusters. Here clusters centroid are means
vectors for each clusters. This clusters can be used to characterized the cluster.For eg, the
centroid of cluster1 shows the class iris.versicolor mean value of the sepal length is 5.4706, sepal
width 2.4765, petal width 1.1294, petal length 3.7941.

7. Another way of understanding characterstics of each cluster through visualization ,we can do
this, try right clicking the RESULt set on the RESULt. List panel and selecting the visualize
cluster assignments.

Dataset iris.arff
@RELATION iris
@ATTRIBUTE sepallength REAL
@ATTRIBUTE sepalwidth REAL
@ATTRIBUTE petallength REAL
@ATTRIBUTE petalwidth REAL
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.4,3.7,1.5,0.2,Iris-setosa
4.8,3.4,1.6,0.2,Iris-setosa
Load the iris.arff dataset
24
IT6711- Data Mining Laboratory Department of IT 2017-2018

Number of Instances: 150


Number of Attributes: 4
Classes: 3
The following screenshot shows the clustering rules that were generated when simple k means
algorithm is applied on the given dataset.

25
IT6711- Data Mining Laboratory Department of IT 2017-2018

26
IT6711- Data Mining Laboratory Department of IT 2017-2018

OUTPUT:
kMeans
Number of iterations: 7
Within cluster sum of squared errors: 62.1436882815797
Missing values globally replaced with mean/mode
Cluster centroids: Cluster#
Attribute Full Data 0 1
(150) (100) (50)
==================================================================
sepallength 5.8433 6.262 5.006
sepalwidth 3.054 2.872 3.418
petallength 3.7587 4.906 1.464
petalwidth 1.1987 1.676 0.244
class Iris-setosa Iris-versicolor Iris-setosa
Clustered Instances
0 100 ( 67%)
1 50 ( 33%)

27
IT6711- Data Mining Laboratory Department of IT 2017-2018

RESULT:

HIERARCHICAL CLUSTERING
Ex.No.6
AIM
This experiment illustrates the use of one hierarchical clustering with Weka explorer. The
sample data set used for this example is based on the weather.arff data set. This document
assumes that appropriate pre-processing has been performed.
HIERARCHICAL CLUSTERING

Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of


clusters. Strategies for hierarchical clustering generally fall into two types. Agglomerative is a
"bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged
as one moves up the hierarchy. Divisive is a "top down" approach: all observations start in one
cluster, and splits are performed recursively as one moves down the hierarchy.

PROCEDURE:

1.Open the data file in Weka Explorer. It is presumed that the required data fields have been
discretized.

2. Clicking on the cluster tab will bring up the interface for cluster algorithm.

3. We will use hierarchical clustering algorithm.

4. Inorder to change the parameters for the run ( Euclidean, Manhattan, Minkowski distance ) we
click on the text box immediately to the right of the choose button.
5. Visualization of the graph

Dataset weather.arff
@relation weather

@attribute outlook {sunny, overcast, rainy}


@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
28
IT6711- Data Mining Laboratory Department of IT 2017-2018

@attribute windy {TRUE, FALSE}


@attribute play {yes, no}

@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no

Load the weather.arff dataset

29
IT6711- Data Mining Laboratory Department of IT 2017-2018

Number of Instances: 14
Number of Attributes: 5
Number of Class: 2
The following screenshot shows the clustering rules that were generated when hierarchical
clustering algorithm is applied on the given dataset.

30
IT6711- Data Mining Laboratory Department of IT 2017-2018

Visualize the tree by right clicking

31
IT6711- Data Mining Laboratory Department of IT 2017-2018

32
IT6711- Data Mining Laboratory Department of IT 2017-2018

OUTPUT:
Model and evaluation on training set

Cluster 0
((1.0:1,1.0:1):0,1.0:1)

Cluster 1
(((((0.0:1,0.0:1):0.41421,((((0.0:1,0.0:1):0,
(0.0:1,0.0:1):0):0.41421,1.0:1.41421):0,0.0:1.41421):0):0,0.0:1.41421):0,0.0:1.41421):0,1.0:1.41
421)
Clustered Instances

0 3 ( 21%)
1 11 ( 79%)

RESULT:
33
IT6711- Data Mining Laboratory Department of IT 2017-2018

BAYESIAN CLASSIFICATION
Ex.No.7
AIM
This experiment illustrates the use of Bayesian classifier with Weka explorer. The sample
data set used for this example is based on the soyabean.arff data set. This document assumes that
appropriate pre-processing has been performed.
BAYESIAN CLASSIFICATION

Bayesian classification is based on Bayes theorem. Bayesian classifiers are the statistical
classifiers. Bayesian classifiers can predict class membership probabilities such as the probability
that a given tuple belongs to a particular class.

PROCEDURE:

1. Open the data file in Weka Explorer. It is presumed that the required data fields have been
discretized.

2. Next we select the “classify” tab and click choose button to select the “NavieBayes” in the
classifier.

3. Now we specify the various parameters. These can be specified by clicking in the text box to
the right of the chose button. In this example, we accept the default values his default version
does perform some pruning but does not perform error pruning.

4. We select the 10-fold cross validation as our evaluation approach. Since we don’t have
separate evaluation data set, this is necessary to get a reasonable idea of accuracy of generated
model.

5. We now click start to generate the model .the ASCII version of the tree as well as evaluation
statistic will appear in the right panel when the model construction is complete.

6. Note that the classification accuracy of model is about 69%.this indicates that we may find
more work. (Either in preprocessing or in selecting current parameters for the classification)

34
IT6711- Data Mining Laboratory Department of IT 2017-2018

7. Now weka also lets us a view a graphical version of the classification tree.

8. We will use our model to classify the new instances.

Dataset soyabean.arff

@RELATION soybean

@ATTRIBUTE date {april,may,june,july,august,september,october}


@ATTRIBUTE plant-stand {normal,lt-normal}
@ATTRIBUTE precip {lt-norm,norm,gt-norm}
@ATTRIBUTE temp {lt-norm,norm,gt-norm}
@ATTRIBUTE hail {yes,no}
@ATTRIBUTE crop-hist {diff-lst-year,same-lst-yr,same-lst-two-yrs, same-lst-sev-
yrs}
@ATTRIBUTE area-damaged {scattered,low-areas,upper-areas,whole-field}
@ATTRIBUTE severity {minor,pot-severe,severe}
@ATTRIBUTE seed-tmt {none,fungicide,other}
@ATTRIBUTE germination {90-100,80-89,lt-80}
@ATTRIBUTE plant-growth {norm,abnorm}
@ATTRIBUTE leaves {norm,abnorm}
@ATTRIBUTE leafspots-halo {absent,yellow-halos,no-yellow-halos}
@ATTRIBUTE leafspots-marg {w-s-marg,no-w-s-marg,dna}
@ATTRIBUTE leafspot-size {lt-1/8,gt-1/8,dna}
@ATTRIBUTE leaf-shread {absent,present}
@ATTRIBUTE leaf-malf {absent,present}
@ATTRIBUTE leaf-mild {absent,upper-surf,lower-surf}
@ATTRIBUTE stem {norm,abnorm}
@ATTRIBUTE lodging {yes,no}
@ATTRIBUTE stem-cankers {absent,below-soil,above-soil,above-sec-nde}
@ATTRIBUTE canker-lesion {dna,brown,dk-brown-blk,tan}
@ATTRIBUTE fruiting-bodies {absent,present}
@ATTRIBUTE external-decay {absent,firm-and-dry,watery}
@ATTRIBUTE mycelium {absent,present}
35
IT6711- Data Mining Laboratory Department of IT 2017-2018

@ATTRIBUTE int-discolor {none,brown,black}


@ATTRIBUTE sclerotia {absent,present}
@ATTRIBUTE fruit-pods {norm,diseased,few-present,dna}
@ATTRIBUTE fruit-spots {absent,colored,brown-w/blk-specks,distort,dna}
@ATTRIBUTE seed {norm,abnorm}
@ATTRIBUTE mold-growth {absent,present}
@ATTRIBUTE seed-discolor {absent,present}
@ATTRIBUTE seed-size {norm,lt-norm}
@ATTRIBUTE shriveling {absent,present}
@ATTRIBUTE roots {norm,rotted,galls-cysts}
@ATTRIBUTE class {diaporthe-stem-canker, charcoal-rot, rhizoctonia-root-rot,
phytophthora-rot, brown-stem-rot, powdery-mildew, downy-mildew, brown-spot, bacterial-
blight, bacterial-pustule, purple-seed-stain, anthracnose, phyllosticta-leaf-spot, alternarialeaf-
spot, frog-eye-leaf-spot, diaporthe-pod-&-stem-blight, cyst-nematode, 2-4-d-injury, herbicide-
injury}

@DATA

october, normal, gt-norm, norm, yes, same-lst-yr, low-areas, pot-severe, none, 90-100, abnorm,
abnorm, absent, dna, dna, absent, absent, absent, abnorm, no, above-sec-nde, brown, present,
firm-and-dry, absent, none, absent, norm, dna, norm, absent, absent, norm, absent, norm,
diaporthe-stem-canker
august, normal, gt-norm, norm, yes, same-lst-two-yrs, scattered, severe, fungicide, 80-89,
abnorm, abnorm, absent, dna, dna, absent, absent, absent, abnorm, yes, above-sec-nde, brown,
present, firm-and-dry, absent, none, absent, norm, dna, norm, absent, absent, norm, absent, norm,
diaporthe-stem-canker
july, normal, gt-norm, norm, yes, same-lst-yr, scattered, severe, fungicide, lt-80, abnorm,
abnorm, absent, dna, dna, absent, absent, absent, abnorm, yes, above-sec-nde, dna, present, firm-
and-dry, absent, none, absent, norm, dna, norm, absent, absent, norm, absent, norm, diaporthe-
stem-canker

Load the Soyabean.arff dataset

36
IT6711- Data Mining Laboratory Department of IT 2017-2018

Number of Instances: 683


Number of Attributes: 35
Number of Class: 19
The following screenshot shows the classifiers that were generated when Naïve Bayes algorithm
is applied on the given dataset.

37
IT6711- Data Mining Laboratory Department of IT 2017-2018

38
IT6711- Data Mining Laboratory Department of IT 2017-2018

OUTPUT:
Correctly Classified Instances 635 92.9722 %
Incorrectly Classified Instances 48 7.0278 %
Kappa statistic 0.923
Mean absolute error 0.0096
Root mean squared error 0.0817
Relative absolute error 9.9344 %
Root relative squared error 37.2742 %
Coverage of cases (0.95 level) 95.1684 %
Mean rel. region size (0.95 level) 6.5501 %
Total Number of Instances 683
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1 0 1 1 1 1 diaporthe-stem-canker
1 0 1 1 1 1 charcoal-rot
1 0 1 1 1 1 rhizoctonia-root-rot
1 0.003 0.978 1 0.989 1 phytophthora-rot
1 0 1 1 1 1 brown-stem-rot
1 0 1 1 1 1 powdery-mildew
1 0 1 1 1 1 downy-mildew
0.837 0.008 0.939 0.837 0.885 0.989 brown-spot
1 0.003 0.909 1 0.952 1 bacterial-blight
0.9 0 1 0.9 0.947 1 bacterial-pustule
1 0 1 1 1 1 purple-seed-stain
1 0 1 1 1 1 anthracnose
0.85 0.008 0.773 0.85 0.81 0.994 phyllosticta-leaf-spot
1 0.049 0.758 1 0.863 0.991 alternarialeaf-spot
0.714 0.007 0.942 0.714 0.813 0.98 frog-eye-leaf-spot
1 0.001 0.938 1 0.968 1 diaporthe-pod-&-stem-blight
1 0 1 1 1 1 cyst-nematode

39
IT6711- Data Mining Laboratory Department of IT 2017-2018

0.875 0 1 0.875 0.933 1 2-4-d-injury


1 0 1 1 1 1 herbicide-injury
Weighted Avg. 0.93 0.009 0.938 0.93 0.929 0.994

=== Confusion Matrix ===

a b c d e f g h i j k l m n o p q r s <-- classified as
20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | a = diaporthe-stem-canker
0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | b = charcoal-rot
0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | c = rhizoctonia-root-rot
0 0 0 88 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | d = phytophthora-rot
0 0 0 0 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | e = brown-stem-rot
0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 | f = powdery-mildew
0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 | g = downy-mildew
0 0 0 0 0 0 0 77 0 0 0 0 5 6 4 0 0 0 0 | h = brown-spot
0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 | i = bacterial-blight
0 0 0 0 0 0 0 0 2 18 0 0 0 0 0 0 0 0 0 | j = bacterial-pustule
0 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 | k = purple-seed-stain
0 0 0 0 0 0 0 0 0 0 0 44 0 0 0 0 0 0 0 | l = anthracnose
0 0 0 0 0 0 0 2 0 0 0 0 17 1 0 0 0 0 0 | m = phyllosticta-leaf-spot
0 0 0 0 0 0 0 0 0 0 0 0 0 91 0 0 0 0 0 | n = alternarialeaf-spot
0 0 0 0 0 0 0 3 0 0 0 0 0 22 65 1 0 0 0 | o = frog-eye-leaf-spot
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 0 0 0 | p = diaporthe-pod-&-stem-blight
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14 0 0 | q = cyst-nematode
0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 14 0 | r = 2-4-d-injury
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 | s = herbicide-injury

40
IT6711- Data Mining Laboratory Department of IT 2017-2018

RESULT

DECISION TREE
Ex.No.8
AIM
This experiment illustrates the use of j-48 classifier in weka. The sample data set used in
this experiment is weather dataset available at arff format. This document assumes that
appropriate data pre processing has been performed.

PROCEDURE:

1. We begin the experiment by loading the data (weather.arff) into weka.

2. Next we select the “classify” tab and click “choose” button to select the “j48”classifier.

3. Now we specify the various parameters. These can be specified by clicking in the text box to
the right of the chose button. In this example, we accept the default values the default version
does perform some pruning but does not perform error pruning.

4.Under the “text “options in the main panel. We select the 10-fold cross validation as our
evaluation approach. Since we don’t have separate evaluation data set, this is necessary to get a
reasonable idea of accuracy of generated model.

5. We now click ”start” to generate the model .the ASCII version of the tree as well as evaluation
statistic will appear in the right panel when the model construction is complete.

6. Note that the classification accuracy of model is about 69%.this indicates that we may find
more work. (Either in preprocessing or in selecting current parameters for the classification)

7. Now weka also lets us a view a graphical version of the classification tree. This can be done
by right clicking the last RESULt set and selecting “visualize tree” from the pop-up menu.

8. We will use our model to classify the new instances.

41
IT6711- Data Mining Laboratory Department of IT 2017-2018

9. In the main panel under “text “options click the “supplied test set” radio button and then click
the “set” button. This wills pop-up a window which will allow you to open the file containing
test instances.
Dataset weather.arff
@relation weather
@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no

Load the weather.arff dataset

42
IT6711- Data Mining Laboratory Department of IT 2017-2018

Number of Instances: 14
Number of Attributes: 5
Number of Class: 2
The following screenshot shows the classification rules that were generated when j48 algorithm
is applied on the given dataset.

43
IT6711- Data Mining Laboratory Department of IT 2017-2018

44
IT6711- Data Mining Laboratory Department of IT 2017-2018

Visualize the tree by right clicking

45
IT6711- Data Mining Laboratory Department of IT 2017-2018

OUTPUT:
Correctly Classified Instances 9 64.2857 %
Incorrectly Classified Instances 5 35.7143 %
Kappa statistic 0.186
Mean absolute error 0.2857
Root mean squared error 0.4818
Relative absolute error 60 %
Root relative squared error 97.6586 %
Coverage of cases (0.95 level) 92.8571 %
Mean rel. region size (0.95 level) 64.2857 %
Total Number of Instances 14

46
IT6711- Data Mining Laboratory Department of IT 2017-2018

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class


0.778 0.6 0.7 0.778 0.737 0.789 yes
0.4 0.222 0.5 0.4 0.444 0.789 no
Weighted Avg. 0.643 0.465 0.629 0.643 0.632 0.789

=== Confusion Matrix ===

a b <-- classified as
7 2 | a = yes
3 2 | b = no

RESULT:

Ex.No.9 SUPPORT VECTOR MACHINES

AIM
This experiment illustrates the use of Support vector classifier in weka. The sample data
set used in this experiment is vote dataset available in arff format. This document assumes that
appropriate data pre processing has been performed.

PROCEDURE:

1. We begin the experiment by loading the data (vote.arff) into weka.

47
IT6711- Data Mining Laboratory Department of IT 2017-2018

2. Next we select the classify tab and click choosefunction button to select the Support vector
machine .

3. Now we specify the various parameters. These can be specified by clicking in the text box to
the right of the chose button.

4. Under the “text “options in the main panel. We select the 10-fold cross validation as our
evaluation approach. Since we don’t have separate evaluation data set, this is necessary to get a
reasonable idea of accuracy of generated model.

5. We now click ”start” to generate the model .the ASCII version of the tree as well as evaluation
statistic will appear in the right panel when the model construction is complete.

6. Note that the classification accuracy of model is about 69%.this indicates that we may find
more work. (Either in preprocessing or in selecting current parameters for the classification)

7. The run information of the support vector classifier will be displayed with the correctly and
incorrectly classified instances.
Dataset vote.arff
@relation vote
@attribute 'handicapped-infants' { 'n', 'y'}
@attribute 'water-project-cost-sharing' { 'n', 'y'}
@attribute 'adoption-of-the-budget-resolution' { 'n', 'y'}
@attribute 'physician-fee-freeze' { 'n', 'y'}
@attribute 'el-salvador-aid' { 'n', 'y'}
@attribute 'religious-groups-in-schools' { 'n', 'y'}
@attribute 'anti-satellite-test-ban' { 'n', 'y'}
@attribute 'aid-to-nicaraguan-contras' { 'n', 'y'}
@attribute 'mx-missile' { 'n', 'y'}
@attribute 'immigration' { 'n', 'y'}
@attribute 'synfuels-corporation-cutback' { 'n', 'y'}
@attribute 'education-spending' { 'n', 'y'}
@attribute 'superfund-right-to-sue' { 'n', 'y'}
@attribute 'crime' { 'n', 'y'}
@attribute 'duty-free-exports' { 'n', 'y'}

48
IT6711- Data Mining Laboratory Department of IT 2017-2018

@attribute 'export-administration-act-south-africa' { 'n', 'y'}


@attribute 'Class' { 'democrat', 'republican'}
@data
'n','y','n','y','y','y','n','n','n','y',?,'y','y','y','n','y','republican'
'n','y','n','y','y','y','n','n','n','n','n','y','y','y','n',?,'republican'
?,'y','y',?,'y','y','n','n','n','n','y','n','y','y','n','n','democrat'
'n','y','y','n',?,'y','n','n','n','n','y','n','y','n','n','y','democrat'
'y','y','y','n','y','y','n','n','n','n','y',?,'y','y','y','y','democrat'
'n','y','y','n','y','y','n','n','n','n','n','n','y','y','y','y','democrat'
'n','y','n','y','y','y','n','n','n','n','n','n',?,'y','y','y','democrat'
'n','y','n','y','y','y','n','n','n','n','n','n','y','y',?,'y','republican'
'n','y','n','y','y','y','n','n','n','n','n','y','y','y','n','y','republican'
'y','y','y','n','n','n','y','y','y','n','n','n','n','n',?,?,'democrat'
'n','y','n','y','y','n','n','n','n','n',?,?,'y','y','n','n','republican'
Load the vote.arff dataset
Number of Instances: 435
Number of Attributes: 17
Number of Class: 2
The following screenshot shows the classification rules that were generated when support vector
classifier is applied on the given dataset.

49
IT6711- Data Mining Laboratory Department of IT 2017-2018

50
IT6711- Data Mining Laboratory Department of IT 2017-2018

51
IT6711- Data Mining Laboratory Department of IT 2017-2018

OUTPUT:
Correctly Classified Instances 418 96.092 %
Incorrectly Classified Instances 17 3.908 %
Kappa statistic 0.9178
Mean absolute error 0.0391
Root mean squared error 0.1977
Relative absolute error 8.2405 %
Root relative squared error 40.6018 %
Coverage of cases (0.95 level) 96.092 %
Mean rel. region size (0.95 level) 50 %
Total Number of Instances 435
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.963 0.042 0.973 0.963 0.968 0.96 democrat
0.958 0.037 0.942 0.958 0.95 0.96 republican
Weighted Avg. 0.961 0.04 0.961 0.961 0.961 0.96
=== Confusion Matrix ===
a b <-- classified as
257 10 | a = democrat
7 161 | b = republican
52
IT6711- Data Mining Laboratory Department of IT 2017-2018

RESULT:
BANK APPLICATION
Ex.No.10
AIM
To analyze a banking application using naive bayes classification method

PROCEDURE
Loading the Data
In addition to the native ARFF data file format, WEKA has the capability to read in ".csv"
format files. This is fortunate since many databases or spreadsheet applications can save or
export data into flat files in this format. As can be seen in the sample data file, the first row
contains the attribute names (separated by commas) followed by each data row with attribute
values listed in the same order (also separated by commas). In fact, once loaded into WEKA, the
data set can be saved into ARFF format.
Load the data set into WEKA, perform a series of operations using WEKA's
preprocessing filters. Initially (in the Preprocess tab) click "open" and navigate to the directory
containing the data file (.csv or .arff).
bank_data.csv:
The data contains the following fields :
Id-- a unique identification number
Age-- age of customer in years (numeric)
sex --MALE / FEMALE
region --inner_city/rural/suburban/town
income-- income of customer (numeric)
married-- is the customer married (YES/NO)
children-- number of children (numeric)
car-- does the customer own a car (YES/NO)
save_acct --does the customer have a saving account (YES/NO)
current_acct --does the customer have a current account (YES/NO)
mortgage -- does the customer have a mortgage (YES/NO)
eligibility – whether eligible for availing loan (YES/NO)

53
IT6711- Data Mining Laboratory Department of IT 2017-2018

Feature selection: Remove attribute : id

=== Run information ===

54
IT6711- Data Mining Laboratory Department of IT 2017-2018

Evaluator: weka.attributeSelection.CfsSubsetEval
Search:weka.attributeSelection.LinearForwardSelection -D 0 -N 5 -I -K 50 -T 0
Relation: bank-data-weka.filters.unsupervised.attribute.Remove-R1
Instances: 600
Attributes: 11
age
sex
region
income
married
children
car
save_act
current_act
mortgage
Eligibility
Evaluation mode:evaluate on all training data
=== Attribute Selection on all input data ===

Search Method:
Linear Forward Selection.
Start set: no attributes
Forward selection method: forward selection
Stale search after 5 node expansions
Linear Forward Selection Type: fixed-set
Number of top-ranked attributes that are used: 11
Total number of subsets evaluated: 63
Merit of best subset found: 0.099

Attribute Subset Evaluator (supervised, Class (nominal): 11 Eligibility):


CFS Subset Evaluator
Including locally predictive attributes

55
IT6711- Data Mining Laboratory Department of IT 2017-2018

Selected attributes: 4,5,6 : 3


income
married
children

Classification Output : Naïve Bayes


=== Run information ===

Scheme:weka.classifiers.bayes.NaiveBayes
Relation: bank-data-weka.filters.unsupervised.attribute.Remove-R1
Instances: 600
Attributes: 11
age
sex
region
income
married
children
car
save_act
current_act
mortgage
Eligibility
Test mode:evaluate on training data

=== Classifier model (full training set) ===


Naive Bayes Classifier

Class
Attribute YES NO

56
IT6711- Data Mining Laboratory Department of IT 2017-2018

(0.46) (0.54)
=====================================
age
mean 45.1277 40.0982
std. dev. 14.3018 14.1018
weight sum 274 326
precision 1 1
sex
FEMALE 131.0 171.0
MALE 145.0 157.0
[total] 276.0 328.0

region
INNER_CITY 124.0 147.0
TOWN 72.0 103.0
RURAL 47.0 51.0
SUBURBAN 35.0 29.0
[total] 278.0 330.0

income
mean 30644.8069 24902.2958
std. dev. 13585.1095 11640.5073
weight sum 274 326
precision 97.1838 97.1838

married
NO 121.0 85.0
YES 155.0 243.0
[total] 276.0 328.0

children
mean 0.9453 1.0675

57
IT6711- Data Mining Laboratory Department of IT 2017-2018

std. dev. 0.859 1.1937


weight sum 274 326
precision 1 1

car
NO 137.0 169.0
YES 139.0 159.0
[total] 276.0 328.0

save_act
NO 96.0 92.0
YES 180.0 236.0
[total] 276.0 328.0
current_act
NO 64.0 83.0
YES 212.0 245.0
[total] 276.0 328.0
mortgage
NO 183.0 210.0
YES 93.0 118.0
[total] 276.0 328.0
Time taken to build model: 0.01 seconds
=== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances 393 65.5 %
Incorrectly Classified Instances 207 34.5 %
Kappa statistic 0.2956
Mean absolute error 0.4154
Root mean squared error 0.4613
Relative absolute error 83.7093 %
Root relative squared error 92.6161 %
Total Number of Instances 600

58
IT6711- Data Mining Laboratory Department of IT 2017-2018

=== Detailed Accuracy By Class ===


TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.54 0.248 0.646 0.54 0.588 0.725 YES
0.752 0.46 0.66 0.752 0.703 0.725 NO
Weighted Avg. 0.655 0.363 0.654 0.655 0.651 0.725
=== Confusion Matrix ===
a b <-- classified as
148 126 | a = YES
81 245 | b = NO
RESULT

TEXT CLASSIFICATION
Ex.No.11
AIM
To classify the text document based on movie reviews

Algorithm
1.Create Text documents and store it in a folder
2.Open the text document in WEKA using Textdirectoryloader option
3.Assign the attributes values for coverting the text documents into ARFF format
4.Infomation gain Feature Selection NB classifer are selected.
5.Accuracy is measured after executing classifer with informationgain for the textdocument
converted ARFF data.

Procedure

To retrieve the data set.


http://www.searchforum.org.cn/tansongbo/corpus/
open file
click on dataset
choosetextdirectoryloader
in directory field click choose dataset main directory(textsen)

59
IT6711- Data Mining Laboratory Department of IT 2017-2018

choose filer-unsupervised-attribute-stringtowordvector-clik -fill the fields-

Assign IDFTransfarm=True,
TFTTransform =True
.lowercasetokes=true ,
Stemmer=IteratedLovinsStemmer
Use stoplist=true
Tokenizer=Unigram tokenizer
Apply-ok

attribute selection –filter –supervised –attributeselection –informationgain and ranker .

60
IT6711- Data Mining Laboratory Department of IT 2017-2018

61
IT6711- Data Mining Laboratory Department of IT 2017-2018

62
IT6711- Data Mining Laboratory Department of IT 2017-2018

63
IT6711- Data Mining Laboratory Department of IT 2017-2018

Preprocess-filter-supervised-attribute-attribute selection-ok-ok-apply

=== Run information ===

Evaluator: weka.attributeSelection.InfoGainAttributeEval
Search:weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N -1
Relation: C__Users_INTEL_Desktop_txt_sentoken-
weka.filters.unsupervised.attribute.StringToWordVector-R1-W1000-prune-rate-1.0-T-I-N0-L-S-
stemmerweka.core.stemmers.IteratedLovinsStemmer-M1-
tokenizerweka.core.tokenizers.WordTokenizer -delimiters " \r\n\t.,;:\'\"()?!"
Instances: 2000
Attributes: 1172
[list of attributes omitted]
Evaluation mode:evaluate on all training data

== Attribute Selection on all input data ===

Search Method:
Attribute ranking.

64
IT6711- Data Mining Laboratory Department of IT 2017-2018

Attribute Evaluator (supervised, Class (nominal): 1 @@class@@):


Information Gain Ranking Filter

Instances: 2000
Attributes: 10
bad
wast
worst
stupid
bor
perfect
ridicl
portr
outstand
@@class@@
Test mode:10-fold cross-validation

=== Classifier model (full training set) ===

Naive Bayes Classifier

Class
Attribute negpos
(0.5) (0.5)
===============================
bad
mean 0.3387 0.1693

65
IT6711- Data Mining Laboratory Department of IT 2017-2018

std. dev. 0.3164 0.2806


weight sum 1000 1000
precision 0.6343 0.6343

wast
mean 0.2992 0.0581
std. dev. 0.5875 0.2846
weight sum 1000 1000
precision 1.4525 1.4525

worst
mean 0.2868 0.0636
std. dev. 0.5846 0.2999
weight sum 1000 1000
precision 1.4784 1.4784

stupid
mean 0.2788 0.0594
std. dev. 0.5892 0.295
weight sum 1000 1000
precision 1.5237 1.5237

bor
mean 0.2988 0.1089
std. dev. 0.5375 0.3549
weight sum 1000 1000
precision 1.2659 1.2659

perfect
mean 0.1379 0.3071
std. dev. 0.3681 0.4999
weight sum 1000 1000
precision 1.1208 1.1208

ridicl
mean 0.2296 0.0629
std. dev. 0.5811 0.321
weight sum 1000 1000
precision 1.7006 1.7006

portr
mean 0.0896 0.2568
std. dev. 0.3546 0.5635
weight sum 1000 1000
precision 1.4932 1.4932

outstand
mean 0.0139 0.1504

66
IT6711- Data Mining Laboratory Department of IT 2017-2018

std. dev. 0.3856 0.5704


weight sum 1000 1000
precision 2.3139 2.3139

Time taken to build model: 0.03 seconds

=== Stratified cross-validation ===


=== Summary ===

Correctly Classified Instances 1406 70.3 %


Incorrectly Classified Instances 594 29.7 %
Kappa statistic 0.406
Mean absolute error 0.3289
Root mean squared error 0.4468
Relative absolute error 65.7795 %
Root relative squared error 89.3569 %
Total Number of Instances 2000

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class


0.583 0.177 0.767 0.583 0.663 0.784 neg
0.823 0.417 0.664 0.823 0.735 0.784 pos
Weighted Avg. 0.703 0.297 0.715 0.703 0.699 0.784

=== Confusion Matrix ===

RESULT

Ex.No.12
DISCRETIZATION
Aim :
To perform the task of data discretization for student dataset.
Procedure
Association rule mining can only be performed on categorical data. This requires performing
discretization on numeric or continuous attributes.
1. Let us divide the values of age attribute into three bins(intervals).
67
IT6711- Data Mining Laboratory Department of IT 2017-2018

2. First load the dataset into weka(student.arff)


3. Select the age attribute and activate filter-dialog box and select
“WEKA.filters.unsupervised.attribute.discretize” from the list.
4. Enter the index for the attribute to be discretized .In this case the attribute is age. So we
must enter ‘1’ corresponding to the age attribute.
5. Enter ‘3’ as the number of bins. Leave the remaining field values as they are.
6. Click OK button.
7. Click apply in the filter panel. This will result in a new working relation with the selected
attribute partition into 3 bins. And save the new working relation in a file called student-
data-discretized.arff

Dataset student .arff


@relation student
@attribute age {<30,30-40,>40}
@attribute income {low, medium, high}
@attribute student {yes, no}
@attribute credit-rating {fair, excellent}
@attribute buyspc {yes, no}
@data
%
<30, high, no, fair, no
<30, high, no, excellent, no
30-40, high, no, fair, yes
>40, medium, no, fair, yes
>40, low, yes, fair, yes

>40, low, yes, excellent, no


30-40, low, yes, excellent, yes
<30, medium, no, fair, no
<30, low, yes, fair, no
>40, medium, yes, fair, yes
<30, medium, yes, excellent, yes

68
IT6711- Data Mining Laboratory Department of IT 2017-2018

30-40, medium, no, excellent, yes


30-40, high, yes, fair, yes
>40, medium, no, excellent, no
%

Output:

Result

Ex.No.13 BAYESIAN CLASSIFICATION USING JAVA API

Aim: To calculate accuracy using Bayesian classifier using java API

69
IT6711- Data Mining Laboratory Department of IT 2017-2018

Algorithm:
1.Load the dataset VOTE.ARFF in WEKA
2.Set the classlabel from the dataset
3.Set the crossvalidation property to 10
4.Evaluate the dataset with Naïve Bayes classifier
5.Store the classification and predications value
6.Display the accuracy for the dataset with NB classifier

Using JAVA API:


//Insert weka.jar
import java.io.BufferedReader;
import java.io.FileReader;
import weka.classifiers.Classifier;
import weka.classifiers.Evaluation;
import weka.classifiers.evaluation.NominalPrediction;
import weka.classifiers.rules.DecisionTable;
import weka.classifiers.rules.PART;
import weka.classifiers.trees.DecisionStump;
import weka.classifiers.functions.MultilayerPerceptron;
import weka.classifiers.bayes.NaiveBayes;
import weka.classifiers.trees.J48;
import weka.core.FastVector;
import weka.core.Instances;
public class Imple {
public static Evaluation classify(Classifier model,
Instances trainingSet, Instances testingSet) throws Exception {
Evaluation evaluation = new Evaluation(trainingSet);
model.buildClassifier(trainingSet);
evaluation.evaluateModel(model, testingSet);

return evaluation;

70
IT6711- Data Mining Laboratory Department of IT 2017-2018

}
public static double calculateAccuracy(FastVector predictions) {
double correct = 0;
for (int i = 0; i<predictions.size(); i++) {
NominalPrediction np = (NominalPrediction) predictions.elementAt(i);
if (np.predicted() == np.actual()) {
correct++;
}
}
return 100 * correct / predictions.size();
}
public static Instances[][] crossValidationSplit(Instances data, int numberOfFolds)
{
Instances[][] split = new Instances[2][numberOfFolds];

for (int i = 0; i<numberOfFolds; i++) {


split[0][i] = data.trainCV(numberOfFolds, i);
split[1][i] = data.testCV(numberOfFolds, i);
}
return split;
}
public static void main(String[] args) throws Exception {
BufferedReader datafile = new BufferedReader(new FileReader("D:\\data\\
soybean.arff"));
Instances data = new Instances(datafile);
data.setClassIndex(data.numAttributes() - 1);
// Do 10-split cross validation
Instances[][] split = crossValidationSplit(data, 10);
// Separate split into training and testing arrays
Instances[] trainingSplits = split[0];
Instances[] testingSplits = split[1];
// Use a set of classifiers

71
IT6711- Data Mining Laboratory Department of IT 2017-2018

Classifier[] models = {
new NaiveBayes()
//new J48()//, // a decision tree
//new PART(),
//new DecisionTable(),//decision table majority classifier
//new DecisionStump(),//one-level decision tree};
// Run for each model
for (Classifier model : models) {
// Collect every group of predictions for current model in a FastVector
FastVector predictions = new FastVector();
// For each training-testing split pair, train and test the classifier
for (int i = 0; i<trainingSplits.length; i++) {
Evaluation validation = classify(model, trainingSplits[i], testingSplits[i]);
predictions.appendElements(validation.predictions());
// Uncomment to see the summary for each training-testing pair.
//System.out.println(models[j].toString());
}
// Calculate overall accuracy of current classifier on all splits
double accuracy = calculateAccuracy(predictions);
// Print current classifier's name and accuracy in a complicated,
// but nice-looking way.
System.out.println("Accuracy of " + model.getClass().getSimpleName() + ": " +
String.format("%.2f%%", accuracy) + "\n---------------------------------");
}
}
// private static void wekaattrsel(ASEvaluationeval, ASSearch search) {
// throw new UnsupportedOperationException("Not supported yet."); //To change body of
generated methods, choose Tools | Templates.
//}
}
KNOWLEDGE FLOW

72
IT6711- Data Mining Laboratory Department of IT 2017-2018

Output:
=== Evaluation result ===

Scheme: NaiveBayes
Relation: vote

Correctly Classified Instances 392 90.1149 %


Incorrectly Classified Instances 43 9.8851 %
Kappa statistic 0.7949
Mean absolute error 0.0995
Root mean squared error 0.2977
Relative absolute error 20.9815 %
Root relative squared error 61.1406 %
Total Number of Instances 435

73
IT6711- Data Mining Laboratory Department of IT 2017-2018

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area
Class
0.891 0.083 0.944 0.891 0.917 0.797 0.973 0.984 democrat
0.917 0.109 0.842 0.917 0.877 0.797 0.973 0.957 republican
Weighted Avg. 0.901 0.093 0.905 0.901 0.902 0.797 0.973 0.973

=== Confusion Matrix ===

a b <-- classified as
238 29 | a = democrat
14 154 | b = republican

RESULT

74
IT6711- Data Mining Laboratory Department of IT 2017-2018

Viva Questions and Answers

1. What is data warehouse?


A data warehouse is a electronic storage of an Organization's historical data for the pur-
pose of reporting, analysis and data mining or knowledge discovery.
2. What is the benefits of data warehouse?
A data warehouse helps to integrate data and store them historically so that we can ana -
lyze different aspects of business including, performance analysis, trend, prediction etc.
3. What is the difference between OLTP and OLAP?
OLTP is the transaction system that collects business data. Whereas OLAP is the report-
ing and analysis system on that data.
4. What is data mart?
Data marts are generally designed for a single subject area.
5. What is dimension?
A dimension is something that qualifies a quantity (measure).
6. What is Fact?
A fact is something that is quantifiable (Or measurable). Facts are typically (but not al-
ways) numerical values that can be aggregated.
7. Briefly state different between data ware house & data mart?
Dataware house is made up of many datamarts. DWH contain many subject areas. but
data mart focuses on one subject area generally.
8. What is the difference between dependent data warehouse and independent data
warehouse?
There is a third type of Datamart called Hybrid. The Hybrid datamart having source data
from Operational systems or external files and central Datawarehouse.
9. What are the storage models of OLAP?
ROLAP, MOLAP and HOLAP
10. What are CUBES?
A data cube stores data in a summarized version which helps in a faster analysis of data.
The data is stored in such a way that it allows reporting easily.
75
IT6711- Data Mining Laboratory Department of IT 2017-2018

11. What is MODEL in Data mining world?


Models in Data mining help the different algorithms in decision making or pattern match-
ing. The second stage of data mining involves considering various models and choosing
the best one based on their predictive performance.
12. Explain how to mine an OLAP cube.
A data mining extension can be used to slice the data the source cube in the order as dis-
covered by data mining. When a cube is mined the case table is a dimension.
13. Explain how to use DMX-the data mining query language.
Data mining extension is based on the syntax of SQL. It is based on relational concepts
and mainly used to create and manage the data mining models.
14. Define Rollup and cube.
Custom rollup operators provide a simple way of controlling the process of rolling up a
member to its parents values.The rollup uses the contents of the column as custom rollup
operator for each member and is used to evaluate the value of the member’s parents.
15. Differentiate between Data Mining and Data warehousing.
Data warehousing is merely extracting data from different sources, cleaning the data and
storing it in the warehouse. Where as data mining aims to examine or explore the data us-
ing queries. These queries can be fired on the data warehouse.
16. What is Discrete and Continuous data in Data mining world?
Discreet data can be considered as defined or finite data. E.g. Mobile numbers, gender.
Continuous data can be considered as data which changes continuously and in an ordered
fashion. E.g. age
17. What is a Decision Tree Algorithm?
A decision tree is a tree in which every node is either a leaf node or a decision node. This
tree takes an input an object and outputs some decision. All Paths from root node to the
leaf node are reached by either using AND or OR or BOTH. The tree is constructed using
the regularities of the data.
18. What is Naïve Bayes Algorithm?
Naïve Bayes Algorithm is used to generate mining models. These models help to identify
relationships between input columns and the predictable columns.
19. Explain clustering algorithm.

76
IT6711- Data Mining Laboratory Department of IT 2017-2018

Clustering algorithm is used to group sets of data with similar characteristics also called
as clusters. These clusters help in making faster decisions, and exploring data.
20. Explain Association algorithm in Data mining?
Association algorithm is used for recommendation engine that is based on a market based
analysis. This engine suggests products to customers based on what they bought earlier.
The model is built on a dataset containing identifiers.
21. What are the goals of data mining?
Prediction, identification, classification and optimization
22. Is data mining independent subject?
No, it is interdisciplinary subject. includes, database technology, visualization, machine
learning, pattern recognition, algorithm etc.
23. What are different types of database?
Relational database, data warehouse and transactional database.
24. What are data mining functionality?
Mining frequent pattern, association rules, classification and prediction, clustering, evolu-
tion analysis and outlier Analise
25. What are issues in data mining?
Issues in mining methodology, performance issues, user interactive issues, different
source of data types issues etc.
26. List some applications of data mining.
Agriculture, biological data analysis, call record analysis, DSS, Business intelligence sys-
tem etc
27. What do you mean by interesting pattern?
A pattern is said to be interesting if it is 1. easily understood by human 2. valid 3. poten-
tially useful 4. novel
28. Why do we pre-process the data?
To ensure the data quality. [accuracy, completeness, consistency, timeliness, believabil-
ity, interpret-ability]
29. What are the steps involved in data pre-processing?
Data cleaning, data integration, data reduction, data transformation.
30. What is distributed data warehouse?

77
IT6711- Data Mining Laboratory Department of IT 2017-2018

Distributed data warehouse shares data across multiple data repositories for the purpose
of OLAP operation.
31. Define virtual data warehouse.
A virtual data warehouse provides a compact view of the data inventory. It contains meta
data and uses middle-ware to establish connection between different data sources.
32. What is are different data warehouse model?
Enterprise data ware housing, Data marts and Virtual Data warehouse
33. List few roles of data warehouse manager.
Creation of data marts, handling users, concurrency control, updation etc,
34. What are different types of cuboids?
0-D cuboids are called as apex cuboids
n-D cuboids are called base cuboids
Middle cuboids
35. What are the forms of multidimensional model?
Star schema
Snow flake schema
Fact constellation Schema
36. What are frequent pattern?
A set of items that appear frequently together in a transaction data set. eg milk, bread,
sugar
37. What are the issues regarding classification and prediction?
Preparing data for classification and prediction and Comparing classification and predic-
tion
38. Define model over fitting.
A model that fits training data well can have generalization errors. Such situation is
called as model over fitting.
39. What are the methods to remove model over fitting?
Pruning [Pre-pruning and post pruning)
Constraint in the size of decision tree
Making stopping criteria more flexible
40. What is regression?

78
IT6711- Data Mining Laboratory Department of IT 2017-2018

Regression can be used to model the relationship between one or more independent and
dependent variables. Types :Linear regression and non-linear regression
41. Compare K-mean and K-mediods algorithm.
K-mediods is more robust than k-mean in presence of noise and outliers. K-Mediods can
be computationally costly.
42. What is K-nearest neighbor algorithm?
It is one of the lazy learner algorithm used in classification. It finds the k-nearest neigh-
bor of the point of interest.

43. What is Baye's Theorem?


P(H/X) = P(X/H)* P(H)/P(X)
44. What is concept Hierarchy?
It defines a sequence of mapping from a set of low level concepts to higher -level, more
general concepts.
45. What are the causes of model over fitting?
Due to presence of noise
Due to lack of representative samples
Due to multiple comparison procedure
46. What is decision tree classifier?
A decision tree is an hierarchically based classifier which compares data with a range of
properly selected features.
47. If there are n dimensions, how many cuboids are there?
There would be 2^n cuboids.
48. What is spatial data mining?
Spatial data mining is the process of discovering interesting, useful, non-trivial patterns
from large spatial datasets.Spatial Data Mining = Mining Spatial Data Sets (i.e. Data
Mining + Geographic Information Systems)
49. What is multimedia data mining?
Multimedia Data Mining is a subfield of data mining that deals with an extraction of im-
plicit knowledge, multimedia data relationships, or other patterns not explicitly stored in
multimedia databases
50. What are different types of multimedia data?

79
IT6711- Data Mining Laboratory Department of IT 2017-2018

image, video, audio


51. What is text mining?
Text mining is the procedure of synthesizing information, by analyzing relations, pat-
terns, and rules among textual data. These procedures contains text summarization, text
categorization, and text clustering.
52. List some application of text mining.
Customer profile analysis
patent analysis
Information dissemination
Company resource planning

53. What do you mean by web content mining?


Web content mining refers to the discovery of useful information from Web contents,
including text, images, audio, video, etc.
54. Define web structure mining and web usage mining.
Web structure mining studies the model underlying the link structures of the Web. It
has been used for search engine result ranking and other Web applications. Web usage
mining focuses on using data mining techniques to analyze search logs to find interesting
patterns. One of the main applications of Web usage mining is its use to learn user pro-
files.
55. What is data warehouse?
A data warehouse is a electronic storage of an Organization's historical data for the pur-
pose of reporting, analysis and data mining or knowledge discovery.
56. What are frequent patterns?
These are the patterns that appear frequently in a data set. item-set, sub sequence, etc
57. What is data characterization?
Data Characterization is s summarization of the general features of a target class of data.
Example, analyzing software product with sales increased by 10%
58. What is data discrimination?
Data discrimination is the comparison of the general features of the target class objects
against one or more contrasting objects.
59. What can business analysts gain from having a data warehouse?

80
IT6711- Data Mining Laboratory Department of IT 2017-2018

Having a data warehouse may provide a competitive advantage by presenting relevant


information from which to measure performance and make critical adjustments in order
to help win over competitors.
60. Why is association rule necessary?
In data mining, association rule learning is a popular and well researched method for dis-
covering interesting relations between variables in large databases.
61. What are two types of data mining tasks?
Descriptive task and Predictive task
62. Define classification.
Classification is the process of finding a model (or function) that describes and distin-
guishes data classes or concepts.

63. What are outliers?


A database may contain data objects that do not comply with the general behavior or
model of the data. These data objects are called outliers.
64. What do you mean by evolution analysis?
Data evolution analysis describes and models regularities or trends for objects whose be-
havior change over time.
65. Define KDD.
The process of finding useful information and patterns in data.
66. What are the components of data mining?
Database, Data Warehouse, World Wide Web, or other information repository
67. Define metadata.
A database that describes various aspects of data in the warehouse is called metadata.
68. What are the usage of metadata?
Map source system data to data warehouse tables and to generate data extract, transform,
and load procedures for import jobs
69. List the demerits of distributed data warehouse.
There is no metadata, no summary data or no individual DSS (Decision Support System)
integration or history. All queries must be repeated, causing additional burden on the sys-
tem.

81
IT6711- Data Mining Laboratory Department of IT 2017-2018

70. Define HOLAP.


The hybrid OLAP approach combines ROLAP and MOLAP technology.
71. What are data mining techniques?
Association rules , Classification and prediction , Clustering, Deviation detection , Simi-
larity search and Sequence Mining
72. List different data mining tools.
Traditional data mining tools : Dashboards and Text mining tools
73. Define sub sequence.
A subsequence, such as buying first a PC, the a digital camera, and then a memory card,
if it occurs frequently in a shopping history database, is a (frequent) sequential pattern.
74. What is the main goal of data mining?
Prediction
75. List the typical OLAP operations.
Roll UP , DRILL DOWN , ROTATE , SLICE AND DICE , DRILL trough and drill
across

76. If there are 3 dimensions, how many cuboids are there in cube?
2^3 = 8 cuboids
77. Differentiate between star schema and snowflake schema.
Star Schema is a multi-dimension model where each of its disjoint dimension is repre-
sented in single table. •Snow-flake is normalized multi-dimension schema when each of
disjoint dimension is represent in multiple tables.
78. List the advantages of star schema.
Star Schema is very easy to understand, even for non-technical business manager and
Star Schema provides better performance and smaller query times.
79. What are the characteristics of data warehouse?
Integrated ,Non-volatile , Subject oriented and Time variant
80. Define support and confidence.
The support for a rule R is the ratio of the number of occurrences of R, given all occur-
rences of all rules.
81. What are the criteria on the basic of which classification and prediction can be com -
pared?

82
IT6711- Data Mining Laboratory Department of IT 2017-2018

speed, accuracy, robustness, scalability, goodness of rules, interpret-ability


82. What is Data purging?
The process of cleaning junk data is termed as data purging. Purging data would mean
getting rid of unnecessary NULL values of columns. This usually happens when the size
of the database gets too large.
83. What is Business Intelligence?
Business Intelligence is also known as DSS – Decision support system which refers to the
technologies, application and practices for the collection, integration and analysis of the
business related information or data. Even, it helps to see the data on the information
itself.
84. What is Fact Table?
Fact table contains the measurement of business processes, and it contains foreign keys
for the dimension tables.
85. What are the stages of Datawarehousing?
There are four stages of Datawarehousing:
 Offline Operational Database
 Offline Data Warehouse
 Real Time Datawarehouse
 Integrated Datawarehouse
86. What is OLTP?
OLTP is abbreviated as On-Line Transaction Processing, and it is an application that modifies
the data whenever it received and has large number of simultaneous users.
87. What is OLAP?
OLAP is abbreviated as Online Analytical Processing, and it is set to be a system which collects,
manages, processes multi-dimensional data for analysis and management purposes.
88. What is the difference between View and Materialized View?
A view is nothing but a virtual table which takes the output of the query and it can be used in
place of tables.A materialized view is nothing but an indirect access to the table data by storing
the results of a query in a separate schema.
89. What is ETL?
ETL is abbreviated as Extract, Transform and Load.
90. What is VLDB?

83
IT6711- Data Mining Laboratory Department of IT 2017-2018

VLDB is abbreviated as Very Large Database and its size is set to be more than one terabyte
database. These are decision support systems which is used to server large number of users.
91. What is real-time datawarehousing?
Real-time datawarehousing capt
ures the business data whenever it occurs. When there is business activity gets completed, that
data will be available in the flow and become available for use instantly.
92. What are Aggregate tables?
Aggregate tables are the tables which contain the existing warehouse data which has been
grouped to certain level of dimensions.
93. What is factless fact tables?
A factless fact tables are the fact table which doesn’t contain numeric fact column in the fact
table.
94. How can we load the time dimension?
Time dimensions are usually loaded through all possible dates in a year and it can be done
through a program. Here, 100 years can be represented with one row per day.
95. What are Non-additive facts?
Non-Addictive facts are said to be facts that cannot be summed up for any of the dimensions
present in the fact table. If there are changes in the dimensions, same facts can be useful.

96. What is conformed fact?


Conformed fact is a table which can be used across multiple data marts in combined with the
multiple fact tables.
97. What is Datamart?
A Datamart is a specialized version of Datawarehousing and it contains a snapshot of operational
data that helps the business people. A data mart helps to emphasizes on easy access to relevant
information.
98. What is Active Datawarehousing?
An active datawarehouse is a datawarehouse that enables decision makers within a company or
organization to manage customer relationships effectively and efficiently.
99. What is the difference between Datawarehouse and OLAP?

84
IT6711- Data Mining Laboratory Department of IT 2017-2018

Datawarehouse is a place where the whole data is stored for analyzing, but OLAP is used for
analyzing the data, managing aggregations, information partitioning into minor level
information.
100. What are the key columns in Fact and dimension tables?
Foreign keys of dimension tables are primary keys of entity tables. Foreign keys of fact tables
are the primary keys of the dimension tables.

85

You might also like