DWDM Lab Manual 2022-2023
DWDM Lab Manual 2022-2023
Introduction
Weka (pronounced to rhyme with Mecca) is a workbench that contains a collection of visualization
tools and algorithms for data analysis and predictive modeling, together with graphical user
interfaces for easy access to these functions. The original non-Java version of Weka was a Tcl/Tk
front-end to (mostly third-party) modeling algorithms implemented in other programming
languages, plus data preprocessing utilities in C, and Make file-based system for running machine
learning experiments. This original version was primarily designed as a tool for analyzing data
from agricultural domains, but the more recent fully Java-based version (Weka 3), for which
development started in 1997, is now used in many different application areas, in particular for
educational purposes and research. Advantages of Weka include:
Description:
Open the program. Once the program has been loaded on the user’s machine it is opened by
navigating to the programs start option and that will depend on the users operating system. Figure
1.1 is an example of the initial opening screen on a computer.
There are four options available on this initial screen:
Fig: 1.1 Weka GUI
1. Explorer - the graphical interface used to conduct experimentation on raw data After clicking
the Explorer button the weka explorer interface appears.
3. Cluster- used to apply different tools that identify clusters within the data file.
The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences
within the data set and produce information for the user to analyze.
4. Association- used to apply different rules to the data file that identify association within the
data. The associate tab opens a window to select the options for associations within thedataset.
5. Select attributes-used to apply different rules to reveal changes based on selected attributes
inclusion or exclusion from the experiment
6. Visualize- used to see what the various manipulation produced on the data set in a 2D format,
in scatter plot and bar graph output.
2. Experimenter - this option allows users to conduct different experimental variations on data
sets and perform statistical manipulation. The Weka Experiment Environment enables the user to
create, run, modify, and analyze experiments in a more convenient manner than is possible when
processing the schemes individually. For example, the user can create an experiment that runs
several schemes against a series of datasets and then analyze the results to determine if one of the
schemes is (statistically) better than the other schemes.
Click the “Open file…” button to open a data set and double click on the “data” directory.
Weka provides a number of small common machine learning datasets that you can use to practice on.
Select the “iris.arff” file to load the Iris dataset.
References:
[1] Witten, I.H. and Frank, E. (2005) Data Mining: Practical machine learning tools
andtechniques. 2nd edition Morgan Kaufmann, San Francisco.
[2] Ross Quinlan (1993). C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers,
San Mateo, CA.
[3] CVS–http://weka.sourceforge.net/wiki/index.php/CVS
[4] Weka Doc–http://weka.sourceforge.net/wekadoc/
Exercise:
1. Normalize the data using min-max normalization
Record Notes
Experiment 2: Creating new ARFF file
An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of instances
sharing a set of attributes. ARFF files were developed by the Machine Learning Project at the
Department of Computer Science of The University of Waikato for use with the Weka machine
learning software in WEKA, each data entry is an instance of the java class weka.core. Instance,
and each instance consists of a For loading datasets in WEKA, WEKA can load ARFF files.
Attribute Relation File Format has two sections:
1. The Header section defines relation (dataset) name, attribute name, and type.
2. The Data section lists the data instances.
The figure above is from the textbook that shows an ARFF file for the weather data. Lines
beginning with a % sign are comments. And there are three basic keywords:
The external representation of an Instances class Consists of:
A header: Describes the attribute types
Data section: Comma separated list of data
References:
https://www.cs.auckland.ac.nz/courses/compsci367s1c/tutorials/IntroductionToWeka.pdf
Exercise:
To search through all possible combinations of attributes in the data and find which subset of
attributes works best for prediction, make sure that you set up attribute evaluator to „Cfs Subset
Val‟ and a search method to „Best First‟. The evaluator will determine what method to use toassign
a worth to each subset of attributes. The search method will determine what style of search to
perform. The options that you can set for selection in the „Attribute Selection Mode‟ fig no: 3.2
1. Use full training set. The worth of the attribute subset is determined using the full set of
training data.
Specify which attribute to treat as the class in the drop-down box below the test options. Once all
the test options are set, you can start the attribute selection process by clicking on „Start‟ button.
2. Visualizing Results
Process: Replacing Missing Attribute Values by the Attribute Mean. This method is used for
data sets with numerical attributes. An example of such a data set is presented in fig no: 3.4
An OLAP cube is a term that typically refers to multi-dimensional array of data. OLAP is an
acronym for online analytical processing,[1]which is a computer-based technique of analyzing
data to look for insights. The term cube here refers to a multi-dimensional dataset, which is also
sometimes called a hypercube if the number of dimensions is greater than 3.
Operations:
1. Slice is the act of picking a rectangular subset of a cube by choosing a single value for one of
its dimensions, creating a new cube with one fewer dimension.[4] The picture shows a slicing
operation: The sales figures of all sales regions and all product categories of the company in the
year 2005 and 2006 are "sliced" out of the data cube.
2. Dice: The dice operation produces a subcube by allowing the analyst to pick specific values of
multiple dimensions.[5]The picture shows a dicing operation: The new cube shows the sales
figures of a limited number of product categories, the time and region dimensions cover the same
range as before.
3. Drill Down/Up allows the user to navigate among levels of data ranging from the most
summarized (up) to the most detailed (down).[4] The picture shows a drill-down operation: The
analyst moves from the summary category "Outdoor-Schutzausrüstung" to see the sales figures
for the individual products.
4. Roll-up: A roll-up involves summarizing the data along a dimension. The summarization rule
might be computing totals along a hierarchy or applying a set of formulas such as "profit = sales
- expenses".
5. Pivot allows an analyst to rotate the cube in space to see its various faces. For example, cities
could be arranged vertically and products horizontally while viewing data for a particular
quarter. Pivoting could replace products with time periods to see data across time for a single
product.
Description:
The Apriori algorithm is an influential algorithm for mining frequent item sets for Boolean
association rules. It uses a “bottom-up” approach, where frequent subsets are extended one at a
time (a step known as candidate generation, and groups of candidates are tested against the data).
Problem:
TID ITEMS
100 1,3,4
200 2,3,5
300 1,2,3,5
400 2,5
To find frequent item sets for above transaction with a minimum support of 2 having confidence
measure of 70% (i.e, 0.7).
Procedure:
Step 1:
Count the number of transactions in which each item occurs
TID ITEMS
1 2
2 3
3 3
4 1
5 3
ITEM NO. OF
TRANSACTIONS
1 2
2 3
3 3
5 3
This is the single items that are bought frequently. Now let’s say we want to find a pair of items
that are bought frequently. We continue from the above table (Table in step 2).
ITEM PAIRS
1,2
1,3
1,5
2,3
2,5
3,5
Step 4:
Now, we count how many times each pair is bought together.
NO.OF
ITEM PAIRS TRANSACTIONS
1,2 1
1,3 2
1,5 1
2,3 2
2,5 3
3,5 2
1,3 2
2,3 2
2,5 3
3,5 2
These pair of items is bought frequently together. Now, let’s say we want to find a set of three
items that are bought together. We use above table (of step 5) and make a set of three items.
Step 6:
To make the set of three items we need one more rule (It is termed as self-join), it simply means,
from item pairs in above table, we find two pairs with the same first numeric, so, we get (2,3)
and (2,5), which gives (2,3,5). Then we find how many times (2, 3, 5) are bought together in the
original table and we get the following
ITEM NO. OF
SET TRANSACTIONS
(2,3,5) 2
Thus, the set of three items that are bought together from this data are (2, 3, 5).
Confidence:
We can take our frequent item set knowledge even further, by finding association rules using the
frequent item set. In simple words, we know (2, 3, 5) are bought together frequently, but what is
the association between them. To do this, we create a list of all subsets of frequently bought
items (2, 3, 5) in our case we get following subsets:
{2}
{3}
{5}
{2,3}
{3,5}
{2,5}
So first create a csv file for the above problem, the csv file for the above problem will look like
the rows and columns in the above figure. This file is written in excel sheet.
The above csv file has generated 5 rules as shown in the figure:
Exercise:
1.Apply the Apriori algorithm on Airport noise monitoring dataset discriminating between
patients with parkin sons and neurological diseases using voice recording dataset.
[https://archive.ics.uci.edu/ml/machine-learning-databases/00000/ refer this link for datasets]
PROBLEM:
To find all frequent item sets in following dataset using FP-growth algorithm. Minimum
support=2 and confidence =70%
TID ITEMS
100 1,3,4
200 2,3,5
300 1,2,3,5
400 2,5
Solution:
Similar to Apriori Algorithm, find the frequency of occurrences of all each item in dataset and
then prioritize the items according to its descending order of its frequency of occurrence.
Eliminating those occurrences with the value less than minimum support and assigning the
priorities, we obtain the following table.
1 2 4
2 3 1
3 3 2
5 3 3
TID ITEMS
100 1,3
200 2,3,5
300 2,3,5,1
400 2,5
Prefixes:
1->3:1 2,3,5:1
5->2,3:2 2:1
3->2:2
1-> 3:2 /*2 and 5 are eliminated because they‟re less than minimum support, and the
occurrence of 3 is obtained by adding the occurrences in both the instances*/
Similarly, 5->2,3:2 ; 2:3;3:2
3->2 :2
Therefore, the frequent item sets are {3,1}, {2,3,5}, {2,5}, {2,3},{3,5}
The tree is constructed as below:
1:1
Generating the association rules for the following tree and calculating the
confidence measures we get-
{3}=>{1}=2/3=67%
{1}=>{3}=2/2=100%
{2}=>{3,5}=2/3=67%
Thus eliminating all the sets having confidence less than 70%, we obtain the following
conclusions:
{1}=>{3} , {3,5}=>{2} , {2,3}=>{5} , {2}=>{5}, {5}=>{2}.
As we see there are 5 rules that are being generated manually and these are to be checked against
the results in WEKA. Inorder to check the results in the tool we need to follow the similar
procedure like
Apriori.
So first create a csv file for the above problem, the csv file for the above problem will look like the rows
and columns in the above figure. This file is written in excel sheet.
Conclusion:
As we have seen the total rules generated by us manually and by the weka are matching, hence
the rules generated are 5.
Exercise
DESCRIPTION:
Decision tree learning is one of the most widely used and practical methods for inductive inference
over supervised data. It represents a procedure for classifying categorical database on their
attributes. This representation of acquired knowledge in tree form is intuitive and easy to assimilate
by humans.
ILLUSTRATION:
Build a decision tree for the following data
Entropy (D) =
Information gain is used as an attribute selection measure; pick the attribute having the highest
information gain, the gain is calculated by:
Gain (D, A) = Entropy (D) -
Where, D: A given data partition A: Attribute
V: Suppose we were partition the tuples in D on some attribute A having v distinct values D is
split into v partition or subsets, (D1, D2….. Dj) , where Dj contains those tuples in D that have
outcome Aj of A.
Class P: buys_computer=”yes”
Class N: buys_computer=”no”
= Entropy (D) -
= Entropy ( D ) – 5/14Entropy(Syouth)-4/14Entropy(Smiddle-aged)-5/14Entropy(Ssenior)
= 0.940-0.694
=0.246
High No Fair No
High No Excellent No
Medium No Fair No
Similarly, Gain(student)=0.971
Gain(credit)=0.0208
Gain( student) is highest ,
A decision tree for the concept buys_computer, indicating whether a customer at All Electronics
is likely to purchase a computer. Each internal (non-leaf) node represents a test on an attribute.
Each leaf node represents a class ( either buys_computer=”yes” or buys_computer=”no”.
first create a csv file for the above problem,the csv file for the above problem will look like the
rows and columns in the above figure. This file is written in excel sheet.
Check the main result which we got manually and the result in weka by right clicking on the
result and visualizing the tree.
Conclusion:
The solution what we got manually and the weka both are same.
Exercise:
1.Apply decision tree algorithm to book a table in a hotel/ book a train ticket/ movie ticket.
Steps:
Calculate the information gain on weather data set(for each attributes separately).
Description:
In machine learning, Naïve Bayes classifiers are a family of simple probabilistic classifiers based
on applying Bayes‟ Theorem with strong (naïve) independence assumptions between the features
Example:
.
AGE INCOME STUDENT CREDIT_RATING BUYS_COMPUTER
<30 High No Fair No
<30 High No Excellent No
31-40 High No Fair Yes
>40 Mediu m No Fair Yes
>40 Low Yes Fair Yes
>40 Low Yes Excellent No
31-40 Mediu m Yes Excellent Yes
<=30
Low No Fair No
<=30 Mediu m Yes Fair Yes
>40 Mediu m Yes Fair Yes
<30 Mediu m Yes Excellent Yes
31-40 Mediu m No Excellent Yes
31-40 High Yes Fair Yes
>40 Mediu m No Excellent No
CLASS:
C1:buys_com
puter = ‘yes’
C2:buys_com
puter=’no’
DATA TO
BECLASSIFIED
:
X= (age<=30, income=Medium, Student=Yes, credit_rating=Fair)
P(C1): P(buys_computer=”yes”)= 9/14 =0.643
P (buys_computer=”no”) =5/14=0.357
1. P( age=”<=30” |buys_computer=”yes”)=2/9
2. P(age=”<=30”|buys_computer=”no”)=3/5
3. P(income=”medium”|buys_computer=”yes”)=4/9
4. P(income=”medium”|buys_computer=”no”)=2/5
5. P(student=”yes”|buys_computer=”yes”)=6/9
6. P(student=”yes” |buys_computer=”no”)=1/5=0.2
7. P(credit_rating=”fair ”|buys_computer=”yes”)=6/9
8. P(credit_rating=”fair” |buys_computer=”no”)=2/5
P(C1/X)=P(X/C1)*P(C1)
P(X/buys_computer=”yes”)*P(buys_computer=”yes”)=(32/1134)*(9/14)=0.019
P(C2/X)=p(x/c2)*p(c2)
P (X/buys_computer=”no”)*P(buys_computer=”no”)=(12/125)*(5/14)=0.007
Exercise
1. Classify data (lung cancer/ diabetes /liver disorder) using Bayesian approach .
DESCRIPTION:
K-means algorithm aims to partition n observations into “k clusters” in which each observation
belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in
partitioning of the data into Voronoi cells.
ILLUSTRATION:
As a simple illustration of a k-means algorithm, consider the following data set consisting of the
scores of two variables on each of the five variables.
I X1 X2
A 1 1
B 1 0
C 0 2
D 2 4
E 3 5
This data set is to be grouped into two clusters: As a first step in finding a sensible partition, let
the A & C values of the two individuals furthest apart (using the Euclidean distance measure),
define the initial cluster means, giving:
Cluster1 A (1,1)
Cluster2 C (0,2)
A C
A 0 1.4
B 1 2.5
C 1.4 0
D 3.2 2.82
E 4.5 4.2
Initial partitions have changed, and the two clusters at this stage having the following
characteristics.
But we cannot yet be sure that each individual has been assigned to the right cluster. So, we
compare each individual‟s distance to its own cluster mean and to that of the opposite cluster. And,
we find:
I A C
A 0.5 2.7
B 0.5 3.7
C 1.8 2.4
D 3.6 0.5
E 4.9 1.9
The individuals C is now relocated to Cluster 1 due to its less mean distance with the centroid
points. Thus, its relocated to cluster 1 resulting in the new partition
The iterative relocation would now continue from this new partition until no more relocation
occurs. However, in this example each individual is now nearer its own cluster mean than that of
the other cluster and the iteration stops, choosing the latest partitioning as the final cluster solution.
Also, it is possible that the k-means algorithm won‟t find a final solution. In this case, it would be
a better idea to consider stopping the algorithm after a pre-chosen maximum number of iterations.
Checking the solution in weka:
In order to check the result in the tool we need to follow a procedure.
Step 1:
Create a csv file with the above table considered in the example. the csv file will look as shown
below:
1. Create Placement.arff file to identify the students who are eligible for
placements using KNN