ITDSIU21030 Nguyễn Duy Phúc
Introduction to Data Mining
Lab 5: More Classifiers
5.1. Classification boundaries
In the fifth class, we are going to look at some machine learning methods used to classify datasets in
Weka. (See the lecture of class 4 by Ian H. Witten, [1]1). We are going to learn about linear regression,
classification by regression, and support vector machines.
In this section, we are going to start by looking at classification boundaries for different machine
learning methods. We are going to use Weka’s Boundary Visualizer, and a 2-dimensional datasets.
Follow the instructions in [1] to do some experiments, and then fill in the following table with the
classifier models.
Dataset Rules → OneR Lazy → IBk
K=5 K=20
Iris.2D.arff Classifier model (full training set) Classifier model (full training set)
=========================== =============================
=== Classifier model (full IB1 instance-based classifier IB1 instance-based classifier using
training set) === using 5 nearest neighbor(s) for 20 nearest neighbor(s) for
petalwidth: classification classification
< 0.8 -> Iris-setosa Time taken to build model: 0 Time taken to build model: 0
< 1.75 -> Iris-versicolor seconds seconds
>= 1.75 -> Iris-virginica
(144/150 instances correct)
1
http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/
1
ITDSIU21030 Nguyễn Duy Phúc
Try other learning methods, e.g NaiveBayes using SupervisedDiscretization, i.e. supervised discretization
is to take the classes into account when discretizing numeric attributes into ranges... [Refer to Text [2].
Chapter 7 for discretization part]
Dataset Bayes > NaiveBayes Trees > J48
minNumbObj = 5 minNumbObj = 10
Iris.2D.arf === Classifier model (full training set) === === Classifier model (full === Classifier model (full
f training set) === training set) ===
Naive Bayes Classifier
J48 pruned tree J48 pruned tree
Class ------------------ ------------------
Attribute Iris-setosa Iris-versicolor Iris-
virginica petalwidth <= 0.6: Iris-setosa petalwidth <= 0.6: Iris-setosa
(0.33) (0.33) (0.33) (50.0) (50.0)
===================================== petalwidth > 0.6 petalwidth > 0.6
= |petalwidth <= 1.7 | petalwidth <= 1.7: Iris-
petallength |petallength <= 4.9: Iris- versicolor (54.0/5.0)
mean 1.4694 4.2452 versicolor (48.0/1.0) | petalwidth > 1.7: Iris-
5.5516 |petallength > 4.9: Iris- virginica (46.0/1.0)
std. dev. 0.1782 0.4712 virginica (6.0/2.0)
0.5529 |petalwidth > 1.7: Iris- Number of Leaves :
weight sum 50 50 50 virginica (46.0/1.0) 3
precision 0.1405 0.1405
0.1405 Number of Leaves: 4 Size of the tree : 5
petalwidth Size of the tree: 7
mean 0.2743 1.3097
2.0343
std. dev. 0.1096 0.1915
0.2646
weight sum 50 50 50
precision 0.1143 0.1143
0.1143
2
ITDSIU21030 Nguyễn Duy Phúc
5.2. Linear regression
In this section, we are going to deal with numeric classes using a classical statistical method.
Follow the lecture of linear regression in [1] to learn how to calculate weights of attributes from training
data, and make predictions. [Refer to Text [2]. Chapter 4.6 for linear regression part]
Follow the instructions in [1] to examine the model of linear regression on the cpu dataset.
Write down the results in the following table:
Dataset Correlation Mean Root mean Relative Root relative
coefficient absolute squared absolute squared
error error error error
Cpu 0.9012 41.0886 69.556 42.6943% 43.2421%
Linear class =
Regression 0.0491 * MYCT +
Model 0.0152 * MMIN +
0.0056 * MMAX +
0.6298 * CACH +
1.4599 * CHMAX +
-56.075
Time taken to build model: 0.08 seconds
Do again to examine M5P on the cpu dataset, and then write down the results in the following table:
Dataset Correlation Mean Root mean Relative Root relative
coefficient absolute squared absolute squared
error error error error
Cpu 0.9274 29.8309 60.7112 30.9967% 37.7434%
Classifier M5 pruned model tree:
model
(using smoothed linear models)
CHMIN <= 7.5 : LM1 (165/12.903%)
CHMIN > 7.5 :
| MMAX <= 28000 :
| | MMAX <= 13240 :
| | | CACH <= 81.5 : LM2 (6/18.551%)
| | | CACH > 81.5 : LM3 (4/30.824%)
| | MMAX > 13240 : LM4 (11/24.185%)
3
ITDSIU21030 Nguyễn Duy Phúc
| MMAX > 28000 : LM5 (23/48.302%)
LM num: 1
class = -0.0055 * MYCT
+ 0.0013 * MMIN
+ 0.0029 * MMAX
+ 0.8007 * CACH
+ 0.4015 * CHMAX
+ 11.0971
LM num: 2
class = -1.0307 * MYCT
+ 0.0086 * MMIN
+ 0.0031 * MMAX
+ 0.7866 * CACH
- 2.4503 * CHMIN
+ 1.1597 * CHMAX
+ 70.8672
LM num: 3
class = -1.1057 * MYCT
+ 0.0086 * MMIN
+ 0.0031 * MMAX
+ 0.7995 * CACH
- 2.4503 * CHMIN
+ 1.1597 * CHMAX
+ 83.0016
LM num: 4
class = -0.8813 * MYCT
+ 0.0086 * MMIN
+ 0.0031 * MMAX
+ 0.6547 * CACH
- 2.3561 * CHMIN
+ 1.1597 * CHMAX
+ 82.5725
LM num: 5
class = -0.4882 * MYCT
+ 0.0218 * MMIN
+ 0.003 * MMAX
+ 0.3865 * CACH
- 1.3252 * CHMIN
+ 3.3671 * CHMAX
- 51.8474
Number of Rules : 5
Time taken to build model: 0.04 seconds
4
ITDSIU21030 Nguyễn Duy Phúc
Linear regression models:
Linear regression models:
=== Classifier model (full training set) ===
Linear Regression Model
class =
0.0491 * MYCT +
0.0152 * MMIN +
0.0056 * MMAX +
0.6298 * CACH +
1.4599 * CHMAX +
-56.075
Is M5P non-linear regression?
- Yes, M5P is a type of non-linear regression model. M5P stands for "Model Tree" and is an extension of
the M5 model tree algorithm. It is designed to handle both linear and non-linear relationships within the
data.
M5P combines decision trees and linear regression models. The algorithm first constructs a decision tree
where each internal node represents a decision based on input features, and the leaves contain linear
regression models. By partitioning the data space and fitting linear models within each partition, M5P
can capture complex, non-linear relationships in the data.
In summary, while the linear regression models at the leaves are linear, the overall model structure and
its ability to split the data based on different conditions make M5P a non-linear regression model.
5.3. Classification by regression
Follow the instructions in [1] to investigate two‐class classification by regression, using the
diabetes dataset.
We are going to convert the nominal class to the numeric class so that the linear regression
model is applicable.
Write down the results in the following table:
Classifier model Evaluation
5
ITDSIU21030 Nguyễn Duy Phúc
Linear Regression Model Correlation coefficient:
0.5322
class=tested_positive = Mean absolute error:
0.0209 * preg + 0.3366
0.0057 * plas + Root mean squared error:
-0.0024 * pres + 0.4036
0.0131 * mass + Relative absolute error:
0.1403 * pedi + 74.0119 %
0.0028 * age + Root relative squared
-0.8363 error: 84.6013 %
Total Number of Instances:
768
5.4. Support vector machines
Learn about logistic regression in [2]. Chapter 4.6
Follow the lecture of support vector machines (SVMs) in [1], …
Support vector machines (SVMs, also support vector networks [1]) are supervised learning models with
associated learning algorithms that analyze data and recognize patterns, used for classification and
regression analysis. Given a set of training examples, each marked as belonging to one of two
categories, an SVM training algorithm builds a model that assigns new examples into one category or
the other, making it a non-probabilistic binary linear classifier. An SVM model is a representation of the
examples as points in space, mapped so that the examples of the separate categories are divided by a
clear gap that is as wide as possible. New examples are then mapped into that same space and
predicted to belong to a category based on which side of the gap they fall on.
Follow the instructions in [1] to examine SMO and LibSVM, and fill in the following table:
Dataset SMO’s classifier model and performance LibSVM’s classifier model and
performance
diabetes Kernel used: LibSVM wrapper, original code by
Linear Kernel: K(x,y) = <x,y> Yasser EL-Manzalawy (= WLSVM)
Classifier for classes: tested_negative, Time taken to build model: 0.07
tested_positive seconds
6
ITDSIU21030 Nguyễn Duy Phúc
BinarySMO ==============================
Correctly Classified Instances: 500
Machine linear: showing attribute weights, not 65.1042 %
support vectors. Incorrectly Classified Instances: 268
34.8958 %
1.3614 * (normalized) preg Kappa statistic: 0
+ 4.8764 * (normalized) plas Mean absolute error: 0.349
+ -0.8118 * (normalized) pres Root mean squared error: 0.5907
+ -0.1158 * (normalized) skin Relative absolute error: 76.7774 %
+ -0.1776 * (normalized) insu Root relative squared error: 23.9347 %
+ 3.0745 * (normalized) mass Total Number of Instances: 768
+ 1.4242 * (normalized) pedi
+ 0.2601 * (normalized) age
- 5.1761
Number of kernel evaluations: 19131 (69.279%
cached)
Notice: A wrapper class for the libsvm tools (the libsvm classes, typically the jar file, need to be in the
classpath to use this classifier) >> see http://weka.wikispaces.com/LibSVM