Chap6 ClassificationBasic
Chap6 ClassificationBasic
Linear Classifiers
Summary
1
Supervised vs. Unsupervised Learning (1)
Supervised learning (classification)
Supervision: The training data such as observations or measurements are
accompanied by labels indicating the classes which they belong to
New data is classified based on the models built from the training set
Training Data with class label:
Outlook Temp Humidity Windy Play Golf
Training Model
Rainy Hot High False No
Instances Learning
Rainy Hot High True No
3
Prediction Problems: Classification vs. Numeric Prediction
Classification
Predict categorical class labels (discrete or nominal)
Construct a model based on the training set and the class labels (the values in a
classifying attribute) and use it in classifying new data
Numeric prediction
Model continuous-valued functions (i.e., predict unknown or missing values)
4
Classification—Model Construction, Validation and Testing
Model Construction and Training
Model: Represented as decision trees, rules, mathematical formulas, or other forms
Assumption: Each sample belongs to a predefined class /class label
Training Set: The set of samples used for model construction
5
Classification—Model Construction, Validation and Testing
Model Validation and Testing:
Test: Estimate accuracy of the model
The known label of test sample VS. the classified result from the model
Accuracy: % of test set samples that are correctly classified by the model
Test set is independent of training set
Validation: If the test set is used to select or refine models, it is called validation (or
development) (test) set
Model Deployment: If the accuracy is acceptable, use the model to classify new data
6
Chapter 6. Classification: Basic Concepts
Classification: Basic Concepts
Linear Classifiers
Summary
7
Decision Tree Induction: An Example
Decision tree construction: Training data set: Play Golf?
Outlook Temp Humidity Windy Play Golf
A top-down, recursive, divide-and-
Rainy Hot High False No
conquer process
Rainy Hot High True No
Resulting tree: Overcast Hot High False Yes
outlook?
Sunny Mild High False Yes
9
Decision Tree Induction: Algorithm
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning
There are no samples left
Prediction
Majority voting is employed for classifying the leaf
10
How to Handle Continuous-Valued Attributes?
Method 1: Discretize continuous values and treat them as categorical values
E.g., age: < 20, 20..30, 30..40, 40..50, > 50
Method 2: Determine the best split point for continuous-valued attribute A
Sort:, e.g. 15, 18, 21, 22, 24, 25, 29, 31, …
Possible split point: (ai+ai+1)/2
e.g., (15+18)/2 = 16.5, 19.5, 21.5, 23, 24.5, 27, 30, …
The point with the maximum information gain for A is selected as the split-
point for A
Split: Based on split point P
The set of tuples in D satisfying A ≤ P vs. those with A > P
11
Pro’s and Con’s
Pro’s
Easy to explain (even for non-expert)
Easy to implement (many software)
Efficient
Can tolerant missing data
White box
No need to normalize data
Non-parametric: No assumption on data distribution, no assumption on
attribute independency
Can work on various attribute types
12
Con’s
Con’s
Unstable. Sensitive to noise
Accuracy may be not good enough (depending on your data)
The optimal splitting is NP. Greedy algorithms are used
Overfitting
13
Splitting Measures: Information Gain
Entropy (Information Theory)
A measure of uncertainty associated with a random number
Calculation: For a discrete random variable Y taking m distinct values {y1, y2, …, ym}
Interpretation
Higher entropy → higher uncertainty
Lower entropy → lower uncertainty
Conditional entropy
m=2
14
Information Gain: An Attribute Selection Measure
Select the attribute with the highest information gain (used in typical
decision tree induction algorithm: ID3/C4.5)
Let pi be the probability that an arbitrary tuple in D belongs to class C i,
estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple in D:
m
Info( D) pi log 2 ( pi )
i 1
j 1 | D |
16
Example: Attribute Selection with Information Gain
Play Temp Yes No I(Yes, No)
Outlook Temp Humidity Windy
Golf Similarly, we can get
Hot 2 2 ?
Rainy Hot High False No Mild 4 2 ? 𝐺𝑎𝑖𝑛 𝑇𝑒𝑚𝑝 = 0.029,
Rainy Hot High True No Cool 3 1 ? 𝐺𝑎𝑖𝑛 ℎ𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 0.151,
Overcast Hot High False Yes 𝐺𝑎𝑖𝑛 𝑊𝑖𝑛𝑑𝑦 = 0.048
Windy Yes No I(Yes, No)
Sunny Mild High False Yes
True ? ? ?
Sunny Cool Normal False Yes
False ? ? ?
Sunny Cool Normal True No
Overcast Cool Normal True Yes Humidity Yes No I(Yes, No)
Rainy Mild High False No Normal 6 1 ?
Rainy Cool Normal False Yes High 3 4 ?
Sunny Mild Normal False Yes
Rainy Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Sunny Mild High True No
17
Gain Ratio: A Refined Measure for Attribute Selection
Information gain measure is biased towards attributes with a large number of
values (e.g. ID)
Gain ratio: Overcomes the problem (as a normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D) log 2 ( )
j 1 | D| | D|
GainRatio(A) = Gain(A)/SplitInfo(A)
The attribute with the maximum gain ratio is selected as the splitting attribute
Gain ratio is used in a popular algorithm C4.5 (a successor of ID3) by R. Quinlan
Example
4 4 6 6 4 4
SplitInfotemp D = − log 2 − log 2 − log 2 = 1.557
14 14 14 14 14 14
GainRatio(temp) = 0.029/1.557 = 0.019
18
Chapter 6. Classification: Basic Concepts
Classification: Basic Concepts
Linear Classifiers
Summary
19
Bayes’ Theorem: Basics
Total probability Theorem:
p B = p B Ai p(Ai )
i
Bayes’ Theorem:
p 𝐗H P H
p H|𝐗 = ∝p 𝐗H P H
p(𝐗)
21
Naïve Bayes Classifier: Making a Naïve Bayes Assumption
Based on the Bayes’ Theorem, we can derive a Bayes Classifier to compute the
posterior probability of classifying an object X to a class C
P (C|X) ∝ P(X|C)P(C) = P(x1|C)P(x2|x1,C)...P(xn|x1,...,C)P(C)
Super efficient: each feature only conditions on the class (boils down to sample
counting)
Achieves surprisingly comparable performance
22
Naïve Bayes Classifier: Categorical vs. Continuous Valued Features
If feature xk is categorical, p(xk = vk |Ci ) is the # of tuples in Ci with xk = vk ,
divided by |Ci, D| (# of tuples of Ci in D)
p X|𝐶𝑖 = ςk p xk Ci ) = p x1 Ci ) ∙ p x2 Ci ) ∙∙∙∙∙ p xn Ci )
23
Naïve Bayes Classifier Example 1: Training Dataset
Class:
play golf= ‘yes’
play golf = ‘no’
24
Naïve Bayes Classifier Example P(Yes | Sunny)
25
Naïve Bayes Classifier Example: P(No | Sunny)
26
Naïve Bayes Classifier Example: Likelihood Tables
27
Naïve Bayes Classifier Example: Likelihood Tables
/ P(x)
______________________
P(x)
/ P(x)
______________________
P(x)
28
Avoiding the Zero-Probability Problem
Naïve Bayesian prediction requires each conditional probability be non-zero
Otherwise, the predicted probability will be zero
p X|𝐶𝑖 = ς𝑘 𝑝 𝑥𝑘 𝐶𝑖 ) = 𝑝 𝑥1 𝐶𝑖 ) ∙ 𝑝 𝑥2 𝐶𝑖 ) ∙∙∙∙∙ 𝑝 𝑥𝑛 𝐶𝑖 )
Example. Suppose a dataset with 1,000 tuples:
income = low (0), income= medium (990), and income = high (10)
Use Laplacian correction (or Laplacian estimator)
Adding 1 (or a small integer) to each case
Prob(income = low) = 1/(1000 + 3)
Prob(income = medium) = (990 + 1)/(1000 + 3)
Prob(income = high) = (10 + 1)/(1000 + 3)
The “corrected” probability estimates are close to their “uncorrected”
counterparts
29
Naïve Bayes Classifier: Strength vs. Weakness
Strength
Performance: A naïve Bayesian classifier, has comparable performance with
decision tree and selected neural network classifiers
Incremental: Each training example can incrementally increase/decrease the
probability that a hypothesis is correct—prior knowledge can be combined with
observed data
30
Naïve Bayes Classifier: Strength vs. Weakness
Weakness
Assumption: attributes conditional independence, therefore loss of accuracy
E.g., Patient’s Profile: (age, family history),
Patient’s Symptoms: (fever, cough),
Patient’s Disease: (lung cancer, diabetes).
Dependencies among these cannot be modeled by Naïve Bayes Classifier
How to deal with these dependencies?
Use Bayesian Belief Networks (chapter 7)
31
Chapter 6. Classification: Basic Concepts
Classification: Basic Concepts
Linear Classifiers
Summary
32
Lazy vs. Eager Learning
Lazy vs. eager learning
Lazy learning (e.g., instance-based learning): Simply stores training data (or only
minor processing) and waits until it is given a test tuple
Eager learning (the above discussed methods): Given a set of training tuples,
constructs a classification model before receiving new (e.g., test) data to classify
Lazy: less time in training but more time in predicting
Accuracy
Lazy method effectively uses a richer hypothesis space since it uses many local
linear functions to form an implicit global approximation to the target function
Eager: must commit to a single hypothesis that covers the entire instance space
33
Lazy Learner: Instance-Based Methods
Instance-based learning:
Store training examples and delay the processing (“lazy evaluation”) until a
new instance must be classified
Typical approaches
k-nearest neighbor approach
Instances represented as points in a Euclidean space.
Locally weighted regression
Constructs local approximation
Case-based reasoning
Uses symbolic representations and knowledge-based inference
34
The k-Nearest Neighbor Algorithm
All instances correspond to points in the n-D space
The nearest neighbor are defined in terms of Euclidean distance, dist(X1, X2)
Target function could be discrete- or real- valued
For discrete-valued, k-NN returns the most common value among the k training
examples nearest to xq
Vonoroi diagram: the decision surface induced by 1-NN for a typical set of
training examples
_
_
_ _ .
+
_
. +
xq +
. . .
_ + .
35
Discussion on the k-NN Algorithm
k-NN for real-valued prediction for a given unknown tuple
Returns the mean values of the k nearest neighbors
Distance-weighted nearest neighbor algorithm
Weight the contribution of each of the k neighbors according to their distance
to the query xq 1
Give greater weight to closer neighbors 𝑤=
𝑑(𝑥𝑞 , 𝑥𝑖 )2
Robust to noisy data by averaging k-nearest neighbors
Curse of dimensionality: distance between neighbors could be dominated by
irrelevant attributes
To overcome it, axes stretch or elimination of the least relevant attributes
36
Selection of k for kNN
The number of neighbors k
Small k: overfitting (high var., low bias)
Big k: bringing too many irrelevant points (high bias, low var.)
37 http://scott.fortmann-roe.com/docs/BiasVariance.html
Case-Based Reasoning (CBR)
CBR: Uses a database of problem solutions to solve new problems
Store symbolic description (tuples or cases)—not points in a Euclidean space
Applications: Customer-service (product-related diagnosis), legal ruling
Methodology
Instances represented by rich symbolic descriptions (e.g., function graphs)
Search for similar cases, multiple retrieved cases may be combined
Tight coupling between case retrieval, knowledge-based reasoning, and problem
solving
Challenges
Find a good similarity metric
Indexing based on syntactic similarity measure, and when failure, backtracking,
and adapting to additional cases
38
Chapter 6. Classification: Basic Concepts
Classification: Basic Concepts
Linear Classifiers
Summary
39
Linear Regression Problem: Example
Mapping from independent attributes to continuous value: x => y
{living area} => Price of the house
{college; major; GPA} => Future Income
Price of houses
Living Area
40
Linear Regression Problem: Model
Linear regression
Data: n independent objects
Observed Value: 𝑦𝑖 , 𝑖 = 1,2,3, ⋯ , 𝑛
𝑇
p-dimensional attributes: 𝑥𝑖 = 𝑥𝑖1 , 𝑥𝑖2 , ⋯ , 𝑥𝑖𝑝 , 𝑖 = 1,2,3 ⋯ , 𝑛
Model:
Weight vector: 𝑤 = 𝑤1 , 𝑤2 , ⋯ , 𝑤𝑝
𝑦𝑖 = 𝑤 𝑇 𝑥𝑖 + 𝑏
The weight vector w and bias b are the model parameter learnt by data
41
Linear Regression Model: Solution
Least Square Method
𝑛
Cost / Loss Function: L 𝑤, 𝑏 = Σ𝑖=1 𝑦𝑖 − 𝑤𝑥𝑖 − 𝑏 2
𝑛 2
Optimization Goal: argmin L 𝑤, 𝑏 = Σ𝑖=1 𝑦𝑖 − 𝑤𝑥𝑖 − 𝑏
(w,b)
Closed-form solution:
Σ𝑛 ത
𝑖=1 𝑥𝑖 (𝑦𝑖 −𝑦) 1𝑛
𝑤= 2 𝑏 = Σ𝑖=1 (𝑦𝑖 − 𝑤𝑥𝑖 )
Σ𝑛 𝑥
𝑖=1 𝑖
2
−𝑛 Σ 𝑛
𝑖=1 𝑥𝑖
𝑛
42
Logistic Regression: General Ideas
How to solve “classification” problems by regression?
Key idea of Logistic Regression
We need to transform the real value Y into a probability value ∈ [0,1]
Sigmoid function (differentiable function) :
1 𝑒𝑧
𝜎 𝑧 = 1+𝑒 −𝑧
= 𝑒 𝑧 +1
Projects (−∞, +∞) to [0, 1]
Not only LR uses this function, but also Sigmoid
neural network, deep learning Function
The projected value changes sharply
around zero point
y
Notice that ln 1−𝑦 = 𝑤 𝑇 𝑥 + 𝑏
43
Logistic Regression: An Example
Suppose we only consider the year as feature
Data points are converted by sigmoid function (“activation” function)
1 (Tenured)
year
6
44
Logistic Regression: Model
The prediction function to learn
Probability that Y=1:
𝑛
𝑝 𝑌 = 1 𝑋 = 𝑥; 𝒘) = 𝑆𝑖𝑔𝑚𝑜𝑖𝑑 𝑤0 + σ𝑖=1 𝑤𝑖 ⋅ 𝑥𝑖
𝒘 = 𝑤0 , 𝑤1 , 𝑤2 , … , 𝑤𝑛 are the parameters
𝑙 𝑤 = 𝑦𝑖 log 𝑝 𝑌 = 1 𝑋 = 𝑥𝑖 ; 𝒘 + 1 − 𝑦𝑖 log 1 − 𝑝 𝑌 = 1 𝑋 = 𝑥𝑖 ; 𝒘
𝑖=1
𝑁
47
Gradient Descent
[footnote: It is gradient ascent for maximizing log likelihood]
Note we have switched the index i in the previous slides to l, and use i to refer to different components of weight vector w
48
Gradient Descent
[footnote: It is gradient ascent for maximizing log likelihood]
49
Gradient Descent
[footnote: It is gradient ascent for maximizing log likelihood]
50
Chapter 6. Classification: Basic Concepts
Classification: Basic Concepts
Linear Classifiers
Summary
51
Model Evaluation and Selection
Evaluation metrics
How can we measure accuracy?
Other metrics to consider?
Use validation test set of class-labeled tuples instead of training set when assessing
accuracy
Methods for estimating a classifier’s accuracy
Holdout method
Cross-validation
Bootstrap (not covered)
Comparing classifiers:
ROC Curves
52
Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
In a confusion matrix w. m classes, CMi,j indicates # of tuples in class i that
were labeled by the classifier as class j
May have extra rows/columns to provide totals
Example of Confusion Matrix:
56
Classifier Evaluation Metrics: Example
Use the same confusion matrix, calculate the measure just introduced
Actual Class\Predicted class cancer = yes cancer = no Total
cancer = yes 90 210 300
cancer = no 140 9560 9700
Total 230 9770 10000
Sensitivity =
Specificity =
Accuracy =
Error rate =
Precision =
Recall =
F1 =
57
Classifier Evaluation Metrics: Example
Use the same confusion matrix, calculate the measure just introduced
Actual Class\Predicted class cancer = yes cancer = no Total
cancer = yes 90 210 300
cancer = no 140 9560 9700
Total 230 9770 10000
underfitting
overfitting
59 59
Classifier Evaluation: Holdout
Holdout method
60
Classifier Evaluation: Cross-Validation
Cross-validation (k-fold, where k = 10 is most popular)
Randomly partition the data into k mutually exclusive subsets, each
approximately equal size
At i-th iteration, use Di as test set and others as training set
Leave-one-out: k folds where k = # of tuples, for small sized data
*Stratified cross-validation*: folds are stratified so that class
distribution, in each fold is approximately the same as that in the
initial data
61
Model Selection: ROC Curves
Linear Classifiers
Summary
63
Techniques to Improve Classification Accuracy
Introducing Ensemble Methods
Bagging
Boosting
Random Forests
Imbalanced Classification
64
Ensemble Methods: Increasing the Accuracy
Ensemble methods
Use a combination of models to increase accuracy
Combine a series of k learned models, M1, M2, …, Mk,
with the aim of creating an improved model M*
65
Ensemble Methods: Increasing the Accuracy
What are the requirements to generate an improved model?
Example: majority voting
x1 x2 x3 x1 x2 x3 x1 x2 x3
M1 ✓ ✓ ✗ M1 ✓ ✓ ✗ M1 ✓ ✗ ✗
Base model
performance M2 ✗ ✓ ✓ M2 ✓ ✓ ✗ M2 ✗ ✓ ✗
M3 ✓ ✗ ✓ M3 ✓ ✓ ✗ M3 ✗ ✗ ✓
66
Ensemble Methods: Increasing the Accuracy
Popular ensemble methods
Bagging: Trains each model using a subset of the training set, and models
learned in parallel
Boosting: Trains each new model instance to emphasize the training instances
that previous models mis-classified, and models learned in order
Bagging Boosting
67
Bagging: Bootstrap Aggregation
Analogy: Diagnosis based on multiple doctors’ majority vote
Training
For i = 1 to k
create bootstrap sample, Di, by sampling D with replacement;
use Di and the learning scheme to derive a model, Mi ;
Prediction:
To predict continuous variables, use average prediction instead of vote
68
Boosting
Analogy: Consult several doctors, based on a combination of weighted diagnoses—
weight assigned based on the previous diagnosis accuracy
How boosting works?
A series of k classifiers are iteratively learned
After a classifier Mi is learned, set the subsequent classifier, Mi+1, to pay more
attention to the training tuples that were misclassified by Mi
The final M* combines the votes of each individual classifier, where the weight
of each classifier's vote is a function of its accuracy
Boosting algorithm can be extended for numeric prediction
69
Adaboost (Freund and Schapire, 1997)
3. Update weights
based on current
1. Assign initial model
weights to each
training tuple
{wn(1)} {wn(2)} … {wn(k)}
2. Train base M1 M2 … Mk
classifier on
weighted dataset
4. After base classifiers
Two ‘weighting’ strategy: are trained, they are
𝑘
1. Assign weights to each training combined to give the
example 𝑀 ∗ 𝑥 = sign 𝛼𝑖 𝑀𝑖 𝑥
final classifier
2. Sample dataset based on weight 𝑖=1
distribution
70
Adaboost (Freund and Schapire, 1997)
Input: Training set 𝐷 = 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 , … , 𝑥𝑛 , 𝑦𝑛
Initialize all weights {𝑤𝑛1 } to 1/N
For round i = 1 to k,
Fit a classifier 𝑀𝑖 based on weighted error function
𝑁
𝑖
𝐽𝑚 = 𝑤𝑛 𝐼 𝑀𝑖 𝑥𝑛 ≠ 𝑦𝑛
𝑛=1
𝑖
Evaluate error rate 𝜖𝑖 = 𝐽𝑚 / σ 𝑤𝑛 (stop iteration if 𝜖𝑖 < threshold)
1 1−𝜖𝑖
and the base model 𝑀𝑖 ’s vote 𝛼𝑖 = ln
2 𝜖𝑖
Update weights
(𝑖+1) (𝑖)
𝑤𝑛 = 𝑤𝑛 exp{𝛼𝑖 ⋅ 𝐼 𝑀𝑖 𝑥𝑛 ≠ 𝑦𝑛 }
Previous model
New weak learner
72
Random Forest: Basic Concepts
Random Forest (first proposed by L. Breiman in 2001)
Bagging with decision trees as base models
Data bagging
Use a subset of training data by sampling with replacement for each tree
Feature bagging Advantage of decision trees – more diversity
During classification, each tree votes and the most popular class is returned
73
Random Forest
Two Methods to construct Random Forest:
Forest-RI (random input selection): Randomly select, at each node, F attributes
as candidates for the split at the node. The CART methodology is used to grow
the trees to maximum size
Forest-RC (random linear combinations): Creates new attributes (or features)
that are a linear combination of the existing attributes (reduces the correlation
between individual classifiers)
Comparable in accuracy to Adaboost, but more robust to errors and outliers
Insensitive to the number of attributes selected for consideration at each split, and
faster than typical bagging or boosting
74
Ensemble Methods Recap
Random forest and XGBoost are the most commonly used algorithms for tabular data
Pros
Good performance for tabular data, requires no data scaling
Can scale to large datasets
Can handle missing data to some extent
Cons
Can overfit to training data if not tuned properly
Lack of interpretability (compared to decision trees)
75
Classification of Class-Imbalanced Data Sets
Traditional methods assume a balanced distribution of classes and equal error
costs. But in real world situations, we may face imbalanced data sets, which has
rare positive examples but numerous negative ones.
77
Classification of Class-Imbalanced Data Sets
Typical methods on imbalanced data (At the algorithm level)
Threshold-moving: Move the decision threshold, t, so that the rare class tuples
are easier to classify, and hence, less chance of costly false negative errors
Class weight adjusting: Since false negative costs more than false positive, we
can give larger weight to false negative
Ensemble techniques: Ensemble multiple classifiers introduced in the following
chapter
Threshold-
moving
78
Evaluate imbalanced data classifier
Can we use Accuracy to evaluate imbalanced data classifier?
Accuracy simply counts the number of errors. If a data set has 2% positive
labels and 98% negative labels, a classifier that map all inputs to negative
class will get an accuracy of 98%!
ROC Curve
79
Summary
Classification: Model construction from a set of training data
Effective and scalable methods
Decision tree induction, Bayes classification methods, lazy learners, linear classifier
No single method has been found to be superior over all others for all data sets
80
References (1)
C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future
Generation Computer Systems, 13, 1997
A. J. Dobson. An Introduction to Generalized Linear Models. Chapman & Hall, 1990.
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed. John Wiley, 2001
U. M. Fayyad. Branching on attribute values in decision tree generation. AAAI’94.
Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and
an application to boosting. J. Computer and System Sciences, 1997.
J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree
construction of large datasets. VLDB’98.
J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision Tree
Construction. SIGMOD'99.
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer-Verlag, 2001
J. Pan and D. Manocha. Bi-level locality sensitive hashing for k-nearest neighbor
computation. IEEE ICDE 2012
81
References (2)
T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy, complexity, and
training time of thirty-three old and new classification algorithms. Machine Learning, 2000
J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic
interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing Research,
Blackwell Business, 1994
M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining. EDBT'96
T. M. Mitchell. Machine Learning. McGraw Hill, 1997
S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey,
Data Mining and Knowledge Discovery 2(4): 345-389, 1998
J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993
J. R. Quinlan. Bagging, boosting, and c4.5. AAAI’96
E. Alpaydin. Introduction to Machine Learning (2nd ed.). MIT Press, 2011
82
References (3)
R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building and
pruning. VLDB’98
J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data
mining. VLDB’96
J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan Kaufmann, 1990
P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley, 2005
S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems.
Morgan Kaufman, 1991
S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997
I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques,
2ed. Morgan Kaufmann, 2005
T. Chen and C. Guestrin. Xgboost: A Scalable Tree Boosting System. ACM SIGKDD 2016
C. Aggarwal. Data Classication. Morgan Springer, 2015
83