Lecture 4 – Classification: Alternative Techniques
Underfitting – when model is too simple, both training and test errors are large
Insufficient Examples:
o Lack of data points in the lower half of the diagram makes it difficult to
predict correctly the class labels of that region
o Insufficient number of training records in the region causes the decision tree
to predict the test examples using other training records that are irrelevant
to the classification task
Overfitting – decision trees that are more complex than necessary
The learning algorithm has access only to the training set during model building. It
has no knowledge of the test set
Solution 1: Pre-Pruning
o Stop the algorithm before it becomes a fully-grown tree
o Typical stop conditions:
All instances belong to the same class
Attribute values are the same
o Restrictive conditions:
Number of instances is less than some user-specified threshold
Class distribution of instances are independent of the available
features
Expanding the current node does not improve impurity measures
Solution 2: Post-pruning
o Grow decision tree to its entirety
o Trim the nodes of decision tree in a bottom-up fashion
o If generalization error improves after trimming, replace sub-tree by a leaf
node
o Class label of a leaf node is determined from majority class of instances in the
sub-tree
o Can use Minimum Description Length (MDL) for post pruning
Estimating Generalization Errors
Occam’s Razor: Given two models of similar generalization errors, one should prefer the
simpler model over the more complex model
For complex models, there is a greater chance that it was fitted accidentally by errors
in data
Minimum Description Length:
o Cost (Model, Data) = Cost (Data|Model) + Cost (Model)
Cost: number of bits needed for encoding
Search for the least costly model
o Cost (Data|Model): misclassification errors
o Cost (Model): node encoding (number of children) + splitting condition
encoding
Instance Based Classifiers:
Store the training records (without training explicit models)
Use training records directly to predict the class label of unseen cases
Rote Learner: memorizes entire training data and performs classification only if
attributes of record match one of the training examples exactly
K-Nearest Neighbours (k-NN): Uses k “closest” points (nearest neighbours) for
performing classification
Nearest-Neighbour Classifier:
Things Required:
o Set of stored training records
o Distance metric to compute distance between records
o Value of k, the number of nearest neighbours to retrieve
If k is too small, sensitive to noise points
If k is too large, neighbourhood may include points from other classes
To classify an unknown record:
o Compute distance to other training records
o Identify k nearest neighbours
o The class labels of the nearest neighbours to determine the class label of the
unknown vote
Computation:
o Euclidean distance:
o Determine the class from nearest neighbour list:
Take majority vote of class labels among the k-nearest neighbours
Weigh the vote according to distance (Weight factor, w = 1/d2)
1 nearest-neighbour:
o Voronoi Diagram: boundaries denote 1-neighborhoods
Bayesian Classifier: A probabilistic framework for classification problems
Naïve Bayes Classifier
If one of the conditional probabilities is zero, then the entire expression becomes
zero
Estimate Probabilities from Data:
For continuous attributes:
o Discretize the range into bins
One ordinal attribute per bin
Violates independence assumption
o Two-way split: (A<v) or (A>v)
Choose only one of the two splits as new attribute
o Probability density estimation:
Assume attribute follows a normal distribution
Use data to estimate parameters of distribution (e.g. mean and
standard deviation)
Once probability distribution is known, can use it to estimate the
conditional probability P(Ai|c)
Normal Distribution:
Support Vector Machines
To know which line is better, we want to find a hyperplane that maximizes the
margin:
If the problem is not linearly separable:
Nonlinear Support Vector Machines
Transform data into higher dimensional space
Using “Kernel Trick”, actual transformation need not be known
Just compute similarity between two vectors in original space
Artificial Neural Networks (ANN)
Perceptron Model is an assembly of inter-connected nodes and weighted links
Output node sums up each of its input value according to the weights of its links
Compare output node against some threshold t
Algorithm for learning ANN:
o Initialize the weights (w0, w1, w2, …, wk)
o Adjust the weights in such a way that the output of ANN is consistent with
class labels of training examples
Objective function:
Find the weights wi that minimize the above objective function
e.g. backpropagation algorithm
Ensemble Methods:
Construct a set of classifiers from the training data
Predict class label previously unseen records by aggregating predictions made by
multiple classifer
Assumption: Individual classifiers (voters) could be lousy (stupid), but the aggregate
(electorate) can usually classify (decide) correctly
Bagging:
o Sampling with replacement
o Build classifier on each bootstrap sample
o Each sample has probability of (1 – (1-1/N) N of being selected
Tends to be 0.63 for large N
Boosting: An iterative procedure to adaptively change distribution of training data by
focusing more on previously misclassified records
o Initially, all N are assigned equal weights
o Unlike bagging, sampling weights may change at the end of boosting round
o Records that are wrongly classified will have their weights increased
o Records that are correctly classified will have their weights decreased
Lecture 6 – Cluster Analysis
Cluster Analysis: Finding groups of objects such that the objects in a group will be similar to
one another and different from to the objects in other groups.
Types of Clustering:
o Partitional Clustering: A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one subset
o Hierarchical Clustering: A set of nested clusters organized as a hierarchical
tree
o Contiguous Cluster (Nearest neighbour or Transitive): A cluster is a set of
points such that a point in a cluster is closer to one or more other points in
the cluster than to any point not in cluster
o Density-based: A cluster is a dense region or points, which is separated by
low-density regions, from other regions of high density
Used when clusters are irregular or intertwined, and when noise and
outliers are present
o Conceptual Cluster: Finds clusters that share some common property or
represent a particular concept
Other Distinctions:
o Exclusive versus non-exclusive
o Fuzzy vs non-fuzzy
o Partial versus complete
o Heterogeneous versus homogeneous
Partial Clustering: Hierarchical Clustering:
K-means Clustering:
o Partitional clustering approach
o Each cluster is associated with a centroid
o Each point is assigned to the cluster with the closest centroid (Euclidean
distance, cosine similarity, etc.)
o Number of clusters, K, must be specified
o Complexity: O (n*k | *d) where n = number of points, k = number of clusters,
I = number of iterations, d = number of attributes
o Choice of initial centroids are extremely important
o Sum of Squared Error (SSE): For every point, the error is the distance to the
nearest cluster
o To reduce SSE, it is important to increase K, the number of clusters
A good clustering with smaller K can have a lower SSE than a poor
clustering with higher K
o Problem with Selecting Initial Points:
If there are K ‘real’ clusters then the chance of selecting one centroid
from each cluster is small
Chance is relatively small when K is large
If clusters are the small size, n, then
o Solution:
Multiple runs
Sample and use hierarchical clustering to determine initial centroids
Select more than k initial centroids and then select among these initial
centroids: Select more widely separated
Postprocessing
Bisecting K-means: Not as susceptible to initialization issues
o Handling Empty Clusters:
Choose the point that contributes most to SSE
Choose a point from the cluster with the highest SSE
If there are several empty clusters, the above can be repeated several
times
o Updating Centres Incrementally after each assignment:
Each assignment updates zero/2 centroids
More expensive
Introduces an order dependency
Never get an empty cluster
Can use “weights” to change the impact
o Pre-Processing:
Normalize the data
Eliminate outliers
o Post-processing:
Eliminate small clusters that may represent outliers
Split ‘loose’ clusters, i.e. clusters with relatively high SSE
Merge clusters that are ‘close’ and that have relatively low SSE
Can use these steps during the clustering process: ISODATA (Iterative
Self-Organizing Data Analysis)
Bisecting K-means algorithm: Variant of K-means that can produce a partitional or a
hierarchical clustering
Limitations:
o Problems when clusters are of differing sizes, densities, non-globular shapes
o Data contains outliers
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree
Space: O(N2) where N is number of points; Time: O(N3) where there are N steps and
at each step the size, N2, proximity matrix must be updated and searched
Can be visualized as a dendrogram
o A tree like diagram that records that sequences of merges or splits
Strengths:
o Do not have to assume any particular number of clusters: Any desired
number of clusters can be obtained by ‘cutting’ the dendrogram at the
proper level
o Correspond to meaningful taxonomies
Types of hierarchical clustering:
o Agglomerative:
Start with points as individual clusters
At each step, merge the closest pair of clusters until only one cluster
(or k clusters) left
o Divisive:
Start with one, all-inclusive cluster
At each step, split a cluster until each cluster contains a point
Inter-Cluster Similarity
MIN MAX
Strength: Strength:
- Can handle non-elliptical shapes - Less susceptible to noise and outliers
Limitation: Limitations:
- Sensitive to noise and outliers - Tends to break large clusters
- Biased towards globular clusters
Group Average Distance between Centroids
Strength:
- Less susceptible to noise and outliers
Limitations:
- Biased towards globular clusters
Cluster Similarity: Ward’s Method
o Based on the increase in squared error when two clusters are merged:
o Less susceptible to noise and outliers
o Biased towards globular clusters
o Hierarchical analogue of K-means
Minimum Spanning Tree (MST): Divisive Hierarchical Clustering
Build MST
o Start with a tree that consists of any point
o In successive steps, look for the closest pair of points (p, q) such that one
point (p) is in the current tree but the other (q) is not
o Add q to the tree and put an edge between p and q
Use MST for constructing hierarchy of clusters
Lecture 7 – Cluster Analysis: Alternative Methods
DBSCAN: density-based algorithm
Density = number of points within a specified radius (Eps)
A point is a core point if it has more than a specified number of points (MinPts)
within Eps (including itself)
o These are points that are at the interior of a cluster
A border point is not a core point but is in the neighbourhood of a core point
A noise point is any point that is not a core point or a border point
DBSCAN Algorithm
o Eliminate noise points
o Perform clustering on the remaining noise
Two core points within a specific radius are put into the same cluster
Border points are added
To determine EPS and Min Pts:
o For points in a cluster, their kth nearest neighbours are at roughly the same
distance
o Noise points have the kth nearest neighbour at farther distance
o So, plot sorted distance of every point to its kth nearest neighbour
Clustering Using REpresentatives (CURE)
Representative points are found by selecting a constant number of points from a
cluster and then “shrinking” them toward the centre of the cluster
Cluster similarity is the similarity of the closest pair of representative points from
different clusters
Shrinking representative points toward the centre helps avoid problems with noise
and outliers
CURE is better able to handle clusters of arbitrary shapes and sizes
Graph-Based Clustering
Graph-Based clustering uses the proximity graph
o Start with the proximity matrix
o Considering each point as a node in a graph
o Each edge between two nodes has a weight which is the proximity between
the two points
A graph, G = (V, E) representing a set of objects (V) and their relations
(E)
Initially the proximity graph is fully connected
MIN (single-link) and MAX (complete-link) can be viewed as starting
with this graph
o Clusters are connected components in the graph
Clustering may work better
o Sparsification techniques keep the connections to the most similar (nearest)
neighbour of a point while breaking the connections to less similar points
o The nearest neighbours of a point tend to belong to the same class as the
point itself
o This reduces the impact of noise and outliers and sharpens the distinction
between clusters
Sparsification facilitates the use of graph partitioning algorithms (Chameleon &
Hypergraph-based Clustering)
o Chameleon: Clustering using Dynamic Modelling
Adapt to the characteristics of the data set to find the natural clusters
Use a dynamic model to measure the similarity between clusters
Relatively closeness & Relative inter-connectivity of the cluster
Two clusters are combined if the resulting cluster shares
certain properties with the constituent clusters
Merging scheme preserves self-similarity
Used in spatial data:
Clusters are defined as densely populated regions of
space
Clusters have arbitrary shapes, orientation, and non-
uniform sizes
Difference in densities across clusters and variation in
density within clusters
Existence of special artefacts and noise
Steps:
Pre-processing Step: Represent the Data by a Graph
Given a set of points, construct the k-nearest-
neighbour graph to capture the relationship between a
point and its k nearest neighbours
Concept of neighbourhood is captured dynamically
Phase 1: Use a multilevel graph partitioning algorithm on the
graph to find a large number of clusters of well-connected
vertices
Each cluster should contain mostly points from one
“true” cluster
Phase 2: Use Hierarchical Agglomerative Clustering to merge
sub-clusters: Two clusters are combined if the resulting cluster
shares certain properties with the constituent clusters
Relative Interconnectivity: Absolute interconnectivity
of two clusters normalized by the internal connectivity
of the clusters
Relative Closeness: Absolute closeness of two clusters
normalized by the internal closeness of the clusters
Shared Nearest Neighbour (SNN) graph: the weight of an edge is the number off shared
neighbours between the vertices given that the vertices are connected
Jarvis Patrick Clustering:
o K-nearest neighbours of all points are found (breaking all but the k strongest
links from a point to other points in the proximity graph)
o Pair of points is put in the same cluster if:
Two points share more than T neighbours
Two points are in each other’s k nearest neighbour list
SNN Density-based Clustering Algorithm:
o Compute the similarity matrix
o Sparsify the similarity matrix by keeping only the k most similar neighbours
o Construct the shared nearest neighbour graph from the sparsified similarity
matrix
o Find the core points
o Form clusters from the core points
o Discard all the noise points
o Assign all non-noise, non-core points to clusters
Cluster Validation:
Determining the clustering tendency of a set of data, i.e. distinguishing whether non-
random structure actually exists in the data
Compare the results of a cluster analysis to externally known results
Evaluate how well the results of a cluster analysis fit the data without referencing to
external information
Comparing the results of two different sets of cluster analyses to determine which is
better
Determining the ‘correct’ number of clusters
Measures of Cluster Validity:
o External Index: extent that cluster labels match externally supplied class
labels (Entropy)
o Internal Index: goodness of a clustering structure without respect to external
information (Sum of Squared Error)
Cluster Cohesion: Measures how closely related are objects in a
cluster (e.g. SSE) – sum of weight of all links within a cluster
Cluster Separation: Measures how distinct or well-separated a cluster
is from other clusters (e.g. Squared Error) – sum of weights between
nodes in the cluster and nodes outside the cluster
Silhouette Coefficient: combine ideas of both cohesion and separation
o Relative Index: Used to compare two different clustering or cluster
o Correlation:
Proximity Matrix
Incidence Matrix: entry = 1 if associated pair of points belong to the
same cluster; else entry = 0
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster
are close to each other
Lecture 8 – Association Rule Mining
Associate Rule Mining: Given a set of transactions, find rules that will predict the occurrence
of an item based on the occurrences of other items in the transaction
Frequent Itemset: An itemset whose support is greater than or equal to a minsup threshold
Itemset: A collection of one or more items (E.g. Milk, Bread, Diaper)
K-itemset: An itemset that contains k items
Support count: Frequency of occurrence of an itemset
Support: Fraction of transactions that contain an itemset