Machine learning clustering

Mauritius JEDI
Machine Learning
&
Big Data
Clustering Algorithms
Nadeem Oozeer

Machine learning:
• Supervised vs Unsupervised.
– Supervised learning - the presence of the
outcome variable is available to guide the learning
process.
• there must be a training data set in which the solution
is already known.
– Unsupervised learning - the outcomes are
unknown.
• cluster the data to reveal meaningful partitions and
hierarchies

Clustering:
• Clustering is the task of gathering samples into groups of similar samples
according to some predefined similarity or dissimilarity measure
sample Cluster/group

• In this case clustering is carried out using the Euclidean distance as a
measure.

Clustering:
• What is clustering good for
– Market segmentation - group customers into
different market segments
– Social network analysis - Facebook "smartlists"
– Organizing computer clusters and data centers for
network layout and location
– Astronomical data analysis - Understanding
galaxy formation

Galaxy Clustering:
• Multi-wavelength data obtained for galaxy clusters
– Aim: determine robust criteria for the inclusion of a galaxy into
a cluster galaxy
– Note: physical parameters of the galaxy cluster can be heavily
influenced by wrong candidate
Credit:
HST

Clustering Algorithms :
• Hierarchy methods
– statistical method used to build a cluster by
arranging elements at various levels

Dendogram:
• Each level will then represent a possible
cluster.
• The height of the dendrogram shows the level
of similarity that any two clusters are joined
• The closer to the bottom they are the more
similar the clusters are
• Finding of groups from a dendrogram is not
simple and is very often subjective

• Partitioning methods
– make an initial division of the database and then use an
iterative strategy to further divide it into sections
– here each object belongs to exactly one cluster
Credit:
Legodi,
2014

K-means algorithm:
1. Given n objects, initialize k cluster centers
2. Assign each object to its closest cluster centre
3. Update the center for each cluster
4. Repeat 2 and 3 until no change in each cluster center
• Experiment: Pack of cards, dominoes
• Apply the K-means algorithm to the Shapley data
– Change the number of potential cluster and find how the
clustering differ

K Nearest Neighbors (k-NN):
• One of the simplest of all machine learning
classifiers
• Differs from other machine learning techniques,
in that it doesn't produce a model.
• It does however require a distance measure and
the selection of K.
• First the K nearest training data points to the new
observation are investigated.
• These K points determine the class of the new
observation.

1-NN
• Simple idea: label a new point the same as
the closest known point
Label it red.

1-NN Aspects of an
Instance-Based Learner
1. A distance metric
– Euclidian
2. How many nearby neighbors to look at?
– One
3. A weighting function (optional)
– Unused
4. How to fit with the local points?
– Just predict the same output as the nearest
neighbor.

k-NN
• Generalizes 1-NN to smooth away noise in the labels
• A new point is now assigned the most frequent label of its k
nearest neighbors
Label it red, when k = 3
Label it blue, when k = 7

Machine learning clustering

In this document