UNIT - 5
UNSUPERVISED LEARNING:
CLUSTERING AND PATTERN
DETECTION
• Have you ever spent time watching a large
crowd? If so, you are likely to have seen some
recurring personalities. Perhaps a certain type of
person, identified by a freshly pressed suit and a
briefcase, comes to typify the "fat cat" business
executive. A twenty-something wearing skinny
jeans, a flannel shirt, and sunglasses might be
dubbed a "hipster," while a woman unloading
children from a minivan may be labeled a "soccer
mom.“
• Of course, these types of stereotypes are
dangerous to apply to individuals, as no two
people are exactly alike. Yet understood as a
way to describe a collective, the labels capture
some underlying aspect of similarity among
the individuals within the group.
Clustering
• Clustering is an unsupervised machine
learning task that automatically divides the
data into clusters, or groups of similar items.
It does this without having been told how the
groups should look ahead of time. As we may
not even know what we're looking for,
clustering is used for knowledge discovery
rather than prediction. It provides an insight
into the natural groupings found within data.
• Without advance knowledge of what comprises a
cluster, how can a computer possibly know where
one group ends and another begins? The answer
is simple. Clustering is guided by the principle
that items inside a cluster should be very similar
to each other, but very different from those
outside. The definition of similarity might vary
across applications, but the basic idea is always
the same— group the data so that the related
elements are placed together.
The resulting clusters can then be used for action. For
instance, you might find clustering methods employed
in the following applications:
• Segmenting customers into groups with similar
demographics or buying patterns for targeted
marketing campaigns
• Detecting anomalous behavior, such as unauthorized
network intrusions, by identifying patterns of use
falling outside the known clusters
• Simplifying extremely large datasets by grouping
features with similar values into a smaller number of
homogeneous categories
Clustering as a machine Learning Task
• Suppose you were organizing a conference on the
topic of data science. To facilitate professional
networking and collaboration, you planned to
seat people in groups according to one of three
research specialties: computer and/or database
science, math and statistics, and machine
learning. Unfortunately, after sending out the
conference invitations, you realize that you had
forgotten to include a survey asking which
discipline the attendee would prefer to be seated
with.
• As expected, there seems to be a pattern. We
might guess that the upper-left corner, which
represents people with many computer
science publications but few articles on math,
could be a cluster of computer scientists.
Following this logic, the lower-right corner
might be a group of mathematicians. Similarly,
the upper-right corner, those with both math
and computer science experience, may be
machine learning experts
K-Means Clustering Algorithm
• The k-means algorithm assigns each of the n examples
to one of the k clusters, where k is a number that has
been determined ahead of time. The goal is to
minimize the differences within each cluster and
maximize the differences between the clusters.
• The algorithm essentially involves two phases. First, it
assigns examples to an initial set of k clusters. Then, it
updates the assignments by adjusting the cluster
boundaries according to the examples that currently
fall into the cluster. The process of updating and
assigning occurs several times until changes no longer
improve the cluster fit. At this point, the process stops
and the clusters are finalized.
Using distance to assign and update
cluster
• The k-means algorithm begins by choosing k
points in the feature space to serve as the
cluster centers. These centers are the catalyst
that spurs the remaining examples to fall into
place. As we hope to identify three clusters,
according to this method, k = 3 points will be
selected at random. These points are indicated
by the star, triangle, and diamond in the
following diagram:
• After choosing the initial cluster centers, the other
examples are assigned to the cluster center that is nearest
according to the distance function.
• As shown in the following diagram, the three cluster
centers partition the examples into three segments labeled
Cluster A, Cluster B, and Cluster C. The dashed lines
indicate the boundaries for the Voronoi diagram created by
the cluster centers. The Voronoi diagram indicates the
areas that are closer to one cluster center than any other;
the vertex where all the three boundaries meet is the
maximal distance from all three cluster centers. Using these
boundaries, we can easily see the regions claimed by each
of the initial k-means seeds:
• Now that the initial assignment phase has been
completed, the k-means algorithm proceeds to
the update phase. The first step of updating the
clusters involves shifting the initial centers to a
new location, known as the centroid, which is
calculated as the average position of the points
currently assigned to that cluster. The following
diagram illustrates how as the cluster centers
shift to the new centroids, the boundaries in the
Voronoi diagram also shift and a point that was
once in Cluster B (indicated by an arrow) is added
to Cluster A:
• As a result of this reassignment, the k-means
algorithm will continue through another
update phase. After shifting the cluster
centroids, updating the cluster boundaries,
and reassigning points into new clusters (as
indicated by arrows), the figure looks like this:
• Because two more points were reassigned,
another update must occur, which moves the
centroids and updates the cluster boundaries.
However, because these changes result in no
reassignments, the k-means algorithm stops.
The cluster assignments are now final: