0 ratings0% found this document useful (0 votes) 43 views6 pagesClustering
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
13: Clustering
Previous Next Index
Unsupervised learning - introduction
+ Talk about clustering
© Learning from unlabeled data
+ Unsupervised learning
© Useful to contras with supervised learning
+ Compare and contrast
© Supervised learning
= Given a set of labels, fit a hypothesis to it
© Unsupervised learning
= Try and determining structure in the data
1 Clustering algorithm groups data together based on data features
+ What is clustering good for
Market segmentation - group customers into different market segments
© Social network analysis - Facebook "smartlists’
© Organizing computer clusters and data centers for network layout and location
© Astronomical data analysis - Understanding galaxy formation
K-means algorithm
+ Want an algorithm to automatically group the data into coherent clusters
+ K-means is by far the most widely used clustering algorithm
Overview
+ Take unlabeled data and group into two clusters
+ Algorithm overview
© 1) Randomly allocate two points as the cluster centroids
= Have as many cluster centroids as clusters you want to do (K cluster centroids, in fact)
= In our example we just have two clusters
© 2) Cluster assignment step
= Go through each example and depending on if it’s closer to the red or blue centroid assign each point to one of
the two clusters
= To demonstrate this, we've gone through the data and "colour" each point red or blue© 3) Move centroid step
‘= Take each centroid and move to the average of the correspondingly assigned data-points
* Repeat 2) and 3) until convergence
+ More formal definition
© Input:
= K (number of clusters in the data)
+ Training set fx! x2, x8. x")
© Algorithm:
* Randomly initialize K cluster centroids as {t1,, Hoy Hg + sx}
Repeat {
fori=1tom
cl := index (from 1 to K) of cluster centroid
closest to x)
fork=1toK
[4 = average (mean) of points assigned to cluster k }
+ Loop
= This inner loop repeatedly sets the &®
centroid closes to x!
+ ive. take example, measure squared distance to each cluster centroid, assign eto the
cluster closest
@a Kod
ope Wal
Cs
« toops
= Loops over each centroid calculate the average mean based on all the points associated with
each centroid from ¢)
= What if there's a centroid with no data
mare acon oto th clase
= Or, randomly reinitialize it
+ Not sure when though,
variable to be the index of the closes variable of cluster
K-means for non-separated clusters
+ So far looking at K-means where we have well defined clusters
+ But often K-means is applied to datasets where there aren't well defined clusters© eg, T-shirt sizing
T-shirt sizing
2 eor 9 ©
> . .
2 .
2 wt.
Height
t obvious discrete groups
ay you want to have three sizes ($,M,L) how big do you make these?
© One way would be to run K-means on this data
© May do the following
T-shirt sizing
Height
© So creates three clusters, even though they aren't realy there
© Look at first population of people
= Tey and design a small T-shirt which fits the 1st population
‘= And so on for the other two
© This is an example of market segmentation
* Build products which suit the needs of your subpopulations
K means optimization objective
+ Supervised learning algorithms have an optimization objective (cost function)
© K-means does too
+ K-means has an optimization objective like the supervised learning funetions we've seen,
© Why is this good?
© Knowing this is useful because it helps for debugging
© Helps find better clusters
‘+ While K-means is running we keep track of two sets of variables
© lis the index of clusters {1,2,... K} to which x is currently assigned
one of K different values)
© ty is the cluster associated with centroid k
= Locations of cluster centroid k
* i. there are mc! values, as each example has a c! value, and that value is one the the clusters (i.e. can only be= So there are K
= So these the centroids which exist in the training data space
the cluster centroid of the cluster to which example x' has been assigned to
‘= This is more for convenience than anything else
‘= You could look up that example iis indexed to cluster j (using the c vector), where jis between 1 and K
= Then look up the value associated with cluster j in the u vector (i.e. what are the features associated
with 4)
‘= But instead, for easy description, we have this variable which gets,
= Lets say x! as been assigned to cluster 5
= Means that
tly the same value
he = Hs
* Using this notation we can write the optimization objective;
te
IT (CO yop, pry esac) = =D Ile = peo |?
(estes) = D2 Hl
6 i.e. squared distances between training example x! and the cluster centroid to which x! has been assigned to
+ This is just what we've been doing, as the visual description below shows;
New
= The red line here shows the distances between the example x! and the cluster to which that example has been.
assigned
= Means that when the example is very close to the cluster, this value is small
= When the cluster is very far away from the example, the value is large
© This is sometimes called the distortion (or distortion cost function)
© Sowe are finding the values which minimizes this funetion;
ymin I(r, we)
Bayes MK
+ Ifwe consider the k-means algorithm
© The cluster assigned step is minimizing J(..) with respect to cc? .¢
* i.e. find the centroid closest to each example
= Doesn't change the centroids themselves
© The move centroid step
= We can show this step is choosing the values of which minimizes J(.) with respect to 1
© So, we're partitioning the algorithm into two parts
‘» First part minimizes the c variables
1 Second part minimizes the J variables
* We can use this knowledge to help debug our K-means algorithm
Random initialization
+ How we initialize K-means
© And how avoid local optimum
+ Consider clustering algorithm
© Never spoke about how we initialize the centroids
1 A few ways - one method is most recommended
+ Have number of centroids set to less than number of examples (K < m) (if K > m we have a problem)o
© Randomly pick K training examples
© Set p, up to pix to these example's values
+ K means can converge to different solutions depending on the initialization setup© Risk of_ocal optimum
GLOBAL OPTIMUM
LocaL OPTIMA
© The local optimum are valid convergence, but local optimum not global ones
«If this is a concern
‘© We can do multiple random initializations
* See if we get the same result - many same results are likely to indicate a global optimum
+ Algorithmically we can do this as follows;
For i=1to 100{
Randomly initialize K-means.
Run K-means. Get c{1) c™), 1...
Compute cost function (distortion)
TED, 2.5, wry.) }
© Atypical number of times to initialize K-means is 50-1000
© Randomly initialize K-means
+ For each 100 random initialization run K-means
* Then compute the distortion on the set of cluster assignments and centroids at convergent
+ End with 100 ways of cluster the data
* ick the clustering which gave the lowest distortion
+ Ifyou're running K means with 2-10 clusters ean help find better global optimum
‘© If Kis larger than 10, then multiple random initializations are less likely to be necessary
© First solution is probably good enough (better granularity of clustering)
How do we choose the number of clusters?
* Choosing K?
© Nota great way to do this automatically
© Normally use visualizations to do it manually
+ What are the intuitions regarding the data?
+ Why is this hard
© Sometimes very ambiguous
1 e.g, two clusters or four clusters
= Not necessarily a correct answer
© This is why doing it automatic this is hard
Elbow method
+ Vary K and compute cost function at a range of K values
+ As K increases J\..) minimum value should decrease (i.e. you decrease the granularity so centroids can better optimize)© Plot this (K vs JO)
+ Look for the "elbow" on the graph
Selon"
ieee
12345678
K (ho. of clusters)
Cost function J
« Chose the “elbow” number of clusters
* Ifyou get a nice plot this is a reasonable way of choosing K
© Risks
© Normally you don’t get a a nice line -> no clear elbow on curve
© Not really that helpful
Another method for choosing K
+ Using K-means for market segmentation
+ Running K-means for a later/downstream purpose
© See how well different number of clusters serve you later needs
+ eg.
© T-shirt size example
= Ifyou have three sizes (S,M,L)
1 Or five sizes (XS, S, M, L, XL)
= Run K means where K= 3 and K= 5
© How does this look
Tshirt sting L. T-shirt sing
Weight
Weight
Height Height
© This gives a way to chose the number of clusters
= Could consider the cost of making extra sizes vs. how well distributed the products are
= How important are those sizes though? (e.g. more sizes might make the customers happier)
= So applied problem may help guide the number of clusters