Clustering

Understanding Clustering

Uploaded by

Tinotenda Sandra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

43 views6 pages

Clustering

Understanding Clustering

Uploaded by

Tinotenda Sandra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 6

13: Clustering Previous Next Index Unsupervised learning - introduction + Talk about clustering © Learning from unlabeled data + Unsupervised learning © Useful to contras with supervised learning + Compare and contrast © Supervised learning = Given a set of labels, fit a hypothesis to it © Unsupervised learning = Try and determining structure in the data 1 Clustering algorithm groups data together based on data features + What is clustering good for Market segmentation - group customers into different market segments © Social network analysis - Facebook "smartlists’ © Organizing computer clusters and data centers for network layout and location © Astronomical data analysis - Understanding galaxy formation K-means algorithm + Want an algorithm to automatically group the data into coherent clusters + K-means is by far the most widely used clustering algorithm Overview + Take unlabeled data and group into two clusters + Algorithm overview © 1) Randomly allocate two points as the cluster centroids = Have as many cluster centroids as clusters you want to do (K cluster centroids, in fact) = In our example we just have two clusters © 2) Cluster assignment step = Go through each example and depending on if it’s closer to the red or blue centroid assign each point to one of the two clusters = To demonstrate this, we've gone through the data and "colour" each point red or blue© 3) Move centroid step ‘= Take each centroid and move to the average of the correspondingly assigned data-points * Repeat 2) and 3) until convergence + More formal definition © Input: = K (number of clusters in the data) + Training set fx! x2, x8. x") © Algorithm: * Randomly initialize K cluster centroids as {t1,, Hoy Hg + sx} Repeat { fori=1tom cl := index (from 1 to K) of cluster centroid closest to x) fork=1toK [4 = average (mean) of points assigned to cluster k } + Loop = This inner loop repeatedly sets the &® centroid closes to x! + ive. take example, measure squared distance to each cluster centroid, assign eto the cluster closest @a Kod ope Wal Cs « toops = Loops over each centroid calculate the average mean based on all the points associated with each centroid from ¢) = What if there's a centroid with no data mare acon oto th clase = Or, randomly reinitialize it + Not sure when though, variable to be the index of the closes variable of cluster K-means for non-separated clusters + So far looking at K-means where we have well defined clusters + But often K-means is applied to datasets where there aren't well defined clusters© eg, T-shirt sizing T-shirt sizing 2 eor 9 © > . . 2 . 2 wt. Height t obvious discrete groups ay you want to have three sizes ($,M,L) how big do you make these? © One way would be to run K-means on this data © May do the following T-shirt sizing Height © So creates three clusters, even though they aren't realy there © Look at first population of people = Tey and design a small T-shirt which fits the 1st population ‘= And so on for the other two © This is an example of market segmentation * Build products which suit the needs of your subpopulations K means optimization objective + Supervised learning algorithms have an optimization objective (cost function) © K-means does too + K-means has an optimization objective like the supervised learning funetions we've seen, © Why is this good? © Knowing this is useful because it helps for debugging © Helps find better clusters ‘+ While K-means is running we keep track of two sets of variables © lis the index of clusters {1,2,... K} to which x is currently assigned one of K different values) © ty is the cluster associated with centroid k = Locations of cluster centroid k * i. there are mc! values, as each example has a c! value, and that value is one the the clusters (i.e. can only be= So there are K = So these the centroids which exist in the training data space the cluster centroid of the cluster to which example x' has been assigned to ‘= This is more for convenience than anything else ‘= You could look up that example iis indexed to cluster j (using the c vector), where jis between 1 and K = Then look up the value associated with cluster j in the u vector (i.e. what are the features associated with 4) ‘= But instead, for easy description, we have this variable which gets, = Lets say x! as been assigned to cluster 5 = Means that tly the same value he = Hs * Using this notation we can write the optimization objective; te IT (CO yop, pry esac) = =D Ile = peo |? (estes) = D2 Hl 6 i.e. squared distances between training example x! and the cluster centroid to which x! has been assigned to + This is just what we've been doing, as the visual description below shows; New = The red line here shows the distances between the example x! and the cluster to which that example has been. assigned = Means that when the example is very close to the cluster, this value is small = When the cluster is very far away from the example, the value is large © This is sometimes called the distortion (or distortion cost function) © Sowe are finding the values which minimizes this funetion; ymin I(r, we) Bayes MK + Ifwe consider the k-means algorithm © The cluster assigned step is minimizing J(..) with respect to cc? .¢ * i.e. find the centroid closest to each example = Doesn't change the centroids themselves © The move centroid step = We can show this step is choosing the values of which minimizes J(.) with respect to 1 © So, we're partitioning the algorithm into two parts ‘» First part minimizes the c variables 1 Second part minimizes the J variables * We can use this knowledge to help debug our K-means algorithm Random initialization + How we initialize K-means © And how avoid local optimum + Consider clustering algorithm © Never spoke about how we initialize the centroids 1 A few ways - one method is most recommended + Have number of centroids set to less than number of examples (K < m) (if K > m we have a problem)o © Randomly pick K training examples © Set p, up to pix to these example's values + K means can converge to different solutions depending on the initialization setup© Risk of_ocal optimum GLOBAL OPTIMUM LocaL OPTIMA © The local optimum are valid convergence, but local optimum not global ones «If this is a concern ‘© We can do multiple random initializations * See if we get the same result - many same results are likely to indicate a global optimum + Algorithmically we can do this as follows; For i=1to 100{ Randomly initialize K-means. Run K-means. Get c{1) c™), 1... Compute cost function (distortion) TED, 2.5, wry.) } © Atypical number of times to initialize K-means is 50-1000 © Randomly initialize K-means + For each 100 random initialization run K-means * Then compute the distortion on the set of cluster assignments and centroids at convergent + End with 100 ways of cluster the data * ick the clustering which gave the lowest distortion + Ifyou're running K means with 2-10 clusters ean help find better global optimum ‘© If Kis larger than 10, then multiple random initializations are less likely to be necessary © First solution is probably good enough (better granularity of clustering) How do we choose the number of clusters? * Choosing K? © Nota great way to do this automatically © Normally use visualizations to do it manually + What are the intuitions regarding the data? + Why is this hard © Sometimes very ambiguous 1 e.g, two clusters or four clusters = Not necessarily a correct answer © This is why doing it automatic this is hard Elbow method + Vary K and compute cost function at a range of K values + As K increases J\..) minimum value should decrease (i.e. you decrease the granularity so centroids can better optimize)© Plot this (K vs JO) + Look for the "elbow" on the graph Selon" ieee 12345678 K (ho. of clusters) Cost function J « Chose the “elbow” number of clusters * Ifyou get a nice plot this is a reasonable way of choosing K © Risks © Normally you don’t get a a nice line -> no clear elbow on curve © Not really that helpful Another method for choosing K + Using K-means for market segmentation + Running K-means for a later/downstream purpose © See how well different number of clusters serve you later needs + eg. © T-shirt size example = Ifyou have three sizes (S,M,L) 1 Or five sizes (XS, S, M, L, XL) = Run K means where K= 3 and K= 5 © How does this look Tshirt sting L. T-shirt sing Weight Weight Height Height © This gives a way to chose the number of clusters = Could consider the cost of making extra sizes vs. how well distributed the products are = How important are those sizes though? (e.g. more sizes might make the customers happier) = So applied problem may help guide the number of clusters

13: Clustering: Unsupervised Learning - Introduction
No ratings yet
13: Clustering: Unsupervised Learning - Introduction
4 pages
2 - K-Mean
No ratings yet
2 - K-Mean
39 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
27 pages
Neural Network Clustering Guide
No ratings yet
Neural Network Clustering Guide
168 pages
19.1. Partitioning-Based Clustering Algorithms
No ratings yet
19.1. Partitioning-Based Clustering Algorithms
27 pages
Intro to K-Means Clustering
No ratings yet
Intro to K-Means Clustering
4 pages
Week 9
No ratings yet
Week 9
66 pages
Clustering: Unsupervised Learning
No ratings yet
Clustering: Unsupervised Learning
29 pages
K Means Clustering
No ratings yet
K Means Clustering
22 pages
P-3 1 2-Kmeans
No ratings yet
P-3 1 2-Kmeans
43 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Kmeans
No ratings yet
Kmeans
92 pages
Mod4 - Unsupervised Learning
No ratings yet
Mod4 - Unsupervised Learning
9 pages
Clustering (Class 38-39)
No ratings yet
Clustering (Class 38-39)
45 pages
K Means - Ipynb - Colab
No ratings yet
K Means - Ipynb - Colab
10 pages
Algo
No ratings yet
Algo
59 pages
Clustering Part1
No ratings yet
Clustering Part1
84 pages
1 The K-Medoids Algorithm
No ratings yet
1 The K-Medoids Algorithm
5 pages
Clustering and K-Mean Algorithm
No ratings yet
Clustering and K-Mean Algorithm
38 pages
K - Means Clustering
No ratings yet
K - Means Clustering
13 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
29 pages
Lecture 13
No ratings yet
Lecture 13
29 pages
Unit 4
No ratings yet
Unit 4
59 pages
Unit 4
No ratings yet
Unit 4
46 pages
4.1.2. K Means Clustering
No ratings yet
4.1.2. K Means Clustering
38 pages
Unit 4
No ratings yet
Unit 4
22 pages
A Tutorial On Clustering Algorithms
No ratings yet
A Tutorial On Clustering Algorithms
4 pages
Clusterin G: Unsupervised Learning
No ratings yet
Clusterin G: Unsupervised Learning
29 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
26 pages
Lec 05 Unsupervised-Kmeans
No ratings yet
Lec 05 Unsupervised-Kmeans
50 pages
Unit 4
No ratings yet
Unit 4
125 pages
Session 37 CO4 Unsupervised Learning
No ratings yet
Session 37 CO4 Unsupervised Learning
34 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Unsupervised Learning - Clustering
No ratings yet
Unsupervised Learning - Clustering
55 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
12 pages
K Mean
No ratings yet
K Mean
12 pages
Unit IV
No ratings yet
Unit IV
96 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
4 pages
Presentation 1
No ratings yet
Presentation 1
47 pages
Clustering Kmeans
No ratings yet
Clustering Kmeans
6 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
20 pages
WINSEM2021-22 ECE6093 ETH VL2021220505450 Reference Material I 23-03-2022 Slides Kmeans
No ratings yet
WINSEM2021-22 ECE6093 ETH VL2021220505450 Reference Material I 23-03-2022 Slides Kmeans
28 pages
2023 K Means
No ratings yet
2023 K Means
48 pages
Lect 10 - Unsupervised Learning
No ratings yet
Lect 10 - Unsupervised Learning
50 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
Kmeansfinal
No ratings yet
Kmeansfinal
16 pages
Lecture 3. Partitioning-Based Clustering Methods
No ratings yet
Lecture 3. Partitioning-Based Clustering Methods
27 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
24 pages
Clustering Techniques - Hierarchical, K-Means Clustering
No ratings yet
Clustering Techniques - Hierarchical, K-Means Clustering
22 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
23 pages
07 Clustering 2024
No ratings yet
07 Clustering 2024
51 pages
K Means Clustering
No ratings yet
K Means Clustering
27 pages
Clustering FinancialData
No ratings yet
Clustering FinancialData
38 pages
04-FSSR DS610 2024 2025T1 Kmeans
No ratings yet
04-FSSR DS610 2024 2025T1 Kmeans
57 pages

Clustering

Uploaded by

Clustering

Uploaded by

You might also like