KEMBAR78
Day8 Unsupervised Learning | PDF | Principal Component Analysis | Cluster Analysis
0% found this document useful (0 votes)
4 views40 pages

Day8 Unsupervised Learning

Idk it wanted to uploads me the pdf so i am doing this may be it is useful for you so you can check it out
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views40 pages

Day8 Unsupervised Learning

Idk it wanted to uploads me the pdf so i am doing this may be it is useful for you so you can check it out
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Unsupervised Learning

Agenda
• Clustering methods

• K-means clustering

• Hierarchical clustering

• Dimensionality reduction

• PCA

2
Machine Learning Methods
Do you have labeled data?

Yes No

Supervised Unsupervised
What do you want to predict? Do you want to group the data?

Category Quantity Yes No

Classification Regression Clustering Dimensionality reduction

Logistic Linear Ridge


KNN SVM CART Lasso K-means Hierarchical PCA
Regression Regression Regression
Unsupervised Learning
Recall: A set of statistical tools for data that only has features/input
available, but no response.

In other words, we have X’s but no labels y.

Goal: Discover interesting patterns/properties of the data.

• E.g. for visualizing or interpreting high-dimensional data.


Challenges of Unsupervised Learning
Why is unsupervised learning challenging?

• Exploratory data analysis — goal is not always clearly defined

• Difficult to assess performance — “right answer” unknown

• Working with high-dimensional data


Types of Unsupervised Learning
Two approaches:

• Cluster analysis

- For identifying homogenous subgroups of samples

• Dimensionality reduction

- For finding a low-dimensional representation to characterize and


visualize the data
Cluster Analysis
Clustering
A set of methods for finding
subgroups within the dataset.

• Observations should share


common characteristics within the
same group, but differ across
groups.

• Groupings are determined from


attributes of the data itself —
https://medium.com/square-corner-blog/so-you-have-some-clusters-now-what-abfd297a575b
differs from classification.
Clustering vs. Classification
Classification

Class A Class B

?
Clustering Cluster A

Cluster B

Cluster C

Cluster D
Dataset
Types of Clustering
• Centroid-based clustering

• Hierarchical clustering

• Model-based clustering
- Each cluster is represented by a parametric distribution
- Dataset is a mixture of distributions

• Hard vs. soft/fuzzy clustering


- Hard: observations divided into distinct clusters
- Soft: observations may belong to more than one cluster
CME 250: Introduction to Machine Learning, Winter 2019 15
K-means Clustering
Groups data into K clusters that
satisfy two properties.

1. Each observation belongs to at


least one of the K clusters.

2. Clusters are non-overlapping.


No observation belongs to more
than one cluster.
K-means Clustering
A good clustering is one for which
the within-cluster variation is as
small as possible.

Denote each cluster by Ck, and let


W(Ck) be a measure of the within-
cluster variation.

K-means aims to solve


K-means Clustering Algorithm
1. Initialize each observation to a cluster by randomly assigning a
cluster, from 1 to K, to each observation.

2. Iterate until the cluster assignments stop changing:


a. For each of the K clusters, compute the cluster centroid. The k-th
cluster centroid is the vector of the p feature means for the
observations in the k-th cluster.

b. Assign each observation to the cluster whose centroid is closest


(using Euclidean distance as the metric).
K-means Clustering Iterations
K-means Clustering Iterations
K-means Clustering Animation
K-means Clustering Properties
It can be shown that the value of
the objective function will never
increase at each iteration of k-
means.

Since the algorithm finds local


minima, however, it will result in
different clusters with different
initializations.
K-means Pros and Cons
Pros:

• Easy to implement and understand

Cons:

• Not robust to data perturbations and different initializations

• Treats each feature equally, not robust to noise features or different


scales of features — looks for in spherical clusters in feature space

• Need to define K before running algorithm


Hierarchical Clustering
Cluster based on distances
between observations.

Represented as a tree hierarchy


(dendrogram) rather than a
partition of data.

Does not require committing to a


choice of K.
Sørlie, Therese, et al. (2003) "Repeated observation of breast
tumor subtypes in independent gene expression data sets," PNAS.
Hierarchical Clustering Algorithm
1. Initialize each observation to its own cluster.

2. For i = n, n-1, …, 2:

a. Examine all pairwise inter-cluster similarities among the i clusters


and identify the pair of clusters that are most similar. Fuse these two
clusters. The dissimilarity between these two clusters indicates the
height in the dendrogram at which the fusion occurs.

b. Compute the new pairwise inter-cluster similarities among the i-1


remaining clusters.
Distance Between Groups
It’s easy to compute Euclidean distance between two observations.
What is the distance or similarity between two groups or clusters of
observations?

Linkage: defines the dissimilarity between two groups of observations.


Most common types are complete, average, single, and centroid.
Types of Linkage

Complete linkage Single linkage Average linkage Centroid linkage


Hierarchical Clustering Example
Illustration of the first few steps of
the hierarchical clustering
algorithm, with complete linkage
and Euclidean distance.
Different Linkage, Different Dendrogram
Hierarchical Clustering Pros and Cons
Pros:

• Don’t have to choose a value of K (number of clusters) before


running algorithm

Cons:

• Do have to pick where to cut the dendrogram to obtain clusters

• Sensitive to similarity measure and type of linkage used


Dimensionality Reduction
Dimensionality Reduction
Recall the curse of dimensionality when working in high dimensions.

Dimensionality reduction is the process of reducing the number of


features under consideration.

We already saw some examples of this in the lasso and forward/


backward selection algorithms. These methods reduce dimensionality
by selecting a subset of features. However, they do so using
supervision — knowing a response y that is of interest.
Dimensionality Reduction
Principal Component Analysis
Look for a low-dimensional representation of the dataset that contains
as much variation in the dataset as possible.

E.g. for plotting our data and gaining intuition, if we can obtain a 2D
representation of the data, then we can plot the observations in this
low-dimensional space.

Note that you want to center the data and make the scales of features
comparable before performing PCA. E.g. if one feature is in kilometers
and another in meters, the one in kilometers may appear to have lower
variance when in fact this is due to scaling.
41
Principal Components
The first principal component of a set of features X1, X2,…, Xp is the
normalized linear combination of the features

that has the largest variance. “Normalized” refers to .

We refer to as the loadings of the first principal


component.

The loadings make up the first principal component vector.


Principal Components
The first principal component loading vector solves the optimization
problem
Principal Components
The second principal component Z2 is the linear combination of
features that has maximal variance out of all linear combinations that
are are uncorrelated with Z1.

Constraining Z2 to be uncorrelated with Z1 is equivalent to constraining


the direction of to be orthogonal to .
Principal Component Analysis

First two principal axes of


this Gaussian dataset.
Principal Component Analysis
Principal Components
Equivalently, find eigenvectors with the largest eigenvalues of the
sample covariance matrix.

By the singular value decomposition (SVD),


Principal Components
Equivalently, find eigenvectors with the largest eigenvalues of the
sample covariance matrix.

By the singular value decomposition (SVD),


The right singular vectors
are the loadings, or
principal axes, of the data.
Principal Components
Equivalently, find eigenvectors with the largest eigenvalues of the
sample covariance matrix.

By the singular value decomposition (SVD), UD is the full principal


components
decomposition of X, aka
the Z’s on previous slides.
How many principal components?
Scree plot

You might also like