Lecture 4: Clustering and KNN
Instructor: Ari Smith
October 10, 2023
Learning objectives
k-means clustering
Hierarchical / agglomerative clustering
Distance metrics and linkage criteria
K-nearest neighbours
Page 2
Netflix
The Netflix Prize
Predict user ratings for films using only previous ratings
Beat Cinematch by 10% and win $1,000,000
• Progress prize of $50,000 for each year that improves upon previous year by 1%
Competition began on October 6, 2006
• 100,000,000 observations in the training set
• 1,500,000 in the validation set
• 1,500,000 in the test set
Page 4
The Netflix Prize
2006
• WXYZConsulting beat Cinematch on Oct 8
• UofT (led by Prof. Hinton) emerged as an early leader
2007
• 40,000 teams from 186 countries
• BellKor beat Cinematch by 8.43%
2008
• An ensemble of BellKor and BigChaos beat Cinematch by 9.54%
Page 5
The Netflix Prize
The Winner
• BellKor’s Pragmatic Chaos beat Cinematch by 10.06%
• Declared the winner on September 18, 2009
• Ensemble of three teams
Page 6
User groups
In 2016, Netflix stopped segmenting users by geography
Users are now clustered into 1300 “taste-communities”
Cluster 290:
• Movies like: Black Mirror,
Lost, and Groundhog Day
Page 7
The basics of clustering
Types of clustering
Clustering is an unsupervised learning algorithm
• Partition data into groups /clusters such that the observations within a cluster are
similar
Two popular types:
1. k-means clustering
2. Hierarchical / agglomerative clustering
Page 9
How do we define “similar”?
We use distance to determine if two observations are similar
Define an observation with F features:
𝑇
𝒙𝒊 = 𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝐹
Non-negativity: 𝑑 𝒙𝟏 , 𝒙𝟐 ≥ 0 and 𝑑 𝒙𝟏 , 𝒙𝟐 = 0 iff 𝒙𝟏 = 𝒙𝟐
Symmetry: 𝑑 𝒙𝟏 , 𝒙𝟐 = 𝑑 𝒙𝟐 , 𝒙𝟏
Triangle inequality: 𝑑 𝒙𝟏 , 𝒙𝟐 + 𝑑 𝒙𝟐 , 𝒙𝟑 ≥ 𝑑(𝒙𝟏 , 𝒙𝟑 )
Page 10
Distance metrics
Euclidean: 1
𝐹 2
2
𝑑 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 − 𝒙𝟐 2 = 𝑥1𝑓 − 𝑥2𝑓
𝑓=1
Manhattan:
𝐹
𝑑 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 − 𝒙𝟐 𝟏 = 𝑥1𝑓 − 𝑥2𝑓
𝑓=1
Chebychev:
𝑑 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 − 𝒙𝟐 ∞ = max 𝑥1𝑓 − 𝑥2𝑓
𝑓=1,…,𝐹
Page 11
Distance metrics
Minkowski:
1
𝐹 𝑝
𝑝
𝑑 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 − 𝒙𝟐 𝑃 = 𝑥1𝑓 − 𝑥2𝑓
𝑓=1
Hamming:
𝑑 𝒙𝟏 , 𝒙𝟐 = 𝕀(𝑥1𝑓 ≠ 𝑥2𝑓 )
𝑓=1
Page 12
Index sets and centroids
Index set: includes the IDs of all observations in a cluster
𝑆𝑘 = {1,3,7,21,44}
Centroid: the “center” or “representative point” of each cluster
1
𝒔𝒌 = 𝒙𝒊
|𝑆𝑘 |
𝑖∈𝑆𝑘
Page 13
Cluster distances
Intra-cluster distance: distance between two points in the same cluster
Inter-cluster distance: distance between two points in different clusters
Page 14
k-means clustering
Basics
Partition observations into k clusters such that the total pairwise distance
between each observation and it's nearest centroid is minimized
Hyperparameters
• k – number of clusters
• 𝑑 𝒙𝟏 , 𝒙𝟐 – distance metric
Can be written as an integer programming problem – NP hard!
• Use heuristic algorithms
Page 16
Lloyd’s Algorithm
1. Randomly initialize k centroids
2. Assign each observation to its closest centroid using the distance metric
3. Recompute the centroid of each cluster
4. Stop if there is no change in the centroids. Otherwise, return to step 2.
Repeat process with many different initializations!
Page 17
Fisher’s Iris dataset
Overview
Collected in the 1930s by Sir Ronald Fisher
• Professor of Eugenics at University College London
50 observations from each of three species of Iris flowers
• Setosa
• Virginica
• Versicolor
4 features
• Petal: length and width
• Sepal: length and width
Page 19
Overview
Page 20
Visualization – no labels
Page 21
Visualization – true labels
Page 22
Visualization – k-means labels
Page 23
How do we determine the number of clusters?
Create an elbow plot
• Number of clusters vs total intra-cluster distance
Choose the number of clusters corresponding to the “elbow”
Page 24
Hierarchical / agglomerative clustering
Basics
Build a hierarchy of clusters where the closest pairwise clusters are merged
until there is only one cluster
Hyperparameters
• 𝑑 𝒙𝟏 , 𝒙𝟐 – distance metric
• 𝑑 𝑆1 , 𝑆2 – Linkage criteria
Page 26
Algorithm
1. Initialize each observation as its own cluster
2. Merge each cluster with its closest neighbor cluster according to some
distance metric / linkage criteria combination
3. Continue until there is only one cluster (or a stopping criteria is met)
Page 27
Linkage criteria
Centroid:
𝑑 𝑆1 , 𝑆2 = 𝑑 𝒔1 , 𝒔2
Minimum:
𝑑 𝑆1 , 𝑆2 = min 𝑑 𝒙𝑖 , 𝒙𝑗
𝑖∈𝑆1 ,𝑗∈𝑆2
Maximum:
𝑑 𝑆1 , 𝑆2 = max 𝑑 𝒙𝑖 , 𝒙𝑗
𝑖∈𝑆1 ,𝑗∈𝑆2
Page 28
Linkage criteria
Average:
1
𝑑 𝑆1 , 𝑆2 = 𝑑 𝒙𝑖 , 𝒙𝑗
𝑆1 |𝑆2 |
𝑖∈𝑆1 𝑗∈𝑆2
Minimum variance:
2 𝑆1 |𝑆2 |
𝑑 𝑆1 , 𝑆2 = 𝒔1 − 𝒔2 2
𝑆1 + |𝑆2 |
Page 29
Dendrogram – Iris dataset
Page 30
Dendrogram – Iris dataset
Page 31
DailyKos
Overview
Internet blog, forum, and news site devoted to the Democratic Party and
liberal politics
Obtained 3430 articles with 1545 features from Fall 2004
• Each feature is a binary variable corresponding to a word
What were the hot topics on DailyKos at the time?
Page 33
Hierarchical clustering dendrogram
Page 34
Articles per cluster
Hierarchical k-means
Page 35
Top 5 words in each cluster
Hierarchical k-means
Page 36
K-nearest neighbors
Overview
Simple, intuitive, and widely used method that can capture complex non-
linear relationships
Two types:
1. Classification: majority vote of the K-nearest neighbors
2. Regression: weighted average of the K-nearest neighbors
Page 38
Hyperparameters
K – the number of nearest neighbors
• Can range from 1 to all
𝒅 𝒙𝒊 , 𝒙𝒋 – the distance metric
• Chosen from the same metrics used for clustering
𝒘𝒊 – the weighting used for each neighbor
• Equal: each neighbor is weighted equally
• Distance: each neighbor is weighted by its distance
Page 39
Algorithm
Given n observation with features (𝒙0 , 𝒙1 , … , 𝒙𝑛 ) and targets (𝑦0 , 𝑦1 , … , 𝑦𝑛 )
• Predict for a new observation 𝒙𝑝
1. Compute 𝑑 𝒙𝒊 , 𝒙𝒑 , for 𝑖 = 0, … , 𝑛 and index K nearest neighbors by 𝑁𝑝
2. Compute prediction:
𝑦ො𝑝 = 𝑤𝑖 𝑦𝑖
𝑖∈𝑁𝑝
where
𝑑 𝒙𝒊 ,𝒙𝒑 1
𝑤𝑖 = σ for distance OR 𝑤𝑖 = for uniform
𝑖∈𝑁𝑝 𝑑 𝒙𝒊 ,𝒙𝒑 𝐾
Page 40
Applied to the Iris dataset
Page 41
Applied to the Iris dataset
Page 42
Applied to the Iris dataset
Page 43