0% found this document useful (0 votes)

36 views43 pages

Lecture4 Slides

Uploaded by

mohammadthajmeel10

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views43 pages

Lecture4 Slides

Uploaded by

mohammadthajmeel10

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Lecture 4: Clustering and KNN

Instructor: Ari Smith

October 10, 2023
Learning objectives

k-means clustering

Hierarchical / agglomerative clustering

Distance metrics and linkage criteria

K-nearest neighbours

Page 2
Netflix
The Netflix Prize

Predict user ratings for films using only previous ratings

Beat Cinematch by 10% and win $1,000,000

• Progress prize of $50,000 for each year that improves upon previous year by 1%

Competition began on October 6, 2006

• 100,000,000 observations in the training set
• 1,500,000 in the validation set
• 1,500,000 in the test set

Page 4
The Netflix Prize

2006
• WXYZConsulting beat Cinematch on Oct 8
• UofT (led by Prof. Hinton) emerged as an early leader

2007
• 40,000 teams from 186 countries
• BellKor beat Cinematch by 8.43%

2008
• An ensemble of BellKor and BigChaos beat Cinematch by 9.54%

Page 5
The Netflix Prize

The Winner
• BellKor’s Pragmatic Chaos beat Cinematch by 10.06%
• Declared the winner on September 18, 2009
• Ensemble of three teams

Page 6
User groups

In 2016, Netflix stopped segmenting users by geography

Users are now clustered into 1300 “taste-communities”

Cluster 290:
• Movies like: Black Mirror,
Lost, and Groundhog Day

Page 7
The basics of clustering
Types of clustering

Clustering is an unsupervised learning algorithm

• Partition data into groups /clusters such that the observations within a cluster are
similar

Two popular types:

1. k-means clustering

2. Hierarchical / agglomerative clustering

Page 9
How do we define “similar”?

We use distance to determine if two observations are similar

Define an observation with F features:

𝑇
𝒙𝒊 = 𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝐹

Non-negativity: 𝑑 𝒙𝟏 , 𝒙𝟐 ≥ 0 and 𝑑 𝒙𝟏 , 𝒙𝟐 = 0 iff 𝒙𝟏 = 𝒙𝟐

Symmetry: 𝑑 𝒙𝟏 , 𝒙𝟐 = 𝑑 𝒙𝟐 , 𝒙𝟏

Triangle inequality: 𝑑 𝒙𝟏 , 𝒙𝟐 + 𝑑 𝒙𝟐 , 𝒙𝟑 ≥ 𝑑(𝒙𝟏 , 𝒙𝟑 )

Page 10
Distance metrics
Euclidean: 1
𝐹 2
2
𝑑 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 − 𝒙𝟐 2 = ෍ 𝑥1𝑓 − 𝑥2𝑓
𝑓=1

Manhattan:
𝐹

𝑑 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 − 𝒙𝟐 𝟏 = ෍ 𝑥1𝑓 − 𝑥2𝑓
𝑓=1

Chebychev:
𝑑 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 − 𝒙𝟐 ∞ = max 𝑥1𝑓 − 𝑥2𝑓
𝑓=1,…,𝐹

Page 11
Distance metrics

Minkowski:
1
𝐹 𝑝
𝑝
𝑑 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 − 𝒙𝟐 𝑃 = ෍ 𝑥1𝑓 − 𝑥2𝑓
𝑓=1

Hamming:

𝑑 𝒙𝟏 , 𝒙𝟐 = ෍ 𝕀(𝑥1𝑓 ≠ 𝑥2𝑓 )
𝑓=1

Page 12
Index sets and centroids

Index set: includes the IDs of all observations in a cluster

𝑆𝑘 = {1,3,7,21,44}

Centroid: the “center” or “representative point” of each cluster

1
𝒔𝒌 = ෍ 𝒙𝒊
|𝑆𝑘 |
𝑖∈𝑆𝑘

Page 13
Cluster distances

Intra-cluster distance: distance between two points in the same cluster

Inter-cluster distance: distance between two points in different clusters

Page 14
k-means clustering
Basics

Partition observations into k clusters such that the total pairwise distance
between each observation and it's nearest centroid is minimized

Hyperparameters
• k – number of clusters
• 𝑑 𝒙𝟏 , 𝒙𝟐 – distance metric

Can be written as an integer programming problem – NP hard!

• Use heuristic algorithms

Page 16
Lloyd’s Algorithm

1. Randomly initialize k centroids

2. Assign each observation to its closest centroid using the distance metric

3. Recompute the centroid of each cluster

4. Stop if there is no change in the centroids. Otherwise, return to step 2.

Repeat process with many different initializations!

Page 17
Fisher’s Iris dataset
Overview

Collected in the 1930s by Sir Ronald Fisher

• Professor of Eugenics at University College London

50 observations from each of three species of Iris flowers

• Setosa
• Virginica
• Versicolor

4 features
• Petal: length and width
• Sepal: length and width

Page 19
Overview

Page 20
Visualization – no labels

Page 21
Visualization – true labels

Page 22
Visualization – k-means labels

Page 23
How do we determine the number of clusters?

Create an elbow plot

• Number of clusters vs total intra-cluster distance

Choose the number of clusters corresponding to the “elbow”

Page 24
Hierarchical / agglomerative clustering
Basics

Build a hierarchy of clusters where the closest pairwise clusters are merged
until there is only one cluster

Hyperparameters
• 𝑑 𝒙𝟏 , 𝒙𝟐 – distance metric
• 𝑑 𝑆1 , 𝑆2 – Linkage criteria

Page 26
Algorithm

1. Initialize each observation as its own cluster

2. Merge each cluster with its closest neighbor cluster according to some
distance metric / linkage criteria combination

3. Continue until there is only one cluster (or a stopping criteria is met)

Page 27
Linkage criteria

Centroid:
𝑑 𝑆1 , 𝑆2 = 𝑑 𝒔1 , 𝒔2

Minimum:
𝑑 𝑆1 , 𝑆2 = min 𝑑 𝒙𝑖 , 𝒙𝑗
𝑖∈𝑆1 ,𝑗∈𝑆2

Maximum:
𝑑 𝑆1 , 𝑆2 = max 𝑑 𝒙𝑖 , 𝒙𝑗
𝑖∈𝑆1 ,𝑗∈𝑆2

Page 28
Linkage criteria

Average:
1
𝑑 𝑆1 , 𝑆2 = ෍ ෍ 𝑑 𝒙𝑖 , 𝒙𝑗
𝑆1 |𝑆2 |
𝑖∈𝑆1 𝑗∈𝑆2

Minimum variance:

2 𝑆1 |𝑆2 |
𝑑 𝑆1 , 𝑆2 = 𝒔1 − 𝒔2 2
𝑆1 + |𝑆2 |

Page 29
Dendrogram – Iris dataset

Page 30
Dendrogram – Iris dataset

Page 31
DailyKos
Overview

Internet blog, forum, and news site devoted to the Democratic Party and
liberal politics

Obtained 3430 articles with 1545 features from Fall 2004

• Each feature is a binary variable corresponding to a word

What were the hot topics on DailyKos at the time?

Page 33
Hierarchical clustering dendrogram

Page 34
Articles per cluster

Hierarchical k-means

Page 35
Top 5 words in each cluster
Hierarchical k-means

Page 36
K-nearest neighbors
Overview

Simple, intuitive, and widely used method that can capture complex non-
linear relationships

Two types:

1. Classification: majority vote of the K-nearest neighbors

2. Regression: weighted average of the K-nearest neighbors

Page 38
Hyperparameters

K – the number of nearest neighbors

• Can range from 1 to all

𝒅 𝒙𝒊 , 𝒙𝒋 – the distance metric

• Chosen from the same metrics used for clustering

𝒘𝒊 – the weighting used for each neighbor

• Equal: each neighbor is weighted equally
• Distance: each neighbor is weighted by its distance

Page 39
Algorithm
Given n observation with features (𝒙0 , 𝒙1 , … , 𝒙𝑛 ) and targets (𝑦0 , 𝑦1 , … , 𝑦𝑛 )
• Predict for a new observation 𝒙𝑝

1. Compute 𝑑 𝒙𝒊 , 𝒙𝒑 , for 𝑖 = 0, … , 𝑛 and index K nearest neighbors by 𝑁𝑝

2. Compute prediction:
𝑦ො𝑝 = ෍ 𝑤𝑖 𝑦𝑖
𝑖∈𝑁𝑝
where
𝑑 𝒙𝒊 ,𝒙𝒑 1
𝑤𝑖 = σ for distance OR 𝑤𝑖 = for uniform
𝑖∈𝑁𝑝 𝑑 𝒙𝒊 ,𝒙𝒑 𝐾

Page 40
Applied to the Iris dataset

Page 41
Applied to the Iris dataset

Page 42
Applied to the Iris dataset

Page 43

Lecture 4
No ratings yet
Lecture 4
6 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
Unsupervised Algorithms Unit3
No ratings yet
Unsupervised Algorithms Unit3
53 pages
Clustering
No ratings yet
Clustering
75 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
84 pages
Clustering Part1
No ratings yet
Clustering Part1
84 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
K-Means Clustering Guide
100% (1)
K-Means Clustering Guide
14 pages
Data Clustering..
No ratings yet
Data Clustering..
10 pages
Clustering
No ratings yet
Clustering
35 pages
Machine Learning Clustering Guide
No ratings yet
Machine Learning Clustering Guide
80 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Clustering
No ratings yet
Clustering
84 pages
Unsupervised Learning for Students
No ratings yet
Unsupervised Learning for Students
59 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Module 5 - Clustering - Afterclassb
No ratings yet
Module 5 - Clustering - Afterclassb
49 pages
K-means and Hierarchical Clustering Guide
No ratings yet
K-means and Hierarchical Clustering Guide
53 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
44 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Clustering
No ratings yet
Clustering
80 pages
Unsupervised Learning - Clustering
No ratings yet
Unsupervised Learning - Clustering
55 pages
SLide#4 - Clustering and Elbow Technique
No ratings yet
SLide#4 - Clustering and Elbow Technique
29 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
Module 4 - 5TH Sem
No ratings yet
Module 4 - 5TH Sem
23 pages
K-means Clustering Explained
No ratings yet
K-means Clustering Explained
41 pages
AI ML Lecture 6
No ratings yet
AI ML Lecture 6
20 pages
Lec09 Clustering
No ratings yet
Lec09 Clustering
27 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
61 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
57 pages
MODULE 4 Clustering
No ratings yet
MODULE 4 Clustering
23 pages
Unsupervised Learning Modi
No ratings yet
Unsupervised Learning Modi
16 pages
Cluster
100% (1)
Cluster
72 pages
Data Mining - Chapter 4 Cluster Analysis
No ratings yet
Data Mining - Chapter 4 Cluster Analysis
37 pages
Lect 6 - Clustering
No ratings yet
Lect 6 - Clustering
50 pages
Clustering Algorithm: An Unsupervised Learning Approach
No ratings yet
Clustering Algorithm: An Unsupervised Learning Approach
23 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Advanced Cluster Analysis Guide
No ratings yet
Advanced Cluster Analysis Guide
49 pages
Clustering
No ratings yet
Clustering
125 pages
Intro to Clustering Methods
No ratings yet
Intro to Clustering Methods
39 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Week 9
No ratings yet
Week 9
66 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Clustering
No ratings yet
Clustering
38 pages
Unsupervised Learning Explained
No ratings yet
Unsupervised Learning Explained
54 pages
Day 3
No ratings yet
Day 3
74 pages
Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
No ratings yet
Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
7 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
83 pages
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
No ratings yet
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
20 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
Clustering
No ratings yet
Clustering
75 pages
کتاب چهارم بارگزاری شده
No ratings yet
کتاب چهارم بارگزاری شده
63 pages
UA5000 LMG PVMD Operations Guide
No ratings yet
UA5000 LMG PVMD Operations Guide
22 pages
Mahesh CV
No ratings yet
Mahesh CV
6 pages
Chapter 5 - Recovery Techniques
No ratings yet
Chapter 5 - Recovery Techniques
24 pages
Theory 2 - Code of Ethics For Professional Teacher & Historical Development of Teaching
No ratings yet
Theory 2 - Code of Ethics For Professional Teacher & Historical Development of Teaching
5 pages
LG Oem Lgit Plde-P017a SCH
No ratings yet
LG Oem Lgit Plde-P017a SCH
2 pages
t201 Visit Report
100% (1)
t201 Visit Report
16 pages
Cost Out Engineer: VAVE Mechanical Role
No ratings yet
Cost Out Engineer: VAVE Mechanical Role
2 pages
CS 601p Assignment 1 Solution BC210410285
No ratings yet
CS 601p Assignment 1 Solution BC210410285
4 pages
Product Guide: Hyundai Construction Equipment
100% (1)
Product Guide: Hyundai Construction Equipment
26 pages
English 7 Curriuculum Map Quarter 1-3
100% (4)
English 7 Curriuculum Map Quarter 1-3
15 pages
PROFIBUS DP AC 800M 6.0 Installation
No ratings yet
PROFIBUS DP AC 800M 6.0 Installation
114 pages
Water Distribution Systems
100% (1)
Water Distribution Systems
49 pages
Euceg Be Negativelist 0
No ratings yet
Euceg Be Negativelist 0
56 pages
2024 Assessment Handbook
No ratings yet
2024 Assessment Handbook
20 pages
Emerging Issues of Procurement
No ratings yet
Emerging Issues of Procurement
19 pages
Business Model Innovation Guide
No ratings yet
Business Model Innovation Guide
131 pages
Photogrammetry Manual PDF
No ratings yet
Photogrammetry Manual PDF
103 pages
Cat Red
No ratings yet
Cat Red
5 pages
Unit-1 Feature Point of View Types of Os
No ratings yet
Unit-1 Feature Point of View Types of Os
5 pages
Assignment 3
No ratings yet
Assignment 3
2 pages
Study Guide QA1
No ratings yet
Study Guide QA1
3 pages
Prototype Approach To Semantic Structure
No ratings yet
Prototype Approach To Semantic Structure
34 pages
Micropilot
No ratings yet
Micropilot
2 pages
Heep 111
0% (1)
Heep 111
7 pages
CTO-20AC Data Sheet
No ratings yet
CTO-20AC Data Sheet
3 pages
Stuudy Case
No ratings yet
Stuudy Case
8 pages
List of Practicals OS Jan-Apr2022
No ratings yet
List of Practicals OS Jan-Apr2022
2 pages
IIUI Schools Examination Policy Revised (Sep 2022)
No ratings yet
IIUI Schools Examination Policy Revised (Sep 2022)
5 pages
NSO Level 2 Class 7 Science Paper 2017 18 Part 1
No ratings yet
NSO Level 2 Class 7 Science Paper 2017 18 Part 1
5 pages
LPL Financial Branch Offices
No ratings yet
LPL Financial Branch Offices
14 pages

Lecture4 Slides

Uploaded by

Lecture4 Slides

Uploaded by

Lecture 4: Clustering and KNN

Instructor: Ari Smith

Hierarchical / agglomerative clustering

Distance metrics and linkage criteria

Predict user ratings for films using only previous ratings

Beat Cinematch by 10% and win $1,000,000

Competition began on October 6, 2006

In 2016, Netflix stopped segmenting users by geography

Users are now clustered into 1300 “taste-communities”

Clustering is an unsupervised learning algorithm

Two popular types:

2. Hierarchical / agglomerative clustering

We use distance to determine if two observations are similar

Define an observation with F features:

Non-negativity: 𝑑 𝒙𝟏 , 𝒙𝟐 ≥ 0 and 𝑑 𝒙𝟏 , 𝒙𝟐 = 0 iff 𝒙𝟏 = 𝒙𝟐

Triangle inequality: 𝑑 𝒙𝟏 , 𝒙𝟐 + 𝑑 𝒙𝟐 , 𝒙𝟑 ≥ 𝑑(𝒙𝟏 , 𝒙𝟑 )

Index set: includes the IDs of all observations in a cluster

Centroid: the “center” or “representative point” of each cluster

Intra-cluster distance: distance between two points in the same cluster

Inter-cluster distance: distance between two points in different clusters

Can be written as an integer programming problem – NP hard!

1. Randomly initialize k centroids

3. Recompute the centroid of each cluster

4. Stop if there is no change in the centroids. Otherwise, return to step 2.

Repeat process with many different initializations!

Collected in the 1930s by Sir Ronald Fisher

50 observations from each of three species of Iris flowers

Create an elbow plot

Choose the number of clusters corresponding to the “elbow”

1. Initialize each observation as its own cluster

Obtained 3430 articles with 1545 features from Fall 2004

What were the hot topics on DailyKos at the time?

1. Classification: majority vote of the K-nearest neighbors

2. Regression: weighted average of the K-nearest neighbors

K – the number of nearest neighbors

𝒅 𝒙𝒊 , 𝒙𝒋 – the distance metric

𝒘𝒊 – the weighting used for each neighbor

1. Compute 𝑑 𝒙𝒊 , 𝒙𝒑 , for 𝑖 = 0, … , 𝑛 and index K nearest neighbors by 𝑁𝑝

You might also like