L13. Cluster Analysis

Cluster Analysis
Poul Petersen
BigML

2
Trees vs Clusters
Trees (Supervised Learning)
Provide: labeled data
Learning Task: be able to predict label
Clusters (Unsupervised Learning)
Provide: unlabeled data
Learning Task: group data by similarity

3
Trees vs Clusters
sepal
length
sepal
width
petal
length
petal
width
species
5.1 3.5 1.4 0.2 setosa
5.7 2.6 3.5 1.0 versicolor
6.7 2.5 5.8 1.8 virginica
… … … … …
sepal
length
sepal
width
petal
length
petal
width
5.1 3.5 1.4 0.2
5.7 2.6 3.5 1.0
6.7 2.5 5.8 1.8
… … … …
Learning Task:
Find function “f” such that:
f(X)≈Y
Learning Task:
Find “k” clusters such that
the data in each cluster is
self similar
Inputs “X”
Label

“Y”

4
• Customer segmentation

• Item discovery

• Association

• Recommender

• Active learning
Use Cases

5
Customer Segmentation
GOAL: Cluster the users by usage statistics. Identify clusters
with a higher percentage of high LTV users. Since they have
similar usage patterns, the remaining users in these clusters
may be good candidates for upsell.
• Dataset of mobile game users.
• Data for each user consists of usage statistics
and a LTV based on in-game purchases
• Assumption: Usage correlates to LTV
5
0%
3%
7%
1%

6
Item Discovery
• Dataset of 86 whiskies
• Each whiskey scored on a scale from 0 to 4
for each of 12 possible flavor characteristics.
GOAL: Cluster the whiskies by flavor profile to discover
whiskies that have similar taste.
Smoky
Fruity
Body

7
Association
• Dataset of Lending Club Loans
• Mark any loan that is currently or has even
been late as “trouble”
GOAL: Cluster the loans by application profile to rank loan
quality by percentage of trouble loans in population
0%
3%
7%
1%

8
Active Learning
GOAL: Rather than sample randomly, use clustering to group
patients by similarity and then test a sample from each
cluster to label the data.
• Dataset of diagnostic measurements of 768
patients.
• Want to test each patient for diabetes and
label the dataset to build a model but the
test is expensive*
*For a more realistic example of high cost, imagine a dataset with a billion
transactions, each one needing to be labelled as fraud/not-fraud. Or a million images
which need to be labeled as cat/not-cat.
2323

10
Human Example
“Round”
“Skinny”
“Corners”

11
Human Example
• Human used prior knowledge to select possible
features that separated the objects.
• “round”, “skinny”, “edges”, “hard”, etc
• Items were then separated based on the chosen
features
• Separation quality was then tested to ensure:
• met criteria of K=3
• groups were sufficiently “distant”
• no crossover

12
Learning from Humans
• Length/Width
• greater than 1 => “skinny”
• equal to 1 => “round”
• less than 1 => invert
• Number of Surfaces
• distinct surfaces require “edges” which have corners
• easier to count
Create features that capture these object differences

13
Feature Engineering
Object Length / Width Num Surfaces
penny 1 3
dime 1 3
knob 1 4
eraser 2.75 6
box 1 6
block 1.6 6
screw 8 3
battery 5 3
key 4.25 3
bead 1 2

14
Now we can Plot
K=3
Num

Surfaces
Length / Width
box block eraser
knob
penny

dime
bead
key battery screw
K-Means Key Insight:

We can ﬁnd clusters using distances

in n-dimensional feature space

15
Now we can Plot
Num

Surfaces
Length / Width
box block eraser
knob
penny

dime
bead
key battery screw
K-Means

Find “best” (minimum distance)

circles that include all points

18
Features Matter!
Metal Other
Wood

19
Convergence
Convergence guaranteed

but not necessarily unique

Starting points important (K++)

20
Starting Points
• Random points or instances in n-dimensional space

• Chose points “farthest” away from each other

• but this is sensitive to outliers

• k++

• the ﬁrst center is chosen randomly from instance

• each subsequent center is chosen from the remaining
instances with probability proportional to its squared
distance from the point's closest existing cluster center

21
Scaling
price
number of bedrooms
d = 160,000
d = 1

22
Other Tricks
• What is the distance to a “missing value”?

• What is the distance between categorical values?

• What is the distance between text features?

• Does it have to be Euclidean distance?

• Unknown “K”?

23
Distance to Missing Value?
• Nonsense! Try replacing missing values with:

• Maximum

• Mean

• Median

• Minimum

• Zero

• Ignore instances with missing values

24
Distance to Categorical?
• Special distance function

• if valA == valB then 
distance = 0 (or scaling value)  
else  
distance = 1

• Assign centroid the most common category of
the member instances

• Compute Euclidean distance as normal
One approach: similar to “k-prototypes”

25
Distance to Categorical?
feature_1 feature_2 feature_3
instance_1 red cat ball
instance_2 red cat ball
instance_3 red cat box
instance_4 blue dog fridge
D = 0
D = 1
D = sqrt(3)

26
Only Euclidean?
1
MahalanobisCosine Similarity
0
-1

29
Finding K: G-means
Let K=2
Keep 1, Split 1
New K=3

30
Finding K: G-means
Let K=3
Keep 1, Split 2
New K=5

31
Finding K: G-means
Let K=5
K=5

L13. Cluster Analysis

More Related Content

What's hot

Similar to L13. Cluster Analysis

More from Machine Learning Valencia

Recently uploaded

L13. Cluster Analysis