Unsupervised Learning
Harsha Vardhan Reddy Burri
Unsupervised Learning
• There is no output or response or target
variable, only having input variable(X)
• The major goal is to identify the hidden
patterns and relationships in data
• Preparing clusters and finding data
distribution in the space (density estimation).
• Examples: grouping fruits
Grouping :
• Green color – bananas and grapes
• Physical characters
• Green color and big size – banana
• Like shape, color, odor,
• Green color and small size‐ grapes
• Ex: Red color – apples and cherrys
• Redcolor and bigsize‐ apples
• Redcolor and small size‐ cherrys
Real Life examples
• You meet strangers in party , then you need to
classify them without prior knowldge. How to
do? – Basis on gender, age, habits and other
behavioural
• You found a new instance that differ from
others, how to find or classify? ‐
Challenges
• Harder as compared to Supervised Learning tasks..
• Dealing with large number of dimensions and large number of
data items can be problematic because of time complexity;
• The effectiveness of the method depends on the definition of
“distance” (for distance‐based clustering).
• The result of the clustering algorithm (that in many cases can
be arbitrary itself) can be interpreted in different ways.
• How do we know if results are meaningful since no answer
labels are available?
• Let the expert look at the results (external evaluation)
• Define an objective function on clustering (internal
evaluation)
Applications
• Can be applied in many fields
• Market Analysis :
Grouping customers
• Biology:
Classification of plants and animals given their features
Analysis genes and genomes
• Insurance:
Identifying groups of motor insurance policy holders
with a high average claim cost; identifying frauds;
• Earthquake studies:
– Clustering observed earthquake epicenters to identify
dangerous zones;
• World Wide Web:
– Document classification; clustering weblog data to discover
groups of similar access patterns.
Types of Unsupervised algorithms
• K‐means clustering
• Hierarchial clustering
• Principle Component Analysis
K‐means Clustering
• Unsupervised learning algoritm
• Unleabelled data or no target label
• Goal is to find patterns and making clusters
Stpes in K‐means:
• 1: Pick random points as cluster centers (also called as
centroids). cluster centroids – c1, c2, c3….ck
• 2: Assign each data point to nearest cluster by calculating
its distance to each centroid
• 3. find new cluster center by taking the averages of
assigned points
• 4. Repeat step 2 and 3 untill none of the cluster
assignments change
Dataset= [2,3,4,10,11,12,20,25,30] #monthly expenditure (in 1000) of customers
10,11,12,20,25,30
2,3,4
Mean =3
Mean =18
11,12,20,25,30
2,3,4,10
Mean =5
Mean =20
12,20,25,30
2,3,4,10,11
Mean =6
Mean =22
2,3,4,10,11,12 20,25,30
Mean =7
Mean =25
Applications:
1. Image segmentation
2. Clustering genome data – gene segments
3. Data mining segmentation
4. Anomly detection
5. Instance classification
6. Customer classification