0% found this document useful (0 votes)

5 views10 pages

DataMining Chapter4

This chapter introduces clustering, an unsupervised learning technique used to group data into similar classes or clusters based on chosen similarity criteria. It discusses the definition of clustering, types of data suitable for clustering, and various clustering algorithms, including Kmeans. Additionally, it covers the evaluation of clustering quality through intra-cluster and inter-cluster inertia, emphasizing the importance of minimizing intra-cluster inertia while maximizing inter-cluster inertia.

Uploaded by

hamidbnb865

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views10 pages

DataMining Chapter4

Uploaded by

hamidbnb865

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

CHAPITRE IV: CLUSTERING

In this chapter we introduce a model based on unsupervised learning: the clustering.

4.1 INTRODUCTION TO CLUSTERING:

In general, clustering is the task of grouping, in an unsupervised way (i.e. without the
prior help of an expert), a set of objects or more broadly data, in such a way that the objects in
the same group, called a cluster, are closer (in the sense of a chosen (dis)similarity criterion)
to each other than those in other group. This is a key task in exploratory data mining, and a
statistical data analysis technique widely used in many fields, including: machine learning,
pattern recognition, signal and image processing, information retrieval and so on.

4.2 DEFINITION OF CLUSTERING :

Let D = {Xi ∈ Rd, i=1, ..., N} be a set of observations described by d attributes. Rd is the
representation space. Each variable xi is represented by a vector of dimension d.
Definition : Clustering.
Clustering, also known as unsupervised automatic classification, is an exploratory data
analysis technique aimed at structuring data into homogeneous classes (clusters): The aim is
to group the data in set D into clusters or classes such that the data in a cluster are as similar
as possible.
Figure 4.1 illustrates the purpose of clustering. It consists of exploring unlabelled data to form
clusters.

Unlabelled data Clustered data

Fig 4.1 Aim of clustering: grouping data in clusters

48
4.3 WHAT DATA ?
The data processed in automatic classification can be images, signals, texts, etc. Often,
the data is multidimensional: each datum is made up of several variables (descriptors), as in
the case of the representation of a colour. In the case of standard multidimensional data, each
piece of data is of dimension d (i.e. a datum in the space Rd) and may be labelled if its class is
known. The n data items can therefore be modelled as a matrix of n elements. In the case of a
(colour) image, the image containing n rows and m columns and therefore n x m colour
pixels, each pixel being made up of three RGB components.
Another example of various data that can be used for clustering is Quinlan's “weather” dataset
(see section 1.8). In this case, data is numeric (for Temperature and Humidity), nominal (for
Sky) or Boolean (for Wind).

4.4 WHAT IS A CLUSTER ?

Here is a definition of a cluster.
Definition: Cluster.
A cluster (group or class) is a set of data formed by homogeneous data. In other words, data
between which there is a certain similarity.

Clusters can be constructed using a similarity function or a dissimilarity function. In the first
case, clusters are formed by elements between which there is greater similarity. In the second
case, the elements in a cluster are less dissimilar.
Often, dissimilarity measures are used to create clusters. These measures can be distances,
densities or probabilities.

4.5 CLUSTERING APPLICATIONS

Table 4.1 gives examples of clustering applications. For each field, it specifies the data
processed and the meaning of the projected clusters.
Table 4.1 Clustering applications
Field Data Form Targeted clusters
Text Mining Texts, Related texts
Mails Automatic documents classification
Web Mining Texts or images Related web pages
Bio-Informatique Genes Similar genes
Marketing Customer information, Customer segmentation
products purchased
Image segmentation Images Homogeneous zones in the image

49
4.6 HOW MANY CLUSTERS ?
The number of clusters (k) can be assumed to be fixed (given by the user). This is the
case, for example, if we are interested in classifying images of handwritten numbers. We
know that number of classes is 10 (numbers: 0, 1, ..., 9), or handwritten letters (number of
classes = number of characters in the alphabet), etc. However, there are criteria, known as
model selection criteria, which can be used to choose the optimum number of classes in a
dataset [11].

4.7 CONCEPTS OF SIMILARITY, DISSIMILARITY AND DISTANCE :

Two pieces of data are said to be similar if a certain measure can be found that enables the
closeness (similarity) of the two pieces of data to be assessed.
Let us note SM the similarity measure: The similarity between the data increases as the SM
value increases.
Data can also be compared using a dissimilarity measure (denoted DM): The similarity
between the data increases as the DM value decreases. In fact, DM is represented by a
measure of distance.
Before presenting the main distance measures used in research, let's remember that each
datum Xi is represented by a vector of dimension d.
Notation: each datum Xi ∈ Rd

Several measures of the distance can be found in the literature to calculate the distance
between two datums Xi and Xj. We will limit ourselves to Minkowsky, Euclidean, and
Manhattan.
 Minkowsky distance: It is defined by the following expression, where q is a
parameter. It is a generalisation of the Euclidean, Manhattan and Chebyshev distances.

(1)

 Euclidean distance: It is obtained from the Minkowsky distance, by taking q=2.

(2)

 Manhattan distance: It is obtained from the Minkowsky distance, by taking q=1.

(3)

50
The Euclidean distance is by far the most widely used in the literature.

4.8 CLUSTERING QUALITY EVALUATION:

Several measures have been proposed to evaluate the quality of a clustering. These include
intra-cluster inertia and inter-cluster inertia.

4.8.1 Intra-cluster inertia:

Intra-cluster inertia measures the concentration of cluster points around the centre of gravity
of tha data. There is an interest in minimising intra-cluster inertia.
Each cluster Ck is characterised by its centre of gravity µk and inertia Ik .
 centre of gravity:

(4)

with = cardinality ( )
 Inertia Ik :

(5)

The inertia of a cluster is therefore equal to the variance of the points in that cluster.
Intra-cluster inertia is expressed as:

(6)

Intra-cluster inertia is therefore equal to the sum of the inertia of all the clusters.

4.8.2 The inter-cluster inertia:

Inter-cluster inertia measures the distance between cluster centres. There is an interest in
maximising inter-cluster inertia.

Let µ the centre of gravity of the entire data:

(7)

Where N is the total number of data.

The centres of gravity of the clusters also form a cloud of points characterised by the inter-
cluster inertia defined by the following expression:

51
(8)

According to what we have just presented, in order to have good quality clustering, we need
to minimise intra-cluster inertia and maximise inter-cluster inertia.
Example: Figure 4.1 below shows two cases of clustering obtained on the same dataset.
However, the clustering in figure B is better than that in figure A: there is low intra-cluster
inertia and high inter-cluster inertia.

Fig 4.1. intra-cluster inertia et inter-cluster inertia

4.9 CLUSTERING ALGORITHMS:

There are several clustering algorithms using different approaches. The most important
categories include : hierarchical clustering, density-based clustering and partition-based
clustering.
 Hierarchical clustering: builds a multi-level hierarchy of clusters by creating a tree of
clusters, called a dendogram.
 Desity-based clustering: groups points close together in areas of high density.
Algorithms based on this approach include Dbscan (Density-based Spatial Clustering
of Applications with Noise).
 Clustering based on partitioning: the data is partitioned into k distinct clusters
according to the distance to the centroid of a cluster. Algorithms based on this
approach include Kmeans. This is the algorithm that will be presented in the next
section.

4.9.1 Kmeans algorithm

Kmeans (or moving centre algorithms) is one of the clustering algorithms based on
partitioning.

52
This algorithm was designed in 1957 at Bell Laboratories by Stuart P. Lloyd. It was not
presented to the general public until 1982. In 1965, Edward W. Forgy had already published
an almost similar algorithm, which is why K-means is often referred to as the Lloyd-Forgy
algorithm.
Its principle is shown in the pseudocode below.
Algorithm 4.1 Kmeans
Algorithm Kmeans
Input : Data set D .
k : desired numbers of clusters.
Sortie : k formed clusters.
Begin

Randomly initialise cluster centres µ1, µ2, … µk.

Repeat
Assigning each point to its nearest cluster
Ck ← xi where l = arg min distance(xi, µk)

Recalculate the centre µk of each cluster

with = cardinality ( )
until the centres of gravity (µ1, µ2, … µk) become stable
End.

The algorithm takes as input the data to be processed and the number of clusters k. The output
is k clusters.
Initially, the algorithm randomly initializes the centres of the k clusters. It then iteratively
calculates the distance between each data item d in the dataset and each of the k centres of
gravity. Data item d is then assigned to the cluster whose centre is closest. At the end of the
iteration, all the centres of gravity are recalculated.
The algorithm ends when the centres of gravity are stable.

4.9.2 Example of clustering with Kmeans

We will apply kmeans algorithm to the following example: The following table shows data on
4 products (A, B, C and D) described by 2 attributes: X and Y.

Table 4.2 Kmeans application example data

Article X Y
A 1 1
B 2 1
C 4 3
D 5 4

53
Task: Plot the data and apply Kmeans with k=2.
µ1 and µ2 are the centres of clusters C1 and C2.
Let's randomly take µ1=A and µ2=B as the initial centres. Here is the graph representing the
initial state.

Fig 4.2. Application of Kmeans. Random selection of initial cluster centres of gravity

The first iteration:

Calculation of the Euclidean distance of each datum (point) from the centres of gravity.

Dist(A,µ1) = =0
Dist(A, µ2) =
=1
Dist(B, µ1) =
=1
Dist(B, µ2) =
=0
Dist(C, µ1) =
= 3.60
Dist(C, µ2) =
= 2.83
Dist(D, µ1) =
=5
Dist(D, µ2) =
= 4.24

Based on these calculations, we assign:

54
A to cluster C1,
B to cluster C2,
C to cluster C2,
D to cluster C2.

Here is the composition of the two clusters:

C1={A} et C2 = {B, C, D}

The following figure shows the temporary clusters obtained after this first iteration.

Fig 4.3. Kmeans application. After a first iteration.

Recalculating cluster centres of gravity :

µ1 = (1, 1)
µ2 = ((2+4+5)/3, (1+3+4)/3) = (3.66, 2.66)

The second iteration :

Calculate the Euclidean distance of each datum (point) from the centres of gravity.

Dist(A,µ1) = =0
Dist(A, µ2) =
= 3.14
Dist(B, µ1) =
=1

55
Dist(B, µ2) =
= 2.35
Dist(C, µ1) =
= 3.60
Dist(C, µ2) =
= 0.74
Dist(D, µ1) =
=5
Dist(D, µ2) =
= 1.90

According to these calculations :

A remains in cluster C1,
B changes cluster and is assigned to C1,
C and D remain in cluster C2

This gives the following configuration:

C1={A, B} et C2={C , D}

Recalculating cluster centres of gravity :

µ1 = ((1+2)/2, (1+1)/2) = (1.5, 1)
µ2 = ( (4+5)/2, (3+4)/2) = (4.5, 3.5)

Fig 4.5. Kmeans application. After a second iteration.

56
Third iteration:

Calculating distances :
Dist(A,µ1) = = 0.50
Dist(A, µ2) =
= 4.0
Dist(B, µ1) =
= 0.50
Dist(B, µ2) =
= 3.54
Dist(C, µ1) =
= 3.20
Dist(C, µ2) =
= 0.71
Dist(D, µ1) =
= 4.61
Dist(D, µ2) =
= 0.71

According to these calculations, there is no change in the clusters:

C1={A, B} et C2={C , D}
The centres of gravity are stable (they remain unchanged). The algorithm stops.

Fig 4.6 Kmeans application. Final situation.

4 Clustering
No ratings yet
4 Clustering
9 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
17 pages
Ijcttjournal V1i1p12
No ratings yet
Ijcttjournal V1i1p12
3 pages
Clustering Algorithm and Analyasis
No ratings yet
Clustering Algorithm and Analyasis
12 pages
Unit - V DW
No ratings yet
Unit - V DW
6 pages
07 Clustering
No ratings yet
07 Clustering
34 pages
Fds Unit03
No ratings yet
Fds Unit03
11 pages
Lab Manual 6
No ratings yet
Lab Manual 6
10 pages
Untitled Document
No ratings yet
Untitled Document
32 pages
Automatic Cluster Detection
No ratings yet
Automatic Cluster Detection
14 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
A Comprehensive Survey of Clustering Algorithms
No ratings yet
A Comprehensive Survey of Clustering Algorithms
30 pages
Clustering
No ratings yet
Clustering
5 pages
Graph Partitioning & Clustering Techniques
No ratings yet
Graph Partitioning & Clustering Techniques
14 pages
Module 4 ML
No ratings yet
Module 4 ML
11 pages
Module-5 Clustering Algorithms
No ratings yet
Module-5 Clustering Algorithms
44 pages
Clustering
No ratings yet
Clustering
125 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
M4 - Clustering
No ratings yet
M4 - Clustering
43 pages
Week 9 - Clustering
No ratings yet
Week 9 - Clustering
63 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
S VD For Clustering
No ratings yet
S VD For Clustering
10 pages
Module 5 - Clustering - Afterclassb
No ratings yet
Module 5 - Clustering - Afterclassb
49 pages
CS8091 - Big Data Analytics - Unit 2
No ratings yet
CS8091 - Big Data Analytics - Unit 2
44 pages
Data Warehousing PDF 6
No ratings yet
Data Warehousing PDF 6
13 pages
DMDW 5th Module
No ratings yet
DMDW 5th Module
28 pages
Chap8 Basic Cluster Analysis
No ratings yet
Chap8 Basic Cluster Analysis
98 pages
Clustering in Data Mining Lecture
No ratings yet
Clustering in Data Mining Lecture
80 pages
Unit 3 Clustering Algorithm
No ratings yet
Unit 3 Clustering Algorithm
44 pages
Unit 3
No ratings yet
Unit 3
34 pages
Chap8 Basic Cluster Analysis
100% (1)
Chap8 Basic Cluster Analysis
104 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Wk. 9. Cluster Analysis (01-04-2021)
No ratings yet
Wk. 9. Cluster Analysis (01-04-2021)
97 pages
Clustering Techniques for Analysts
No ratings yet
Clustering Techniques for Analysts
7 pages
DWDM 5
No ratings yet
DWDM 5
12 pages
Cluster Analysis Techniques Guide
No ratings yet
Cluster Analysis Techniques Guide
152 pages
Cluster Analysis
No ratings yet
Cluster Analysis
29 pages
Prasanna Hebbar @govt First Grade College Honnavar
No ratings yet
Prasanna Hebbar @govt First Grade College Honnavar
11 pages
Comparative Analysis of Clustering Techniques
No ratings yet
Comparative Analysis of Clustering Techniques
13 pages
4.unit 4 ML Q&A
No ratings yet
4.unit 4 ML Q&A
73 pages
Cluster Analysis Essentials
No ratings yet
Cluster Analysis Essentials
24 pages
29501clustering in Data Mining Process
No ratings yet
29501clustering in Data Mining Process
3 pages
Lecture 25 K Means Clustering
No ratings yet
Lecture 25 K Means Clustering
28 pages
Clustering
No ratings yet
Clustering
27 pages
Unit 2
No ratings yet
Unit 2
89 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
w6 Clustering
No ratings yet
w6 Clustering
29 pages
Data Mining With Clustering: Dr. Mahesh Fernando
No ratings yet
Data Mining With Clustering: Dr. Mahesh Fernando
55 pages
ML Lec-16
No ratings yet
ML Lec-16
16 pages
1.supervised and Unsupervised
No ratings yet
1.supervised and Unsupervised
42 pages
Week 9 Part 1 Clustering
No ratings yet
Week 9 Part 1 Clustering
44 pages
Machine Learning Notes-1 (Clustering-1)
No ratings yet
Machine Learning Notes-1 (Clustering-1)
25 pages
K-Means Data Clustering Approach: Jaipur National University
No ratings yet
K-Means Data Clustering Approach: Jaipur National University
43 pages
Clustering and Pattern Recognition Unit 5
No ratings yet
Clustering and Pattern Recognition Unit 5
21 pages
Aiml 5th Module Part2
No ratings yet
Aiml 5th Module Part2
28 pages
DM Chapter 5 (Clustering)
No ratings yet
DM Chapter 5 (Clustering)
40 pages
SDG Quiz Answers
100% (2)
SDG Quiz Answers
2 pages
Estimation of Potassium in Tap Water by Flame Photometer
100% (1)
Estimation of Potassium in Tap Water by Flame Photometer
19 pages
Anh11 CK2
No ratings yet
Anh11 CK2
18 pages
Charter of UN PDF
No ratings yet
Charter of UN PDF
188 pages
SML Prospekt Cast PET - Final
No ratings yet
SML Prospekt Cast PET - Final
32 pages
1.1 Material Submittal Cover Page-Stair Case and Platform PDF
No ratings yet
1.1 Material Submittal Cover Page-Stair Case and Platform PDF
197 pages
Listening Compre and Dictation Grade 3
No ratings yet
Listening Compre and Dictation Grade 3
3 pages
Swords: History and Cultural Impact
No ratings yet
Swords: History and Cultural Impact
3 pages
Remedial Test Q2
No ratings yet
Remedial Test Q2
3 pages
BCM MARKET SURVEY OF FLOORING AND PAVING N
No ratings yet
BCM MARKET SURVEY OF FLOORING AND PAVING N
14 pages
BS 3293 Weld Neck and Slip On Flanges Class 150lbs
No ratings yet
BS 3293 Weld Neck and Slip On Flanges Class 150lbs
1 page
CASE 12-159347 Redacted
No ratings yet
CASE 12-159347 Redacted
5 pages
Initial Summary Streetcar Cost Review 8-31-18
No ratings yet
Initial Summary Streetcar Cost Review 8-31-18
23 pages
02 Transformer Protection RET670
No ratings yet
02 Transformer Protection RET670
20 pages
Design Cal of Cememt Silo PDF
100% (1)
Design Cal of Cememt Silo PDF
176 pages
IoT-Based Battery Health Monitoring
No ratings yet
IoT-Based Battery Health Monitoring
6 pages
ILS L4 Transcripts PDF
75% (4)
ILS L4 Transcripts PDF
38 pages
SDN Labs
No ratings yet
SDN Labs
282 pages
Dsadd
No ratings yet
Dsadd
2 pages
Annex 2a IBSP-Project Log 5
No ratings yet
Annex 2a IBSP-Project Log 5
2 pages
Differences Between The NPV Vs IRR Vs PB Vs PI Vs ARR
No ratings yet
Differences Between The NPV Vs IRR Vs PB Vs PI Vs ARR
4 pages
02-Chain of Action
No ratings yet
02-Chain of Action
2 pages
Everett Rogers Diffusion of Innovations Theory PDF
No ratings yet
Everett Rogers Diffusion of Innovations Theory PDF
4 pages
Zimbabwe School Examinations Council: Accounting 9197/3
50% (2)
Zimbabwe School Examinations Council: Accounting 9197/3
8 pages
Science Ramban Part 1
100% (5)
Science Ramban Part 1
85 pages
Injection Molding Control Plan
100% (1)
Injection Molding Control Plan
3 pages
High Seas
No ratings yet
High Seas
7 pages
Multicap Fund - One Pager
No ratings yet
Multicap Fund - One Pager
2 pages
G6 Science Module 1 - Week 1 - Q1 - Mixtures
100% (1)
G6 Science Module 1 - Week 1 - Q1 - Mixtures
14 pages
Reflection On Sports and Exercise Psychology
No ratings yet
Reflection On Sports and Exercise Psychology
2 pages

DataMining Chapter4

Uploaded by

DataMining Chapter4

Uploaded by

CHAPITRE IV: CLUSTERING

In this chapter we introduce a model based on unsupervised learning: the clustering.

4.1 INTRODUCTION TO CLUSTERING:

4.2 DEFINITION OF CLUSTERING :

Unlabelled data Clustered data

Fig 4.1 Aim of clustering: grouping data in clusters

4.4 WHAT IS A CLUSTER ?

4.5 CLUSTERING APPLICATIONS

4.7 CONCEPTS OF SIMILARITY, DISSIMILARITY AND DISTANCE :

 Euclidean distance: It is obtained from the Minkowsky distance, by taking q=2.

 Manhattan distance: It is obtained from the Minkowsky distance, by taking q=1.

4.8 CLUSTERING QUALITY EVALUATION:

4.8.1 Intra-cluster inertia:

4.8.2 The inter-cluster inertia:

Let µ the centre of gravity of the entire data:

Where N is the total number of data.

Fig 4.1. intra-cluster inertia et inter-cluster inertia

4.9 CLUSTERING ALGORITHMS:

4.9.1 Kmeans algorithm

Randomly initialise cluster centres µ1, µ2, … µk.

Recalculate the centre µk of each cluster

4.9.2 Example of clustering with Kmeans

Table 4.2 Kmeans application example data

The first iteration:

Based on these calculations, we assign:

Here is the composition of the two clusters:

Fig 4.3. Kmeans application. After a first iteration.

Recalculating cluster centres of gravity :

The second iteration :

According to these calculations :

This gives the following configuration:

Recalculating cluster centres of gravity :

Fig 4.5. Kmeans application. After a second iteration.

According to these calculations, there is no change in the clusters:

Fig 4.6 Kmeans application. Final situation.

You might also like