0% found this document useful (0 votes)

96 views51 pages

Cluster Analysis Data Types

The document discusses different types of data that can be used in cluster analysis, including interval-valued numeric variables, binary variables, categorical variables, and variables with mixed data types. It provides examples of how to calculate dissimilarity between objects for each type of variable. Specifically, it discusses using Euclidean distance for interval data, and discusses approaches for calculating dissimilarity between binary variables, including symmetric and asymmetric approaches using contingency tables.

Uploaded by

rupali wadwale

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

96 views51 pages

Cluster Analysis Data Types

Uploaded by

rupali wadwale

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Data Mining

5. Cluster Analysis

5.2 Types of Data in Cluster Analysis

Fall 2009

Instructor: Dr. Masoud Yaghini

Types of Data in Cluster Analysis

Outline

Data Structures
Interval-Valued (Numeric) Variables
Binary Variables
Categorical Variables
Ordinal Variables
Variables of Mixed Types
References

Types of Data in Cluster Analysis

Data Structures

Types of Data in Cluster Analysis

Data Structures

Clustering algorithms typically operate on either of the

following two data structures:
– Data matrix
– Dissimilarity matrix

Types of Data in Cluster Analysis

Data matrix

This represents n objects, such as persons, with p variables

(measurements or attributes), such as age, height, weight,
gender, and so on.
The structure is in the form of a relational table, or n-by-p
matrix (n objects p variables)

Types of Data in Cluster Analysis

Dissimilarity matrix

It is often represented by an n-by-n where d(i, j) is the measured

difference or dissimilarity between objects i and j.
In general, d(i, j) is a nonnegative number that is
– close to 0 when objects i and j are highly similar or “near” each other
– becomes larger the more they differ
Where d(i, j)=d( j, i), and d(i, i)=0

Types of Data in Cluster Analysis

Type of data in clustering analysis

Dissimilarity can be computed for

– Interval-scaled (numeric) variables
– Binary variables
– Categorical (nominal) variables
– Ordinal variables
– Ratio variables
– Mixed types variables

Types of Data in Cluster Analysis

Interval-Valued (Numeric) Variables

Types of Data in Cluster Analysis

Interval-valued variables

Interval-scaled (numeric) variables are continuous

measurements of a roughly linear scale.
Examples
– weight and height, latitude and longitude coordinates (e.g.,
when clustering houses), and weather temperature.
The measurement unit used can affect the clustering
analysis
– For example, changing measurement units from meters to
inches for height, or from kilograms to pounds for weight,
may lead to a very different clustering structure.

Types of Data in Cluster Analysis

Data Standardization

Expressing a variable in smaller units will lead to a

larger range for that variable, and thus a larger effect
on the resulting clustering structure.
To help avoid dependence on the choice of
measurement units, the data should be standardized.
Standardizing measurements attempts to give all
variables an equal weight.
To standardize measurements, one choice is to convert
the original measurements to unitless variables.

Types of Data in Cluster Analysis

Data Standardization

Standardize data
– Calculate the mean absolute deviation:
s f = 1n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)

– where m f = 1n (x1 f + x2 f + ... + xnf )

– Calculate the standardized measurement (z-score)

xif − m f
zif = sf

Types of Data in Cluster Analysis

Data Standardization

Using mean absolute deviation is more robust to

outliers than using standard deviation
When computing the mean absolute deviation, the
deviations from the mean are not squared; hence, the
effect of outliers is somewhat reduced.
Standardization may or may not be useful in a
particular application.
– Thus the choice of whether and how to perform
standardization should be left to the user.
Methods of standardization are also discussed under
normalization techniques for data preprocessing.

Types of Data in Cluster Analysis

Dissimilarity Between Objects

Distances are normally used to measure the similarity

or dissimilarity between two data objects described by
interval-scaled variables

Types of Data in Cluster Analysis

Dissimilarity Between Objects

Euclidean distance: the most popular distance measure

d (i, j) = (| x − x | 2 + | x − x | 2 +...+ | x − x | 2 )
i1 j1 i2 j2 ip jp
– where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects

Manhattan (city block) distance: another well-known

metric
d(i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j2 ip jp

Types of Data in Cluster Analysis

Dissimilarity Between Objects

Example: Let x1 = (1, 2) and x2 = (3, 5) represent two

objects

Types of Data in Cluster Analysis

Dissimilarity Between Objects

Properties of Euclidean and Manhattan distances:

– d(i,j) ≥ 0 : Distance is a nonnegative number.
– d(i,i) = 0 : The distance of an object to itself is 0.
– d(i,j) = d(j,i) : Distance is a symmetric function.
– d(i,j) ≤ d(i,k) + d(k,j) : Going directly from object i to
object j in space is no more than making a detour over any
other object h (triangular inequality).

Types of Data in Cluster Analysis

Dissimilarity Between Objects

Minkowski distance: a generalization of both

Euclidean distance and Manhattan distance

d (i, j) = q (| x − x |q + | x − x | q +...+ | x − x |q )
i1 j1 i2 j2 ip jp
– Where q is a positive integer
– It represents the Manhattan distance when q = 1 and
Euclidean distance when q = 2

Types of Data in Cluster Analysis

Binary Variables

Types of Data in Cluster Analysis

Binary Variables

A binary variable has only two states: 0 or 1, where 0

means that the variable is absent, and 1 means that it is
present.
Given the variable smoker describing a patient,
– 1 indicates that the patient smokes
– 0 indicates that the patient does not.
Treating binary variables as if they are interval-scaled
can lead to misleading clustering results.
Therefore, methods specific to binary data are
necessary for computing dissimilarities.

Types of Data in Cluster Analysis

Binary Variables

One approach involves computing a dissimilarity

matrix from the given binary data.
If all binary variables are thought of as having the
same weight, we have the 2-by-2 contingency table

Types of Data in Cluster Analysis

Contingency Table

where
– q is the number of variables that equal 1 for both objects i and j,
– r is the number of variables that equal 1 for object i but that are 0 for
object j,
– s is the number of variables that equal 0 for object i but equal 1 for
object j, and
– t is the number of variables that equal 0 for both objects i and j.
– p is the total number of variables, p = q+r+s+t.
Types of Data in Cluster Analysis
Symmetric Binary Dissimilarity

A binary variable is symmetric if both of its states are

equally valuable and carry the same weight
– Example: the attribute gender having the states male and
female.
Dissimilarity that is based on symmetric binary
variables is called symmetric binary dissimilarity.
The dissimilarity between objects i and j:

Types of Data in Cluster Analysis

Asymmetric Binary Dissimilarity

A binary variable is asymmetric if the outcomes of the

states are not equally important,
– Example: the positive and negative outcomes of a HIV test.
– we shall code the most important outcome, which is usually
the rarest one, by 1 (HIV positive)
Given two asymmetric binary variables, the agreement of two
1s (a positive match) is then considered more significant than
that of two 0s (a negative match).
Therefore, such binary variables are often considered “monary”
(as if having one state).
The dissimilarity based on such variables is called asymmetric
binary dissimilarity

Types of Data in Cluster Analysis

Asymmetric Binary Dissimilarity

In asymmetric binary dissimilarity the number of

negative matches, t, is considered unimportant and
thus is ignored in the computation:

Types of Data in Cluster Analysis

Asymmetric Binary Similarity

The asymmetric binary similarity between the

objects i and j, or sim(i, j), can be computed as

The coefficient sim(i, j) is called the Jaccard

coefficient
When both symmetric and asymmetric binary
variables occur in the same data set, the mixed
variables approach can be applied (described later)

Types of Data in Cluster Analysis

Example: Dissimilarity Between Binary Variables

Suppose that a patient record table contains the attributes :

– name: an object identifier
– gender: a symmetric attribute
– fever, cough, test-1, test-2, test-3, test-4: the asymmetric
attributes

Types of Data in Cluster Analysis

Example: Dissimilarity Between Binary Variables

For asymmetric attribute values

– let the values Y (yes) and P (positive) be set to1, and
– the value N (no or negative) be set to 0.
Suppose that the distance between objects (patients) is
computed based only on the asymmetric variables.
The distance between each pair of the three patients, Jack,
Mary, and Jim, is

Types of Data in Cluster Analysis

Example: Dissimilarity Between Binary Variables

These measurements suggest that

– Mary and Jim are unlikely to have a similar disease because
they have the highest dissimilarity value among the three
pairs.
– Of the three patients, Jack and Mary are the most likely to
have a similar disease.

Types of Data in Cluster Analysis

Categorical Variables

Types of Data in Cluster Analysis

Categorical Variables

A categorical (nominal) variable is a generalization of

the binary variable in that it can take on more than two
states.
– Example: map_color is a categorical variable that may have
five states: red, yellow, green, pink, and blue.
The states can be denoted by letters, symbols, or a set
of integers.

Types of Data in Cluster Analysis

Dissimilarity between categorical variables

Method 1: Simple matching

– The dissimilarity between two objects i and j can be
computed based on the ratio of mismatches:

– m is the number of matches (i.e., the number of variables

for which i and j are in the same state)
– p is the total number of variables.
Weights can be assigned to increase the effect of m or
to assign greater weight to the matches in variables
having a larger number of states.

Types of Data in Cluster Analysis

Example: Dissimilarity between categorical variables

Suppose that we have the sample data

– where test-1 is categorical.

Types of Data in Cluster Analysis

Categorical Variables

Let’s compute the dissimilarity the matrix

Since here we have one categorical variable, test-1, we set p = 1

Types of Data in Cluster Analysis

Categorical Variables

So that d(i, j) evaluates to 0 if objects i and j match,

and 1 if the objects differ. Thus,

Types of Data in Cluster Analysis

Categorical Variables

Method 2: use a large number of binary variables

– creating a new asymmetric binary variable for each of the
nominal states
– For an object with a given state value, the binary variable
representing that state is set to 1, while the remaining
binary variables are set to 0.
– For example, to encode the categorical variable map _color,
a binary variable can be created for each of the five colors
listed above.
– For an object having the color yellow, the yellow variable is
set to 1, while the remaining four variables are set to 0.

Types of Data in Cluster Analysis

Ordinal Variables

Types of Data in Cluster Analysis

Ordinal Variables

A discrete ordinal variable resembles a categorical

variable, except that the M states of the ordinal value
are ordered in a meaningful sequence.
– Example: professional ranks are often enumerated in a
sequential order, such as assistant, associate, and full for
professors.
Ordinal variables may also be obtained from the discretization
of interval-scaled quantities by splitting the value range into a
finite number of classes.
The values of an ordinal variable can be mapped to ranks.
– Example: suppose that an ordinal variable f has Mf states.
– These ordered states define the ranking 1, … , Mf .

Types of Data in Cluster Analysis

Ordinal Variables

Suppose that f is a variable from a set of ordinal

variables describing n objects.
The dissimilarity computation with respect to f
involves the following steps:
Step 1:
– The value of f for the ith object is xif , and f has Mf
ordered states, representing the ranking 1, … , Mf.
– Replace each xif by its corresponding rank:

Types of Data in Cluster Analysis

Ordinal Variables

Step 2:
– Since each ordinal variable can have a different number of
states, it is often necessary to map the range of each
variable onto [0.0, 1.0] so that each variable has equal
weight.
– This can be achieved by replacing the rank rif of the ith
object in the f th variable by:

Step 3:
– Dissimilarity can then be computed using any of the
distance measures described for interval-scaled variables.

Types of Data in Cluster Analysis

Ordinal Variables

Example: Suppose that we have the sample data:

There are three states for test-2, namely fair, good, and
excellent, that is Mf = 3.

Types of Data in Cluster Analysis

Example: Dissimilarity between ordinal variables

Step 1: if we replace each value for test-2 by its rank, the four
objects are assigned the ranks 3, 1, 2, and 3, respectively.
Step 2: normalizes the ranking by mapping rank 1 to 0.0, rank 2
to 0.5, and rank 3 to 1.0.
Step 3: we can use, say, the Euclidean distance, which results in
the following dissimilarity matrix:

Types of Data in Cluster Analysis

Variables of Mixed Types

Types of Data in Cluster Analysis

Variables of Mixed Types

A database may contain different types of variables

– interval-scaled, symmetric binary, asymmetric binary,
nominal, and ordinal
We can combine the different variables into a single
dissimilarity matrix, bringing all of the meaningful
variables onto a common scale of the interval [0.0,
1.0].

Types of Data in Cluster Analysis

Variables of Mixed Types

Suppose that the data set contains p variables of mixed type.

The dissimilarity d(i, j) between objects i and j is defined as

– if either (1) xif or xjf is missing (i.e., there is no
measurement of variable f for object i or object j),
– or (2) xif = xjf = 0 and variable f is asymmetric binary;
otherwise

Types of Data in Cluster Analysis

Variables of Mixed Types

The contribution of variable f to the dissimilarity between i and

j, that is, dij(f)
If f is interval-based:
– use the normalized distance so that the values map to the
interval [0.0,1.0].
If f is binary or categorical:
– dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
If f is ordinal:
– compute ranks rif and

Types of Data in Cluster Analysis

Example: Dissimilarity between variables of mixed type

The sample data:

Types of Data in Cluster Analysis

Example: Dissimilarity between variables of mixed type

For test-1 (which is categorical) is the same as outlined above

For test-2 (which is ordinal) is the same as outlined above
We can now calculate the dissimilarity matrices for the two
variables.

 0.00 
 
 1.00 0.00 
 0.75 0.75 0.00 
 
 0.00 1.00 0.75 0.00 

Types of Data in Cluster Analysis

Example: Dissimilarity between variables of mixed type

If we go back and look at the data, we can intuitively

guess that objects 1 and 4 are the most similar, based
on their values for test-1 and test-2.
This is confirmed by the dissimilarity matrix, where
d(4, 1) is the lowest value for any pair of different
objects.
Similarly, the matrix indicates that objects 1 and 2 and
object 2 and 4 are the least similar.

Types of Data in Cluster Analysis

References
0

0


 
 0 
 
 0

Types of Data in Cluster Analysis

References

J. Han, M. Kamber, Data Mining: Concepts and

Techniques, Elsevier Inc. (2006). (Chapter 7)

Types of Data in Cluster Analysis

The end

Types of Data in Cluster Analysis

DWM Unit-Vi
No ratings yet
DWM Unit-Vi
30 pages
Cluster Analysis Data Types Lecture
No ratings yet
Cluster Analysis Data Types Lecture
29 pages
DM 24 Types of Data in Cluster Analysis
No ratings yet
DM 24 Types of Data in Cluster Analysis
3 pages
Cluster Analysis and DBSCAN
No ratings yet
Cluster Analysis and DBSCAN
44 pages
Data Mining: Clustering Essentials
No ratings yet
Data Mining: Clustering Essentials
18 pages
Unit - 4 DMA
No ratings yet
Unit - 4 DMA
145 pages
Cluster Analysis: Introduction - I: Dr. A. Ramesh
No ratings yet
Cluster Analysis: Introduction - I: Dr. A. Ramesh
28 pages
Clustering and Applications and Trends in Data Mining
No ratings yet
Clustering and Applications and Trends in Data Mining
42 pages
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
No ratings yet
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
56 pages
Clustering for Data Analysts
No ratings yet
Clustering for Data Analysts
69 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Clustering
No ratings yet
Clustering
47 pages
Clustering 1
No ratings yet
Clustering 1
75 pages
Lecture 07 2025 Clustering Prepr
No ratings yet
Lecture 07 2025 Clustering Prepr
17 pages
Cluster Analysis Introduction (Unit-6)
No ratings yet
Cluster Analysis Introduction (Unit-6)
20 pages
Mod 4 Types of Data in Cluster Analysis
No ratings yet
Mod 4 Types of Data in Cluster Analysis
31 pages
Cluster Analysis Essentials
No ratings yet
Cluster Analysis Essentials
24 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
Cluster Analysis Data Types
No ratings yet
Cluster Analysis Data Types
20 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
24 pages
Cluster Analysis Techniques Guide
No ratings yet
Cluster Analysis Techniques Guide
97 pages
Gower's Similarity Coefficient
75% (4)
Gower's Similarity Coefficient
7 pages
Cluster Analysis Techniques Guide
No ratings yet
Cluster Analysis Techniques Guide
152 pages
Chp-10 (Topic Not in Book) Types of Data in Cluster Analysis.
No ratings yet
Chp-10 (Topic Not in Book) Types of Data in Cluster Analysis.
13 pages
DMDW Notes Unit 2
0% (1)
DMDW Notes Unit 2
11 pages
Clustering
0% (1)
Clustering
127 pages
ML12 Clustering
No ratings yet
ML12 Clustering
34 pages
Data Analysiswith SPSSPPT
No ratings yet
Data Analysiswith SPSSPPT
192 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
STAT243 Chapter 2 - Section 2.4
No ratings yet
STAT243 Chapter 2 - Section 2.4
41 pages
Cluster Analysis in Construction
No ratings yet
Cluster Analysis in Construction
23 pages
NoteSCK3483 7b Clustering
No ratings yet
NoteSCK3483 7b Clustering
24 pages
Clustering
No ratings yet
Clustering
64 pages
DM-Knowing Your Data
No ratings yet
DM-Knowing Your Data
56 pages
DM Chapter 5 (Clustering)
No ratings yet
DM Chapter 5 (Clustering)
40 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
119 pages
Descriptive Data Mining
No ratings yet
Descriptive Data Mining
8 pages
Unit 4
No ratings yet
Unit 4
65 pages
CPSC 4830 2025summer Lecture 2
No ratings yet
CPSC 4830 2025summer Lecture 2
42 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
ML Clustering Algorithm
No ratings yet
ML Clustering Algorithm
29 pages
Further Bound Reference
No ratings yet
Further Bound Reference
42 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
Cluster Analysis
No ratings yet
Cluster Analysis
29 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
Cluster Analysis Methods Guide
No ratings yet
Cluster Analysis Methods Guide
51 pages
Cluster
No ratings yet
Cluster
120 pages
02data Part4
No ratings yet
02data Part4
28 pages
Analysis of Cluteruing
No ratings yet
Analysis of Cluteruing
16 pages
8 Clustering
No ratings yet
8 Clustering
53 pages
Lecture 2 - Exploratory Data Analysis
No ratings yet
Lecture 2 - Exploratory Data Analysis
35 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
48 pages
Introduction To Data
No ratings yet
Introduction To Data
26 pages
Lecture24 s12
No ratings yet
Lecture24 s12
24 pages
02 Data
No ratings yet
02 Data
35 pages
Business Analytics (Tanya Pandey) Mba M3a
No ratings yet
Business Analytics (Tanya Pandey) Mba M3a
64 pages
Categorical and Numerical Data Analysis
No ratings yet
Categorical and Numerical Data Analysis
30 pages
Lesson 2.1 - Know Your Data PDF
No ratings yet
Lesson 2.1 - Know Your Data PDF
43 pages
Hypothesis Testing in Research Methodolo PDF
No ratings yet
Hypothesis Testing in Research Methodolo PDF
3 pages
Parameter Estimation Techniques
No ratings yet
Parameter Estimation Techniques
8 pages
Environment and Citizenship 1st Edition Benito Cao PDF Download
No ratings yet
Environment and Citizenship 1st Edition Benito Cao PDF Download
116 pages
ANOVA Presentation
No ratings yet
ANOVA Presentation
12 pages
Hypothesis Testing Spss
No ratings yet
Hypothesis Testing Spss
75 pages
Lecture 4 INDU 6331
No ratings yet
Lecture 4 INDU 6331
65 pages
The Rise of Neoliberal Feminism Catherine Rottenberg Online Version
No ratings yet
The Rise of Neoliberal Feminism Catherine Rottenberg Online Version
104 pages
1722496821005-M Tech
No ratings yet
1722496821005-M Tech
32 pages
Data Prep for Researchers
No ratings yet
Data Prep for Researchers
16 pages
All 12 Assignment Answers
100% (1)
All 12 Assignment Answers
5 pages
Polymers For Vibration Damping Applications B. C. Chakraborty Instant Download
100% (1)
Polymers For Vibration Damping Applications B. C. Chakraborty Instant Download
112 pages
ASQR09
No ratings yet
ASQR09
14 pages
Backtesting Mistakes in Trading
No ratings yet
Backtesting Mistakes in Trading
5 pages
Employee Attrition Analysis
No ratings yet
Employee Attrition Analysis
21 pages
Fodor S Seoul With Busan Jeju and The Best of Korea Full Color Travel Guide 1st Edition Eileen Cho Instant Download
100% (1)
Fodor S Seoul With Busan Jeju and The Best of Korea Full Color Travel Guide 1st Edition Eileen Cho Instant Download
122 pages
Statistical Analysis of Dataset
No ratings yet
Statistical Analysis of Dataset
2 pages
Final Credit Risk Prediction Report Corrected
No ratings yet
Final Credit Risk Prediction Report Corrected
19 pages
Ruisen PDF
No ratings yet
Ruisen PDF
38 pages
Spectraplusv3 For s8t Doc-m80-Exx109 v1
100% (2)
Spectraplusv3 For s8t Doc-m80-Exx109 v1
571 pages
SAP HANA Predictive Analysis Library PAL en
No ratings yet
SAP HANA Predictive Analysis Library PAL en
578 pages
ML Mid Question Solve
No ratings yet
ML Mid Question Solve
19 pages
Comparing Cuddling Preferences
No ratings yet
Comparing Cuddling Preferences
8 pages
CHAPTER 5 Project Report
No ratings yet
CHAPTER 5 Project Report
35 pages
Statistics for MBA Students
No ratings yet
Statistics for MBA Students
16 pages
M. Ataharul Islam, Abdullah Al-Shiha - Foundations of Biostatistics (2018, Springer) PDF
No ratings yet
M. Ataharul Islam, Abdullah Al-Shiha - Foundations of Biostatistics (2018, Springer) PDF
471 pages
Complete Business Statistics: Confidence Intervals
No ratings yet
Complete Business Statistics: Confidence Intervals
50 pages
Advances and Opportunities in Process Data Analytics. - 1
No ratings yet
Advances and Opportunities in Process Data Analytics. - 1
9 pages
UNIT-4: Reading Material On Hypothesis Testing With Single Sample
No ratings yet
UNIT-4: Reading Material On Hypothesis Testing With Single Sample
7 pages
Assignment II Stat I
No ratings yet
Assignment II Stat I
1 page
Raphael Sonabend PHD Thesis
No ratings yet
Raphael Sonabend PHD Thesis
345 pages