'Cluster analysis' is a class of statistical techniques that can be applied to data that exhibit “natural”
groupings. Cluster analysis sorts through the raw data and groups them into clusters. A cluster is a
group of relatively homogeneous cases or observations. Objects in a cluster are similar to each
other. They are also dissimilar to objects outside the cluster, particularly objects in other clusters.
The diagram below illustrates the results of a survey that studied drinkers’ perceptions of spirits
(alcohol). Each point represents the results from one respondent. The research indicates there are
four clusters in this market.
Illustration of clusters
Another example is the vacation travel market. Recent research has identified three clusters or
market segments. They are the: 1) The demanders - they want exceptional service and expect to be
pampered; 2) The escapists - they want to get away and just relax; 3) The educationalist - they want
to see new things, go to museums, go on a safari, or experience new cultures.
Cluster analysis, like factor analysis and multi dimensional scaling, is an interdependence
technique: it makes no distinction between dependent and independent variables. The entire set of
interdependent relationships is examined. It is similar to multi dimensional scaling in that both
examine inter-object similarity by examining the complete set of interdependent relationships. The
difference is that multi dimensional scaling identifies underlying dimensions, while cluster analysis
identifies clusters. Cluster analysis is the obverse of factor analysis. Whereas factor analysis reduces
the number of variables by grouping them into a smaller set of factors, cluster analysis reduces the
number of observations or cases by grouping them into a smaller set of clusters.
In marketing, cluster analysis is used for
• Segmenting the market and determining target markets
• Product positioning and New Product Development
• Selecting test markets (see : experimental techniques)
[edit] Basic procedure
1. Formulate the problem - select the variables that you wish to apply the clustering technique
to
2. Select a distance measure - various ways of computing distance:
• Squared Euclidean distance - the square root of the sum of the squared differences in
value for each variable
• Manhattan distance - the sum of the absolute differences in value for any variable
• Chebyshev distance - the maximum absolute difference in values for any variable
• Mahalanobis (or correlation) distance - this measure uses the correlation coefficients
between the observations and uses that as a measure to cluster them. This is an
important measure since it is unit invariant (can literally compare apples to oranges)
3. Select a clustering procedure (see below)
4. Decide on the number of clusters
5. Map and interpret clusters - draw conclusions - illustrative techniques like perceptual maps,
icicle plots, and dendrograms are useful
6. Assess reliability and validity - various methods:
• repeat analysis but use different distance measure
• repeat analysis but use different clustering technique
• split the data randomly into two halves and analyze each part separately
• repeat analysis several times, deleting one variable each time
• repeat analysis several times, using a different order each time
[edit] Clustering procedures
There are several types of clustering methods:
• Non-Hierarchical clustering (also called k-means clustering)
• first determine a cluster center, then group all objects that are within a certain
distance
• examples:
• Sequential Threshold method - first determine a cluster center, then group
all objects that are within a predetermined threshold from the center - one
cluster is created at a time
• Parallel Threshold method - simultaneously several cluster centers are
determined, then objects that are within a predetermined threshold from the
centers are grouped
• Optimizing Partitioning method - first a non-hierarchical procedure is run,
then objects are reassigned so as to optimize an overall criterion.
• Hierarchical clustering
• objects are organized into an hierarchical structure as part of the procedure
• examples:
• Divisive clustering - start by treating all objects as if they are part of a single
large cluster, then divide the cluster into smaller and smaller clusters
• Agglomerative clustering - start by treating each object as a separate cluster,
then group them into bigger and bigger clusters
• examples:
• Centroid methods - clusters are generated that maximize the
distance between the centers of clusters (a centroid is the mean
value for all the objects in the cluster)
• Variance methods - clusters are generated that minimize the
within-cluster variance
• example:
• Ward’s Procedure - clusters are generated that
minimize the squared Euclidean distance to the
center mean
• Linkage methods - cluster objects based on the distance
between them
• examples:
• Single Linkage method - cluster objects based
on the minimum distance between them (also
called the nearest neighbour rule)
• Complete Linkage method - cluster objects
based on the maximum distance between them
(also called the furthest neighbour rule)
• Average Linkage method - cluster objects
based on the average distance between all pairs
of objects (one member of the pair must be from
a different cluster)