KEMBAR78
Data Mining - 5 | PDF | Cluster Analysis | Computing
0% found this document useful (0 votes)
10 views4 pages

Data Mining - 5

Cluster analysis is essential in data mining for grouping similar objects, requiring scalability, adaptability to various data types, and robustness against noise. Key clustering methods include partitioning, hierarchical, and density-based approaches, each with unique characteristics and limitations. The k-means algorithm is a popular partitioning method that iteratively assigns objects to clusters based on similarity to cluster means.

Uploaded by

Misba firdose
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views4 pages

Data Mining - 5

Cluster analysis is essential in data mining for grouping similar objects, requiring scalability, adaptability to various data types, and robustness against noise. Key clustering methods include partitioning, hierarchical, and density-based approaches, each with unique characteristics and limitations. The k-means algorithm is a popular partitioning method that iteratively assigns objects to clusters based on similarity to cluster means.

Uploaded by

Misba firdose
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Data mining – 5

Requirements for Cluster Analysis

Cluster analysis is a crucial task in data mining that helps in grouping a set of objects in such a
way that objects in the same group (called a cluster) are more similar to each other than to those
in other groups (clusters). Below are the key requirements for effective clustering:

1. Scalability:
o Issue: Many clustering algorithms perform well on small datasets but struggle
with large datasets containing millions or billions of objects.
o Requirement: Algorithms need to be highly scalable to handle large databases
effectively.
2. Ability to Deal with Different Types of Attributes:
o Issue: Traditional algorithms are often designed for numeric data.
o Requirement: Algorithms should be capable of clustering different data types
such as binary, nominal (categorical), ordinal data, and complex data types like
graphs, sequences, images, and documents.
3. Discovery of Clusters with Arbitrary Shape:
o Issue: Many algorithms assume clusters are spherical and of similar size and
density.
o Requirement: Algorithms should detect clusters of arbitrary shapes, useful in
applications like environmental surveillance where phenomena may have
irregular boundaries.
4. Requirements for Domain Knowledge to Determine Input Parameters:
o Issue: Many algorithms need predefined parameters like the number of clusters,
which can be hard to determine.
o Requirement: Algorithms should minimize the need for user-specified input
parameters to reduce user burden and improve clustering quality.
5. Ability to Deal with Noisy Data:
o Issue: Real-world data often contains noise and errors, which can degrade
clustering quality.
o Requirement: Robust clustering methods that can handle noisy data and outliers
are needed.
6. Incremental Clustering and Insensitivity to Input Order:
o Issue: Some algorithms can't update clusters incrementally and are sensitive to the
order of data input.
o Requirement: Algorithms should support incremental updates and be insensitive
to the order in which data is presented.
7. Capability of Clustering High-Dimensional Data:
o Issue: High-dimensional data, such as documents with thousands of keywords, is
challenging to cluster.
o Requirement: Algorithms should effectively handle high-dimensional data,
considering the sparsity and skewness of such data.
8. Constraint-Based Clustering:
o Issue: Real-world applications may have specific constraints, such as
geographical barriers or customer types.
o Requirement: Algorithms should perform clustering under various constraints to
meet real-world needs.
9. Interpretability and Usability:
o Issue: Users need clustering results that are understandable and practical for their
specific applications.
o Requirement: Clustering results should be interpretable and usable, with clear
semantic meaning and relevance to the application goals.

Overview of Basic Clustering Methods

Clustering methods are essential tools in data mining, used to group a set of objects into clusters,
such that objects in the same cluster are more similar to each other than to those in other clusters.
The main categories of clustering methods are:

1. Partitioning Methods:
o Concept: Partitioning methods divide a dataset into k groups (partitions), where
each group represents a cluster, and each cluster must contain at least one object.
o Process: These methods typically use an iterative relocation technique to improve
partitioning by moving objects between groups.
o Criteria: The quality of partitioning is judged based on how close objects in the
same cluster are to each other and how far apart objects in different clusters are.
o Techniques: Common techniques include k-means and k-medoids, which are
heuristic methods aimed at finding local optima for clustering.
o Extensions: These methods can be extended for subspace clustering to handle
sparse data in high-dimensional spaces.
o Limitations: Achieving global optimality is often computationally prohibitive,
and these methods generally find spherical-shaped clusters.
2. Hierarchical Methods:
o Concept: Hierarchical methods create a tree-like (hierarchical) decomposition of
the dataset, which can be either agglomerative (bottom-up) or divisive (top-
down).
o Agglomerative Approach: Starts with each object as a separate cluster and
merges the closest clusters iteratively until all objects are in one cluster or a
termination condition is met.
o Divisive Approach: Starts with all objects in one cluster and splits them
iteratively until each object is in its own cluster or a termination condition is met.
o Techniques: These methods can be based on distance, density, or continuity.
o Limitations: Once a merge or split is done, it cannot be undone, leading to
potential errors that cannot be corrected. However, methods to improve the
quality of hierarchical clustering exist.
3. Density-Based Methods:
o Concept: These methods form clusters based on the density of data points in a
region, continuing to grow a cluster as long as the density in the neighborhood
exceeds a certain threshold.
o Process: For each data point in a cluster, the neighborhood within a given radius
must contain at least a minimum number of points.
o Techniques: These methods are effective in filtering out noise and discovering
clusters of arbitrary shape.
o Applications: Density-based methods can create multiple exclusive clusters or a
hierarchy of clusters, and they can be extended to subspace clustering.
o Limitations: Typically, these methods consider only exclusive clusters and may
not handle fuzzy clusters well.

Partitioning Methods

Partitioning methods are a fundamental approach to clustering in data mining. They organize a
set of objects into k clusters, where each cluster is represented by a single object or a summary of
the objects (such as the mean). The key idea is to partition the data in such a way that objects
within the same cluster are more similar to each other than to those in other clusters.

Key Characteristics

 Exclusive Clusters: Each object belongs to exactly one cluster.


 Fixed Number of Clusters: The number of clusters (k) is predefined.
 Iterative Optimization: These methods typically involve iterative refinement to improve
clustering quality.

Algorithm: k-means

Input:

k: the number of clusters

D: a dataset containing n objects

Output:

A set of k clusters

Method:
1. Arbitrarily choose k objects from D as the initial cluster centers.

2. Repeat

a. (Re)assign each object to the cluster to which the object is the most similar, based on the mean
value of the objects in the cluster.

b. Update the cluster means, that is, calculate the mean value of the objects for each cluster.

3. Until no change in cluster assignments.

You might also like