KEMBAR78
Unit 4 - Part 2 | PDF | Cluster Analysis | Applied Mathematics
0% found this document useful (0 votes)
24 views45 pages

Unit 4 - Part 2

Uploaded by

Vandit Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views45 pages

Unit 4 - Part 2

Uploaded by

Vandit Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Unit-4

CLIQUE (Clustering In QUEst)


• Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).
• Automatically identifying subspaces of a high dimensional data
space that allow better clustering than original space
• CLIQUE can be considered as both density-based and grid-based
• It partitions each dimension into the same number of equal
length interval
• It partitions an m-dimensional data space into non-overlapping
rectangular units
• A unit is dense if the fraction of total data points contained in
the unit exceeds the input model parameter
• A cluster is a maximal set of connected dense units within a
subspace
CLIQUE: The Major Steps
• Partition the data space and find the number of points that lie
inside each cell of the partition.
• Identify the subspaces that contain clusters using the Apriori
principle
• Identify clusters:
• Determine dense units in all subspaces of interests
• Determine connected dense units in all subspaces of
interests.
• Generate minimal description for the clusters
• Determine maximal regions that cover a cluster of connected
dense units for each cluster
• Determination of minimal cover for each cluster
Salary
(10,000)

τ=3
0 1 2 3 4 5 6 7

20
30
40

S
50

ala
r
Vacation
y
60
age

30

Vacation
(week)
50

0 1 2 3 4 5 6 7
20
30
40

age
50
60
age
Strength and Weakness of CLIQUE

• Strength
• It automatically finds subspaces of the highest
dimensionality such that high density clusters exist in those
subspaces
• It is insensitive to the order of records in input and does
not presume some canonical data distribution
• It scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
• Weakness
• The accuracy of the clustering result may be degraded at
the expense of simplicity of the method
ProCLUS (Projected Clustering) in Data
Analytics
• ProCLUS (Projected Clustering) is a clustering algorithm designed for
high-dimensional data.
• It identifies clusters by focusing on relevant subsets of dimensions for
each cluster rather than attempting to cluster data across all
dimensions.
• This makes ProCLUS well-suited for subspace clustering in
high-dimensional datasets, where only specific dimensions contribute
to meaningful clusters.
Key Features of ProCLUS
• Projection-Based Clustering: Unlike traditional clustering methods
like K-Means or DBSCAN, ProCLUS clusters points by identifying a
subset of relevant dimensions for each cluster, reducing the impact of
irrelevant dimensions.
• Scalability: ProCLUS is designed to handle large datasets and
efficiently process high-dimensional spaces.
• High-Dimensional Data Suitability: It is particularly useful in domains
like bioinformatics, finance, or text analysis, where data points are
often described by a large number of attributes, but not all attributes
are relevant for clustering.
Steps in the ProCLUS Algorithm
• Initialization:
• ProCLUS requires the number of clusters 𝑘k as an input parameter.
• The algorithm selects a random subset of 𝑘k medoids (central points of
clusters) from the data.
• Dimension Selection:
• For each cluster, ProCLUS selects a subset of dimensions (subspaces) that are
most relevant to the cluster.
• Dimension relevance is determined using a scoring function that measures
how well the points in a cluster are grouped in a particular dimension.
• Assignment of Points:
• Data points are assigned to the cluster whose medoid and subspace best
describe them.
• ProCLUS minimizes a cost function based on the distances between points
and their cluster medoids within the selected dimensions.
• Iteration:
• The algorithm iteratively refines the medoids and subspaces for each cluster
to improve clustering quality.
• This process continues until convergence (e.g., no significant changes in
clusters).
• Output:
• ProCLUS produces 𝑘k clusters along with the relevant subspaces (dimensions)
for each cluster.
Advantages of ProCLUS
1. Dimensionality Reduction:
1. By clustering in subspaces, ProCLUS reduces the curse of dimensionality,
which affects traditional clustering methods in high-dimensional data.
2. Interpretability:
1. It provides insights into which dimensions are relevant for each cluster, aiding
interpretability in data analytics.
3. Efficiency:
1. ProCLUS avoids analyzing all dimensions simultaneously, improving
computational efficiency.
4. Versatility:
1. It can handle overlapping clusters in different subspaces.
Challenges of ProCLUS
• Dependency on Parameters:
• ProCLUS requires the number of clusters (𝑘k) and the number of relevant
dimensions per cluster as input, which might be hard to determine.
• Random Initialization:
• The choice of initial medoids can significantly affect the quality of clustering.
• Scalability to Extremely High Dimensions:
• While effective in high-dimensional data, performance may degrade in
ultra-high-dimensional spaces where most dimensions are irrelevant.
Comparison of ProCLUS and CLIQUE
When to Use ProCLUS vs. CLIQUE
Frequent Pattern Mining in Data Mining
• Frequent pattern mining in data mining is the process of
identifying patterns or associations within a dataset that occur
frequently. This is typically done by analyzing large datasets to
find items or sets of items that appear together frequently.
• Frequent pattern extraction is an essential mission in data
mining that intends to uncover repetitive patterns or itemsets in
a granted dataset. It encompasses recognizing collections of
components that occur together frequently in a transactional or
relational database. This procedure can offer valuable
perceptions into the connections and affiliations among diverse
components or features within the data.
• Transactional and Relational Databases:
• Repeating arrangement prospecting can be applied to transactional databases, where each
transaction consists of a collection of objects.
• For instance, in a retail dataset, each transaction may represent a customer’s purchase with objects
like loaf, dairy, and ovals.
• It can also be used with relational databases, where data is organized into multiple related tables.
• In this case, repeating arrangements can represent connections among different attributes or
columns.
• Support and Repeating Groupings:
• The support of a grouping is defined as the proportion of transactions in the database that contain
that particular grouping.
• It represents the frequency or occurrence of the grouping in the dataset.
• Repeating groupings are collections of objects whose support is above a specified minimum support
threshold.
• These groupings are considered interesting and are the primary focus of repeating arrangement
prospecting.
• Apriori Algorithm:
• The Apriori algorithm is one of the most well-known and widely used algorithms for repeating
arrangement prospecting.
• It uses a breadth-first search strategy to discover repeating groupings efficiently. The algorithm
works in multiple iterations.
• It starts by finding repeating individual objects by scanning the database once and counting the
occurrence of each object.
• It then generates candidate groupings of size 2 by combining the repeating groupings of size 1.
• The process continues iteratively, generating candidate groupings of size k and calculating their
support until no more repeating groupings can be found.
• Support-based Pruning:
• During the Apriori algorithm’s execution, aid-based pruning is used to reduce the search space and
enhance efficiency.
• If an itemset is found to be rare (i.e., its aid is below the minimum aid threshold), then all its supersets
are also assured to be rare.
• Therefore, these supersets are trimmed from further consideration. This trimming step significantly
decreases the number of potential item sets that need to be evaluated in subsequent iterations.
• Association Rule Mining:
• Frequent item sets can be further examined to discover association
rules, which represent connections between different items.
• An association rule consists of an antecedent and a consequent
(right-hand side), both of which are item sets.
• For instance, {milk, bread} => {eggs} is an association rule. Association
rules are produced from frequent itemsets by considering different
combinations of items and calculating measures such as aid,
confidence, and lift.
• Aid measures the frequency of both the antecedent and the
consequent appearing together, while confidence measures the
conditional probability of the consequent given the antecedent.
• Lift indicates the strength of the association between the antecedent
and the consequent, considering their individual aid.
Clustering for streams and
parallelism
DataStream Clustering
• The analysis of large-scale datasets that evolve over time has gained
considerable attention, with a particular focus on stream processing
methods.
• Among the vital tasks in data stream analysis is the clustering of data
streams.
• This task involves partitioning data in streams into clusters such that similar
data are grouped together while dissimilar data are separated into distinct
clusters.
• Unlike traditional clustering algorithms that work on the entire dataset, DSC
algorithms have to analyze each data point as it arrives in sequential order
and perform necessary processing or learning steps in an online fashion.
• In particular, DSC algorithms commonly maintain temporal clusters that
temporarily hold the current computed clustering results.
Design Aspects of DSC Algorithms
• Summarizing Data Structure stores the intermediate clustering
information. Since data streams are typically infinite, it is impractical
to store the entire input data stream for clustering.
• Window Model is used to determine the most recent input data for
processing. In most cases, more recent information from the stream
better reflects the evolving activities in clusters.
• Outlier Detection Mechanism identifies the incoming data points that
seem to be different from the historical stream.
• Existing DSC algorithms all identify the new data points which are far from the
temporal clusters as outliers.
• However, when outlier evolution occurs in data stream, some previous
outliers may become part of clusters and some clustered points may become
outliers.
• Therefore, the detection of outliers over data streams is always a challenging
task for all of the DSC algorithms.
• Offline Refinement Strategy refers to the process of applying offline
clustering algorithms to refine the clustering results from online
clustering.
• Different from the other three design aspects that aim to keep execution
continuously in real-time, the offline refinement strategy only applies once
before getting the final clustering result.
• Thus, it usually does not bring a significant influence on efficiency but
hopefully improves accuracy

You might also like