Unit 4 - Part 2

Uploaded by

Vandit Jain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views45 pages

Unit 4 - Part 2

Uploaded by

Vandit Jain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Unit-4

CLIQUE (Clustering In QUEst)

• Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).
• Automatically identifying subspaces of a high dimensional data
space that allow better clustering than original space
• CLIQUE can be considered as both density-based and grid-based
• It partitions each dimension into the same number of equal
length interval
• It partitions an m-dimensional data space into non-overlapping
rectangular units
• A unit is dense if the fraction of total data points contained in
the unit exceeds the input model parameter
• A cluster is a maximal set of connected dense units within a
subspace
CLIQUE: The Major Steps
• Partition the data space and find the number of points that lie
inside each cell of the partition.
• Identify the subspaces that contain clusters using the Apriori
principle
• Identify clusters:
• Determine dense units in all subspaces of interests
• Determine connected dense units in all subspaces of
interests.
• Generate minimal description for the clusters
• Determine maximal regions that cover a cluster of connected
dense units for each cluster
• Determination of minimal cover for each cluster
Salary
(10,000)

τ=3
0 1 2 3 4 5 6 7

20
30
40

S
50

ala
r
Vacation
y
60
age

Vacation
(week)
50

0 1 2 3 4 5 6 7
20
30
40

age
50
60
age
Strength and Weakness of CLIQUE

• Strength
• It automatically finds subspaces of the highest
dimensionality such that high density clusters exist in those
subspaces
• It is insensitive to the order of records in input and does
not presume some canonical data distribution
• It scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
• Weakness
• The accuracy of the clustering result may be degraded at
the expense of simplicity of the method
ProCLUS (Projected Clustering) in Data
Analytics
• ProCLUS (Projected Clustering) is a clustering algorithm designed for
high-dimensional data.
• It identifies clusters by focusing on relevant subsets of dimensions for
each cluster rather than attempting to cluster data across all
dimensions.
• This makes ProCLUS well-suited for subspace clustering in
high-dimensional datasets, where only specific dimensions contribute
to meaningful clusters.
Key Features of ProCLUS
• Projection-Based Clustering: Unlike traditional clustering methods
like K-Means or DBSCAN, ProCLUS clusters points by identifying a
subset of relevant dimensions for each cluster, reducing the impact of
irrelevant dimensions.
• Scalability: ProCLUS is designed to handle large datasets and
efficiently process high-dimensional spaces.
• High-Dimensional Data Suitability: It is particularly useful in domains
like bioinformatics, finance, or text analysis, where data points are
often described by a large number of attributes, but not all attributes
are relevant for clustering.
Steps in the ProCLUS Algorithm
• Initialization:
• ProCLUS requires the number of clusters 𝑘k as an input parameter.
• The algorithm selects a random subset of 𝑘k medoids (central points of
clusters) from the data.
• Dimension Selection:
• For each cluster, ProCLUS selects a subset of dimensions (subspaces) that are
most relevant to the cluster.
• Dimension relevance is determined using a scoring function that measures
how well the points in a cluster are grouped in a particular dimension.
• Assignment of Points:
• Data points are assigned to the cluster whose medoid and subspace best
describe them.
• ProCLUS minimizes a cost function based on the distances between points
and their cluster medoids within the selected dimensions.
• Iteration:
• The algorithm iteratively refines the medoids and subspaces for each cluster
to improve clustering quality.
• This process continues until convergence (e.g., no significant changes in
clusters).
• Output:
• ProCLUS produces 𝑘k clusters along with the relevant subspaces (dimensions)
for each cluster.
Advantages of ProCLUS
1. Dimensionality Reduction:
1. By clustering in subspaces, ProCLUS reduces the curse of dimensionality,
which affects traditional clustering methods in high-dimensional data.
2. Interpretability:
1. It provides insights into which dimensions are relevant for each cluster, aiding
interpretability in data analytics.
3. Efficiency:
1. ProCLUS avoids analyzing all dimensions simultaneously, improving
computational efficiency.
4. Versatility:
1. It can handle overlapping clusters in different subspaces.
Challenges of ProCLUS
• Dependency on Parameters:
• ProCLUS requires the number of clusters (𝑘k) and the number of relevant
dimensions per cluster as input, which might be hard to determine.
• Random Initialization:
• The choice of initial medoids can significantly affect the quality of clustering.
• Scalability to Extremely High Dimensions:
• While effective in high-dimensional data, performance may degrade in
ultra-high-dimensional spaces where most dimensions are irrelevant.
Comparison of ProCLUS and CLIQUE
When to Use ProCLUS vs. CLIQUE
Frequent Pattern Mining in Data Mining
• Frequent pattern mining in data mining is the process of
identifying patterns or associations within a dataset that occur
frequently. This is typically done by analyzing large datasets to
find items or sets of items that appear together frequently.
• Frequent pattern extraction is an essential mission in data
mining that intends to uncover repetitive patterns or itemsets in
a granted dataset. It encompasses recognizing collections of
components that occur together frequently in a transactional or
relational database. This procedure can offer valuable
perceptions into the connections and affiliations among diverse
components or features within the data.
• Transactional and Relational Databases:
• Repeating arrangement prospecting can be applied to transactional databases, where each
transaction consists of a collection of objects.
• For instance, in a retail dataset, each transaction may represent a customer’s purchase with objects
like loaf, dairy, and ovals.
• It can also be used with relational databases, where data is organized into multiple related tables.
• In this case, repeating arrangements can represent connections among different attributes or
columns.
• Support and Repeating Groupings:
• The support of a grouping is defined as the proportion of transactions in the database that contain
that particular grouping.
• It represents the frequency or occurrence of the grouping in the dataset.
• Repeating groupings are collections of objects whose support is above a specified minimum support
threshold.
• These groupings are considered interesting and are the primary focus of repeating arrangement
prospecting.
• Apriori Algorithm:
• The Apriori algorithm is one of the most well-known and widely used algorithms for repeating
arrangement prospecting.
• It uses a breadth-first search strategy to discover repeating groupings efficiently. The algorithm
works in multiple iterations.
• It starts by finding repeating individual objects by scanning the database once and counting the
occurrence of each object.
• It then generates candidate groupings of size 2 by combining the repeating groupings of size 1.
• The process continues iteratively, generating candidate groupings of size k and calculating their
support until no more repeating groupings can be found.
• Support-based Pruning:
• During the Apriori algorithm’s execution, aid-based pruning is used to reduce the search space and
enhance efficiency.
• If an itemset is found to be rare (i.e., its aid is below the minimum aid threshold), then all its supersets
are also assured to be rare.
• Therefore, these supersets are trimmed from further consideration. This trimming step significantly
decreases the number of potential item sets that need to be evaluated in subsequent iterations.
• Association Rule Mining:
• Frequent item sets can be further examined to discover association
rules, which represent connections between different items.
• An association rule consists of an antecedent and a consequent
(right-hand side), both of which are item sets.
• For instance, {milk, bread} => {eggs} is an association rule. Association
rules are produced from frequent itemsets by considering different
combinations of items and calculating measures such as aid,
confidence, and lift.
• Aid measures the frequency of both the antecedent and the
consequent appearing together, while confidence measures the
conditional probability of the consequent given the antecedent.
• Lift indicates the strength of the association between the antecedent
and the consequent, considering their individual aid.
Clustering for streams and
parallelism
DataStream Clustering
• The analysis of large-scale datasets that evolve over time has gained
considerable attention, with a particular focus on stream processing
methods.
• Among the vital tasks in data stream analysis is the clustering of data
streams.
• This task involves partitioning data in streams into clusters such that similar
data are grouped together while dissimilar data are separated into distinct
clusters.
• Unlike traditional clustering algorithms that work on the entire dataset, DSC
algorithms have to analyze each data point as it arrives in sequential order
and perform necessary processing or learning steps in an online fashion.
• In particular, DSC algorithms commonly maintain temporal clusters that
temporarily hold the current computed clustering results.
Design Aspects of DSC Algorithms
• Summarizing Data Structure stores the intermediate clustering
information. Since data streams are typically infinite, it is impractical
to store the entire input data stream for clustering.
• Window Model is used to determine the most recent input data for
processing. In most cases, more recent information from the stream
better reflects the evolving activities in clusters.
• Outlier Detection Mechanism identifies the incoming data points that
seem to be different from the historical stream.
• Existing DSC algorithms all identify the new data points which are far from the
temporal clusters as outliers.
• However, when outlier evolution occurs in data stream, some previous
outliers may become part of clusters and some clustered points may become
outliers.
• Therefore, the detection of outliers over data streams is always a challenging
task for all of the DSC algorithms.
• Offline Refinement Strategy refers to the process of applying offline
clustering algorithms to refine the clustering results from online
clustering.
• Different from the other three design aspects that aim to keep execution
continuously in real-time, the offline refinement strategy only applies once
before getting the final clustering result.
• Thus, it usually does not bring a significant influence on efficiency but
hopefully improves accuracy

Advanced Cluster Analysis: Clustering High-Dimensional Data
No ratings yet
Advanced Cluster Analysis: Clustering High-Dimensional Data
49 pages
CLIQUE and PROCLUS
0% (1)
CLIQUE and PROCLUS
13 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
52 pages
Data Clustering Seminar
No ratings yet
Data Clustering Seminar
34 pages
Cluster Analysis in Data Mining
No ratings yet
Cluster Analysis in Data Mining
36 pages
Data Mining Clustering Techniques
No ratings yet
Data Mining Clustering Techniques
43 pages
Unit 4
No ratings yet
Unit 4
5 pages
Cluster Analysis
No ratings yet
Cluster Analysis
18 pages
Clustering Unit4
No ratings yet
Clustering Unit4
9 pages
A06-A Survey of Clustering Techniques
No ratings yet
A06-A Survey of Clustering Techniques
5 pages
Unit-IV Cluster Outlier Analysis
No ratings yet
Unit-IV Cluster Outlier Analysis
21 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
Clustering High-Dimensional Data
No ratings yet
Clustering High-Dimensional Data
5 pages
Lecture 6 - Clustering
No ratings yet
Lecture 6 - Clustering
25 pages
Data Mining With Clustering: Dr. Mahesh Fernando
No ratings yet
Data Mining With Clustering: Dr. Mahesh Fernando
55 pages
Data Mining-Model Based Clustering
No ratings yet
Data Mining-Model Based Clustering
8 pages
Segment 7 (Ch10)
No ratings yet
Segment 7 (Ch10)
60 pages
Unit Iv
No ratings yet
Unit Iv
14 pages
Cluster Analysis-Unit 4
No ratings yet
Cluster Analysis-Unit 4
7 pages
DWDM Lecture Notes U-5
No ratings yet
DWDM Lecture Notes U-5
26 pages
Module V
No ratings yet
Module V
16 pages
UNIT 3 DWDM Notes
No ratings yet
UNIT 3 DWDM Notes
32 pages
Clustering Techniques in Data Mining
No ratings yet
Clustering Techniques in Data Mining
7 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Cluster Analysis
No ratings yet
Cluster Analysis
26 pages
DM Unit 2 Topics
No ratings yet
DM Unit 2 Topics
12 pages
Ljku Sem 1 049010105 Data Mining and Analysis
No ratings yet
Ljku Sem 1 049010105 Data Mining and Analysis
3 pages
Sample Doc Final
No ratings yet
Sample Doc Final
21 pages
BCA Data Mining
No ratings yet
BCA Data Mining
116 pages
Clustering Techniquesin Data Mining
No ratings yet
Clustering Techniquesin Data Mining
7 pages
ML 8
No ratings yet
ML 8
5 pages
Clustering Agglo Devisive DBSCAN
No ratings yet
Clustering Agglo Devisive DBSCAN
78 pages
DM Unit 5
No ratings yet
DM Unit 5
15 pages
Clustering Part 1
No ratings yet
Clustering Part 1
12 pages
BD Unit 3
No ratings yet
BD Unit 3
27 pages
Unit VII
No ratings yet
Unit VII
30 pages
ADB Slides 5
No ratings yet
ADB Slides 5
52 pages
Clustering
No ratings yet
Clustering
34 pages
Paper - Xvii Data Mining and Warehousing
No ratings yet
Paper - Xvii Data Mining and Warehousing
140 pages
Data Mining: Prof Jyotiranjan Hota
No ratings yet
Data Mining: Prof Jyotiranjan Hota
17 pages
Clustering & Association Algorithms 4
No ratings yet
Clustering & Association Algorithms 4
17 pages
Data Mining: Cluster Analysis Guide
No ratings yet
Data Mining: Cluster Analysis Guide
40 pages
Data Mining Unit-Iv
No ratings yet
Data Mining Unit-Iv
34 pages
Cluster Analysis
No ratings yet
Cluster Analysis
22 pages
Clustering: Methods and Applications
No ratings yet
Clustering: Methods and Applications
69 pages
Improved K-Means Clustering Algorithm by Getting Initial Cenroids
No ratings yet
Improved K-Means Clustering Algorithm by Getting Initial Cenroids
9 pages
DM Lesson3
No ratings yet
DM Lesson3
14 pages
DWDM - Unit - VI
No ratings yet
DWDM - Unit - VI
38 pages
Module 3 Clustering
No ratings yet
Module 3 Clustering
57 pages
PROFICIENCY Data Mining
No ratings yet
PROFICIENCY Data Mining
6 pages
High-Dimensional Data Clustering
No ratings yet
High-Dimensional Data Clustering
15 pages
Data Mining - 5
No ratings yet
Data Mining - 5
4 pages
Chapter - 1: 1.1 Overview
No ratings yet
Chapter - 1: 1.1 Overview
50 pages
By Kanchan Jadhav Guided by Prof. R.N. Phursule Computer Engg Dept. Jspm's Imperial College of Engineering & Research
No ratings yet
By Kanchan Jadhav Guided by Prof. R.N. Phursule Computer Engg Dept. Jspm's Imperial College of Engineering & Research
20 pages
Unit-3 DWDM 7TH Sem Cse
No ratings yet
Unit-3 DWDM 7TH Sem Cse
54 pages
DMDW R20 Unit 5
No ratings yet
DMDW R20 Unit 5
21 pages
DMDW Unit-5
No ratings yet
DMDW Unit-5
21 pages
Fundamentals of Data Science Unit 3
No ratings yet
Fundamentals of Data Science Unit 3
15 pages
High Availabilty Active Passive
No ratings yet
High Availabilty Active Passive
7 pages
Data Mining: Frequent Itemsets & Clustering
No ratings yet
Data Mining: Frequent Itemsets & Clustering
152 pages
Rohan Vs Atishay 1727511015358
No ratings yet
Rohan Vs Atishay 1727511015358
2 pages
Dr. Abdul Kalam Sports Fest 2022-23 Schedule & Rules
No ratings yet
Dr. Abdul Kalam Sports Fest 2022-23 Schedule & Rules
16 pages
Chapter # 2 Solution of Algebraic and Transcendental Equations
100% (2)
Chapter # 2 Solution of Algebraic and Transcendental Equations
31 pages
Shallow Vs Deep Nns Dse 3151 Deep Learning
No ratings yet
Shallow Vs Deep Nns Dse 3151 Deep Learning
591 pages
Course Material For cs391
No ratings yet
Course Material For cs391
21 pages
Вежба 1-ФТНК
No ratings yet
Вежба 1-ФТНК
67 pages
Learning Enriched Features For Real Image Restoration and Enhancement
No ratings yet
Learning Enriched Features For Real Image Restoration and Enhancement
20 pages
Solving The Permutation Flow Shop Problem With Makespan Criterion Using Grids
No ratings yet
Solving The Permutation Flow Shop Problem With Makespan Criterion Using Grids
12 pages
Electronics
No ratings yet
Electronics
2 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
74 pages
TMS320c50 Programs
67% (3)
TMS320c50 Programs
28 pages
Department of Computer Science and Engineering
No ratings yet
Department of Computer Science and Engineering
21 pages
Purdue Beamer Template Guide
No ratings yet
Purdue Beamer Template Guide
8 pages
Big Data Basics for Beginners
No ratings yet
Big Data Basics for Beginners
43 pages
3 Question in Mid-Semester: CODE
No ratings yet
3 Question in Mid-Semester: CODE
3 pages
Practice Set 2
No ratings yet
Practice Set 2
5 pages
Linear Codes & Hamming Distance
No ratings yet
Linear Codes & Hamming Distance
30 pages
Fully Interval Integer Transportation Problem For Finding Optimal Interval Solution Using Row Column Minima Method
No ratings yet
Fully Interval Integer Transportation Problem For Finding Optimal Interval Solution Using Row Column Minima Method
5 pages
Tour Diary 2022
No ratings yet
Tour Diary 2022
9 pages
Lecture 19
No ratings yet
Lecture 19
31 pages
220C3A
No ratings yet
220C3A
2 pages
Module 6 - Network and CPM
No ratings yet
Module 6 - Network and CPM
19 pages
TKM2 2016
No ratings yet
TKM2 2016
85 pages
Machine Learning Manual
No ratings yet
Machine Learning Manual
40 pages
DSA Solved Paper (May - June 2023) by VP
No ratings yet
DSA Solved Paper (May - June 2023) by VP
31 pages
Case Study On Demand Forecasting
No ratings yet
Case Study On Demand Forecasting
3 pages
Classification of Dry Bean
No ratings yet
Classification of Dry Bean
16 pages
Numerical Integration Techniques
No ratings yet
Numerical Integration Techniques
13 pages
ES105 - Section I - Winter 2020 Exam
No ratings yet
ES105 - Section I - Winter 2020 Exam
2 pages
Scaler Curriculum
No ratings yet
Scaler Curriculum
16 pages
Linear - Regression - 01
No ratings yet
Linear - Regression - 01
81 pages
Solution of System of Linear Equations
No ratings yet
Solution of System of Linear Equations
13 pages

Unit 4 - Part 2

Uploaded by

Unit 4 - Part 2

Uploaded by

Unit-4

CLIQUE (Clustering In QUEst)

You might also like