0% found this document useful (0 votes)

566 views8 pages

Data Mining-Model Based Clustering

Model-based clustering techniques attempt to optimize the fit between data and underlying probability distributions. Expectation-maximization represents clusters as probability distributions and iteratively assigns data points to clusters. Conceptual clustering forms a hierarchical classification tree where each node represents a concept and finds characteristic descriptions for classes. COBWEB incrementally builds this tree by optimizing category utility. Neural network approaches represent clusters with exemplars and use competitive learning for data assignment. High-dimensional data poses challenges which subspace and frequent pattern-based clustering techniques attempt to address.

Uploaded by

Raj Endran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

566 views8 pages

Data Mining-Model Based Clustering

Uploaded by

Raj Endran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

CLUSTERING MODEL BASED TECHNIQUES

AND HANDLING HIGH DIMENSIONAL DATA

Model-Based Clustering Methods
Attempt to optimize the fit between the data and
some mathematical model
Assumption: Data are generated by a mixture of
underlying probability distributions
Techniques
Expectation-Maximization
Conceptual Clustering
Neural Networks Approach
Expectation Maximization
Each cluster is represented mathematically by a
parametric probability distribution
Component distribution
Data is a mixture of these distributions
Mixture density model
Problem: To estimate parameters of probability
distributions
Iterative Refinement Algorithm used to find
parameter estimates
Extension of k-means
Assigns an object to a cluster according to a
weight representing probability of membership
Initial estimate of parameters
Iteratively reassigns scores
Initial guess for parameters; randomly select k
objects to represent cluster means or centers
Iteratively refine parameters / clusters
Expectation Step

Assign each object xi to cluster Ck with

probability

Maximization Step

Re-estimate model parameters

Simple and easy to implement

Complexity depends on features,

objects and

iterations
Conceptual Clustering
Conceptual clustering

A form of clustering in machine learning

Produces a classification scheme for a set of

unlabeled objects

Finds characteristic description for each

concept (class)
COBWEB

A popular and simple method of incremental

conceptual learning
Creates a hierarchical clustering in the form
of a classification tree
Each node refers to a concept and contains a
probabilistic description of that concept

COBWEB Clustering Method COBWEB

Classification tree
Each node Concept and its probabilistic
distribution (Summary of objects under that node)
Description Conditional probabilities P(Ai=vij /

Ck)
Sibling nodes at given level form a partition
Category Utility

Increase in the expected number of attribute

values that can be correctly guessed given a
partition
Category Utility rewards:
Intra-class similarity P(Ai=vij|Ck)

High value indicates many class members

share this attribute-value pair
Inter-class dissimilarity P(Ck|Ai=vij)

High values fewer objects in different

classes share this attribute-value
Placement of new objects
Descend tree
Identify best host

Temporarily place object in each node and

compute CU of resulting partition

Placement with highest CU is chosen

COBWEB may also forms new nodes if

object does not fit into the existing tree

COBWEB is sensitive to order of records

Additional operations
Merging and Splitting
Two best hosts are considered for merging
Best host is considered for splitting
Limitations
The assumption
that the attributes are
independent of each other is often too strong
because correlation may exist

Not suitable for clustering large database data

CLASSIT - an extension of COBWEB for
incremental clustering of continuous data
Neural Network Approach
Represent each cluster as an exemplar, acting as a
prototype of the cluster
New objects are distributed to the cluster whose
exemplar is the most similar according to some
distance measure
Self Organizing Map
Competitive learning
Involves a hierarchical architecture of several
units (neurons)
Neurons compete in a winner-takes-all fashion
for the object currently being presented
Organization of units forms a feature map
Web Document Clustering

Kohenen SOM Clustering High-Dimensional data

As dimensionality increases
number of irrelevant dimensions may produce
noise and mask real clusters
data becomes sparse
Distance measures meaningless
Feature transformation methods
PCA, SVD Summarize data by creating linear
combinations of attributes
But do not remove any attributes; transformed
attributes complex to interpret
Feature Selection methods
Most relevant set of attributes with respect to
class labels
Entropy Analysis

Subspace Clustering searches for groups of

clusters within different subspaces of the same
data set
CLIQUE: CLustering In QUest
Dimension growth subspace clustering
Starts at 1-D and grows upwards to higher
dimensions
Partitions each dimension grids determines
whether cell is dense
CLIQUE
Determines sparse and crowded units
Dense unit fraction of data points > threshold
Cluster maximal set of connected dense units
First partitions d-dimensional space into nonoverlapping units
Performed in 1-D
Based on Apriori property: If a k-dimensional
unit is dense so are its projections in (k-1)
dimensional space
Search space size is reduced

Determines the maximal dense region and Generates

a minimal description
Finds subspace of highest dimension
Insensitive to order of inputs
Performance depends on grid size and density
threshold
Difficult to determine across all dimensions

Several lower dimensional subspaces will have to be

processed
Can use adaptive strategy
PROCLUS PROjected CLUStering
Dimension-reduction Subspace Clustering technique
Finds initial approximation of clusters in high
dimensional space
Avoids generation of large number of overlapped
clusters of lower dimensionality
Finds best set of medoids by hill-climbing process
(Similar to CLARANS)
Manhattan Segmental distance measure
Initialization phase
Greedy algorithm to select
a set of initial
medoids that are far apart
Iteration Phase
Selects a random set of k-medoids
Replaces bad medoids
For each medoid a set of dimensions is chosen
whose average distances are small
Refinement Phase
Computes new dimensions for each medoid based
on clusters found, reasigns points to medoids and
removes outliers
Frequent Pattern based Clustering
Frequent patterns may also form clusters
Instead of growing clusters dimension by dimension
sets of frequent itemsets are determined
Two common technqiues
Frequent term-based text Clustering
Clustering by Pattern similarity

Frequent-term based text clustering

Text documents are clustered based on frequent terms
they contain
Documents terms
Dimensionality is very high
Frequent term based analysis
Well selected subset of set of all frequent terms
must be discovered
Fi Set of frequent term sets
Cov(Fi) set of documents covered by Fi
k
i=1 cov(Fi) = D and overlap between Fi and Fj
must be minimized
Description of clusters their frequent term sets
Clustering by Pattern Similarity
pCluster on micro-array data analysis
DNA micro-array analysis expression levels of two
genes may rise and fall synchronously in response to
stimuli
Two objects are similar if they exhibit a coherent
pattern on a subset of dimensions
pCluster
Shift Pattern discovery
Euclidean distance not suitable
Derive new attributes
Bi-Clustering based on mean squared residue
score
pCluster
Objects x, y; attributes a, b

A pair (O,T) forms a -pCluster if for any 2 x 2

Each pair of objects and their features must

satisfy threshold
Scaling patterns

pCluster can be used in other applications also

Question Bank of MCQs - IDT For IA3
No ratings yet
Question Bank of MCQs - IDT For IA3
10 pages
Decision Tables for CS Students
0% (1)
Decision Tables for CS Students
10 pages
SQL Solution Full
No ratings yet
SQL Solution Full
17 pages
Functional, Enterprise, and Inter Organizational Systems
No ratings yet
Functional, Enterprise, and Inter Organizational Systems
26 pages
SAD (System Analysis and Design) Prelim
No ratings yet
SAD (System Analysis and Design) Prelim
3 pages
Auditing Database Systems - 1
No ratings yet
Auditing Database Systems - 1
46 pages
Characteristics of The Database Approach
0% (1)
Characteristics of The Database Approach
4 pages
Solution 1
63% (8)
Solution 1
3 pages
Soln 1
100% (1)
Soln 1
6 pages
Cae05 Intermediate Accounting 2
No ratings yet
Cae05 Intermediate Accounting 2
2 pages
Risc and Cisc: Computer Architecture
No ratings yet
Risc and Cisc: Computer Architecture
17 pages
Desktop Operating Systems: Types of Os
No ratings yet
Desktop Operating Systems: Types of Os
10 pages
ER Diagram Exercices
No ratings yet
ER Diagram Exercices
1 page
PV Tables
No ratings yet
PV Tables
10 pages
System Implementation Guide
No ratings yet
System Implementation Guide
14 pages
IT Auditing, Hall, 3e
No ratings yet
IT Auditing, Hall, 3e
33 pages
Complete RDBMS Questions
No ratings yet
Complete RDBMS Questions
83 pages
Module Part I Preface To UNIT I PDF
No ratings yet
Module Part I Preface To UNIT I PDF
43 pages
Succession and Transfer Taxes
No ratings yet
Succession and Transfer Taxes
4 pages
Permutation and Combination Practice Questions - Questions & Answers
No ratings yet
Permutation and Combination Practice Questions - Questions & Answers
6 pages
Symbian Os
No ratings yet
Symbian Os
31 pages
Information Systems Analysis & Design
No ratings yet
Information Systems Analysis & Design
2 pages
Chapter 8 - Cryptography
No ratings yet
Chapter 8 - Cryptography
105 pages
1st Year, FINACC (SQE) 2011,2007, 2012
No ratings yet
1st Year, FINACC (SQE) 2011,2007, 2012
21 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
20 pages
CashFlow Statements
No ratings yet
CashFlow Statements
7 pages
Simulation & Modeling Essentials
No ratings yet
Simulation & Modeling Essentials
8 pages
Business Process and Business Functions
No ratings yet
Business Process and Business Functions
45 pages
Breadth First Search Algorithm Guide
No ratings yet
Breadth First Search Algorithm Guide
17 pages
DBMS Quiz
No ratings yet
DBMS Quiz
4 pages
Java QUESTION BANK FOR MID EXAM 2
No ratings yet
Java QUESTION BANK FOR MID EXAM 2
2 pages
Practice Problem 1
100% (1)
Practice Problem 1
5 pages
Data Warehousing and On-Line Analytical Processing
No ratings yet
Data Warehousing and On-Line Analytical Processing
40 pages
AACA2 - QUIZ 1 Audit of Investments - 2nd Semester AY2023 2024
No ratings yet
AACA2 - QUIZ 1 Audit of Investments - 2nd Semester AY2023 2024
5 pages
Advanced SQL Functions & DML/DDL
100% (2)
Advanced SQL Functions & DML/DDL
47 pages
Screenshot 2021-03-12 at 1.38.48 PM
No ratings yet
Screenshot 2021-03-12 at 1.38.48 PM
6 pages
Asian Academy For Excellence Foundation, Inc.: Practical Accounting 2-Cost CPA Review O2017
No ratings yet
Asian Academy For Excellence Foundation, Inc.: Practical Accounting 2-Cost CPA Review O2017
6 pages
AIS - Chapter 4 The REA Data Model
No ratings yet
AIS - Chapter 4 The REA Data Model
60 pages
PRACTICAL - 1 (Part-I) : Aim: - Create The Following Tables As Per The Given Description
No ratings yet
PRACTICAL - 1 (Part-I) : Aim: - Create The Following Tables As Per The Given Description
5 pages
Enterprise Systems: True/False & MCQs
100% (1)
Enterprise Systems: True/False & MCQs
16 pages
Pizza
No ratings yet
Pizza
12 pages
What Are The Basic Benefits and Purposes of Develo
No ratings yet
What Are The Basic Benefits and Purposes of Develo
7 pages
Computer Generations W/ Examples
No ratings yet
Computer Generations W/ Examples
17 pages
Activity-Based Costing Insights
No ratings yet
Activity-Based Costing Insights
11 pages
Ais Solved Tutorial 001 Mrvalentine
No ratings yet
Ais Solved Tutorial 001 Mrvalentine
7 pages
C++ Lab Exercises for BCA Students
No ratings yet
C++ Lab Exercises for BCA Students
3 pages
Details About The Cebu Pacific
No ratings yet
Details About The Cebu Pacific
12 pages
PROBLEM NO. 3 - Various Current Liabilities
No ratings yet
PROBLEM NO. 3 - Various Current Liabilities
2 pages
Accounting Information System
No ratings yet
Accounting Information System
9 pages
A Hierarchical Database Model
No ratings yet
A Hierarchical Database Model
1 page
Recoverability and Serializability
No ratings yet
Recoverability and Serializability
3 pages
Module 4 Queuing Theory 2
No ratings yet
Module 4 Queuing Theory 2
41 pages
College of Education Arts and Sciences Acivity # 2 Statistical Analysis With Software Application
No ratings yet
College of Education Arts and Sciences Acivity # 2 Statistical Analysis With Software Application
1 page
History of Linux
No ratings yet
History of Linux
5 pages
Decision Tree PDF
No ratings yet
Decision Tree PDF
19 pages
Business & Accountancy Exam
No ratings yet
Business & Accountancy Exam
2 pages
AFAR Preweek Lecture Batch 93 - 230520 - 015203
No ratings yet
AFAR Preweek Lecture Batch 93 - 230520 - 015203
21 pages
Distributed Database Management System
No ratings yet
Distributed Database Management System
6 pages
Rea Model
No ratings yet
Rea Model
9 pages
Clustering
No ratings yet
Clustering
45 pages
Data Mining-Graph Mining
No ratings yet
Data Mining-Graph Mining
9 pages
5.1 Mining Data Streams
No ratings yet
5.1 Mining Data Streams
16 pages
Data Mining-Mining Sequence Patterns in Biological Data
No ratings yet
Data Mining-Mining Sequence Patterns in Biological Data
6 pages
Data Mining-Mining Time Series Data
0% (1)
Data Mining-Mining Time Series Data
7 pages
Data Mining - Mining Sequential Patterns
No ratings yet
Data Mining - Mining Sequential Patterns
10 pages
Data Mining-Multimedia Datamining
No ratings yet
Data Mining-Multimedia Datamining
8 pages
Rule-Based Classification Guide
No ratings yet
Rule-Based Classification Guide
4 pages
Spatial Data Mining Techniques
No ratings yet
Spatial Data Mining Techniques
8 pages
Data Mining-Outlier Analysis
No ratings yet
Data Mining-Outlier Analysis
6 pages
Data Mining-Constraint Based Cluster Analysis
100% (1)
Data Mining-Constraint Based Cluster Analysis
4 pages
Data Mining - Data Reduction
No ratings yet
Data Mining - Data Reduction
6 pages
Data Science: Classification & Regression
No ratings yet
Data Science: Classification & Regression
7 pages
Data Mining-Partitioning Methods
100% (1)
Data Mining-Partitioning Methods
7 pages
Data Mining-Backpropagation
100% (1)
Data Mining-Backpropagation
5 pages
Data Warehouse Concepts & Models
No ratings yet
Data Warehouse Concepts & Models
7 pages
Data Mining-Applications, Issues
No ratings yet
Data Mining-Applications, Issues
9 pages
Bayesian Classification Guide
No ratings yet
Bayesian Classification Guide
6 pages
Data Mining - Discretization
100% (1)
Data Mining - Discretization
5 pages
Data Mining - Outlier Analysis
100% (3)
Data Mining - Outlier Analysis
11 pages
Data Mining - Density Based Clustering
No ratings yet
Data Mining - Density Based Clustering
8 pages
02 Data Mining-Partitioning Method
No ratings yet
02 Data Mining-Partitioning Method
8 pages
08 Data Mining-Other Classifications
No ratings yet
08 Data Mining-Other Classifications
4 pages
An Analysis of Stochastic Game Theory For Multiagent Reinforcement Learning
No ratings yet
An Analysis of Stochastic Game Theory For Multiagent Reinforcement Learning
12 pages
Data Structure - AVL Tree
No ratings yet
Data Structure - AVL Tree
6 pages
Detection of Structural Cracks of An Aircraft Using Deep Neural Networks
No ratings yet
Detection of Structural Cracks of An Aircraft Using Deep Neural Networks
9 pages
Lecture2 Interpolation
No ratings yet
Lecture2 Interpolation
3 pages
Quantum Information Theory
No ratings yet
Quantum Information Theory
19 pages
TASK 2 - Decisions Under Risk - 212066 - 75
No ratings yet
TASK 2 - Decisions Under Risk - 212066 - 75
39 pages
Asymptotic Notations
No ratings yet
Asymptotic Notations
18 pages
Data Structures Glossary
No ratings yet
Data Structures Glossary
5 pages
Math Tanisa
100% (1)
Math Tanisa
7 pages
Pole Placement Adaptive Control For
No ratings yet
Pole Placement Adaptive Control For
5 pages
Risks: Predicting Motor Insurance Claims Using Telematics Data-Xgboost Versus Logistic Regression
No ratings yet
Risks: Predicting Motor Insurance Claims Using Telematics Data-Xgboost Versus Logistic Regression
16 pages
2nd Order ODE Exam Questions
No ratings yet
2nd Order ODE Exam Questions
21 pages
XXX Yyyy ZZZ: Xxxxxxxxxxxxxxxx@el-Eng - Menofia.edu - Eg
No ratings yet
XXX Yyyy ZZZ: Xxxxxxxxxxxxxxxx@el-Eng - Menofia.edu - Eg
7 pages
2016 Bin Packing and Cutting Stock Problems Mathematical Models and
No ratings yet
2016 Bin Packing and Cutting Stock Problems Mathematical Models and
20 pages
Solving The Permutation Flow Shop Problem With Makespan Criterion Using Grids
No ratings yet
Solving The Permutation Flow Shop Problem With Makespan Criterion Using Grids
12 pages
Gaussian Processes in Machine Learning
No ratings yet
Gaussian Processes in Machine Learning
9 pages
Design of Rigid Pavements 2 PDF
No ratings yet
Design of Rigid Pavements 2 PDF
5 pages
Applications of Graph Theory in Computer Science An Overview
100% (2)
Applications of Graph Theory in Computer Science An Overview
12 pages
YOLOv7-DeepSORT for Video Tracking
No ratings yet
YOLOv7-DeepSORT for Video Tracking
4 pages
100 Solved Calculus Problems
No ratings yet
100 Solved Calculus Problems
26 pages
Akmal Fahrezi Prak - STTK
No ratings yet
Akmal Fahrezi Prak - STTK
2 pages
CAT1 (Design and Analysis of Algorithms)
No ratings yet
CAT1 (Design and Analysis of Algorithms)
6 pages
CS F211 - Data Structures & Algorithms
No ratings yet
CS F211 - Data Structures & Algorithms
7 pages
IKEA Product Manufacturing LP Optimization
No ratings yet
IKEA Product Manufacturing LP Optimization
5 pages
Transposition Cipher Guide
No ratings yet
Transposition Cipher Guide
10 pages
Differential Equation, Calculus of Variation and Special Function
No ratings yet
Differential Equation, Calculus of Variation and Special Function
22 pages
Map-Reduce (Hadoop) Based Data Clustering For BigData A Survey
No ratings yet
Map-Reduce (Hadoop) Based Data Clustering For BigData A Survey
6 pages
Modelling Crack Propagation Using XFEM Important
No ratings yet
Modelling Crack Propagation Using XFEM Important
11 pages
Moments
No ratings yet
Moments
15 pages
gRAPHING LINEAR EQUATIONS PDF
No ratings yet
gRAPHING LINEAR EQUATIONS PDF
1 page

Data Mining-Model Based Clustering

Uploaded by

Data Mining-Model Based Clustering

Uploaded by

CLUSTERING MODEL BASED TECHNIQUES

AND HANDLING HIGH DIMENSIONAL DATA

Assign each object xi to cluster Ck with

Re-estimate model parameters

Simple and easy to implement

A form of clustering in machine learning

Produces a classification scheme for a set of

Finds characteristic description for each

A popular and simple method of incremental

COBWEB Clustering Method COBWEB

Increase in the expected number of attribute

High value indicates many class members

High values fewer objects in different

Temporarily place object in each node and

Placement with highest CU is chosen

COBWEB may also forms new nodes if

COBWEB is sensitive to order of records

Not suitable for clustering large database data

Kohenen SOM Clustering High-Dimensional data

Subspace Clustering searches for groups of

Determines the maximal dense region and Generates

Several lower dimensional subspaces will have to be

Frequent-term based text clustering

A pair (O,T) forms a -pCluster if for any 2 x 2

Each pair of objects and their features must

pCluster can be used in other applications also

You might also like