KEMBAR78
DWDM 3rd EDITION TEXT BOOK SLIDES24.pptx
1
1
Data Mining:
Concepts and Techniques
(3rd
ed.)
— Chapter 1 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
2
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 A Multi-Dimensional View of Data Mining
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary
3
Why Data Mining?
 The Explosive Growth of Data: from terabytes to petabytes
 Data collection and data availability

Automated data collection tools, database systems, Web,
computerized society
 Major sources of abundant data

Business: Web, e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific
simulation, …

Society and everyone: news, digital cameras, YouTube
 We are drowning in data, but starving for knowledge!
 “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
4
Evolution of Sciences
 Before 1600, empirical science
 1600-1950s, theoretical science

Each discipline has grown a theoretical component. Theoretical models often
motivate experiments and generalize our understanding.
 1950s-1990s, computational science
 Over the last 50 years, most disciplines have grown a third, computational
branch (e.g. empirical, theoretical, and computational ecology, or physics, or
linguistics.)
 Computational Science traditionally meant simulation. It grew out of our
inability to find closed-form solutions for complex mathematical models.
 1990-now, data science
 The flood of data from new scientific instruments and simulations
 The ability to economically store and manage petabytes of data online

The Internet and computing Grid that makes all these archives universally
accessible
 Scientific info. management, acquisition, organization, query, and visualization
tasks scale almost linearly with data volumes. Data mining is a major new
challenge!
 Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science,
5
Evolution of Database Technology
 1960s:
 Data collection, database creation, IMS and network DBMS
 1970s:
 Relational data model, relational DBMS implementation
 1980s:
 RDBMS, advanced data models (extended-relational, OO, deductive,
etc.)
 Application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990s:
 Data mining, data warehousing, multimedia databases, and Web
databases
 2000s
 Stream data management and mining
 Data mining and its applications
 Web technology (XML, data integration) and global information
systems
6
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 A Multi-Dimensional View of Data Mining
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary
7
What Is Data Mining?
 Data mining (knowledge discovery from data)
 Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data
 Data mining: a misnomer?
 Alternative names
 Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting, business
intelligence, etc.
 Watch out: Is everything “data mining”?
 Simple search and query processing
 (Deductive) expert systems
8
Knowledge Discovery (KDD) Process
 This is a view from typical
database systems and data
warehousing communities
 Data mining plays an essential
role in the knowledge discovery
process
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
9
Example: A Web Mining Framework
 Web mining usually involves
 Data cleaning
 Data integration from multiple sources
 Warehousing the data
 Data cube construction
 Data selection for data mining
 Data mining
 Presentation of the mining results
 Patterns and knowledge to be used or stored into
knowledge-base
10
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Business
Analyst
Data
Analyst
DBA
Decision
Making
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
11
Example: Mining vs. Data Exploration
 Business intelligence view
 Warehouse, data cube, reporting but not much
mining
 Business objects vs. data mining tools
 Supply chain example: tools
 Data presentation
 Exploration
12
KDD Process: A Typical View from ML and
Statistics
Input Data Data
Mining
Data Pre-
Processing
Post-
Processing
 This is a view from typical machine learning and statistics communities
Data integration
Normalization
Feature selection
Dimension reduction
Pattern discovery
Association &
correlation
Classification
Clustering
Outlier analysis
… … … …
Pattern evaluation
Pattern selection
Pattern interpretation
Pattern visualization
13
Example: Medical Data Mining
 Health care & medical data mining – often
adopted such a view in statistics and machine
learning
 Preprocessing of the data (including feature
extraction and dimension reduction)
 Classification or/and clustering processes
 Post-processing for presentation
14
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 A Multi-Dimensional View of Data Mining
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary
15
Multi-Dimensional View of Data Mining
 Data to be mined
 Database data (extended-relational, object-oriented,
heterogeneous, legacy), data warehouse, transactional data,
stream, spatiotemporal, time-series, sequence, text and web,
multi-media, graphs & social and information networks
 Knowledge to be mined (or: Data mining functions)
 Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis, etc.
 Descriptive vs. predictive data mining
 Multiple/integrated functions and mining at multiple levels
 Techniques utilized
 Data-intensive, data warehouse (OLAP), machine learning,
statistics, pattern recognition, visualization, high-performance,
etc.
 Applications adapted
 Retail, telecommunication, banking, fraud analysis, bio-data
mining, stock market analysis, text mining, Web mining, etc.
16
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 A Multi-Dimensional View of Data Mining
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary
17
Data Mining: On What Kinds of Data?
 Database-oriented data sets and applications
 Relational database, data warehouse, transactional database
 Advanced data sets and advanced applications
 Data streams and sensor data
 Time-series data, temporal data, sequence data (incl. bio-sequences)
 Structure data, graphs, social networks and multi-linked data
 Object-relational databases
 Heterogeneous databases and legacy databases
 Spatial data and spatiotemporal data
 Multimedia database
 Text databases
 The World-Wide Web
18
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 A Multi-Dimensional View of Data Mining
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary
19
Data Mining Function: (1) Generalization
 Information integration and data warehouse
construction
 Data cleaning, transformation, integration, and
multidimensional data model
 Data cube technology
 Scalable methods for computing (i.e., materializing)
multidimensional aggregates
 OLAP (online analytical processing)
 Multidimensional concept description:
Characterization and discrimination
 Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet region
20
Data Mining Function: (2) Association and
Correlation Analysis
 Frequent patterns (or frequent itemsets)
 What items are frequently purchased together in
your Walmart?
 Association, correlation vs. causality
 A typical association rule

Diaper  Beer [0.5%, 75%] (support,
confidence)
 Are strongly associated items also strongly
correlated?
 How to mine such patterns and rules efficiently in
large datasets?
 How to use such patterns for classification, clustering,
21
Data Mining Function: (3) Classification
 Classification and label prediction
 Construct models (functions) based on some training
examples
 Describe and distinguish classes or concepts for future
prediction

E.g., classify countries based on (climate), or classify cars
based on (gas mileage)
 Predict some unknown class labels
 Typical methods
 Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-
based classification, logistic regression, …
 Typical applications:
 Credit card fraud detection, direct marketing, classifying stars,
diseases, web-pages, …
22
Data Mining Function: (4) Cluster Analysis
 Unsupervised learning (i.e., Class label is unknown)
 Group data to form new categories (i.e., clusters), e.g.,
cluster houses to find distribution patterns
 Principle: Maximizing intra-class similarity & minimizing
interclass similarity
 Many methods and applications
23
Data Mining Function: (5) Outlier Analysis
 Outlier analysis
 Outlier: A data object that does not comply with the general
behavior of the data
 Noise or exception? ― One person’s garbage could be another
person’s treasure
 Methods: by product of clustering or regression analysis, …
 Useful in fraud detection, rare events analysis
24
Time and Ordering: Sequential Pattern,
Trend and Evolution Analysis
 Sequence, trend and evolution analysis
 Trend, time-series, and deviation analysis: e.g.,
regression and value prediction
 Sequential pattern mining

e.g., first buy digital camera, then buy large SD
memory cards
 Periodicity analysis
 Motifs and biological sequence analysis

Approximate and consecutive motifs
 Similarity-based analysis
 Mining data streams
 Ordered, time-varying, potentially infinite, data
streams
25
Structure and Network Analysis
 Graph mining
 Finding frequent subgraphs (e.g., chemical compounds), trees
(XML), substructures (web fragments)
 Information network analysis
 Social networks: actors (objects, nodes) and relationships
(edges)

e.g., author networks in CS, terrorist networks
 Multiple heterogeneous networks

A person could be multiple information networks: friends,
family, classmates, …
 Links carry a lot of semantic information: Link mining
 Web mining
 Web is a big information network: from PageRank to Google
 Analysis of Web information networks

Web community discovery, opinion mining, usage mining,
…
26
Evaluation of Knowledge
 Are all mined knowledge interesting?
 One can mine tremendous amount of “patterns” and
knowledge
 Some may fit only certain dimension space (time, location, …)
 Some may not be representative, may be transient, …
 Evaluation of mined knowledge → directly mine only
interesting knowledge?
 Descriptive vs. predictive
 Coverage
 Typicality vs. novelty
 Accuracy
 Timeliness
 …
27
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 A Multi-Dimensional View of Data Mining
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary
28
Data Mining: Confluence of Multiple Disciplines
Data Mining
Machine
Learning
Statistics
Applications
Algorithm
Pattern
Recognition
High-Performance
Computing
Visualization
Database
Technology
29
Why Confluence of Multiple Disciplines?
 Tremendous amount of data
 Algorithms must be highly scalable to handle such as tera-bytes
of data
 High-dimensionality of data
 Micro-array may have tens of thousands of dimensions
 High complexity of data
 Data streams and sensor data
 Time-series data, temporal data, sequence data
 Structure data, graphs, social networks and multi-linked data
 Heterogeneous databases and legacy databases
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
 New and sophisticated applications
30
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 A Multi-Dimensional View of Data Mining
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary
31
Applications of Data Mining
 Web page analysis: from web page classification, clustering to
PageRank & HITS algorithms
 Collaborative analysis & recommender systems
 Basket data analysis to targeted marketing
 Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological
network analysis
 Data mining and software engineering (e.g., IEEE Computer, Aug.
2009 issue)
 From major dedicated data mining systems/tools (e.g., SAS, MS
SQL-Server Analysis Manager, Oracle Data Mining Tools) to
invisible data mining
32
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 A Multi-Dimensional View of Data Mining
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary
33
Major Issues in Data Mining (1)
 Mining Methodology
 Mining various and new kinds of knowledge
 Mining knowledge in multi-dimensional space
 Data mining: An interdisciplinary effort
 Boosting the power of discovery in a networked environment
 Handling noise, uncertainty, and incompleteness of data
 Pattern evaluation and pattern- or constraint-guided mining
 User Interaction
 Interactive mining
 Incorporation of background knowledge
 Presentation and visualization of data mining results
34
Major Issues in Data Mining (2)
 Efficiency and Scalability
 Efficiency and scalability of data mining algorithms
 Parallel, distributed, stream, and incremental mining methods
 Diversity of data types
 Handling complex types of data
 Mining dynamic, networked, and global data repositories
 Data mining and society
 Social impacts of data mining
 Privacy-preserving data mining
 Invisible data mining
35
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 A Multi-Dimensional View of Data Mining
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary
36
A Brief History of Data Mining Society
 1989 IJCAI Workshop on Knowledge Discovery in Databases
 Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W.
Frawley, 1991)
 1991-1994 Workshops on Knowledge Discovery in Databases
 Advances in Knowledge Discovery and Data Mining (U. Fayyad, G.
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
 1995-1998 International Conferences on Knowledge Discovery in
Databases and Data Mining (KDD’95-98)
 Journal of Data Mining and Knowledge Discovery (1997)
 ACM SIGKDD conferences since 1998 and SIGKDD Explorations
 More conferences on data mining
 PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM
(2001), etc.
 ACM Transactions on KDD starting in 2007
37
Conferences and Journals on Data Mining
 KDD Conferences
 ACM SIGKDD Int. Conf. on
Knowledge Discovery in
Databases and Data Mining
(KDD)
 SIAM Data Mining Conf. (SDM)
 (IEEE) Int. Conf. on Data Mining
(ICDM)
 European Conf. on Machine
Learning and Principles and
practices of Knowledge
Discovery and Data Mining
(ECML-PKDD)
 Pacific-Asia Conf. on Knowledge
Discovery and Data Mining
(PAKDD)
 Int. Conf. on Web Search and
Data Mining (WSDM)
 Other related conferences
 DB conferences: ACM SIGMOD,
VLDB, ICDE, EDBT, ICDT, …
 Web and IR conferences: WWW,
SIGIR, WSDM
 ML conferences: ICML, NIPS
 PR conferences: CVPR,
 Journals
 Data Mining and Knowledge
Discovery (DAMI or DMKD)
 IEEE Trans. On Knowledge and
Data Eng. (TKDE)
 KDD Explorations
 ACM Trans. on KDD
38
Where to Find References? DBLP, CiteSeer, Google
 Data mining and KDD (SIGKDD: CDROM)
 Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
 Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
 Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM)
 Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
 Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
 AI & Machine Learning
 Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS,
etc.
 Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems,
IEEE-PAMI, etc.
 Web and IR
 Conferences: SIGIR, WWW, CIKM, etc.
 Journals: WWW: Internet and Web Information Systems,
 Statistics
 Conferences: Joint Stat. Meeting, etc.
 Journals: Annals of statistics, etc.
 Visualization
 Conference proceedings: CHI, ACM-SIGGraph, etc.
 Journals: IEEE Trans. visualization and computer graphics, etc.
39
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 A Multi-Dimensional View of Data Mining
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary
40
Summary
 Data mining: Discovering interesting patterns and knowledge
from massive amount of data
 A natural evolution of database technology, in great demand, with
wide applications
 A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
 Mining can be performed in a variety of data
 Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis,
etc.
 Data mining technologies and applications
 Major issues in data mining
41
Recommended Reference Books
 S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan
Kaufmann, 2002
 R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
 U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and
Data Mining. AAAI/MIT Press, 1996
 U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery,
Morgan Kaufmann, 2001
 J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd
ed., 2011
 D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
 T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and
Prediction, 2nd
ed., Springer-Verlag, 2009
 B. Liu, Web Data Mining, Springer 2006.
 T. M. Mitchell, Machine Learning, McGraw Hill, 1997
 G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991
 P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
 S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
 I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java
Implementations, Morgan Kaufmann, 2nd
ed. 2005
42
Data Mining:
Concepts and Techniques
— Chapter 2 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign
Simon Fraser University
©2011 Han, Kamber, and Pei. All rights
reserved.
43
Chapter 2: Getting to Know Your Data
 Data Objects and Attribute Types
 Basic Statistical Descriptions of Data
 Data Visualization
 Measuring Data Similarity and Dissimilarity
 Summary
44
Types of Data Sets
 Record
 Relational records
 Data matrix, e.g., numerical matrix,
crosstabs
 Document data: text documents: term-
frequency vector
 Transaction data
 Graph and network
 World Wide Web
 Social or information networks
 Molecular Structures
 Ordered
 Video data: sequence of images
 Temporal data: time-series
 Sequential Data: transaction sequences
 Genetic sequence data
 Spatial, image and multimedia:
 Spatial data: maps
 Image data:
 Video data:
Document 1
season
timeout
lost
wi
n
game
score
ball
pla
y
coach
team
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
45
Important Characteristics of Structured Data
 Dimensionality
 Curse of dimensionality
 Sparsity
 Only presence counts
 Resolution

Patterns depend on the scale
 Distribution
 Centrality and dispersion
46
Data Objects
 Data sets are made up of data objects.
 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points,
objects, tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.
47
Attributes
 Attribute (or dimensions, features, variables):
a data field, representing a characteristic or
feature of a data object.
 E.g., customer _ID, name, address
 Types:
 Nominal
 Binary
 Numeric: quantitative

Interval-scaled

Ratio-scaled
48
Attribute Types
 Nominal: categories, states, or “names of things”
 Hair_color = {auburn, black, blond, brown, grey, red, white}
 marital status, occupation, ID numbers, zip codes
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important

e.g., gender
 Asymmetric binary: outcomes not equally important.

e.g., medical test (positive vs. negative)

Convention: assign 1 to most important outcome (e.g.,
HIV positive)
 Ordinal
 Values have a meaningful order (ranking) but magnitude
between successive values is not known.
 Size = {small, medium, large}, grades, army rankings
49
Numeric Attribute Types
 Quantity (integer or real-valued)
 Interval

Measured on a scale of equal-sized units

Values have order
 E.g., temperature in C˚or F˚, calendar dates

No true zero-point
 Ratio

Inherent zero-point

We can speak of values as being an order of
magnitude larger than the unit of measurement
(10 K˚ is twice as high as 5 K˚).
 e.g., temperature in Kelvin, length, counts,
monetary quantities
50
Discrete vs. Continuous Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values

E.g., zip codes, profession, or the set of words in a
collection of documents
 Sometimes, represented as integer variables
 Note: Binary attributes are a special case of discrete
attributes
 Continuous Attribute
 Has real numbers as attribute values

E.g., temperature, height, or weight
 Practically, real values can only be measured and
represented using a finite number of digits
 Continuous attributes are typically represented as
floating-point variables
51
Chapter 2: Getting to Know Your Data
 Data Objects and Attribute Types
 Basic Statistical Descriptions of Data
 Data Visualization
 Measuring Data Similarity and Dissimilarity
 Summary
52
Basic Statistical Descriptions of Data
 Motivation
 To better understand the data: central tendency,
variation and spread
 Data dispersion characteristics
 median, max, min, quantiles, outliers, variance, etc.
 Numerical dimensions correspond to sorted intervals
 Data dispersion: analyzed with multiple
granularities of precision
 Boxplot or quantile analysis on sorted intervals
 Dispersion analysis on computed measures
 Folding measures into numerical dimensions
 Boxplot or quantile analysis on the transformed
cube
53
Measuring the Central Tendency
 Mean (algebraic measure) (sample vs. population):
Note: n is sample size and N is population size.
 Weighted arithmetic mean:
 Trimmed mean: chopping extreme values
 Median:
 Middle value if odd number of values, or average of
the middle two values otherwise
 Estimated by interpolation (for grouped data):
 Mode
 Value that occurs most frequently in the data
 Unimodal, bimodal, trimodal
 Empirical formula:



n
i
i
x
n
x
1
1




 n
i
i
n
i
i
i
w
x
w
x
1
1
width
freq
l
freq
n
L
median
median
)
)
(
2
/
(
1




)
(
3 median
mean
mode
mean 



N
x



October 24, 2024
Data Mining: Concepts and
Techniques 54
Symmetric vs. Skewed Data
 Median, mean and mode of
symmetric, positively and
negatively skewed data
positively skewed negatively skewed
symmetric
55
Measuring the Dispersion of Data
 Quartiles, outliers and boxplots
 Quartiles: Q1 (25th
percentile), Q3 (75th
percentile)
 Inter-quartile range: IQR = Q3 –Q1
 Five number summary: min, Q1, median, Q3, max
 Boxplot: ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually
 Outlier: usually, a value higher/lower than 1.5 x IQR
 Variance and standard deviation (sample: s, population: σ)
 Variance: (algebraic, scalable computation)
 Standard deviation s (or σ) is the square root of variance s2 (
or σ2)
 
  







n
i
n
i
i
i
n
i
i x
n
x
n
x
x
n
s
1 1
2
2
1
2
2
]
)
(
1
[
1
1
)
(
1
1

 





n
i
i
n
i
i x
N
x
N 1
2
2
1
2
2 1
)
(
1



56
Boxplot Analysis
 Five-number summary of a distribution
 Minimum, Q1, Median, Q3, Maximum
 Boxplot
 Data is represented with a box
 The ends of the box are at the first and
third quartiles, i.e., the height of the box is
IQR
 The median is marked by a line within the
box
 Whiskers: two lines outside the box
extended to Minimum and Maximum
 Outliers: points beyond a specified outlier
threshold, plotted individually
October 24, 2024
Data Mining: Concepts and
Techniques 57
Visualization of Data Dispersion: 3-D Boxplots
58
Properties of Normal Distribution Curve
 The normal (distribution) curve
 From μ–σ to μ+σ: contains about 68% of the
measurements (μ: mean, σ: standard deviation)
 From μ–2σ to μ+2σ: contains about 95% of it
 From μ–3σ to μ+3σ: contains about 99.7% of it
59
Graphic Displays of Basic Statistical Descriptions
 Boxplot: graphic display of five-number summary
 Histogram: x-axis are values, y-axis repres.
frequencies
 Quantile plot: each value xi is paired with fi indicating
that approximately 100 fi % of data are  xi
 Quantile-quantile (q-q) plot: graphs the quantiles of
one univariant distribution against the corresponding
quantiles of another
 Scatter plot: each pair of values is a pair of
coordinates and plotted as points in the plane
60
Histogram Analysis
 Histogram: Graph display of
tabulated frequencies, shown as
bars
 It shows what proportion of cases
fall into each of several categories
 Differs from a bar chart in that it is
the area of the bar that denotes
the value, not the height as in bar
charts, a crucial distinction when
the categories are not of uniform
width
 The categories are usually
specified as non-overlapping
intervals of some variable. The
categories (bars) must be adjacent
0
5
10
15
20
25
30
35
40
10000 30000 50000 70000 90000
61
Histograms Often Tell More than Boxplots
 The two histograms
shown in the left may
have the same
boxplot
representation
 The same values
for: min, Q1,
median, Q3, max
 But they have rather
different data
distributions
Data Mining: Concepts and
Techniques 62
Quantile Plot
 Displays all of the data (allowing the user to assess
both the overall behavior and unusual occurrences)
 Plots quantile information

For a data xi data sorted in increasing order, fi
indicates that approximately 100 fi% of the data are
below or equal to the value xi
63
Quantile-Quantile (Q-Q) Plot
 Graphs the quantiles of one univariate distribution against the
corresponding quantiles of another
 View: Is there is a shift in going from one distribution to another?
 Example shows unit price of items sold at Branch 1 vs. Branch 2
for each quantile. Unit prices of items sold at Branch 1 tend to be
lower than those at Branch 2.
64
Scatter plot
 Provides a first look at bivariate data to see clusters of
points, outliers, etc
 Each pair of values is treated as a pair of coordinates
and plotted as points in the plane
65
Positively and Negatively Correlated Data
 The left half fragment is positively
correlated
 The right half is negative
correlated
66
Uncorrelated Data
67
Chapter 2: Getting to Know Your Data
 Data Objects and Attribute Types
 Basic Statistical Descriptions of Data
 Data Visualization
 Measuring Data Similarity and Dissimilarity
 Summary
68
Data Visualization
 Why data visualization?
 Gain insight into an information space by mapping data onto graphical
primitives
 Provide qualitative overview of large data sets
 Search for patterns, trends, structure, irregularities, relationships
among data
 Help find interesting regions and suitable parameters for further
quantitative analysis
 Provide a visual proof of computer representations derived
 Categorization of visualization methods:
 Pixel-oriented visualization techniques
 Geometric projection visualization techniques
 Icon-based visualization techniques
 Hierarchical visualization techniques
 Visualizing complex data and relations
69
Pixel-Oriented Visualization Techniques
 For a data set of m dimensions, create m windows on the screen,
one for each dimension
 The m dimension values of a record are mapped to m pixels at the
corresponding positions in the windows
 The colors of the pixels reflect the corresponding values
(a) Income (b) Credit
Limit
(c) transaction volume (d) age
70
Laying Out Pixels in Circle Segments
 To save space and show the connections among multiple
dimensions, space filling is often done in a circle segment
(a) Representing a data record
in circle segment
(b) Laying out pixels in circle
segment
71
Geometric Projection Visualization Techniques
 Visualization of geometric transformations and
projections of the data
 Methods
 Direct visualization
 Scatterplot and scatterplot matrices
 Landscapes
 Projection pursuit technique: Help users find
meaningful projections of multidimensional data
 Prosection views
 Hyperslice
 Parallel coordinates
Data Mining: Concepts and
Techniques 72
Direct Data Visualization
Ribbons
with
Twists
Based
on
Vorticity
73
Scatterplot Matrices
Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of (k2/2-k) scatterplots]
Used
by
ermission
of
M.
Ward,
Worcester
Polytechnic
Institute
74
news articles
visualized as
a landscape
Used
by
permission
of
B.
Wright,
Visible
Decisions
Inc.
Landscapes
 Visualization of the data as perspective landscape
 The data needs to be transformed into a (possibly artificial) 2D
spatial representation which preserves the characteristics of the
data
75
Attr. 1 Attr. 2 Attr. k
Attr. 3
• • •
Parallel Coordinates
 n equidistant axes which are parallel to one of the screen axes and
correspond to the attributes
 The axes are scaled to the [minimum, maximum]: range of the
corresponding attribute
 Every data item corresponds to a polygonal line which intersects
each of the axes at the point which corresponds to the value for
the attribute
76
Parallel Coordinates of a Data Set
77
Icon-Based Visualization Techniques
 Visualization of the data values as features of icons
 Typical visualization methods
 Chernoff Faces
 Stick Figures
 General techniques
 Shape coding: Use shape to represent certain
information encoding
 Color icons: Use color icons to encode more
information
 Tile bars: Use small icons to represent the relevant
feature vectors in document retrieval
78
Chernoff Faces
 A way to display variables on a two-dimensional surface, e.g., let x
be eyebrow slant, y be eye size, z be nose length, etc.
 The figure shows faces produced using 10 characteristics--head
eccentricity, eye size, eye spacing, eye eccentricity, pupil size,
eyebrow slant, nose size, mouth shape, mouth size, and mouth
opening): Each assigned one of 10 possible values, generated
using Mathematica (S. Dickson)
 REFERENCE: Gonick, L. and Smith, W.
The Cartoon Guide to Statistics. New York:
Harper Perennial, p. 212, 1993
 Weisstein, Eric W. "Chernoff Face." From
MathWorld--A Wolfram Web Resource.
mathworld.wolfram.com/ChernoffFace.ht
ml
79
Two attributes mapped to axes, remaining attributes mapped to angle or length of limbs”. Look at texture
A census data
figure showing
age, income,
gender,
education, etc.
used
by
permissio
n
of
G.
Grinste
in,
University
of
Massac
husettes
at
Lowell
Stick Figure
A 5-piece stick
figure (1 body
and 4 limbs w.
different
angle/length)
80
Hierarchical Visualization Techniques
 Visualization of the data using a hierarchical
partitioning into subspaces
 Methods
 Dimensional Stacking
 Worlds-within-Worlds
 Tree-Map
 Cone Trees
 InfoCube
81
Dimensional Stacking
attribute 1
attribute 2
attribute 3
attribute 4
 Partitioning of the n-dimensional attribute space in 2-D
subspaces, which are ‘stacked’ into each other
 Partitioning of the attribute value ranges into classes.
The important attributes should be used on the outer
levels.
 Adequate for data with ordinal attributes of low
cardinality
 But, difficult to display more than nine dimensions
82
Used by permission of M. Ward, Worcester Polytechnic Institute
Visualization of oil mining data with longitude and latitude mapped to the
outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes
Dimensional Stacking
83
Worlds-within-Worlds
 Assign the function and two most important parameters to
innermost world
 Fix all other parameters at constant values - draw other (1 or 2 or 3
dimensional worlds choosing these as the axes)
 Software that uses this paradigm
 N–vision: Dynamic
interaction through
data glove and stereo
displays, including
rotation, scaling (inner)
and translation
(inner/outer)
 Auto Visual: Static
interaction by means of
queries
84
Tree-Map
 Screen-filling method which uses a hierarchical
partitioning of the screen into regions depending on the
attribute values
 The x- and y-dimension of the screen are partitioned
alternately according to the attribute values (classes)
MSR Netscan Image
Ack.:
85
Tree-Map of a File System (Schneiderman)
86
InfoCube
 A 3-D visualization technique where hierarchical
information is displayed as nested semi-
transparent cubes
 The outermost cubes correspond to the top level
data, while the subnodes or the lower level data
are represented as smaller cubes inside the
outermost cubes, and so on
87
Three-D Cone Trees
 3D cone tree visualization technique
works well for up to a thousand nodes
or so
 First build a 2D circle tree that arranges
its nodes in concentric circles centered
on the root node
 Cannot avoid overlaps when projected
to 2D
 G. Robertson, J. Mackinlay, S. Card.
“Cone Trees: Animated 3D Visualizations
of Hierarchical Information”, ACM
SIGCHI'91
 Graph from Nadeau Software Consulting
website: Visualize a social network data
set that models the way an infection
spreads from one person to the next
Ack.: http://nadeausoftware.com/articles/visualization
Visualizing Complex Data and Relations
 Visualizing non-numerical data: text and social networks
 Tag cloud: visualizing user-generated tags
 The importance of
tag is represented
by font size/color
 Besides text data,
there are also
methods to visualize
relationships, such
as visualizing social
networks
Newsmap: Google News Stories in
89
Chapter 2: Getting to Know Your Data
 Data Objects and Attribute Types
 Basic Statistical Descriptions of Data
 Data Visualization
 Measuring Data Similarity and Dissimilarity
 Summary
90
Similarity and Dissimilarity
 Similarity
 Numerical measure of how alike two data objects are
 Value is higher when objects are more alike
 Often falls in the range [0,1]
 Dissimilarity (e.g., distance)
 Numerical measure of how different two data
objects are
 Lower when objects are more alike
 Minimum dissimilarity is often 0
 Upper limit varies
 Proximity refers to a similarity or dissimilarity
91
Data Matrix and Dissimilarity Matrix
 Data matrix
 n data points with p
dimensions
 Two modes
 Dissimilarity matrix
 n data points, but
registers only the
distance
 A triangular matrix
 Single mode


















np
x
...
nf
x
...
n1
x
...
...
...
...
...
ip
x
...
if
x
...
i1
x
...
...
...
...
...
1p
x
...
1f
x
...
11
x
















0
...
)
2
,
(
)
1
,
(
:
:
:
)
2
,
3
(
)
...
n
d
n
d
0
d
d(3,1
0
d(2,1)
0
92
Proximity Measure for Nominal Attributes
 Can take 2 or more states, e.g., red, yellow,
blue, green (generalization of a binary
attribute)
 Method 1: Simple matching
 m: # of matches, p: total # of variables
 Method 2: Use a large number of binary
attributes
 creating a new binary attribute for each of
p
m
p
j
i
d 

)
,
(
93
Proximity Measure for Binary Attributes
 A contingency table for binary
data
 Distance measure for symmetric
binary variables:
 Distance measure for asymmetric
binary variables:
 Jaccard coefficient (similarity
measure for asymmetric binary
variables):
 Note: Jaccard coefficient is the same as “coherence”:
Object i
Object j
94
Dissimilarity between Binary Variables
 Example
 Gender is a symmetric attribute
 The remaining attributes are asymmetric binary
 Let the values Y and P be 1, and the value N 0
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
75
.
0
2
1
1
2
1
)
,
(
67
.
0
1
1
1
1
1
)
,
(
33
.
0
1
0
2
1
0
)
,
(















mary
jim
d
jim
jack
d
mary
jack
d
95
Standardizing Numeric Data
 Z-score:
 X: raw score to be standardized, μ: mean of the population, σ:
standard deviation
 the distance between the raw score and the population mean
in units of the standard deviation
 negative when the raw score is below the mean, “+” when
above
 An alternative way: Calculate the mean absolute deviation
where
 standardized measure (z-score):
 Using mean absolute deviation is more robust than using
standard deviation
.
)
...
2
1
1
nf
f
f
f
x
x
(x
n
m 



|)
|
...
|
|
|
(|
1
2
1 f
nf
f
f
f
f
f
m
x
m
x
m
x
n
s 






f
f
if
if s
m
x
z





 x
z
96
Example:
Data Matrix and Dissimilarity Matrix
point attribute1 attribute2
x1 1 2
x2 3 5
x3 2 0
x4 4 5
Dissimilarity Matrix
(with Euclidean Distance)
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
Data Matrix
0 2 4
2
4
x1
x2
x3
x4
97
Distance on Numeric Data: Minkowski Distance
 Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
 Properties
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
 A distance that satisfies these properties is a metric
98
Special Cases of Minkowski Distance
 h = 1: Manhattan (city block, L1
norm) distance
 E.g., the Hamming distance: the number of bits that are
different between two binary vectors
 h = 2: (L2 norm) Euclidean distance
 h  . “supremum” (Lmax
norm, L
norm) distance.
 This is the maximum difference between any component
(attribute) of the vectors
)
|
|
...
|
|
|
(|
)
,
( 2
2
2
2
2
1
1 p
p j
x
i
x
j
x
i
x
j
x
i
x
j
i
d 






|
|
...
|
|
|
|
)
,
(
2
2
1
1 p
p j
x
i
x
j
x
i
x
j
x
i
x
j
i
d 






99
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2
x1 1 2
x2 3 5
x3 2 0
x4 4 5
L x1 x2 x3 x4
x1 0
x2 5 0
x3 3 6 0
x4 6 1 7 0
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
Manhattan (L1)
Euclidean (L2)
Supremum
0 2 4
2
4
x1
x2
x3
x4
100
Ordinal Variables
 An ordinal variable can be discrete or continuous
 Order is important, e.g., rank
 Can be treated like interval-scaled
 replace xif by their rank
 map the range of each variable onto [0, 1] by
replacing i-th object in the f-th variable by
 compute the dissimilarity using methods for
interval-scaled variables
1
1



f
if
if M
r
z
}
,...,
1
{ f
if
M
r 
101
Attributes of Mixed Type
 A database may contain all attribute types
 Nominal, symmetric binary, asymmetric binary,
numeric, ordinal
 One may use a weighted formula to combine their
effects
 f is binary or nominal:
dij
(f)
= 0 if xif = xjf , or dij
(f)
= 1 otherwise
 f is numeric: use the normalized distance
 f is ordinal

Compute ranks rif and
)
(
1
)
(
)
(
1
)
,
( f
ij
p
f
f
ij
f
ij
p
f
d
j
i
d







1
1



f
if
M
r
zif
102
Cosine Similarity
 A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.
 Other vector objects: gene features in micro-arrays, …
 Applications: information retrieval, biologic taxonomy, gene feature
mapping, ...
 Cosine measure: If d1
and d2
are two vectors (e.g., term-frequency
vectors), then
cos(d1
, d2
) = (d1
 d2
) /||d1
|| ||d2
|| ,
where  indicates vector dot product, ||d||: the length of vector
d
103
Example: Cosine Similarity
 cos(d1
, d2
) = (d1
 d2
) /||d1
|| ||d2
|| ,
where  indicates vector dot product, ||d|: the length of vector d
 Ex: Find the similarity between documents 1 and 2.
d1
= (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2
= (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1
d2
= 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1
||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5
=(42)0.5
=
6.481
||d2
||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5
=(17)0.5
= 4.12
cos(d1
, d2
) = 0.94
104
Chapter 2: Getting to Know Your Data
 Data Objects and Attribute Types
 Basic Statistical Descriptions of Data
 Data Visualization
 Measuring Data Similarity and Dissimilarity
 Summary
Summary
 Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-
scaled
 Many types of data sets, e.g., numerical, text, graph, Web, image.
 Gain insight into the data by:
 Basic statistical data description: central tendency, dispersion,
graphical displays
 Data visualization: map data onto graphical primitives
 Measure data similarity
 Above steps are the beginning of data preprocessing.
 Many methods have been developed but still an active area of
research.
105
References
 W. Cleveland, Visualizing Data, Hobart Press, 1993
 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
 U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
 H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech.
Committee on Data Eng., 20(4), Dec. 1997
 D. A. Keim. Information visualization and visual data mining, IEEE trans. on Visualization
and Computer Graphics, 8(1), 2002
 D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
 S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and
Machine Intelligence, 21(9), 1999
 E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press,
2001
 C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009
106
107
Data Mining:
Concepts and Techniques
(3rd
ed.)
— Chapter 3 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
108
108
Chapter 3: Data Preprocessing
 Data Preprocessing: An Overview
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
109
Data Quality: Why Preprocess the Data?
 Measures for data quality: A multidimensional view
 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not,
dangling, …
 Timeliness: timely update?
 Believability: how trustable the data are correct?
 Interpretability: how easily the data can be
understood?
110
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Concept hierarchy generation
111
111
Chapter 3: Data Preprocessing
 Data Preprocessing: An Overview
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
112
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission
error
 incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data

e.g., Occupation=“ ” (missing data)
 noisy: containing noise, errors, or outliers

e.g., Salary=“ 10” (an error)
−
 inconsistent: containing discrepancies in codes or names, e.g.,

Age=“42”, Birthday=“03/07/2010”

Was rating “1, 2, 3”, now rating “A, B, C”

discrepancy between duplicate records
 Intentional (e.g., disguised missing data)

Jan. 1 as everyone’s birthday?
113
Incomplete (Missing) Data
 Data is not always available
 E.g., many tuples have no recorded value for
several attributes, such as customer income in
sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus
deleted
 data not entered due to misunderstanding
 certain data may not be considered important at
the time of entry
 not register history or changes of the data
114
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is
missing (when doing classification)—not effective when
the % of missing values per attribute varies
considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the
same class: smarter
 the most probable value: inference-based such as
115
Noisy Data
 Noise: random error or variance in a measured
variable
 Incorrect attribute values may be due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which require data cleaning
 duplicate records
 incomplete data
 inconsistent data
116
How to Handle Noisy Data?
 Binning
 first sort data and partition into (equal-frequency)
bins
 then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
 Regression
 smooth by fitting the data into regression functions
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values and check by human (e.g.,
deal with possible outliers)
117
Data Cleaning as a Process
 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)
 Check field overloading
 Check uniqueness rule, consecutive rule and null rule
 Use commercial tools

Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections

Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and
clustering to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified
 ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
 Integration of the two processes
 Iterative and interactive (e.g., Potter’s Wheels)
118
118
Chapter 3: Data Preprocessing
 Data Preprocessing: An Overview
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
119
119
Data Integration
 Data integration:
 Combines data from multiple sources into a coherent store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Entity identification problem:
 Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different
sources are different
 Possible reasons: different representations, different scales,
e.g., metric vs. British units
120
120
Handling Redundancy in Data Integration
 Redundant data occur often when integration of
multiple databases
 Object identification: The same attribute or object
may have different names in different databases
 Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
 Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
 Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
121
Correlation Analysis (Nominal Data)
 Χ2
(chi-square) test
 The larger the Χ2
value, the more likely the variables
are related
 The cells that contribute the most to the Χ2
value are
those whose actual count is very different from the
expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population



Expected
Expected
Observed 2
2 )
(

122
Chi-Square Calculation: An Example
 Χ2
(chi-square) calculation (numbers in parenthesis are
expected counts calculated based on the data
distribution in the two categories)
 It shows that like_science_fiction and play_chess are
correlated in the group
93
.
507
840
)
840
1000
(
360
)
360
200
(
210
)
210
50
(
90
)
90
250
( 2
2
2
2
2










Play
chess
Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science
fiction
50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
123
Correlation Analysis (Numeric Data)
 Correlation coefficient (also called Pearson’s product
moment coefficient)
where n is the number of tuples, and are the respective
means of A and B, σA and σB are the respective standard
deviation of A and B, and Σ(aibi) is the sum of the AB cross-
product.
 If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated
B
A
n
i i
i
B
A
n
i i
i
B
A
n
B
A
n
b
a
n
B
b
A
a
r



 )
1
(
)
(
)
1
(
)
)(
( 1
1
,








 

A B
124
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
125
Correlation (viewed as linear relationship)
 Correlation measures the linear relationship
between objects
 To compute correlation, we standardize data
objects, A and B, and then take their dot
product
)
(
/
))
(
(
' A
std
A
mean
a
a k
k 

)
(
/
))
(
(
' B
std
B
mean
b
b k
k 

'
'
)
,
( B
A
B
A
n
correlatio 

126
Covariance (Numeric Data)
 Covariance is similar to correlation
where n is the number of tuples, and are the respective mean
or expected values of A and B, σA and σB are the respective
standard deviation of A and B.
 Positive covariance: If CovA,B > 0, then A and B both tend to be larger
than their expected values.
 Negative covariance: If CovA,B < 0 then if A is larger than its expected
value, B is likely to be smaller than its expected value.

Independence: CovA,B = 0 but the converse is not true:
 Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data
follow multivariate normal distributions) does a covariance of 0 imply
A B
Correlation coefficient:
Co-Variance: An Example
 It can be simplified in computation as
 Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
 Question: If the stocks are affected by the same industry trends,
will their prices rise or fall together?
 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 4 × 9.6 = 4
−
 Thus, A and B rise together since Cov(A, B) > 0.
128
128
Chapter 3: Data Preprocessing
 Data Preprocessing: An Overview
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
129
Data Reduction Strategies
 Data reduction: Obtain a reduced representation of the data set
that is much smaller in volume but yet produces the same (or
almost the same) analytical results
 Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time
to run on the complete data set.
 Data reduction strategies
 Dimensionality reduction, e.g., remove unimportant attributes

Wavelet transforms

Principal Components Analysis (PCA)

Feature subset selection, feature creation
 Numerosity reduction (some simply call it: Data Reduction)

Regression and Log-Linear Models

Histograms, clustering, sampling

Data cube aggregation
 Data compression
130
Data Reduction 1: Dimensionality Reduction
 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering,
outlier analysis, becomes less meaningful
 The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)
131
Mapping Data to a New Space
Two Sine Waves Two Sine Waves + Noise Frequency
 Fourier transform
 Wavelet transform
132
What Is Wavelet Transform?
 Decomposes a signal into
different frequency
subbands
 Applicable to n-
dimensional signals
 Data are transformed to
preserve relative distance
between objects at different
levels of resolution
 Allow natural clusters to
become more
distinguishable
 Used for image
133
Wavelet Transformation
 Discrete wavelet transform (DWT) for linear signal
processing, multi-resolution analysis
 Compressed approximation: store only a small fraction
of the strongest of the wavelet coefficients
 Similar to discrete Fourier transform (DFT), but better
lossy compression, localized in space
 Method:
 Length, L, must be an integer power of 2 (padding with 0’s,
when necessary)
 Each transform has 2 functions: smoothing, difference
 Applies to pairs of data, resulting in two set of data of length
L/2
 Applies two functions recursively, until reaches the desired
Haar2 Daubechie4
134
Wavelet Decomposition
 Wavelets: A math tool for space-efficient hierarchical
decomposition of functions
 S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S^ = [23
/4,
-11
/4, 1
/2, 0, 0, -1, -1, 0]
 Compression: many small detail coefficients can be
replaced by 0’s, and only the significant coefficients
are retained
135
Haar Wavelet Coefficients
Coefficient
“Supports”
2 2 0 2 3 5 4 4
-1.25
2.75
0.5 0
0 -1 0
-1
+
-
+
+
+ + +
+
+
- -
- - - -
+
-
+
+
-
+
-
+
-+
-
-
+
+
-
-1
-1
0.5
0
2.75
-1.25
0
0
Original frequency distribution
Hierarchical
decomposition
structure (a.k.a.
“error tree”)
136
Why Wavelet Transform?
 Use hat-shape filters
 Emphasize region where points cluster
 Suppress weaker information in their boundaries
 Effective removal of outliers
 Insensitive to noise, insensitive to input order
 Multi-resolution
 Detect arbitrary shaped clusters at different scales
 Efficient
 Complexity O(N)
 Only applicable to low dimensional data
137
x2
x1
e
Principal Component Analysis (PCA)
 Find a projection that captures the largest amount of variation in
data
 The original data are projected onto a much smaller space,
resulting in dimensionality reduction. We find the eigenvectors of
the covariance matrix, and these eigenvectors define the new
space
138
 Given N data vectors from n-dimensions, find k ≤ n orthogonal
vectors (principal components) that can be best used to represent
data
 Normalize input data: Each attribute falls within the same
range
 Compute k orthonormal (unit) vectors, i.e., principal components
 Each input data (vector) is a linear combination of the k
principal component vectors
 The principal components are sorted in order of decreasing
“significance” or strength
 Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with
low variance (i.e., using the strongest principal components, it
is possible to reconstruct a good approximation of the original
Principal Component Analysis (Steps)
139
Attribute Subset Selection
 Another way to reduce dimensionality of data
 Redundant attributes
 Duplicate much or all of the information contained
in one or more other attributes
 E.g., purchase price of a product and the amount of
sales tax paid
 Irrelevant attributes
 Contain no information that is useful for the data
mining task at hand
 E.g., students' ID is often irrelevant to the task of
predicting students' GPA
140
Heuristic Search in Attribute Selection
 There are 2d
possible attribute combinations of d
attributes
 Typical heuristic attribute selection methods:
 Best single attribute under the attribute
independence assumption: choose by significance
tests
 Best step-wise feature selection:

The best single-attribute is picked first

Then next best attribute condition to the first, ...
 Step-wise attribute elimination:

Repeatedly eliminate the worst attribute
 Best combined attribute selection and elimination
 Optimal branch and bound:
141
Attribute Creation (Feature Generation)
 Create new attributes (features) that can capture the
important information in a data set more effectively
than the original ones
 Three general methodologies
 Attribute extraction

Domain-specific
 Mapping data to new space (see: data reduction)

E.g., Fourier transformation, wavelet
transformation, manifold approaches (not
covered)
 Attribute construction

Combining features (see: discriminative frequent
patterns in Chapter 7)

142
Data Reduction 2: Numerosity Reduction
 Reduce data volume by choosing alternative, smaller
forms of data representation
 Parametric methods (e.g., regression)
 Assume the data fits some model, estimate model
parameters, store only the parameters, and
discard the data (except possible outliers)
 Ex.: Log-linear models—obtain value at a point in
m-D space as the product on appropriate marginal
subspaces
 Non-parametric methods
 Do not assume models
 Major families: histograms, clustering, sampling, …
143
Parametric Data Reduction: Regression
and Log-Linear Models
 Linear regression
 Data modeled to fit a straight line
 Often uses the least-square method to fit the line
 Multiple regression
 Allows a response variable Y to be modeled as a
linear function of multidimensional feature vector
 Log-linear model
 Approximates discrete multidimensional
probability distributions
144
Regression Analysis
 Regression analysis: A collective name
for techniques for the modeling and
analysis of numerical data consisting of
values of a dependent variable (also
called response variable or
measurement) and of one or more
independent variables (aka. explanatory
variables or predictors)
 The parameters are estimated so as to
give a "best fit" of the data
 Most commonly the best fit is evaluated
by using the least squares method, but
other criteria have also been used
 Used for prediction
(including forecasting of
time-series data),
inference, hypothesis
testing, and modeling of
causal relationships
y
x
y = x + 1
X1
Y1
Y1’
145
 Linear regression: Y = w X + b
 Two regression coefficients, w and b, specify the line and are to
be estimated by using the data at hand
 Using the least squares criterion to the known values of Y1, Y2,
…, X1, X2, ….
 Multiple regression: Y = b0 + b1 X1 + b2 X2
 Many nonlinear functions can be transformed into the above
 Log-linear models:
 Approximate discrete multidimensional probability
distributions
 Estimate the probability of each point (tuple) in a multi-
dimensional space for a set of discretized attributes, based on a
smaller subset of dimensional combinations
Regress Analysis and Log-Linear Models
146
Histogram Analysis
 Divide data into buckets and
store average (sum) for each
bucket
 Partitioning rules:
 Equal-width: equal bucket
range
 Equal-frequency (or
equal-depth)
0
5
10
15
20
25
30
35
40
1
0
0
0
0
2
0
0
0
0
3
0
0
0
0
4
0
0
0
0
5
0
0
0
0
6
0
0
0
0
7
0
0
0
0
8
0
0
0
0
9
0
0
0
0
1
0
0
0
0
0
147
Clustering
 Partition data set into clusters based on similarity,
and store cluster representation (e.g., centroid and
diameter) only
 Can be very effective if data is clustered but not if
data is “smeared”
 Can have hierarchical clustering and be stored in
multi-dimensional index tree structures
 There are many choices of clustering definitions and
clustering algorithms
 Cluster analysis will be studied in depth in Chapter 10
148
Sampling
 Sampling: obtaining a small sample s to represent the
whole data set N
 Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
 Key principle: Choose a representative subset of the
data
 Simple random sampling may have very poor
performance in the presence of skew
 Develop adaptive sampling methods, e.g., stratified
sampling:
 Note: Sampling may not reduce database I/Os (page at
149
Types of Sampling
 Simple random sampling
 There is an equal probability of selecting any
particular item
 Sampling without replacement
 Once an object is selected, it is removed from the
population
 Sampling with replacement
 A selected object is not removed from the population
 Stratified sampling:
 Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the
same percentage of the data)
 Used in conjunction with skewed data
150
Sampling: With or without Replacement
SRSWOR
(simple random
sample without
replacement)
SRSWR
Raw Data
151
Sampling: Cluster or Stratified Sampling
Raw Data Cluster/Stratified Sample
152
Data Cube Aggregation
 The lowest level of a data cube (base cuboid)
 The aggregated data for an individual entity of
interest
 E.g., a customer in a phone calling data warehouse
 Multiple levels of aggregation in data cubes
 Further reduce the size of data to deal with
 Reference appropriate levels
 Use the smallest representation which is enough to
solve the task
 Queries regarding aggregated information should be
answered using data cube, when possible
153
Data Reduction 3: Data Compression
 String compression
 There are extensive theories and well-tuned
algorithms
 Typically lossless, but only limited manipulation is
possible without expansion
 Audio/video compression
 Typically lossy compression, with progressive
refinement
 Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
 Time sequence is not audio
 Typically short and vary slowly with time
 Dimensionality and numerosity reduction may also be
154
Data Compression
Original Data Compressed
Data
lossless
Original Data
Approximated
lossy
155
Chapter 3: Data Preprocessing
 Data Preprocessing: An Overview
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
156
Data Transformation
 A function that maps the entire set of values of a given attribute
to a new set of replacement values s.t. each old value can be
identified with one of the new values
 Methods
 Smoothing: Remove noise from data
 Attribute/feature construction

New attributes constructed from the given ones
 Aggregation: Summarization, data cube construction
 Normalization: Scaled to fall within a smaller, specified range

min-max normalization

z-score normalization

normalization by decimal scaling
 Discretization: Concept hierarchy climbing
157
Normalization
 Min-max normalization: to [new_minA, new_maxA]
 Ex. Let income range $12,000 to $98,000 normalized to [0.0,
1.0]. Then $73,000 is mapped to
 Z-score normalization (μ: mean, σ: standard deviation):
 Ex. Let μ = 54,000, σ = 16,000. Then
 Normalization by decimal scaling
716
.
0
0
)
0
0
.
1
(
000
,
12
000
,
98
000
,
12
600
,
73





A
A
A
A
A
A
min
new
min
new
max
new
min
max
min
v
v _
)
_
_
(
' 




A
A
v
v




'
j
v
v
10
' Where j is the smallest integer such that Max(|ν’|) < 1
225
.
1
000
,
16
000
,
54
600
,
73


158
Discretization
 Three types of attributes
 Nominal—values from an unordered set, e.g., color, profession
 Ordinal—values from an ordered set, e.g., military or academic
rank
 Numeric—real numbers, e.g., integer or real numbers
 Discretization: Divide the range of a continuous attribute into
intervals
 Interval labels can then be used to replace actual data values
 Reduce data size by discretization
 Supervised vs. unsupervised
 Split (top-down) vs. merge (bottom-up)
 Discretization can be performed recursively on an attribute
 Prepare for further analysis, e.g., classification
159
Data Discretization Methods
 Typical methods: All the methods can be applied
recursively
 Binning

Top-down split, unsupervised
 Histogram analysis

Top-down split, unsupervised
 Clustering analysis (unsupervised, top-down split or
bottom-up merge)
 Decision-tree analysis (supervised, top-down split)
 Correlation (e.g., 2
) analysis (unsupervised, bottom-
up merge)
160
Simple Discretization: Binning
 Equal-width (distance) partitioning
 Divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute,
the width of intervals will be: W = (B –A)/N.
 The most straightforward, but outliers may dominate
presentation
 Skewed data is not handled well
 Equal-depth (frequency) partitioning
 Divides the range into N intervals, each containing
approximately same number of samples
 Good data scaling
 Managing categorical attributes can be tricky
161
Binning Methods for Data Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
162
Discretization Without Using Class Labels
(Binning vs. Clustering)
Data Equal interval width
(binning)
Equal frequency (binning) K-means clustering leads to better
results
163
Discretization by Classification &
Correlation Analysis
 Classification (e.g., decision tree analysis)
 Supervised: Given class labels, e.g., cancerous vs. benign
 Using entropy to determine split point (discretization point)
 Top-down, recursive split
 Details to be covered in Chapter 7
 Correlation analysis (e.g., Chi-merge: χ2
-based discretization)
 Supervised: use class information
 Bottom-up merge: find the best neighboring intervals (those
having similar distributions of classes, i.e., low χ2
values) to
merge
 Merge performed recursively, until a predefined stopping
164
Concept Hierarchy Generation
 Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically and is usually associated with each dimension in a
data warehouse
 Concept hierarchies facilitate drilling and rolling in data
warehouses to view data in multiple granularity
 Concept hierarchy formation: Recursively reduce the data by
collecting and replacing low level concepts (such as numeric values
for age) by higher level concepts (such as youth, adult, or senior)
 Concept hierarchies can be explicitly specified by domain experts
and/or data warehouse designers
 Concept hierarchy can be automatically formed for both numeric
and nominal data. For numeric data, use discretization methods
shown.
165
Concept Hierarchy Generation
for Nominal Data
 Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
 street < city < state < country
 Specification of a hierarchy for a set of values by
explicit data grouping
 {Urbana, Champaign, Chicago} < Illinois
 Specification of only a partial set of attributes
 E.g., only street < city, not others
 Automatic generation of hierarchies (or attribute
levels) by the analysis of the number of distinct values
 E.g., for a set of attributes: {street, city, state, country}
166
Automatic Concept Hierarchy Generation
 Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
 The attribute with the most distinct values is placed at
the lowest level of the hierarchy
 Exceptions, e.g., weekday, month, quarter, year
country
province_or_ state
city
street
15 distinct values
365 distinct values
3567 distinct values
674,339 distinct values
167
Chapter 3: Data Preprocessing
 Data Preprocessing: An Overview
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
168
Summary
 Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
 Data cleaning: e.g. missing/noisy values, outliers
 Data integration from multiple sources:
 Entity identification problem
 Remove redundancies
 Detect inconsistencies
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Concept hierarchy generation
169
References
 D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of
ACM, 42:73-78, 1999
 A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996
 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
 J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997.
 H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning:
Language, model, and algorithms. VLDB'01
 M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07
 H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), Dec. 1997
 H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining
Perspective. Kluwer Academic, 1998
 J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003
 D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
 V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB’2001
 T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001
 R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7:623-640, 1995
170
170
Data Mining:
Concepts and Techniques
(3rd
ed.)
— Chapter 4 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
171
Chapter 4: Data Warehousing and On-line
Analytical Processing
 Data Warehouse: Basic Concepts
 Data Warehouse Modeling: Data Cube and
OLAP
 Data Warehouse Design and Usage
 Data Warehouse Implementation
 Data Generalization by Attribute-Oriented
Induction
 Summary
172
What is a Data Warehouse?
 Defined in many different ways, but not rigorously.
 A decision support database that is maintained separately
from the organization’s operational database
 Support information processing by providing a solid platform
of consolidated, historical data for analysis.
 “A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s
decision-making process.”—W. H. Inmon
 Data warehousing:
 The process of constructing and using data warehouses
173
Data Warehouse—Subject-Oriented
 Organized around major subjects, such as customer,
product, sales
 Focusing on the modeling and analysis of data for
decision makers, not on daily operations or
transaction processing
 Provide a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process
174
Data Warehouse—Integrated
 Constructed by integrating multiple, heterogeneous
data sources
 relational databases, flat files, on-line transaction
records
 Data cleaning and data integration techniques are
applied.
 Ensure consistency in naming conventions,
encoding structures, attribute measures, etc.
among different data sources

E.g., Hotel price: currency, tax, breakfast covered, etc.
 When data is moved to the warehouse, it is
converted.
175
Data Warehouse—Time Variant
 The time horizon for the data warehouse is
significantly longer than that of operational systems
 Operational database: current value data
 Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
 Every key structure in the data warehouse
 Contains an element of time, explicitly or implicitly
 But the key of operational data may or may not
contain “time element”
176
Data Warehouse—Nonvolatile
 A physically separate store of data transformed from
the operational environment
 Operational update of data does not occur in the
data warehouse environment
 Does not require transaction processing, recovery,
and concurrency control mechanisms
 Requires only two operations in data accessing:

initial loading of data and access of data
177
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date
detailed, flat relational
isolated
historical,
summarized, multidimensional
integrated, consolidated
usage repetitive ad-hoc
access read/write
index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
178
Why a Separate Data Warehouse?
 High performance for both systems
 DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
 Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
 Different functions and different data:
 missing data: Decision support requires historical data which
operational DBs do not typically maintain
 data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
 data quality: different sources typically use inconsistent data
representations, codes and formats which have to be
reconciled
 Note: There are more and more systems which perform OLAP
analysis directly on relational databases
179
Data Warehouse: A Multi-Tiered Architecture
Data
Warehouse
Extract
Transform
Load
Refresh
OLAP Engine
Analysis
Query
Reports
Data mining
Monitor
&
Integrator
Metadata
Data Sources Front-End Tools
Serve
Data Marts
Operational
DBs
Other
sources
Data Storage
OLAP Server
180
Three Data Warehouse Models
 Enterprise warehouse
 collects all of the information about subjects
spanning the entire organization
 Data Mart
 a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to
specific, selected groups, such as marketing data
mart

Independent vs. dependent (directly from warehouse) data
mart
 Virtual warehouse
 A set of views over operational databases
 Only some of the possible summary views may be
181
Extraction, Transformation, and Loading (ETL)
 Data extraction
 get data from multiple, heterogeneous, and external
sources
 Data cleaning
 detect errors in the data and rectify them when
possible
 Data transformation
 convert data from legacy or host format to
warehouse format
 Load
 sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitions
 Refresh
 propagate the updates from the data sources to the
warehouse
182
Metadata Repository
 Meta data is the data defining warehouse objects. It stores:
 Description of the structure of the data warehouse
 schema, view, dimensions, hierarchies, derived data defn, data
mart locations and contents
 Operational meta-data
 data lineage (history of migrated data and transformation
path), currency of data (active, archived, or purged), monitoring
information (warehouse usage statistics, error reports, audit
trails)
 The algorithms used for summarization
 The mapping from operational environment to the data
warehouse
 Data related to system performance
 warehouse schema, view and derived data definitions
 Business data
183
Chapter 4: Data Warehousing and On-line
Analytical Processing
 Data Warehouse: Basic Concepts
 Data Warehouse Modeling: Data Cube and
OLAP
 Data Warehouse Design and Usage
 Data Warehouse Implementation
 Data Generalization by Attribute-Oriented
Induction
 Summary
184
From Tables and Spreadsheets to
Data Cubes
 A data warehouse is based on a multidimensional data model
which views data in the form of a data cube
 A data cube, such as sales, allows data to be modeled and
viewed in multiple dimensions
 Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
 Fact table contains measures (such as dollars_sold) and keys
to each of the related dimension tables
 In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level
of summarization, is called the apex cuboid. The lattice of
cuboids forms a data cube.
185
Cube: A Lattice of Cuboids
time,item
time,item,location
time, item, location, supplier
all
time item location supplier
time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,supplier
time,location,supplier
item,location,supplier
0-D (apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D (base) cuboid
186
Conceptual Modeling of Data Warehouses
 Modeling data warehouses: dimensions & measures
 Star schema: A fact table in the middle connected to
a set of dimension tables
 Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized
into a set of smaller dimension tables, forming a
shape similar to snowflake
 Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
187
Example of Star Schema
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
state_or_province
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch
188
Example of Snowflake Schema
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
item
branch_key
branch_name
branch_type
branch
supplier_key
supplier_type
supplier
city_key
city
state_or_province
country
city
189
Example of Fact Constellation
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
province_or_state
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch
Shipping Fact Table
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_key
shipper_name
location_key
shipper_type
shipper
190
A Concept Hierarchy:
Dimension (location)
all
Europe North_America
Mexico
Canada
Spain
Germany
Vancouver
M. Wind
L. Chan
...
...
...
... ...
...
all
region
office
country
Toronto
Frankfurt
city
191
Data Cube Measures: Three Categories
 Distributive: if the result derived by applying the
function to n aggregate values is the same as that
derived by applying the function on all the data without
partitioning

E.g., count(), sum(), min(), max()
 Algebraic: if it can be computed by an algebraic
function with M arguments (where M is a bounded
integer), each of which is obtained by applying a
distributive aggregate function

E.g., avg(), min_N(), standard_deviation()
 Holistic: if there is no constant bound on the storage
size needed to describe a subaggregate.

E.g., median(), mode(), rank()
192
View of Warehouses and Hierarchies
Specification of hierarchies
 Schema hierarchy
day < {month <
quarter; week} < year
 Set_grouping hierarchy
{1..10} < inexpensive
193
Multidimensional Data
 Sales volume as a function of product, month,
and region
Product
R
e
g
i
o
n
Month
Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region Year
Category Country Quarter
Product City Month Week
Office Day
194
A Sample Data Cube
Total annual sales
of TVs in U.S.A.
Date
P
r
o
d
u
c
t
Country
sum
sum
TV
VCR
PC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
sum
195
Cuboids Corresponding to the Cube
all
product date country
product,date product,country date, country
product, date, country
0-D (apex) cuboid
1-D cuboids
2-D cuboids
3-D (base) cuboid
196
Typical OLAP Operations
 Roll up (drill-up): summarize data
 by climbing up hierarchy or by dimension reduction
 Drill down (roll down): reverse of roll-up
 from higher level summary to lower level summary or
detailed data, or introducing new dimensions

Slice and dice: project and select
 Pivot (rotate):
 reorient the cube, visualization, 3D to series of 2D planes
 Other operations
 drill across: involving (across) more than one fact table
 drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
197
Fig. 3.10 Typical
OLAP Operations
198
A Star-Net Query Model
Shipping Method
AIR-EXPRESS
TRUCK
ORDER
Customer Orders
CONTRACTS
Customer
Product
PRODUCT GROUP
PRODUCT LINE
PRODUCT ITEM
SALES PERSON
DISTRICT
DIVISION
Organization
Promotion
CITY
COUNTRY
REGION
Location
DAILY
QTRLY
ANNUALY
Time
Each circle is
called a
footprint
199
Browsing a Data Cube
 Visualization
 OLAP capabilities
 Interactive
manipulation
200
Chapter 4: Data Warehousing and On-line
Analytical Processing
 Data Warehouse: Basic Concepts
 Data Warehouse Modeling: Data Cube and
OLAP
 Data Warehouse Design and Usage
 Data Warehouse Implementation
 Data Generalization by Attribute-Oriented
Induction
 Summary
201
Design of Data Warehouse: A Business
Analysis Framework
 Four views regarding the design of a data warehouse
 Top-down view

allows selection of the relevant information necessary for
the data warehouse
 Data source view

exposes the information being captured, stored, and
managed by operational systems
 Data warehouse view

consists of fact tables and dimension tables
 Business query view

sees the perspectives of data in the warehouse from the
view of end-user
202
Data Warehouse Design Process
 Top-down, bottom-up approaches or a combination of both
 Top-down: Starts with overall design and planning (mature)
 Bottom-up: Starts with experiments and prototypes (rapid)
 From software engineering point of view
 Waterfall: structured and systematic analysis at each step
before proceeding to the next
 Spiral: rapid generation of increasingly functional systems,
short turn around time, quick turn around
 Typical data warehouse design process
 Choose a business process to model, e.g., orders, invoices, etc.
 Choose the grain (atomic level of data) of the business process
 Choose the dimensions that will apply to each fact table record
 Choose the measure that will populate each fact table record
203
Data Warehouse Development:
A Recommended Approach
Define a high-level corporate data model
Data
Mart
Data
Mart
Distributed
Data Marts
Multi-Tier Data
Warehouse
Enterprise
Data
Warehouse
Model refinement
Model refinement
204
Data Warehouse Usage
 Three kinds of data warehouse applications
 Information processing

supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
 Analytical processing

multidimensional analysis of data warehouse data

supports basic OLAP operations, slice-dice, drilling,
pivoting
 Data mining

knowledge discovery from hidden patterns

supports associations, constructing analytical models,
performing classification and prediction, and presenting
the mining results using visualization tools
205
From On-Line Analytical Processing (OLAP)
to On Line Analytical Mining (OLAM)
 Why online analytical mining?
 High quality of data in data warehouses

DW contains integrated, consistent, cleaned data
 Available information processing structure
surrounding data warehouses

ODBC, OLEDB, Web accessing, service facilities,
reporting and OLAP tools
 OLAP-based exploratory data analysis

Mining with drilling, dicing, pivoting, etc.
 On-line selection of data mining functions

Integration and swapping of multiple mining
functions, algorithms, and tasks
206
Chapter 4: Data Warehousing and On-line
Analytical Processing
 Data Warehouse: Basic Concepts
 Data Warehouse Modeling: Data Cube and
OLAP
 Data Warehouse Design and Usage
 Data Warehouse Implementation
 Data Generalization by Attribute-Oriented
Induction
 Summary
207
Efficient Data Cube Computation
 Data cube can be viewed as a lattice of cuboids
 The bottom-most cuboid is the base cuboid
 The top-most cuboid (apex) contains only one cell
 How many cuboids in an n-dimensional cube with L
levels?
 Materialization of data cube
 Materialize every (cuboid) (full materialization),
none (no materialization), or some (partial
materialization)
 Selection of which cuboids to materialize

Based on size, sharing, access frequency, etc.
)
1
1
( 



n
i
i
L
T
208
The “Compute Cube” Operator
 Cube definition and computation in DMQL
define cube sales [item, city, year]: sum (sales_in_dollars)
compute cube sales
 Transform it into a SQL-like language (with a new operator
cube by, introduced by Gray et al.’96)
SELECT item, city, year, SUM (amount)
FROM SALES
CUBE BY item, city, year
 Need compute the following Group-Bys
(date, product, customer),
(date,product),(date, customer), (product, customer),
(date), (product), (customer)
()
(item)
(city)
()
(year)
(city, item) (city, year) (item, year)
(city, item, year)
209
Indexing OLAP Data: Bitmap Index
 Index on a particular column
 Each value in the column has a bit vector: bit-op is fast
 The length of the bit vector: # of records in the base table
 The i-th bit is set if the i-th row of the base table has the value for
the indexed column
 not suitable for high cardinality domains
 A recent bit compression technique, Word-Aligned Hybrid (WAH),
makes it work for high cardinality domain as well [Wu, et al.
TODS’06]
Cust Region Type
C1 Asia Retail
C2 Europe Dealer
C3 Asia Dealer
C4 America Retail
C5 Europe Dealer
RecID Retail Dealer
1 1 0
2 0 1
3 0 1
4 1 0
5 0 1
RecIDAsia Europe America
1 1 0 0
2 0 1 0
3 1 0 0
4 0 0 1
5 0 1 0
Base table Index on Region Index on Type
210
Indexing OLAP Data: Join Indices
 Join index: JI(R-id, S-id) where R (R-id, …)  S (S-
id, …)
 Traditional indices map the values to a list of
record ids
 It materializes relational join in JI file and
speeds up relational join
 In data warehouses, join index relates the
values of the dimensions of a start schema to
rows in the fact table.
 E.g. fact table: Sales and two dimensions
city and product

A join index on city maintains for each
distinct city a list of R-IDs of the tuples
recording the Sales in the city
 Join indices can span multiple dimensions
211
Efficient Processing OLAP Queries
 Determine which operations should be performed on the available
cuboids
 Transform drill, roll, etc. into corresponding SQL and/or OLAP
operations, e.g., dice = selection + projection
 Determine which materialized cuboid(s) should be selected for OLAP op.
 Let the query to be processed be on {brand, province_or_state} with the
condition “year = 2004”, and there are 4 materialized cuboids available:
1) {year, item_name, city}
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004
Which should be selected to process the query?
 Explore indexing structures and compressed vs. dense array structs in
212
OLAP Server Architectures
 Relational OLAP (ROLAP)
 Use relational or extended-relational DBMS to store and
manage warehouse data and OLAP middle ware
 Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and
services
 Greater scalability
 Multidimensional OLAP (MOLAP)
 Sparse array-based multidimensional storage engine
 Fast indexing to pre-computed summarized data
 Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
 Flexibility, e.g., low level: relational, high-level: array
 Specialized SQL servers (e.g., Redbricks)
 Specialized support for SQL queries over star/snowflake
213
Chapter 4: Data Warehousing and On-line
Analytical Processing
 Data Warehouse: Basic Concepts
 Data Warehouse Modeling: Data Cube and
OLAP
 Data Warehouse Design and Usage
 Data Warehouse Implementation
 Data Generalization by Attribute-Oriented
Induction
 Summary
214
Attribute-Oriented Induction
 Proposed in 1989 (KDD ‘89 workshop)
 Not confined to categorical data nor particular
measures
 How it is done?
 Collect the task-relevant data (initial relation) using a
relational database query
 Perform generalization by attribute removal or
attribute generalization
 Apply aggregation by merging identical,
generalized tuples and accumulating their
respective counts
 Interaction with users for knowledge presentation
215
Attribute-Oriented Induction: An Example
Example: Describe general characteristics of graduate
students in the University database
 Step 1. Fetch relevant set of data using an SQL
statement, e.g.,
Select * (i.e., name, gender, major, birth_place,
birth_date, residence, phone#, gpa)
from student
where student_status in {“Msc”, “MBA”, “PhD” }
 Step 2. Perform attribute-oriented induction
 Step 3. Present results in generalized relation, cross-tab,
or rule forms
216
Class Characterization: An Example
Name Gender Major Birth-Place Birth_date Residence Phone # GPA
Jim
Woodman
M CS Vancouver,BC,
Canada
8-12-76 3511 Main St.,
Richmond
687-4598 3.67
Scott
Lachance
M CS Montreal, Que,
Canada
28-7-75 345 1st Ave.,
Richmond
253-9106 3.70
Laura Lee
…
F
…
Physics
…
Seattle, WA, USA
…
25-8-70
…
125 Austin Ave.,
Burnaby
…
420-5232
…
3.83
…
Removed Retained Sci,Eng,
Bus
Country Age range City Removed Excl,
VG,..
Gender Major Birth_region Age_range Residence GPA Count
M Science Canada 20-25 Richmond Very-good 16
F Science Foreign 25-30 Burnaby Excellent 22
… … … … … … …
Birth_Region
Gender
Canada Foreign Total
M 16 14 30
F 10 22 32
Total 26 36 62
Prime
Generalized
Relation
Initial
Relation
217
Basic Principles of Attribute-Oriented Induction
 Data focusing: task-relevant data, including
dimensions, and the result is the initial relation
 Attribute-removal: remove attribute A if there is a large
set of distinct values for A but (1) there is no
generalization operator on A, or (2) A’s higher level
concepts are expressed in terms of other attributes
 Attribute-generalization: If there is a large set of
distinct values for A, and there exists a set of
generalization operators on A, then select an operator
and generalize A
 Attribute-threshold control: typical 2-8,
specified/default
218
Attribute-Oriented Induction: Basic
Algorithm
 InitialRel: Query processing of task-relevant data,
deriving the initial relation.
 PreGen: Based on the analysis of the number of distinct
values in each attribute, determine generalization plan
for each attribute: removal? or how high to generalize?
 PrimeGen: Based on the PreGen plan, perform
generalization to the right level to derive a “prime
generalized relation”, accumulating the counts.
 Presentation: User interaction: (1) adjust levels by
drilling, (2) pivoting, (3) mapping into rules, cross tabs,
visualization presentations.
219
Presentation of Generalized Results
 Generalized relation:
 Relations where some or all attributes are generalized, with
counts or other aggregation values accumulated.
 Cross tabulation:
 Mapping results into cross tabulation form (similar to
contingency tables).
 Visualization techniques:
 Pie charts, bar charts, curves, cubes, and other visual forms.
 Quantitative characteristic rules:
 Mapping generalized result into characteristic rules with
quantitative information associated with it, e.g.,
.
%]
47
:
[
"
"
)
(
_
%]
53
:
[
"
"
)
(
_
)
(
)
(
t
foreign
x
region
birth
t
Canada
x
region
birth
x
male
x
grad





220
Mining Class Comparisons
 Comparison: Comparing two or more classes
 Method:
 Partition the set of relevant data into the target class and the
contrasting class(es)
 Generalize both classes to the same high level concepts
 Compare tuples with the same high level descriptions
 Present for every tuple its description and two measures
 support - distribution within single class
 comparison - distribution between classes
 Highlight the tuples with strong discriminant features
 Relevance Analysis:
 Find attributes (features) which best distinguish different
classes
221
Concept Description vs. Cube-Based OLAP
 Similarity:
 Data generalization
 Presentation of data summarization at multiple levels
of abstraction
 Interactive drilling, pivoting, slicing and dicing
 Differences:
 OLAP has systematic preprocessing, query
independent, and can drill down to rather low level
 AOI has automated desired level allocation, and may
perform dimension relevance analysis/ranking when
there are many relevant dimensions
 AOI works on the data which are not in relational
forms
222
Chapter 4: Data Warehousing and On-line
Analytical Processing
 Data Warehouse: Basic Concepts
 Data Warehouse Modeling: Data Cube and
OLAP
 Data Warehouse Design and Usage
 Data Warehouse Implementation
 Data Generalization by Attribute-Oriented
Induction
 Summary
223
Summary
 Data warehousing: A multi-dimensional model of a data warehouse
 A data cube consists of dimensions & measures
 Star schema, snowflake schema, fact constellations
 OLAP operations: drilling, rolling, slicing, dicing and pivoting
 Data Warehouse Architecture, Design, and Usage
 Multi-tiered architecture
 Business analysis design framework
 Information processing, analytical processing, data mining, OLAM
(Online Analytical Mining)
 Implementation: Efficient computation of data cubes
 Partial vs. full vs. no materialization
 Indexing OALP data: Bitmap index and join index
 OLAP query processing
 OLAP servers: ROLAP, MOLAP, HOLAP
 Data generalization: Attribute-oriented induction
224
References (I)
 S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S.
Sarawagi. On the computation of multidimensional aggregates. VLDB’96
 D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data
warehouses. SIGMOD’97
 R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97
 S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM
SIGMOD Record, 26:65-74, 1997
 E. F. Codd, S. B. Codd, and C. T. Salley. Beyond decision support. Computer World, 27, July 1993.
 J. Gray, et al. Data cube: A relational aggregation operator generalizing group-by, cross-tab and
sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997.
 A. Gupta and I. S. Mumick. Materialized Views: Techniques, Implementations, and
Applications. MIT Press, 1999.
 J. Han. Towards on-line analytical mining in large databases. ACM SIGMOD Record, 27:97-107,
1998.
 V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently.
SIGMOD’96
 J. Hellerstein, P. Haas, and H. Wang. Online aggregation. SIGMOD'97
225
References (II)
 C. Imhoff, N. Galemmo, and J. G. Geiger. Mastering Data Warehouse Design: Relational and
Dimensional Techniques. John Wiley, 2003
 W. H. Inmon. Building the Data Warehouse. John Wiley, 1996
 R. Kimball and M. Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional
Modeling. 2ed. John Wiley, 2002
 P. O’Neil and G. Graefe. Multi-table joins through bitmapped join indices. SIGMOD Record, 24:8–
11, Sept. 1995.
 P. O'Neil and D. Quass. Improved query performance with variant indexes. SIGMOD'97
 Microsoft. OLEDB for OLAP programmer's reference version 1.0. In
http://www.microsoft.com/data/oledb/olap, 1998
 S. Sarawagi and M. Stonebraker. Efficient organization of large multidimensional arrays. ICDE'94
 A. Shoshani. OLAP and statistical databases: Similarities and differences. PODS’00.
 D. Srivastava, S. Dar, H. V. Jagadish, and A. V. Levy. Answering queries with aggregation using
views. VLDB'96
 P. Valduriez. Join indices. ACM Trans. Database Systems, 12:218-246, 1987.
 J. Widom. Research problems in data warehousing. CIKM’95
 K. Wu, E. Otoo, and A. Shoshani, Optimal Bitmap Indices with Efficient Compression, ACM Trans.
on Database Systems (TODS), 31(1): 1-38, 2006
226
Surplus Slides
227
Compression of Bitmap Indices
 Bitmap indexes must be compressed to reduce I/O
costs and minimize CPU usage—majority of the bits
are 0’s
 Two compression schemes:
 Byte-aligned Bitmap Code (BBC)
 Word-Aligned Hybrid (WAH) code
 Time and space required to operate on compressed
bitmap is proportional to the total size of the bitmap
 Optimal on attributes of low cardinality as well as
those of high cardinality.
 WAH out performs BBC by about a factor of two
228
228
Data Mining:
Concepts and Techniques
(3rd
ed.)
— Chapter 5 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2010 Han, Kamber & Pei. All rights reserved.
229
229
Chapter 5: Data Cube Technology
 Data Cube Computation: Preliminary
Concepts
 Data Cube Computation Methods
 Processing Advanced Queries by Exploring
Data Cube Technology
 Multidimensional Data Analysis in Cube Space
 Summary
230
230
Data Cube: A Lattice of Cuboids
time,item
time,item,location
time, item, location, supplierc
all
time item location supplier
time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,supplier
time,location,supplier
item,location,supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
231
Data Cube: A Lattice of Cuboids
 Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child
cells
1. (9/15, milk, Urbana, Dairy_land)
2. (9/15, milk, Urbana, *)
3. (*, milk, Urbana, *)
4. (*, milk, Urbana, *)
5. (*, milk, Chicago, *)
6. (*, milk, *, *)
all
time,item
time,item,location
time, item, location, supplier
time item location supplier
time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,supplier
time,location,supplier
item,location,supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
232
232
Cube Materialization:
Full Cube vs. Iceberg Cube
 Full cube vs. iceberg cube
compute cube sales iceberg as
select month, city, customer group, count(*)
from salesInfo
cube by month, city, customer group
having count(*) >= min support
 Computing only the cuboid cells whose measure satisfies the
iceberg condition
 Only a small portion of cells may be “above the water’’ in a
sparse cube
 Avoid explosive growth: A cube with 100 dimensions
 2 base cells: (a1, a2, …., a100), (b1, b2, …, b100)

How many aggregate cells if “having count >= 1”?

What about “having count >= 2”?
iceberg
condition
233
Iceberg Cube, Closed Cube & Cube Shell
 Is iceberg cube good enough?
 2 base cells: {(a1, a2, a3 . . . , a100):10, (a1, a2, b3, . . . , b100):10}
 How many cells will the iceberg cube have if having count(*) >=
10? Hint: A huge but tricky number!
 Close cube:
 Closed cell c: if there exists no cell d, s.t. d is a descendant of c,
and d has the same measure value as c.
 Closed cube: a cube consisting of only closed cells
 What is the closed cube of the above base cuboid? Hint: only 3
cells
 Cube Shell
 Precompute only the cuboids involving a small # of
dimensions, e.g., 3
 More dimension combinations will need to be computed on
the fly
For (A1, A2, … A10), how many combinations to
compute?
234
234
Roadmap for Efficient Computation
 General cube computation heuristics (Agarwal et al.’96)
 Computing full/iceberg cubes: 3 methodologies
 Bottom-Up: Multi-Way array aggregation (Zhao, Deshpande &
Naughton, SIGMOD’97)
 Top-down:

BUC (Beyer & Ramarkrishnan, SIGMOD’99)

H-cubing technique (Han, Pei, Dong & Wang: SIGMOD’01)
 Integrating Top-Down and Bottom-Up:

Star-cubing algorithm (Xin, Han, Li & Wah: VLDB’03)
 High-dimensional OLAP: A Minimal Cubing Approach (Li, et al.
VLDB’04)
 Computing alternative kinds of cubes:
 Partial cube, closed cube, approximate cube, etc.
235
235
General Heuristics (Agarwal et al. VLDB’96)
 Sorting, hashing, and grouping operations are applied to the
dimension attributes in order to reorder and cluster related tuples
 Aggregates may be computed from previously computed
aggregates, rather than from the base fact table
 Smallest-child: computing a cuboid from the smallest,
previously computed cuboid
 Cache-results: caching results of a cuboid from which other
cuboids are computed to reduce disk I/Os
 Amortize-scans: computing as many as possible cuboids at the
same time to amortize disk reads
 Share-sorts: sharing sorting costs cross multiple cuboids when
sort-based method is used
 Share-partitions: sharing the partitioning cost across multiple
cuboids when hash-based algorithms are used
236
236
Chapter 5: Data Cube Technology
 Data Cube Computation: Preliminary
Concepts
 Data Cube Computation Methods
 Processing Advanced Queries by Exploring
Data Cube Technology
 Multidimensional Data Analysis in Cube Space
 Summary
237
237
Data Cube Computation Methods
 Multi-Way Array Aggregation
 BUC
 Star-Cubing
 High-Dimensional OLAP
238
238
Multi-Way Array Aggregation
 Array-based “bottom-up” algorithm
 Using multi-dimensional chunks
 No direct tuple comparisons
 Simultaneous aggregation on
multiple dimensions
 Intermediate aggregate values are
re-used for computing ancestor
cuboids
 Cannot do Apriori pruning: No
iceberg optimization
ABC
AB
A
All
B
AC BC
C
239
239
Multi-way Array Aggregation for Cube
Computation (MOLAP)
 Partition arrays into chunks (a small subcube which fits in
memory).
 Compressed sparse array addressing: (chunk_id, offset)
 Compute aggregates in “multiway” by visiting cube cells in the
order which minimizes the # of times to visit each cell, and
reduces memory access and storage cost.
What is the best
traversing order
to do multi-way
aggregation?
A
B
29 30 31 32
1 2 3 4
5
9
13 14 15 16
64
63
62
61
48
47
46
45
a1
a0
c3
c2
c1
c 0
b3
b2
b1
b0
a2 a3
C
B
44
28 56
40
24 52
36
20
60
240
Multi-way Array Aggregation for Cube
Computation (3-D to 2-D)
all
A B
A B
A BC
A C BC
C
 The best order is
the one that
minimizes the
memory
requirement and
reduced I/Os
ABC
AB
A
All
B
AC BC
C
241
Multi-way Array Aggregation for Cube
Computation (2-D to 1-D)
ABC
AB
A
All
B
AC BC
C
242
242
Multi-Way Array Aggregation for Cube
Computation (Method Summary)
 Method: the planes should be sorted and computed
according to their size in ascending order
 Idea: keep the smallest plane in the main memory,
fetch and compute only one chunk at a time for the
largest plane
 Limitation of the method: computing well only for a
small number of dimensions
 If there are a large number of dimensions, “top-
down” computation and iceberg cube computation
methods can be explored
243
243
Data Cube Computation Methods
 Multi-Way Array Aggregation
 BUC
 Star-Cubing
 High-Dimensional OLAP
244
244
Bottom-Up Computation (BUC)
 BUC (Beyer & Ramakrishnan,
SIGMOD’99)
 Bottom-up cube computation
(Note: top-down in our view!)
 Divides dimensions into
partitions and facilitates iceberg
pruning
 If a partition does not satisfy
min_sup, its descendants can
be pruned
 If minsup = 1 Þ compute full
CUBE!
 No simultaneous aggregation
all
A B C
A C BC
A BC A BD A C D BC D
A D BD C D
D
A BC D
A B
1 all
2 A 10 B 14 C
7 A C 11 BC
4 A BC 6 A BD 8 A C D 12 BC D
9 A D 13 BD 15 C D
16 D
5 A BC D
3 A B
245
245
BUC: Partitioning
 Usually, entire data set can’t
fit in main memory
 Sort distinct values
 partition into blocks that fit
 Continue processing
 Optimizations
 Partitioning

External Sorting, Hashing, Counting Sort
 Ordering dimensions to encourage pruning

Cardinality, Skew, Correlation
 Collapsing duplicates

Can’t do holistic aggregates anymore!
246
246
Data Cube Computation Methods
 Multi-Way Array Aggregation
 BUC
 Star-Cubing
 High-Dimensional OLAP
247
247
Star-Cubing: An Integrating Method
 D. Xin, J. Han, X. Li, B. W. Wah, Star-Cubing: Computing Iceberg
Cubes by Top-Down and Bottom-Up Integration, VLDB'03
 Explore shared dimensions
 E.g., dimension A is the shared dimension of ACD and AD
 ABD/AB means cuboid ABD has shared dimensions AB
 Allows for shared computations
 e.g., cuboid AB is computed simultaneously as ABD
C/C
AC/A C BC/BC
ABC/ABC ABD/AB ACD/A BCD
AD/A BD/B CD
D
ABC D/all
 Aggregate in a top-down
manner but with the bottom-
up sub-layer underneath
which will allow Apriori
pruning
 Shared dimensions grow in
bottom-up fashion
248
248
Iceberg Pruning in Shared Dimensions
 Anti-monotonic property of shared dimensions
 If the measure is anti-monotonic, and if the
aggregate value on a shared dimension does
not satisfy the iceberg condition, then all the
cells extended from this shared dimension
cannot satisfy the condition either
 Intuition: if we can compute the shared
dimensions before the actual cuboid, we can use
them to do Apriori pruning
 Problem: how to prune while still aggregate
simultaneously on multiple dimensions?
249
249
Cell Trees
 Use a tree structure similar
to H-tree to represent
cuboids
 Collapses common prefixes
to save memory
 Keep count at node
 Traverse the tree to
retrieve a particular tuple
250
250
Star Attributes and Star Nodes
 Intuition: If a single-dimensional
aggregate on an attribute value p
does not satisfy the iceberg
condition, it is useless to
distinguish them during the
iceberg computation
 E.g., b2, b3, b4, c1, c2, c4, d1, d2, d3
 Solution: Replace such attributes
by a *. Such attributes are star
attributes, and the corresponding
nodes in the cell tree are star
nodes
A B C D Count
a1 b1 c1 d1 1
a1 b1 c4 d3 1
a1 b2 c2 d2 1
a2 b3 c3 d4 1
a2 b4 c3 d4 1
251
251
Example: Star Reduction
 Suppose minsup = 2
 Perform one-dimensional
aggregation. Replace attribute
values whose count < 2 with *.
And collapse all *’s together
 Resulting table has all such
attributes replaced with the star-
attribute
 With regards to the iceberg
computation, this new table is a
lossless compression of the original
table
A B C D Count
a1 b1 * * 2
a1 * * * 1
a2 * c3 d4 2
A B C D Count
a1 b1 * * 1
a1 b1 * * 1
a1 * * * 1
a2 * c3 d4 1
a2 * c3 d4 1
252
252
Star Tree
 Given the new compressed
table, it is possible to
construct the
corresponding cell tree—
called star tree
 Keep a star table at the side
for easy lookup of star
attributes
 The star tree is a lossless
compression of the original
cell tree
A B C D Count
a1 b1 * * 2
a1 * * * 1
a2 * c3 d4 2
253
253
Star-Cubing Algorithm—DFS on Lattice Tree
all
A B/B C/C
AC/AC BC /BC
ABC/ABC ABD/AB A CD /A BCD
AD /A BD/B CD
D/D
A BC D
/A
AB/A B
BCD : 51
b*: 33 b1: 26
c*: 27
c3: 211
c*: 14
d*: 15 d4: 212 d*: 28
root: 5
a1: 3 a2: 2
b*: 2
b1: 2
b*: 1
d*: 1
c*: 1
d*: 2
c*: 2
d4: 2
c3: 2
254
254
Multi-Way Aggregation
A BC /ABC
ABD /AB
A CD/A
BCD
ABC D
255
255
Star-Cubing Algorithm—DFS on Star-Tree
A BC /ABC
ABD /AB
A CD/A
BCD
ABC D
256
256
Multi-Way Star-Tree Aggregation
 Start depth-first search at the root of the base star tree
 At each new node in the DFS, create corresponding star tree that are descendants
of the current tree according to the integrated traversal ordering
 E.g., in the base tree, when DFS reaches a1, the ACD/A tree is created
 When DFS reaches b*, the ABD/AD tree is created
 The counts in the base tree are carried over to the new trees
 When DFS reaches a leaf node (e.g., d*), start backtracking
 On every backtracking branch, the count in the corresponding trees are output,
the tree is destroyed, and the node in the base tree is destroyed
 Example
 When traversing from d* back to c*, the a1b*c*/a1b*c* tree is output and
destroyed
 When traversing from c* back to b*, the a1b*D/a1b* tree is output and
destroyed
 When at b*, jump to b1 and repeat similar process
ABC /ABC
ABD/AB
ACD /A
BCD
ABCD
257
257
Data Cube Computation Methods
 Multi-Way Array Aggregation
 BUC
 Star-Cubing
 High-Dimensional OLAP
258
258
The Curse of Dimensionality
 None of the previous cubing method can handle high
dimensionality!
 A database of 600k tuples. Each dimension has
cardinality of 100 and zipf of 2.
259
259
Motivation of High-D OLAP
 X. Li, J. Han, and H. Gonzalez, High-Dimensional
OLAP: A Minimal Cubing Approach, VLDB'04
 Challenge to current cubing methods:
 The “curse of dimensionality’’ problem
 Iceberg cube and compressed cubes: only delay
the inevitable explosion
 Full materialization: still significant overhead in
accessing results on disk
 High-D OLAP is needed in applications
 Science and engineering analysis
 Bio-data analysis: thousands of genes
 Statistical surveys: hundreds of variables
260
260
Fast High-D OLAP with Minimal Cubing
 Observation: OLAP occurs only on a small subset of
dimensions at a time
 Semi-Online Computational Model
1. Partition the set of dimensions into shell
fragments
2. Compute data cubes for each shell fragment
while retaining inverted indices or value-list
indices
3. Given the pre-computed fragment cubes,
dynamically compute cube cells of the high-
261
261
Properties of Proposed Method
 Partitions the data vertically
 Reduces high-dimensional cube into a set of lower
dimensional cubes
 Online re-construction of original high-dimensional
space
 Lossless reduction
 Offers tradeoffs between the amount of pre-
processing and the speed of online computation
262
262
Example Computation
 Let the cube aggregation function be count
 Divide the 5 dimensions into 2 shell fragments:
 (A, B, C) and (D, E)
tid A B C D E
1 a1 b1 c1 d1 e1
2 a1 b2 c1 d2 e1
3 a1 b2 c1 d1 e2
4 a2 b1 c1 d1 e2
5 a2 b1 c1 d1 e3
263
263
1-D Inverted Indices
 Build traditional invert index or RID list
Attribute Value TID List List Size
a1 1 2 3 3
a2 4 5 2
b1 1 4 5 3
b2 2 3 2
c1 1 2 3 4 5 5
d1 1 3 4 5 4
d2 2 1
e1 1 2 2
e2 3 4 2
e3 5 1
264
264
Shell Fragment Cubes: Ideas
 Generalize the 1-D inverted indices to multi-dimensional
ones in the data cube sense
 Compute all cuboids for data cubes ABC and DE while
retaining the inverted indices
 For example, shell
fragment cube ABC
contains 7 cuboids:
 A, B, C
 AB, AC, BC
 ABC
 This completes the offline
computation stage
1
1
1 2 3 1 4 5
a1 b1
0
4 5 2 3
a2 b2
2
4 5
4 5 1 4 5
a2 b1
2
2 3
1 2 3 2 3
a1 b2
List Size
TID List
Intersection
Cell










265
265
Shell Fragment Cubes: Size and Design
 Given a database of T tuples, D dimensions, and F shell
fragment size, the fragment cubes’ space requirement is:
 For F < 5, the growth is sub-linear
 Shell fragments do not have to be disjoint
 Fragment groupings can be arbitrary to allow for
maximum online performance
 Known common combinations (e.g.,<city, state>)
should be grouped together.
 Shell fragment sizes can be adjusted for optimal balance
between offline and online computation

O T
D
F






(2F
 1)






266
266
ID_Measure Table
 If measures other than count are present, store in
ID_measure table separate from the shell fragments
tid count sum
1 5 70
2 3 10
3 8 20
4 5 40
5 2 30
267
267
The Frag-Shells Algorithm
1. Partition set of dimension (A1,…,An) into a set of k fragments (P1,
…,Pk).
2. Scan base table once and do the following
3. insert <tid, measure> into ID_measure table.
4. for each attribute value ai of each dimension Ai
5. build inverted index entry <ai, tidlist>
6. For each fragment partition Pi
7. build local fragment cube Si by intersecting tid-lists in
bottom- up fashion.
268
268
Frag-Shells (2)
A B C D E F …
ABC
Cube
DEF
Cube
D Cuboid
EF Cuboid
DE Cuboid
Cell Tuple-ID List
d1 e1 {1, 3, 8, 9}
d1 e2 {2, 4, 6, 7}
d2 e1 {5, 10}
… …
Dimensions
269
269
Online Query Computation: Query
 A query has the general form
 Each ai has 3 possible values
1. Instantiated value
2. Aggregate * function
3. Inquire ? function
 For example, returns a 2-D data
cube.


a1,a2,,an : M

3 ? ? * 1:count
270
270
Online Query Computation: Method
 Given the fragment cubes, process a query as
follows
1. Divide the query into fragment, same as the
shell
2. Fetch the corresponding TID list for each
fragment from the fragment cube
3. Intersect the TID lists from each fragment to
construct instantiated base table
4. Compute the data cube using the base table
with any cubing algorithm
271
271
Online Query Computation: Sketch
A B C D E F G H I J K L M N …
Online
Cube
Instantiated
Base Table
272
272
Experiment: Size vs. Dimensionality (50
and 100 cardinality)
 (50-C): 106
tuples, 0 skew, 50 cardinality, fragment size 3.
 (100-C): 106
tuples, 2 skew, 100 cardinality, fragment size 2.
273
273
Experiments on Real World Data
 UCI Forest CoverType data set
 54 dimensions, 581K tuples
 Shell fragments of size 2 took 33 seconds and
325MB to compute
 3-D subquery with 1 instantiate D: 85ms~1.4 sec.
 Longitudinal Study of Vocational Rehab. Data
 24 dimensions, 8818 tuples
 Shell fragments of size 3 took 0.9 seconds and
60MB to compute
 5-D query with 0 instantiated D: 227ms~2.6 sec.
274
274
Chapter 5: Data Cube Technology
 Data Cube Computation: Preliminary Concepts
 Data Cube Computation Methods
 Processing Advanced Queries by Exploring Data Cube
Technology
 Sampling Cube
 Ranking Cube
 Multidimensional Data Analysis in Cube Space
 Summary
275
275
Processing Advanced Queries by
Exploring Data Cube Technology
 Sampling Cube
 X. Li, J. Han, Z. Yin, J.-G. Lee, Y. Sun, “Sampling
Cube: A Framework for Statistical OLAP over
Sampling Data”, SIGMOD’08
 Ranking Cube
 D. Xin, J. Han, H. Cheng, and X. Li. Answering top-k
queries with multi-dimensional selections: The
ranking cube approach. VLDB’06
 Other advanced cubes for processing data and
queries
 Stream cube, spatial cube, multimedia cube, text
cube, RFID cube, etc. — to be studied in volume 2
276
276
Statistical Surveys and OLAP
 Statistical survey: A popular tool to collect information
about a population based on a sample
 Ex.: TV ratings, US Census, election polls
 A common tool in politics, health, market research,
science, and many more
 An efficient way of collecting information (Data
collection is expensive)
 Many statistical tools available, to determine validity
 Confidence intervals
 Hypothesis tests
 OLAP (multidimensional analysis) on survey data
 highly desirable but can it be done well?
277
277
Surveys: Sample vs. Whole Population
AgeEducation High-school College Graduate
18
19
20
…
Data is only a sample of population
278
278
Problems for Drilling in Multidim. Space
AgeEducation High-school College Graduate
18
19
20
…
Data is only a sample of population but samples could be small
when drilling to certain multidimensional space
279
279
OLAP on Survey (i.e., Sampling) Data
Age/Education High-school College Graduate
18
19
20
…
 Semantics of query is unchanged
 Input data has changed
280
280
Challenges for OLAP on Sampling Data
 Computing confidence intervals in OLAP
context
 No data?
 Not exactly. No data in subspaces in cube
 Sparse data
 Causes include sampling bias and query
selection bias
 Curse of dimensionality
 Survey data can be high dimensional
 Over 600 dimensions in real world
example
 Impossible to fully materialize
281
281
Example 1: Confidence Interval
Age/Education High-school College Graduate
18
19
20
…
What is the average income of 19-year-old high-school students?
Return not only query result but also confidence interval
282
282
Confidence Interval
 Confidence interval at :

x is a sample of data set; is the mean of sample
 tc is the critical t-value, calculated by a look-up
 is the estimated standard error of the mean
 Example: $50,000 ± $3,000 with 95% confidence

Treat points in cube cell as samples

Compute confidence interval as traditional sample
set
 Return answer in the form of confidence interval

Indicates quality of query answer

No
Image
283
283
Efficient Computing Confidence Interval Measures
 Efficient computation in all cells in data cube

Both mean and confidence interval are algebraic

Why confidence interval measure is algebraic?
is algebraic
where both s and l (count) are algebraic
 Thus one can calculate cells efficiently at more general
cuboids without having to start at the base cuboid each
time
No
Image
284
284
Example 2: Query Expansion
Age/Education High-school College Graduate
18
19
20
…
What is the average income of 19-year-old college students?
285
285
Boosting Confidence by Query Expansion
 From the example: The queried cell “19-year-old
college students” contains only 2 samples
 Confidence interval is large (i.e., low confidence).
why?
 Small sample size
 High standard deviation with samples
 Small sample sizes can occur at relatively low
dimensional selections
 Collect more data?― expensive!
 Use data in other cells? Maybe, but have to be
careful
286
286
Intra-Cuboid Expansion: Choice 1
Age/Education High-school College Graduate
18
19
20
…
Expand query to include 18 and 20 year olds?
287
287
Intra-Cuboid Expansion: Choice 2
Age/Education High-school College Graduate
18
19
20
…
Expand query to include high-school and graduate students?
288
288
Query Expansion
289
Intra-Cuboid Expansion
 Combine other cells’ data into own to “boost”
confidence
 If share semantic and cube similarity
 Use only if necessary
 Bigger sample size will decrease confidence
interval
 Cell segment similarity
 Some dimensions are clear: Age
 Some are fuzzy: Occupation
 May need domain knowledge
 Cell value similarity
 How to determine if two cells’ samples come
from the same population?
 Two-sample t-test (confidence-based)
290
290
Inter-Cuboid Expansion
 If a query dimension is

Not correlated with cube value

But is causing small sample size by drilling down
too much
 Remove dimension (i.e., generalize to *) and move to
a more general cuboid
 Can use two-sample t-test to determine similarity
between two cells across cuboids
 Can also use a different method to be shown later
291
291
Query Expansion Experiments
 Real world sample data: 600 dimensions and
750,000 tuples
 0.05% to simulate “sample” (allows error
checking)
292
292
Chapter 5: Data Cube Technology
 Data Cube Computation: Preliminary Concepts
 Data Cube Computation Methods
 Processing Advanced Queries by Exploring Data Cube
Technology
 Sampling Cube
 Ranking Cube
 Multidimensional Data Analysis in Cube Space
 Summary
293
Ranking Cubes – Efficient Computation of
Ranking queries
 Data cube helps not only OLAP but also ranked search
 (top-k) ranking query: only returns the best k results
according to a user-specified preference, consisting of
(1) a selection condition and (2) a ranking function
 Ex.: Search for apartments with expected price 1000
and expected square feet 800

Select top 1 from Apartment

where City = “LA” and Num_Bedroom = 2

order by [price – 1000]^2 + [sq feet - 800]^2 asc
 Efficiency question: Can we only search what we need?
 Build a ranking cube on both selection dimensions
and ranking dimensions
294
Sliced Partition
for city=“LA”
Sliced Partition
for BR=2
Ranking Cube: Partition Data on Both
Selection and Ranking Dimensions
One single data
partition as the template
Slice the data partition
by selection conditions
Partition for
all data
295
Materialize Ranking-Cube
tid City BR Price Sq feet Block ID
t1 SEA 1 500 600 5
t2 CLE 2 700 800 5
t3 SEA 1 800 900 2
t4 CLE 3 1000 1000 6
t5 LA 1 1100 200 15
t6 LA 2 1200 500 11
t7 LA 2 1200 560 11
t8 CLE 3 1350 1120 4
Step 1: Partition Data on
Ranking Dimensions
Step 2: Group data by
Selection Dimensions
City
BR
City & BR
3 4
2
1
CLE
LA
SEA
Step 3: Compute Measures for each group
For the cell (LA)
1 2 3 4
5 6 7 8
9 10 11
12
13 14 15
16
Block-level: {11, 15}
Data-level: {11: t6, t7; 15: t5}
296
Search with Ranking-Cube:
Simultaneously Push Selection and Ranking
Select top 1 from Apartment
where city = “LA”
order by [price – 1000]^2 + [sq feet - 800]^2 asc
800
1000
Without ranking-cube: start
search from here
With ranking-cube:
start search from here
Measure for
LA: {11, 15}
{11: t6,t7;
15:t5}
11
15
Given the bin boundaries,
locate the block with top score
Bin boundary for price [500, 600, 800, 1100,1350]
Bin boundary for sq feet [200, 400, 600, 800, 1120]
297
Processing Ranking Query: Execution Trace
Select top 1 from Apartment
where city = “LA”
order by [price – 1000]^2 + [sq feet - 800]^2 asc
800
1000
With ranking-
cube: start search
from here
Measure for
LA: {11, 15}
{11: t6,t7;
15:t5}
11
15
f=[price-1000]^2 + [sq feet – 800]^2
Bin boundary for price [500, 600, 800, 1100,1350]
Bin boundary for sq feet [200, 400, 600, 800, 1120]
Execution Trace:
1. Retrieve High-level measure for LA {11, 15}
2. Estimate lower bound score for block 11, 15
f(block 11) = 40,000, f(block 15) = 160,000
3. Retrieve block 11
4. Retrieve low-level measure for block 11
5. f(t6) = 130,000, f(t7) = 97,600
Output t7, done!
298
Ranking Cube: Methodology and Extension
 Ranking cube methodology
 Push selection and ranking simultaneously
 It works for many sophisticated ranking functions
 How to support high-dimensional data?
 Materialize only those atomic cuboids that contain
single selection dimensions

Uses the idea similar to high-dimensional OLAP

Achieves low space overhead and high
performance in answering ranking queries with
a high number of selection dimensions
299
299
Chapter 5: Data Cube Technology
 Data Cube Computation: Preliminary
Concepts
 Data Cube Computation Methods
 Processing Advanced Queries by Exploring
Data Cube Technology
 Multidimensional Data Analysis in Cube Space
 Summary
300
300
Multidimensional Data Analysis in
Cube Space
 Prediction Cubes: Data Mining in Multi-
Dimensional Cube Space
 Multi-Feature Cubes: Complex Aggregation at
Multiple Granularities
 Discovery-Driven Exploration of Data Cubes
301
Data Mining in Cube Space
 Data cube greatly increases the analysis bandwidth
 Four ways to interact OLAP-styled analysis and data
mining
 Using cube space to define data space for mining
 Using OLAP queries to generate features and targets
for mining, e.g., multi-feature cube
 Using data-mining models as building blocks in a
multi-step mining process, e.g., prediction cube
 Using data-cube computation techniques to speed
up repeated model construction

Cube-space data mining may require building a
model for each candidate data space

Sharing computation across model-construction
for different candidates may lead to efficient
302
Prediction Cubes
 Prediction cube: A cube structure that stores
prediction models in multidimensional data space and
supports prediction in OLAP manner
 Prediction models are used as building blocks to
define the interestingness of subsets of data, i.e., to
answer which subsets of data indicate better
prediction
303
How to Determine the Prediction Power
of an Attribute?
 Ex. A customer table D:
 Two dimensions Z: Time (Month, Year ) and Location
(State, Country)
 Two features X: Gender and Salary
 One class-label attribute Y: Valued Customer
 Q: “Are there times and locations in which the value of
a customer depended greatly on the customers
gender (i.e., Gender: predictiveness attribute V)?”
 Idea:
 Compute the difference between the model built on
that using X to predict Y and that built on using X –
V to predict Y
 If the difference is large, V must play an important
role at predicting Y
304
Efficient Computation of Prediction Cubes
 Naïve method: Fully materialize the prediction
cube, i.e., exhaustively build models and
evaluate them for each cell and for each
granularity
 Better approach: Explore score function
decomposition that reduces prediction cube
computation to data cube computation
305
305
Multidimensional Data Analysis in
Cube Space
 Prediction Cubes: Data Mining in Multi-
Dimensional Cube Space
 Multi-Feature Cubes: Complex Aggregation at
Multiple Granularities
 Discovery-Driven Exploration of Data Cubes
306
306
Complex Aggregation at Multiple
Granularities: Multi-Feature Cubes
 Multi-feature cubes (Ross, et al. 1998): Compute complex
queries involving multiple dependent aggregates at multiple
granularities
 Ex. Grouping by all subsets of {item, region, month}, find the
maximum price in 2010 for each group, and the total sales
among all maximum price tuples
select item, region, month, max(price), sum(R.sales)
from purchases
where year = 2010
cube by item, region, month: R
such that R.price = max(price)
 Continuing the last example, among the max price tuples, find
the min and max shelf live, and find the fraction of the total
sales due to tuple that have min shelf life within the set of all
max price tuples
307
307
Multidimensional Data Analysis in
Cube Space
 Prediction Cubes: Data Mining in Multi-
Dimensional Cube Space
 Multi-Feature Cubes: Complex Aggregation at
Multiple Granularities
 Discovery-Driven Exploration of Data Cubes
308
308
Discovery-Driven Exploration of Data Cubes
 Hypothesis-driven
 exploration by user, huge search space
 Discovery-driven (Sarawagi, et al.’98)
 Effective navigation of large OLAP data cubes
 pre-compute measures indicating exceptions, guide
user in the data analysis, at all levels of aggregation
 Exception: significantly different from the value
anticipated, based on a statistical model
 Visual cues such as background color are used to
reflect the degree of exception of each cell
309
309
Kinds of Exceptions and their Computation
 Parameters
 SelfExp: surprise of cell relative to other cells at
same level of aggregation
 InExp: surprise beneath the cell
 PathExp: surprise beneath cell for each drill-down
path
 Computation of exception indicator (modeling fitting
and computing SelfExp, InExp, and PathExp values)
can be overlapped with cube construction
 Exception themselves can be stored, indexed and
retrieved like precomputed aggregates
310
310
Examples: Discovery-Driven Data Cubes
311
311
Chapter 5: Data Cube Technology
 Data Cube Computation: Preliminary
Concepts
 Data Cube Computation Methods
 Processing Advanced Queries by Exploring
Data Cube Technology
 Multidimensional Data Analysis in Cube Space
 Summary
312
312
Data Cube Technology: Summary
 Data Cube Computation: Preliminary Concepts
 Data Cube Computation Methods
 MultiWay Array Aggregation
 BUC
 Star-Cubing
 High-Dimensional OLAP with Shell-Fragments
 Processing Advanced Queries by Exploring Data Cube
Technology
 Sampling Cubes
 Ranking Cubes
 Multidimensional Data Analysis in Cube Space
 Discovery-Driven Exploration of Data Cubes
 Multi-feature Cubes

313
313
Ref.(I) Data Cube Computation Methods
 S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the
computation of multidimensional aggregates. VLDB’96
 D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data warehouses. SIGMOD’97
 K. Beyer and R. Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs.. SIGMOD’99
 M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries efficiently.
VLDB’98
 J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube:
A relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge
Discovery, 1:29–54, 1997.
 J. Han, J. Pei, G. Dong, K. Wang. Efficient Computation of Iceberg Cubes With Complex Measures. SIGMOD’01
 L. V. S. Lakshmanan, J. Pei, and J. Han, Quotient Cube: How to Summarize the Semantics of a Data Cube,
VLDB'02
 X. Li, J. Han, and H. Gonzalez, High-Dimensional OLAP: A Minimal Cubing Approach, VLDB'04
 Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional
aggregates. SIGMOD’97
 K. Ross and D. Srivastava. Fast computation of sparse datacubes. VLDB’97
 D. Xin, J. Han, X. Li, B. W. Wah, Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-Up Integration,
VLDB'03
 D. Xin, J. Han, Z. Shao, H. Liu, C-Cubing: Efficient Computation of Closed Cubes by Aggregation-Based Checking,
ICDE'06
314
314
Ref. (II) Advanced Applications with Data Cubes
 D. Burdick, P. Deshpande, T. S. Jayram, R. Ramakrishnan, and S. Vaithyanathan. OLAP over
uncertain and imprecise data. VLDB’05
 X. Li, J. Han, Z. Yin, J.-G. Lee, Y. Sun, “Sampling Cube: A Framework for Statistical OLAP over
Sampling Data”, SIGMOD’08
 C. X. Lin, B. Ding, J. Han, F. Zhu, and B. Zhao. Text Cube: Computing IR measures for
multidimensional text database analysis. ICDM’08
 D. Papadias, P. Kalnis, J. Zhang, and Y. Tao. Efficient OLAP operations in spatial data warehouses.
SSTD’01
 N. Stefanovic, J. Han, and K. Koperski. Object-based selective materialization for efficient
implementation of spatial data cubes. IEEE Trans. Knowledge and Data Engineering, 12:938–958,
2000.
 T. Wu, D. Xin, Q. Mei, and J. Han. Promotion analysis in multidimensional space. VLDB’09
 T. Wu, D. Xin, and J. Han. ARCube: Supporting ranking aggregate queries in partially materialized
data cubes. SIGMOD’08
 D. Xin, J. Han, H. Cheng, and X. Li. Answering top-k queries with multi-dimensional selections:
The ranking cube approach. VLDB’06
 J. S. Vitter, M. Wang, and B. R. Iyer. Data cube approximation and histograms via wavelets.
CIKM’98
 D. Zhang, C. Zhai, and J. Han. Topic cube: Topic modeling for OLAP on multi-dimensional text
databases. SDM’09
315
Ref. (III) Knowledge Discovery with Data Cubes
 R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97
 B.-C. Chen, L. Chen, Y. Lin, and R. Ramakrishnan. Prediction cubes. VLDB’05
 B.-C. Chen, R. Ramakrishnan, J.W. Shavlik, and P. Tamma. Bellwether analysis: Predicting global
aggregates from local regions. VLDB’06
 Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang, Multi-Dimensional Regression Analysis of
Time-Series Data Streams, VLDB'02
 G. Dong, J. Han, J. Lam, J. Pei, K. Wang. Mining Multi-dimensional Constrained Gradients in Data
Cubes. VLDB’ 01
 R. Fagin, R. V. Guha, R. Kumar, J. Novak, D. Sivakumar, and A. Tomkins. Multi-structural
databases. PODS’05
 J. Han. Towards on-line analytical mining in large databases. SIGMOD Record, 27:97–107, 1998
 T. Imielinski, L. Khachiyan, and A. Abdulghani. Cubegrades: Generalizing association rules. Data
Mining & Knowledge Discovery, 6:219–258, 2002.
 R. Ramakrishnan and B.-C. Chen. Exploratory mining in cube space. Data Mining and Knowledge
Discovery, 15:29–54, 2007.
 K. A. Ross, D. Srivastava, and D. Chatziantoniou. Complex aggregation at multiple granularities.
EDBT'98
 S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP data cubes.
EDBT'98
 G. Sathe and S. Sarawagi. Intelligent Rollups in Multidimensional OLAP Data. VLDB'01
Surplus Slides
316
317
317
Chapter 5: Data Cube Technology
 Efficient Methods for Data Cube Computation

Preliminary Concepts and General Strategies for Cube Computation
 Multiway Array Aggregation for Full Cube Computation
 BUC: Computing Iceberg Cubes from the Apex Cuboid Downward
 H-Cubing: Exploring an H-Tree Structure
 Star-cubing: Computing Iceberg Cubes Using a Dynamic Star-tree
Structure
 Precomputing Shell Fragments for Fast High-Dimensional OLAP
 Data Cubes for Advanced Applications
 Sampling Cubes: OLAP on Sampling Data
 Ranking Cubes: Efficient Computation of Ranking Queries
 Knowledge Discovery with Data Cubes
 Discovery-Driven Exploration of Data Cubes
 Complex Aggregation at Multiple Granularity: Multi-feature Cubes
 Prediction Cubes: Data Mining in Multi-Dimensional Cube Space
 Summary
318
318
H-Cubing: Using H-Tree Structure
 Bottom-up computation
 Exploring an H-tree
structure
 If the current
computation of an H-
tree cannot pass
min_sup, do not
proceed further
(pruning)
 No simultaneous
aggregation
a ll
A B C
A C B C
A B C A B D A C D B C D
A D B D C D
D
A B C D
A B
319
319
H-tree: A Prefix Hyper-tree
Month City Cust_grp Prod Cost Price
Jan Tor Edu Printer 500 485
Jan Tor Hhd TV 800 1200
Jan Tor Edu Camera 1160 1280
Feb Mon Bus Laptop 1500 2500
Mar Van Edu HD 540 520
… … … … … …
root
edu hhd bus
Jan Mar Jan Feb
Tor Van Tor Mon
Q.I.
Q.I. Q.I.
Quant-
Info
Sum: 1765
Cnt: 2
bins
Attr. Val. Quant-Info Side-link
Edu Sum:2285 …
Hhd …
Bus …
… …
Jan …
Feb …
… …
Tor …
Van …
Mon …
… …
Header
table
320
320
root
Edu. Hhd. Bus.
Jan. Mar. Jan. Feb.
Tor. Van. Tor. Mon.
Q.I.
Q.I. Q.I.
Quant-
Info
Sum: 1765
Cnt: 2
bins
Attr. Val. Quant-Info Side-link
Edu Sum:2285 …
Hhd …
Bus …
… …
Jan …
Feb …
… …
Tor …
Van …
Mon …
… …
Attr. Val. Q.I. Side-link
Edu …
Hhd …
Bus …
… …
Jan …
Feb …
… …
Header
Table
HTor
From (*, *, Tor) to (*, Jan, Tor)
Computing Cells Involving “City”
321
321
Computing Cells Involving Month But No City
root
Edu. Hhd. Bus.
Jan. Mar. Jan. Feb.
Tor. Van. Tor. Mont.
Q.I.
Q.I. Q.I.
Attr. Val. Quant-Info Side-link
Edu. Sum:2285 …
Hhd. …
Bus. …
… …
Jan. …
Feb. …
Mar. …
… …
Tor. …
Van. …
Mont. …
… …
1. Roll up quant-info
2. Compute cells
involving month but
no city
Q.I.
Top-k OK mark: if Q.I. in a child passes
top-k avg threshold, so does its
parents. No binning is needed!
322
322
Computing Cells Involving Only Cust_grp
root
edu hhd bus
Jan Mar Jan Feb
Tor Van Tor Mon
Q.I.
Q.I. Q.I.
Attr. Val. Quant-Info Side-link
Edu Sum:2285 …
Hhd …
Bus …
… …
Jan …
Feb …
Mar …
… …
Tor …
Van …
Mon …
… …
Check header table
directly
Q.I.
323
323
Data Mining:
Concepts and Techniques
(3rd
ed.)
— Chapter 6 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
324
Chapter 5: Mining Frequent Patterns, Association
and Correlations: Basic Concepts and Methods
 Basic Concepts
 Frequent Itemset Mining Methods
 Which Patterns Are Interesting?—Pattern
Evaluation Methods
 Summary
325
What Is Frequent Pattern Analysis?
 Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the
context of frequent itemsets and association rule mining
 Motivation: Finding inherent regularities in data
 What products were often purchased together?— Beer and
diapers?!
 What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
 Can we automatically classify web documents?
 Applications
 Basket data analysis, cross-marketing, catalog design, sale
campaign analysis, Web log (click stream) analysis, and DNA
326
Why Is Freq. Pattern Mining Important?
 Freq. pattern: An intrinsic and important property of
datasets
 Foundation for many essential data mining tasks
 Association, correlation, and causality analysis
 Sequential, structural (e.g., sub-graph) patterns
 Pattern analysis in spatiotemporal, multimedia, time-
series, and stream data
 Classification: discriminative, frequent pattern
analysis
 Cluster analysis: frequent pattern-based clustering
 Data warehousing: iceberg cube and cube-gradient
 Semantic data compression: fascicles
 Broad applications
327
Basic Concepts: Frequent Patterns
 itemset: A set of one or more
items
 k-itemset X = {x1, …, xk}
 (absolute) support, or, support
count of X: Frequency or
occurrence of an itemset X
 (relative) support, s, is the
fraction of transactions that
contains X (i.e., the probability
that a transaction contains X)
 An itemset X is frequent if X’s
support is no less than a
minsup threshold
Customer
buys diaper
Customer
buys both
Customer
buys beer
Tid Items bought
10 Beer, Nuts, Diaper
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Diaper, Eggs, Milk
328
Basic Concepts: Association Rules
 Find all the rules X  Y with
minimum support and
confidence
 support, s, probability that a
transaction contains X  Y
 confidence, c, conditional
probability that a transaction
having X also contains Y
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3,
{Beer, Diaper}:3
Customer
buys
diaper
Customer
buys both
Customer
buys beer
Nuts, Eggs, Milk
40
Nuts, Coffee, Diaper, Eggs, Milk
50
Beer, Diaper, Eggs
30
Beer, Coffee, Diaper
20
Beer, Nuts, Diaper
10
Items bought
Tid
 Association rules: (many more!)
 Beer  Diaper (60%, 100%)
 Diaper  Beer (60%, 75%)
329
Closed Patterns and Max-Patterns
 A long pattern contains a combinatorial number of
sub-patterns, e.g., {a1, …, a100} contains (100
1
) + (100
2
) + …
+ (1
1
0
0
0
0
) = 2100
– 1 = 1.27*1030
sub-patterns!
 Solution: Mine closed patterns and max-patterns instead
 An itemset X is closed if X is frequent and there exists
no super-pattern Y ‫כ‬ X, with the same support as X
(proposed by Pasquier, et al. @ ICDT’99)
 An itemset X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y ‫כ‬ X (proposed by
Bayardo @ SIGMOD’98)
 Closed pattern is a lossless compression of freq.
patterns

330
Closed Patterns and Max-Patterns
 Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
 Min_sup = 1.
 What is the set of closed itemset?
 <a1, …, a100>: 1
 < a1, …, a50>: 2
 What is the set of max-pattern?
 <a1, …, a100>: 1
 What is the set of all patterns?
 !!
331
Computational Complexity of Frequent Itemset
Mining
 How many itemsets are potentially to be generated in the worst case?
 The number of frequent itemsets to be generated is senstive to the
minsup threshold
 When minsup is low, there exist potentially an exponential number
of frequent itemsets
 The worst case: MN
where M: # distinct items, and N: max length of
transactions
 The worst case complexty vs. the expected probability
 Ex. Suppose Walmart has 104
kinds of products

The chance to pick up one product 10-4

The chance to pick up a particular set of 10 products: ~10-40

What is the chance this particular set of 10 products to be
frequent 103
times in 109
transactions?
332
Chapter 5: Mining Frequent Patterns, Association
and Correlations: Basic Concepts and Methods
 Basic Concepts
 Frequent Itemset Mining Methods
 Which Patterns Are Interesting?—Pattern
Evaluation Methods
 Summary
333
Scalable Frequent Itemset Mining Methods
 Apriori: A Candidate Generation-and-Test
Approach
 Improving the Efficiency of Apriori
 FPGrowth: A Frequent Pattern-Growth
Approach
 ECLAT: Frequent Pattern Mining with Vertical
334
The Downward Closure Property and Scalable
Mining Methods
 The downward closure property of frequent patterns
 Any subset of a frequent itemset must be frequent
 If {beer, diaper, nuts} is frequent, so is {beer,
diaper}
 i.e., every transaction having {beer, diaper, nuts} also
contains {beer, diaper}
 Scalable mining methods: Three major approaches
 Apriori (Agrawal & Srikant@VLDB’94)
 Freq. pattern growth (FPgrowth—Han, Pei & Yin
@SIGMOD’00)
 Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)
335
Apriori: A Candidate Generation & Test Approach
 Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
(Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
 Method:
 Initially, scan DB once to get frequent 1-itemset
 Generate length (k+1) candidate itemsets from length
k frequent itemsets
 Test the candidates against DB
 Terminate when no frequent or candidate set can be
generated
336
The Apriori Algorithm—An Example
Database TDB
1st
scan
C1
L1
L2
C2 C2
2nd
scan
C3 L3
3rd
scan
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Itemset sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset sup
{A} 2
{B} 3
{C} 3
{E} 3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset sup
{A, B} 1
{A, C} 2
{A, E} 1
{B, C} 2
{B, E} 3
{C, E} 2
Itemset sup
{A, C} 2
{B, C} 2
{B, E} 3
{C, E} 2
Itemset
{B, C, E}
Itemset sup
{B, C, E} 2
Supmin = 2
337
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that
are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
338
Implementation of Apriori
 How to generate candidates?
 Step 1: self-joining Lk
 Step 2: pruning
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3

abcd from abc and abd

acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3
 C4 = {abcd}
339
How to Count Supports of Candidates?
 Why counting supports of candidates a problem?
 The total number of candidates can be very huge
 One transaction may contain many candidates
 Method:
 Candidate itemsets are stored in a hash-tree
 Leaf node of hash-tree contains a list of itemsets
and counts
 Interior node contains a hash table
 Subset function: finds all the candidates contained
in a transaction
340
Counting Supports of Candidates Using Hash Tree
1,4,7
2,5,8
3,6,9
Subset function
2 3 4
5 6 7
1 4 5
1 3 6
1 2 4
4 5 7 1 2 5
4 5 8
1 5 9
3 4 5 3 5 6
3 5 7
6 8 9
3 6 7
3 6 8
Transaction: 1 2 3 5 6
1 + 2 3 5 6
1 2 + 3 5 6
1 3 + 5 6
341
Candidate Generation: An SQL Implementation
 SQL Implementation of candidate generation
 Suppose the items in Lk-1 are listed in an order
 Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
 Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
 Use object-relational extensions like UDFs, BLOBs, and Table functions for
efficient implementation [See: S. Sarawagi, S. Thomas, and R. Agrawal.
Integrating association rule mining with relational database systems:
Alternatives and implications. SIGMOD’98]
342
Scalable Frequent Itemset Mining Methods
 Apriori: A Candidate Generation-and-Test Approach
 Improving the Efficiency of Apriori
 FPGrowth: A Frequent Pattern-Growth Approach
 ECLAT: Frequent Pattern Mining with Vertical Data
Format

343
Further Improvement of the Apriori Method
 Major computational challenges
 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for
candidates
 Improving Apriori: general ideas
 Reduce passes of transaction database scans
 Shrink number of candidates
 Facilitate support counting of candidates
Partition: Scan Database Only Twice
 Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
 Scan 1: partition database and find local frequent
patterns
 Scan 2: consolidate global frequent patterns
 A. Savasere, E. Omiecinski and S. Navathe, VLDB’95
DB1 DB2 DBk
+ = DB
+
+
sup1(i) < σDB1 sup2(i) < σDB2 supk(i) < σDBk sup(i) < σDB
345
DHP: Reduce the Number of Candidates
 A k-itemset whose corresponding hashing bucket count is below
the threshold cannot be frequent
 Candidates: a, b, c, d, e
 Hash entries

{ab, ad, ae}

{bd, be, de}

…
 Frequent 1-itemset: a, b, d, e
 ab is not a candidate 2-itemset if the sum of count of {ab, ad,
ae} is below support threshold
 J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for
mining association rules. SIGMOD’95
count itemset
s
35 {ab, ad, ae}
{yz, qs, wt}
88
102
.
.
.
{bd, be, de}
.
.
.
Hash Table
346
Sampling for Frequent Patterns
 Select a sample of original database, mine frequent
patterns within sample using Apriori
 Scan database once to verify frequent itemsets found
in sample, only borders of closure of frequent patterns
are checked
 Example: check abcd instead of ab, ac, …, etc.
 Scan database again to find missed frequent patterns
 H. Toivonen. Sampling large databases for association
rules. In VLDB’96
347
DIC: Reduce Number of Scans
ABCD
ABC ABD ACD BCD
AB AC BC AD BD CD
A B C D
{}
Itemset lattice
 Once both A and D are determined
frequent, the counting of AD begins
 Once all length-2 subsets of BCD are
determined frequent, the counting of
BCD begins
Transactions
1-itemsets
2-itemsets
…
Apriori
1-itemsets
2-items
3-items
DIC
S. Brin R. Motwani, J. Ullman,
and S. Tsur. Dynamic itemset
counting and implication rules
for market basket data.
SIGMOD’97
348
Scalable Frequent Itemset Mining Methods
 Apriori: A Candidate Generation-and-Test Approach
 Improving the Efficiency of Apriori
 FPGrowth: A Frequent Pattern-Growth Approach
 ECLAT: Frequent Pattern Mining with Vertical Data
Format

349
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
 Bottlenecks of the Apriori approach
 Breadth-first (i.e., level-wise) search
 Candidate generation and test

Often generates a huge number of candidates
 The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
 Depth-first search
 Avoid explicit candidate generation
 Major philosophy: Grow long patterns from short ones using local
frequent items only
 “abc” is a frequent pattern
 Get all transactions having “abc”, i.e., project DB on abc: DB|abc
 “d” is a local frequent item in DB|abc  abcd is a frequent
pattern
350
Construct FP-tree from a Transaction Database
{}
f:4 c:1
b:1
p:1
b:1
c:3
a:3
b:1
m:2
p:2 m:1
Header Table
Item frequency head
f 4
c 4
a 3
b 3
m 3
p 3
min_support = 3
TID Items bought (ordered) frequent
items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
1. Scan DB once, find
frequent 1-itemset
(single item pattern)
2. Sort frequent items in
frequency descending
order, f-list
3. Scan DB again,
construct FP-tree
F-list = f-c-a-b-m-p
351
Partition Patterns and Databases
 Frequent patterns can be partitioned into
subsets according to f-list
 F-list = f-c-a-b-m-p
 Patterns containing p
 Patterns having m but no p
 …
 Patterns having c but no a nor b, m, p
 Pattern f
 Completeness and non-redundency
352
Find Patterns Having P From P-conditional Database
 Starting at the frequent item header table in the FP-tree
 Traverse the FP-tree by following the link of each frequent
item p
 Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base
Conditional pattern bases
item cond. pattern base
c f:3
a fc:3
b fca:1, f:1, c:1
m fca:2, fcab:1
p fcam:2, cb:1
{}
f:4 c:1
b:1
p:1
b:1
c:3
a:3
b:1
m:2
p:2 m:1
Header Table
Item frequency head
f 4
c 4
a 3
b 3
m 3
p 3
353
From Conditional Pattern-bases to Conditional FP-trees
 For each pattern-base
 Accumulate the count for each item in the base
 Construct the FP-tree for the frequent items of
the pattern base
m-conditional pattern base:
fca:2, fcab:1
{}
f:3
c:3
a:3
m-conditional FP-tree
All frequent
patterns relate to m
m,
fm, cm, am,
fcm, fam, cam,
fcam


{}
f:4 c:1
b:1
p:1
b:1
c:3
a:3
b:1
m:2
p:2 m:1
Header Table
Item frequency head
f 4
c 4
a 3
b 3
m 3
p 3
354
Recursion: Mining Each Conditional FP-tree
{}
f:3
c:3
a:3
m-conditional FP-tree
Cond. pattern base of “am”: (fc:3)
{}
f:3
c:3
am-conditional FP-tree
Cond. pattern base of “cm”: (f:3)
{}
f:3
cm-conditional FP-tree
Cond. pattern base of “cam”: (f:3)
{}
f:3
cam-conditional FP-tree
355
A Special Case: Single Prefix Path in FP-tree
 Suppose a (conditional) FP-tree T has a shared
single prefix-path P
 Mining can be decomposed into two parts
 Reduction of the single prefix path into one
node
 Concatenation of the mining results of the two
parts

a2:n2
a3:n3
a1:n1
{}
b1:m1
C1:k1
C2:k2 C3:k3
b1:m1
C1:k1
C2:k2 C3:k3
r1
+
a2:n2
a3:n3
a1:n1
{}
r1 =
356
Benefits of the FP-tree Structure
 Completeness
 Preserve complete information for frequent pattern
mining
 Never break a long pattern of any transaction
 Compactness
 Reduce irrelevant info—infrequent items are gone
 Items in frequency descending order: the more
frequently occurring, the more likely to be shared
 Never be larger than the original database (not count
node-links and the count field)
357
The Frequent Pattern Growth Mining Method
 Idea: Frequent pattern growth
 Recursively grow frequent patterns by pattern and
database partition
 Method
 For each frequent item, construct its conditional
pattern-base, and then its conditional FP-tree
 Repeat the process on each newly created
conditional FP-tree
 Until the resulting FP-tree is empty, or it contains
only one path—single path will generate all the
combinations of its sub-paths, each of which is a
frequent pattern
358
Scaling FP-growth by Database Projection
 What about if FP-tree cannot fit in memory?
 DB projection
 First partition a database into a set of projected DBs
 Then construct and mine FP-tree for each projected DB
 Parallel projection vs. partition projection techniques
 Parallel projection

Project the DB in parallel for each frequent item

Parallel projection is space costly

All the partitions can be processed in parallel
 Partition projection

Partition the DB based on the ordered frequent items

Passing the unprocessed parts to the subsequent partitions
359
Partition-Based Projection
 Parallel projection needs a lot
of disk space
 Partition projection saves it
Tran. DB
fcamp
fcabm
fb
cbp
fcamp
p-proj DB
fcam
cb
fcam
m-proj DB
fcab
fca
fca
b-proj DB
f
cb
…
a-proj DB
fc
…
c-proj DB
f
…
f-proj DB
…
am-proj DB
fc
fc
fc
cm-proj DB
f
f
f
…
Performance of FPGrowth in Large Datasets
FP-Growth vs. Apriori
360
0
10
20
30
40
50
60
70
80
90
100
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
Run
tim
e(sec.)
D1 FP-grow th runtime
D1 Apriori runtime
Data set T25I20D10K
0
20
40
60
80
100
120
140
0 0.5 1 1.5 2
Support threshold (%)
Runtime
(sec.)
D2 FP-growth
D2 TreeProjection
Data set T25I20D100K
FP-Growth vs. Tree-Projection
361
Advantages of the Pattern Growth Approach
 Divide-and-conquer:
 Decompose both the mining task and DB according to the
frequent patterns obtained so far
 Lead to focused search of smaller databases
 Other factors
 No candidate generation, no candidate test
 Compressed database: FP-tree structure
 No repeated scan of entire database
 Basic ops: counting local freq items and building sub FP-tree,
no pattern search and matching
 A good open-source implementation and refinement of
FPGrowth
 FPGrowth+ (Grahne and J. Zhu, FIMI'03)
362
Further Improvements of Mining Methods
 AFOPT (Liu, et al. @ KDD’03)
 A “push-right” method for mining condensed frequent pattern
(CFP) tree
 Carpenter (Pan, et al. @ KDD’03)
 Mine data sets with small rows but numerous columns
 Construct a row-enumeration tree for efficient mining
 FPgrowth+ (Grahne and Zhu, FIMI’03)
 Efficiently Using Prefix-Trees in Mining Frequent Itemsets,
Proc. ICDM'03 Int. Workshop on Frequent Itemset Mining
Implementations (FIMI'03), Melbourne, FL, Nov. 2003
 TD-Close (Liu, et al, SDM’06)
363
Extension of Pattern Growth Mining Methodology
 Mining closed frequent itemsets and max-patterns
 CLOSET (DMKD’00), FPclose, and FPMax (Grahne & Zhu, Fimi’03)
 Mining sequential patterns
 PrefixSpan (ICDE’01), CloSpan (SDM’03), BIDE (ICDE’04)
 Mining graph patterns
 gSpan (ICDM’02), CloseGraph (KDD’03)
 Constraint-based mining of frequent patterns
 Convertible constraints (ICDE’01), gPrune (PAKDD’03)
 Computing iceberg data cubes with complex measures
 H-tree, H-cubing, and Star-cubing (SIGMOD’01, VLDB’03)
 Pattern-growth-based Clustering
 MaPle (Pei, et al., ICDM’03)
 Pattern-Growth-Based Classification
 Mining frequent and discriminative patterns (Cheng, et al,
ICDE’07)
364
Scalable Frequent Itemset Mining Methods
 Apriori: A Candidate Generation-and-Test Approach
 Improving the Efficiency of Apriori
 FPGrowth: A Frequent Pattern-Growth Approach
 ECLAT: Frequent Pattern Mining with Vertical Data
Format

365
ECLAT: Mining by Exploring Vertical Data
Format
 Vertical format: t(AB) = {T11, T25, …}
 tid-list: list of trans.-ids containing an itemset
 Deriving frequent patterns based on vertical intersections
 t(X) = t(Y): X and Y always happen together
 t(X)  t(Y): transaction having X always has Y
 Using diffset to accelerate mining
 Only keep track of differences of tids
 t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
 Diffset (XY, X) = {T2}
 Eclat (Zaki et al. @KDD’97)
 Mining Closed patterns using vertical format: CHARM (Zaki &
Hsiao@SDM’02)
366
Scalable Frequent Itemset Mining Methods
 Apriori: A Candidate Generation-and-Test Approach
 Improving the Efficiency of Apriori
 FPGrowth: A Frequent Pattern-Growth Approach
 ECLAT: Frequent Pattern Mining with Vertical Data
Format

Mining Frequent Closed Patterns: CLOSET
 Flist: list of all frequent items in support ascending order
 Flist: d-a-f-e-c
 Divide search space
 Patterns having d
 Patterns having d but no a, etc.
 Find frequent closed pattern recursively
 Every transaction having d also has cfa  cfad is a
frequent closed pattern
 J. Pei, J. Han & R. Mao. “CLOSET: An Efficient Algorithm for
Mining Frequent Closed Itemsets", DMKD'00.
TID Items
10 a, c, d, e, f
20 a, b, e
30 c, e, f
40 a, c, d, f
50 c, e, f
Min_sup=2
CLOSET+: Mining Closed Itemsets by Pattern-Growth
 Itemset merging: if Y appears in every occurrence of X, then
Y is merged with X
 Sub-itemset pruning: if Y ‫כ‬ X, and sup(X) = sup(Y), X and all of
X’s descendants in the set enumeration tree can be pruned
 Hybrid tree projection
 Bottom-up physical tree-projection
 Top-down pseudo tree-projection
 Item skipping: if a local frequent item has the same support
in several header tables at different levels, one can prune it
from the header table at higher levels
 Efficient subset checking
MaxMiner: Mining Max-Patterns
 1st
scan: find frequent items
 A, B, C, D, E
 2nd
scan: find support for
 AB, AC, AD, AE, ABCDE
 BC, BD, BE, BCDE
 CD, CE, CDE, DE
 Since BCDE is a max-pattern, no need to check BCD,
BDE, CDE in later scan
 R. Bayardo. Efficiently mining long patterns from
databases. SIGMOD’98
Tid Items
10 A, B, C, D, E
20 B, C, D, E,
30 A, C, D, F
Potential
max-patterns
CHARM: Mining by Exploring Vertical Data
Format
 Vertical format: t(AB) = {T11, T25, …}
 tid-list: list of trans.-ids containing an itemset
 Deriving closed patterns based on vertical
intersections
 t(X) = t(Y): X and Y always happen together
 t(X)  t(Y): transaction having X always has Y
 Using diffset to accelerate mining
 Only keep track of differences of tids
 t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
 Diffset (XY, X) = {T2}
 Eclat/MaxEclat (Zaki et al. @KDD’97), VIPER(P. Shenoy
371
Visualization of Association Rules: Plane
Graph
372
Visualization of Association Rules: Rule
Graph
373
Visualization of Association Rules
(SGI/MineSet 3.0)
374
Chapter 5: Mining Frequent Patterns, Association
and Correlations: Basic Concepts and Methods
 Basic Concepts
 Frequent Itemset Mining Methods
 Which Patterns Are Interesting?—Pattern
Evaluation Methods
 Summary
375
Interestingness Measure: Correlations (Lift)
 play basketball  eat cereal [40%, 66.7%] is misleading
 The overall % of students eating cereal is 75% > 66.7%.
 play basketball  not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
 Measure of dependent/correlated events: lift
89
.
0
5000
/
3750
*
5000
/
3000
5000
/
2000
)
,
( 

C
B
lift
Basketbal
l
Not
basketball
Sum (row)
Cereal 2000 1750 3750
Not cereal 1000 250 1250
Sum(col.) 3000 2000 5000
)
(
)
(
)
(
B
P
A
P
B
A
P
lift


33
.
1
5000
/
1250
*
5000
/
3000
5000
/
1000
)
,
( 

C
B
lift
376
Are lift and 2
Good Measures of Correlation?
 “Buy walnuts  buy
milk [1%, 80%]” is
misleading if 85% of
customers buy milk
 Support and
confidence are not
good to indicate
correlations
 Over 20
interestingness
measures have been
proposed (see Tan,
Kumar, Sritastava
@KDD’02)
 Which are good ones?
377
Null-Invariant Measures
October 24, 2024 Data Mining: Concepts and
Techniques
378
Comparison of Interestingness Measures
Milk No Milk Sum (row)
Coffee m, c ~m, c c
No
Coffee
m, ~c ~m, ~c ~c
Sum(col.) m ~m 
 Null-(transaction) invariance is crucial for correlation analysis
 Lift and 2
are not null-invariant
 5 null-invariant measures
Null-transactions
w.r.t. m and c Null-invariant
Subtle: They disagree
Kulczynski
measure (1927)
379
Analysis of DBLP Coauthor Relationships
Advisor-advisee relation: Kulc: high,
coherence: low, cosine: middle
Recent DB conferences, removing balanced associations, low sup, etc.
 Tianyi Wu, Yuguo Chen and Jiawei Han, “
Association Mining in Large Databases: A Re-Examination of Its Me
asures
”, Proc. 2007 Int. Conf. Principles and Practice of Knowledge
Discovery in Databases (PKDD'07), Sept. 2007
Which Null-Invariant Measure Is Better?
 IR (Imbalance Ratio): measure the imbalance of two
itemsets A and B in rule implications
 Kulczynski and Imbalance Ratio (IR) together present a
clear picture for all the three datasets D4 through D6
 D4 is balanced & neutral
 D5 is imbalanced & neutral
 D6 is very imbalanced & neutral
381
Chapter 5: Mining Frequent Patterns, Association
and Correlations: Basic Concepts and Methods
 Basic Concepts
 Frequent Itemset Mining Methods
 Which Patterns Are Interesting?—Pattern
Evaluation Methods
 Summary
382
Summary
 Basic concepts: association rules, support-
confident framework, closed and max-patterns
 Scalable frequent pattern mining methods
 Apriori (Candidate generation & test)
 Projection-based (FPgrowth, CLOSET+, ...)
 Vertical format approach (ECLAT, CHARM, ...)
 Which patterns are interesting?
 Pattern evaluation methods
383
Ref: Basic Concepts of Frequent Pattern Mining
 (Association Rules) R. Agrawal, T. Imielinski, and A. Swami. Mining
association rules between sets of items in large databases. SIGMOD'93
 (Max-pattern) R. J. Bayardo. Efficiently mining long patterns from
databases. SIGMOD'98
 (Closed-pattern) N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal.
Discovering frequent closed itemsets for association rules. ICDT'99
 (Sequential pattern) R. Agrawal and R. Srikant. Mining sequential patterns.
ICDE'95
384
Ref: Apriori and Its Improvements
 R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94
 H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering
association rules. KDD'94
 A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining
association rules in large databases. VLDB'95
 J. S. Park, M. S. Chen, and P. S. Yu. An effective hash-based algorithm for
mining association rules. SIGMOD'95
 H. Toivonen. Sampling large databases for association rules. VLDB'96
 S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and
implication rules for market basket analysis. SIGMOD'97
 S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining
with relational database systems: Alternatives and implications. SIGMOD'98
385
Ref: Depth-First, Projection-Based FP Mining
 R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation
of frequent itemsets. J. Parallel and Distributed Computing, 2002.
 G. Grahne and J. Zhu, Efficiently Using Prefix-Trees in Mining Frequent Itemsets, Proc.
FIMI'03
 B. Goethals and M. Zaki. An introduction to workshop on frequent itemset mining
implementations. Proc. ICDM’03 Int. Workshop on Frequent Itemset Mining
Implementations (FIMI’03), Melbourne, FL, Nov. 2003
 J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation.
SIGMOD’ 00
 J. Liu, Y. Pan, K. Wang, and J. Han. Mining Frequent Item Sets by Opportunistic
Projection. KDD'02
 J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining Top-K Frequent Closed Patterns without
Minimum Support. ICDM'02
 J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the Best Strategies for Mining
Frequent Closed Itemsets. KDD'03
386
Ref: Vertical Format and Row Enumeration Methods
 M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for
discovery of association rules. DAMI:97.
 M. J. Zaki and C. J. Hsiao. CHARM: An Efficient Algorithm for Closed Itemset
Mining, SDM'02.
 C. Bucila, J. Gehrke, D. Kifer, and W. White. DualMiner: A Dual-Pruning
Algorithm for Itemsets with Constraints. KDD’02.
 F. Pan, G. Cong, A. K. H. Tung, J. Yang, and M. Zaki , CARPENTER: Finding
Closed Patterns in Long Biological Datasets. KDD'03.
 H. Liu, J. Han, D. Xin, and Z. Shao, Mining Interesting Patterns from Very High
Dimensional Data: A Top-Down Row Enumeration Approach, SDM'06.
387
Ref: Mining Correlations and Interesting Rules
 S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing
association rules to correlations. SIGMOD'97.
 M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding
interesting rules from large sets of discovered association rules. CIKM'94.
 R. J. Hilderman and H. J. Hamilton. Knowledge Discovery and Measures of Interest.
Kluwer Academic, 2001.
 C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining
causal structures. VLDB'98.
 P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure
for Association Patterns. KDD'02.
 E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE’03.
 T. Wu, Y. Chen, and J. Han, “Re-Examination of Interestingness Measures in Pattern
Mining: A Unified Framework", Data Mining and Knowledge Discovery, 21(3):371-
397, 2010
388
388
Data Mining:
Concepts and Techniques
(3rd
ed.)
— Chapter 7 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2010 Han, Kamber & Pei. All rights reserved.
October 24, 2024
Data Mining: Concepts and
Techniques 389
390
Chapter 7 : Advanced Frequent Pattern Mining
 Pattern Mining: A Road Map
 Pattern Mining in Multi-Level, Multi-Dimensional
Space
 Constraint-Based Frequent Pattern Mining
 Mining High-Dimensional Data and Colossal Patterns
 Mining Compressed or Approximate Patterns
 Pattern Exploration and Application
 Summary
Research
on
Pattern
Mining:
A
Road
Map
391
392
Chapter 7 : Advanced Frequent Pattern Mining
 Pattern Mining: A Road Map
 Pattern Mining in Multi-Level, Multi-Dimensional
Space
 Mining Multi-Level Association
 Mining Multi-Dimensional Association
 Mining Quantitative Association Rules
 Mining Rare Patterns and Negative Patterns
 Constraint-Based Frequent Pattern Mining
 Mining High-Dimensional Data and Colossal Patterns
 Mining Compressed or Approximate Patterns
 Pattern Exploration and Application
 Summary
393
Mining Multiple-Level Association Rules
 Items often form hierarchies
 Flexible support settings
 Items at the lower level are expected to have lower
support
 Exploration of shared multi-level mining (Agrawal &
Srikant@VLB’95, Han & Fu@VLDB’95)
uniform
support
Milk
[support = 10%]
2% Milk
[support = 6%]
Skim Milk
[support = 4%]
Level 1
min_sup = 5%
Level 2
min_sup = 5%
Level 1
min_sup = 5%
Level 2
min_sup = 3%
reduced support
394
Multi-level Association: Flexible Support and
Redundancy filtering
 Flexible min-support thresholds: Some items are more valuable
but less frequent
 Use non-uniform, group-based min-support
 E.g., {diamond, watch, camera}: 0.05%; {bread, milk}: 5%; …
 Redundancy Filtering: Some rules may be redundant due to
“ancestor” relationships between items
 milk  wheat bread [support = 8%, confidence = 70%]
 2% milk  wheat bread [support = 2%, confidence = 72%]
The first rule is an ancestor of the second rule
 A rule is redundant if its support is close to the “expected” value,
based on the rule’s ancestor
395
Chapter 7 : Advanced Frequent Pattern Mining
 Pattern Mining: A Road Map
 Pattern Mining in Multi-Level, Multi-Dimensional
Space
 Mining Multi-Level Association
 Mining Multi-Dimensional Association
 Mining Quantitative Association Rules
 Mining Rare Patterns and Negative Patterns
 Constraint-Based Frequent Pattern Mining
 Mining High-Dimensional Data and Colossal Patterns
 Mining Compressed or Approximate Patterns
 Pattern Exploration and Application
 Summary
396
Mining Multi-Dimensional Association
 Single-dimensional rules:
buys(X, “milk”)  buys(X, “bread”)
 Multi-dimensional rules:  2 dimensions or predicates
 Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”)  occupation(X,“student”)  buys(X, “coke”)
 hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”)
 Categorical Attributes: finite number of possible values,
no ordering among values—data cube approach
 Quantitative Attributes: Numeric, implicit ordering
among values—discretization, clustering, and gradient
approaches
397
Chapter 7 : Advanced Frequent Pattern Mining
 Pattern Mining: A Road Map
 Pattern Mining in Multi-Level, Multi-Dimensional
Space
 Mining Multi-Level Association
 Mining Multi-Dimensional Association
 Mining Quantitative Association Rules
 Mining Rare Patterns and Negative Patterns
 Constraint-Based Frequent Pattern Mining
 Mining High-Dimensional Data and Colossal Patterns
 Mining Compressed or Approximate Patterns
 Pattern Exploration and Application
 Summary
398
Mining Quantitative Associations
Techniques can be categorized by how numerical
attributes, such as age or salary are treated
1. Static discretization based on predefined concept
hierarchies (data cube methods)
2. Dynamic discretization based on data distribution
(quantitative rules, e.g., Agrawal &
Srikant@SIGMOD96)
3. Clustering: Distance-based association (e.g., Yang &
Miller@SIGMOD97)
 One dimensional clustering then association
4. Deviation: (such as Aumann and Lindell@KDD99)
Sex = female => Wage: mean=$7/hr (overall mean = $9)
399
Static Discretization of Quantitative Attributes
 Discretized prior to mining using concept hierarchy.
 Numeric values are replaced by ranges
 In relational database, finding all frequent k-predicate
sets will require k or k+1 table scans
 Data cube is well suited for mining
 The cells of an n-dimensional
cuboid correspond to the
predicate sets
 Mining from data cubes
can be much faster
(income)
(age)
()
(buys)
(age, income) (age,buys) (income,buys)
(age,income,buys)
400
Quantitative Association Rules Based on Statistical
Inference Theory [Aumann and Lindell@DMKD’03]
 Finding extraordinary and therefore interesting phenomena, e.g.,
(Sex = female) => Wage: mean=$7/hr (overall mean = $9)
 LHS: a subset of the population
 RHS: an extraordinary behavior of this subset
 The rule is accepted only if a statistical test (e.g., Z-test) confirms
the inference with high confidence
 Subrule: highlights the extraordinary behavior of a subset of the
pop. of the super rule
 E.g., (Sex = female) ^ (South = yes) => mean wage = $6.3/hr
 Two forms of rules
 Categorical => quantitative rules, or Quantitative => quantitative rules
 E.g., Education in [14-18] (yrs) => mean wage = $11.64/hr
 Open problem: Efficient methods for LHS containing two or more
quantitative attributes
401
Chapter 7 : Advanced Frequent Pattern Mining
 Pattern Mining: A Road Map
 Pattern Mining in Multi-Level, Multi-Dimensional
Space
 Mining Multi-Level Association
 Mining Multi-Dimensional Association
 Mining Quantitative Association Rules
 Mining Rare Patterns and Negative Patterns
 Constraint-Based Frequent Pattern Mining
 Mining High-Dimensional Data and Colossal Patterns
 Mining Compressed or Approximate Patterns
 Pattern Exploration and Application
 Summary
402
Negative and Rare Patterns
 Rare patterns: Very low support but interesting
 E.g., buying Rolex watches
 Mining: Setting individual-based or special group-
based support threshold for valuable items
 Negative patterns
 Since it is unlikely that one buys Ford Expedition (an
SUV car) and Toyota Prius (a hybrid car) together,
Ford Expedition and Toyota Prius are likely negatively
correlated patterns
 Negatively correlated patterns that are infrequent tend
to be more interesting than those that are frequent
403
Defining Negative Correlated Patterns (I)
 Definition 1 (support-based)
 If itemsets X and Y are both frequent but rarely occur together,
i.e.,
sup(X U Y) < sup (X) * sup(Y)
 Then X and Y are negatively correlated
 Problem: A store sold two needle 100 packages A and B, only one
transaction containing both A and B.
 When there are in total 200 transactions, we have
s(A U B) = 0.005, s(A) * s(B) = 0.25, s(A U B) < s(A) * s(B)
 When there are 105
transactions, we have
s(A U B) = 1/105
, s(A) * s(B) = 1/103 *
1/103
, s(A U B) > s(A) * s(B)
 Where is the problem? —Null transactions, i.e., the support-
based definition is not null-invariant!
404
Defining Negative Correlated Patterns (II)
 Definition 2 (negative itemset-based)
 X is a negative itemset if (1) X = Ā U B, where B is a set of positive
items, and Ā is a set of negative items, |Ā| 1, and (2) s(X)
≥ ≥ μ
 Itemsets X is negatively correlated, if
 This definition suffers a similar null-invariant problem
 Definition 3 (Kulzynski measure-based) If itemsets X and Y are
frequent, but (P(X|Y) + P(Y|X))/2 < є, where є is a negative pattern
threshold, then X and Y are negatively correlated.
 Ex. For the same needle package problem, when no matter there
are 200 or 105
transactions, if є = 0.01, we have
(P(A|B) + P(B|A))/2 = (0.01 + 0.01)/2 < є
405
Chapter 7 : Advanced Frequent Pattern Mining
 Pattern Mining: A Road Map
 Pattern Mining in Multi-Level, Multi-Dimensional
Space
 Constraint-Based Frequent Pattern Mining
 Mining High-Dimensional Data and Colossal Patterns
 Mining Compressed or Approximate Patterns
 Pattern Exploration and Application
 Summary
406
Constraint-based (Query-Directed) Mining
 Finding all the patterns in a database autonomously? —
unrealistic!
 The patterns could be too many but not focused!
 Data mining should be an interactive process
 User directs what to be mined using a data mining query
language (or a graphical user interface)
 Constraint-based mining
 User flexibility: provides constraints on what to be mined
 Optimization: explores such constraints for efficient mining —
constraint-based mining: constraint-pushing, similar to push
selection first in DB query processing
 Note: still find all the answers satisfying constraints, not
finding some answers in “heuristic search”
407
Constraints in Data Mining
 Knowledge type constraint:
 classification, association, etc.
 Data constraint — using SQL-like queries
 find product pairs sold together in stores in Chicago
this year
 Dimension/level constraint
 in relevance to region, price, brand, customer
category
 Rule (or pattern) constraint
 small sales (price < $10) triggers big sales (sum >
$200)
 Interestingness constraint
 strong rules: min_support  3%, min_confidence 
60%
Meta-Rule Guided Mining
 Meta-rule can be in the rule form with partially instantiated
predicates and constants
P1(X, Y) ^ P2(X, W) => buys(X, “iPad”)
 The resulting rule derived can be
age(X, “15-25”) ^ profession(X, “student”) => buys(X, “iPad”)
 In general, it can be in the form of
P1 ^ P2 ^ … ^ Pl => Q1 ^ Q2 ^ … ^ Qr
 Method to find meta-rules
 Find frequent (l+r) predicates (based on min-support
threshold)
 Push constants deeply when possible into the mining process
(see the remaining discussions on constraint-push techniques)
 Use confidence, correlation, and other measures when 408
409
Constraint-Based Frequent Pattern Mining
 Pattern space pruning constraints
 Anti-monotonic: If constraint c is violated, its further mining
can be terminated
 Monotonic: If c is satisfied, no need to check c again
 Succinct: c must be satisfied, so one can start with the data
sets satisfying c
 Convertible: c is not monotonic nor anti-monotonic, but it can
be converted into it if items in the transaction can be properly
ordered
 Data space pruning constraint
 Data succinct: Data space can be pruned at the initial pattern
mining process
 Data anti-monotonic: If a transaction t does not satisfy c, t can
be pruned from its further mining
410
Pattern Space Pruning with Anti-Monotonicity Constraints
 A constraint C is anti-monotone if the super
pattern satisfies C, all of its sub-patterns do so
too
 In other words, anti-monotonicity: If an itemset
S violates the constraint, so does any of its
superset
 Ex. 1. sum(S.price)  v is anti-monotone
 Ex. 2. range(S.profit)  15 is anti-monotone
 Itemset ab violates C
 So does every superset of ab
 Ex. 3. sum(S.Price)  v is not anti-monotone
 Ex. 4. support count is anti-monotone: core
property used in Apriori
TID Transaction
10 a, b, c, d, f
20 b, c, d, f, g, h
30 a, c, d, e, f
40 c, e, f, g
TDB (min_sup=2)
Item Profit
a 40
b 0
c -20
d 10
e -30
f 30
g 20
h -10
411
Pattern Space Pruning with Monotonicity Constraints
 A constraint C is monotone if the pattern
satisfies C, we do not need to check C in
subsequent mining
 Alternatively, monotonicity: If an itemset S
satisfies the constraint, so does any of its
superset
 Ex. 1. sum(S.Price)  v is monotone
 Ex. 2. min(S.Price)  v is monotone
 Ex. 3. C: range(S.profit)  15
 Itemset ab satisfies C
 So does every superset of ab
TID Transaction
10 a, b, c, d, f
20 b, c, d, f, g, h
30 a, c, d, e, f
40 c, e, f, g
TDB (min_sup=2)
Item Profit
a 40
b 0
c -20
d 10
e -30
f 30
g 20
h -10
412
Data Space Pruning with Data Anti-monotonicity
 A constraint c is data anti-monotone if for a
pattern p cannot satisfy a transaction t under c,
p’s superset cannot satisfy t under c either
 The key for data anti-monotone is recursive data
reduction
 Ex. 1. sum(S.Price)  v is data anti-monotone
 Ex. 2. min(S.Price)  v is data anti-monotone
 Ex. 3. C: range(S.profit)  25 is data anti-
monotone
 Itemset {b, c}’s projected DB:

T10’: {d, f, h}, T20’: {d, f, g, h}, T30’: {d, f, g}
 since C cannot satisfy T10’, T10’ can be
pruned
TID Transaction
10 a, b, c, d, f, h
20 b, c, d, f, g, h
30 b, c, d, f, g
40 c, e, f, g
TDB (min_sup=2)
Item Profit
a 40
b 0
c -20
d -15
e -30
f -10
g 20
h -5
413
Pattern Space Pruning with Succinctness
 Succinctness:
 Given A1, the set of items satisfying a succinctness
constraint C, then any set S satisfying C is based
on A1 , i.e., S contains a subset belonging to A1
 Idea: Without looking at the transaction database,
whether an itemset S satisfies constraint C can be
determined based on the selection of items
 min(S.Price)  v is succinct
 sum(S.Price)  v is not succinct
 Optimization: If C is succinct, C is pre-counting
pushable
414
Naïve Algorithm: Apriori + Constraint
TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
Database D itemset sup.
{1} 2
{2} 3
{3} 3
{4} 1
{5} 3
itemset sup.
{1} 2
{2} 3
{3} 3
{5} 3
Scan D
C1
L1
itemset
{1 2}
{1 3}
{1 5}
{2 3}
{2 5}
{3 5}
itemset sup
{1 2} 1
{1 3} 2
{1 5} 1
{2 3} 2
{2 5} 3
{3 5} 2
itemset sup
{1 3} 2
{2 3} 2
{2 5} 3
{3 5} 2
L2
C2 C2
Scan D
C3 L3
itemset
{2 3 5}
Scan D itemset sup
{2 3 5} 2
Constraint:
Sum{S.price} < 5
415
Constrained Apriori : Push a Succinct Constraint
Deep
TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
Database D itemset sup.
{1} 2
{2} 3
{3} 3
{4} 1
{5} 3
itemset sup.
{1} 2
{2} 3
{3} 3
{5} 3
Scan D
C1
L1
itemset
{1 2}
{1 3}
{1 5}
{2 3}
{2 5}
{3 5}
itemset sup
{1 2} 1
{1 3} 2
{1 5} 1
{2 3} 2
{2 5} 3
{3 5} 2
itemset sup
{1 3} 2
{2 3} 2
{2 5} 3
{3 5} 2
L2
C2 C2
Scan D
C3 L3
itemset
{2 3 5}
Scan D itemset sup
{2 3 5} 2
Constraint:
min{S.price } <= 1
not immediately
to be used
416
Constrained FP-Growth: Push a Succinct
Constraint Deep
Constraint:
min{S.price } <= 1
TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
TID Items
100 1 3
200 2 3 5
300 1 2 3 5
400 2 5
Remove
infrequent
length 1
FP-Tree
TID Items
100 3 4
300 2 3 5
1-Projected DB
No Need to project on 2, 3, or 5
417
Constrained FP-Growth: Push a Data
Anti-monotonic Constraint Deep
Constraint:
min{S.price } <= 1
TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
TID Items
100 1 3
300 1 3
FP-Tree
Single branch, we are done
Remove from data
418
Constrained FP-Growth: Push a Data
Anti-monotonic Constraint Deep
Constraint:
range{S.price } > 25
min_sup >= 2
FP-Tree
TID Transaction
10 a, c, d, f, h
20 c, d, f, g, h
30 c, d, f, g
B-Projected DB
B
FP-Tree
TID Transaction
10 a, b, c, d, f, h
20 b, c, d, f, g, h
30 b, c, d, f, g
40 a, c, e, f, g
TID Transaction
10 a, b, c, d, f, h
20 b, c, d, f, g, h
30 b, c, d, f, g
40 a, c, e, f, g
Item Profit
a 40
b 0
c -20
d -15
e -30
f -10
g 20
h -5
Recursive
Data
Pruning
Single branch:
bcdfg: 2
419
Convertible Constraints: Ordering Data in
Transactions
 Convert tough constraints into anti-
monotone or monotone by properly
ordering items
 Examine C: avg(S.profit)  25
 Order items in value-descending
order

<a, f, g, d, b, h, c, e>
 If an itemset afb violates C

So does afbh, afb*

It becomes anti-monotone!
TID Transaction
10 a, b, c, d, f
20 b, c, d, f, g, h
30 a, c, d, e, f
40 c, e, f, g
TDB (min_sup=2)
Item Profit
a 40
b 0
c -20
d 10
e -30
f 30
g 20
h -10
420
Strongly Convertible Constraints
 avg(X)  25 is convertible anti-monotone
w.r.t. item value descending order R: <a, f, g,
d, b, h, c, e>
 If an itemset af violates a constraint C, so
does every itemset with af as prefix, such
as afd
 avg(X)  25 is convertible monotone w.r.t.
item value ascending order R-1
: <e, c, h, b, d,
g, f, a>
 If an itemset d satisfies a constraint C, so
does itemsets df and dfa, which having d
as a prefix
 Thus, avg(X)  25 is strongly convertible
Item Profit
a 40
b 0
c -20
d 10
e -30
f 30
g 20
h -10
421
Can Apriori Handle Convertible Constraints?
 A convertible, not monotone nor anti-
monotone nor succinct constraint cannot be
pushed deep into the an Apriori mining
algorithm
 Within the level wise framework, no direct
pruning based on the constraint can be
made
 Itemset df violates constraint C: avg(X) >=
25
 Since adf satisfies C, Apriori needs df to
assemble adf, df cannot be pruned
 But it can be pushed into frequent-pattern
Item Value
a 40
b 0
c -20
d 10
e -30
f 30
g 20
h -10
422
Pattern Space Pruning w. Convertible Constraints
 C: avg(X) >= 25, min_sup=2
 List items in every transaction in value
descending order R: <a, f, g, d, b, h, c, e>
 C is convertible anti-monotone w.r.t. R
 Scan TDB once
 remove infrequent items

Item h is dropped
 Itemsets a and f are good, …
 Projection-based mining
 Imposing an appropriate order on item
projection
 Many tough constraints can be converted
into (anti)-monotone
TID Transaction
10 a, f, d, b, c
20 f, g, d, b, c
30 a, f, d, c, e
40 f, g, h, c, e
TDB (min_sup=2)
Item Value
a 40
f 30
g 20
d 10
b 0
h -10
c -20
e -30
423
Handling Multiple Constraints
 Different constraints may require different or even
conflicting item-ordering
 If there exists an order R s.t. both C1 and C2 are
convertible w.r.t. R, then there is no conflict between
the two convertible constraints
 If there exists conflict on order of items
 Try to satisfy one constraint first
 Then using the order for the other constraint to
mine frequent itemsets in the corresponding
projected database
424
What Constraints Are Convertible?
Constraint
Convertible anti-
monotone
Convertible
monotone
Strongly
convertible
avg(S)  ,  v Yes Yes Yes
median(S)  ,  v Yes Yes Yes
sum(S)  v (items could be of any
value, v  0)
Yes No No
sum(S)  v (items could be of any
value, v  0)
No Yes No
sum(S)  v (items could be of any
value, v  0)
No Yes No
sum(S)  v (items could be of any
value, v  0)
Yes No No
……
425
Constraint-Based Mining — A General Picture
Constraint Anti-monotone Monotone Succinct
v  S no yes yes
S  V no yes yes
S  V yes no yes
min(S)  v no yes yes
min(S)  v yes no yes
max(S)  v yes no yes
max(S)  v no yes yes
count(S)  v yes no weakly
count(S)  v no yes weakly
sum(S)  v ( a  S, a  0 ) yes no no
sum(S)  v ( a  S, a  0 ) no yes no
range(S)  v yes no no
range(S)  v no yes no
avg(S)  v,   { , ,  } convertible convertible no
support(S)   yes no no
support(S)   no yes no
426
Chapter 7 : Advanced Frequent Pattern Mining
 Pattern Mining: A Road Map
 Pattern Mining in Multi-Level, Multi-Dimensional
Space
 Constraint-Based Frequent Pattern Mining
 Mining High-Dimensional Data and Colossal Patterns
 Mining Compressed or Approximate Patterns
 Pattern Exploration and Application
 Summary
427
Mining Colossal Frequent Patterns
 F. Zhu, X. Yan, J. Han, P. S. Yu, and H. Cheng, “Mining Colossal
Frequent Patterns by Core Pattern Fusion”, ICDE'07.
 We have many algorithms, but can we mine large (i.e., colossal)
patterns? ― such as just size around 50 to 100? Unfortunately, not!
 Why not? ― the curse of “downward closure” of frequent patterns
 The “downward closure” property

Any sub-pattern of a frequent pattern is frequent.
 Example. If (a1, a2, …, a100) is frequent, then a1, a2, …, a100, (a1,
a2), (a1, a3), …, (a1, a100), (a1, a2, a3), … are all frequent! There
are about 2100
such frequent itemsets!
 No matter using breadth-first search (e.g., Apriori) or depth-first
search (FPgrowth), we have to examine so many patterns
 Thus the downward closure property leads to explosion!
428
Closed/maximal patterns may
partially alleviate the problem but not
really solve it: We often need to
mine scattered large patterns!
Let the minimum support threshold
σ= 20
There are frequent patterns of
size 20
Each is closed and maximal
# patterns =
The size of the answer set is
exponential to n
Colossal Patterns: A Motivating Example
T1 = 1 2 3 4 ….. 39 40
T2 = 1 2 3 4 ….. 39 40
: .
: .
: .
: .
T40=1 2 3 4 ….. 39 40 







20
40
T1 = 2 3 4 ….. 39 40
T2 = 1 3 4 ….. 39 40
: .
: .
: .
: .
T40=1 2 3 4 …… 39
n
n
n n
2
/
2
2
/










Then delete the items on the diagonal
Let’s make a set of 40 transactions
429
Colossal Pattern Set: Small but Interesting
 It is often the case that
only a small number of
patterns are colossal,
i.e., of large size
 Colossal patterns are
usually attached with
greater importance
than those of small
pattern sizes
430
Mining Colossal Patterns: Motivation and
Philosophy
 Motivation: Many real-world tasks need mining colossal patterns
 Micro-array analysis in bioinformatics (when support is low)
 Biological sequence patterns
 Biological/sociological/information graph pattern mining
 No hope for completeness
 If the mining of mid-sized patterns is explosive in size, there is
no hope to find colossal patterns efficiently by insisting
“complete set” mining philosophy
 Jumping out of the swamp of the mid-sized results
 What we may develop is a philosophy that may jump out of the
swamp of mid-sized results that are explosive in size and jump
to reach colossal patterns
 Striving for mining almost complete colossal patterns
 The key is to develop a mechanism that may quickly reach
colossal patterns and discover most of them
431
Let the min-support threshold σ= 20
Then there are closed/maximal
frequent patterns of size 20
However, there is only one with size
greater than 20, (i.e., colossal):
α= {41,42,…,79} of size 39
Alas, A Show of Colossal Pattern Mining!








20
40
T1 = 2 3 4 ….. 39 40
T2 = 1 3 4 ….. 39 40
: .
: .
: .
: .
T40=1 2 3 4 …… 39
T41= 41 42 43 ….. 79
T42= 41 42 43 ….. 79
: .
: .
T60= 41 42 43 … 79
The existing fastest mining algorithms
(e.g., FPClose, LCM) fail to complete
running
Our algorithm outputs this colossal
pattern in seconds
432
Methodology of Pattern-Fusion Strategy
 Pattern-Fusion traverses the tree in a bounded-breadth way
 Always pushes down a frontier of a bounded-size candidate
pool
 Only a fixed number of patterns in the current candidate pool
will be used as the starting nodes to go down in the pattern tree
― thus avoids the exponential search space
 Pattern-Fusion identifies “shortcuts” whenever possible
 Pattern growth is not performed by single-item addition but by
leaps and bounded: agglomeration of multiple patterns in the
pool
 These shortcuts will direct the search down the tree much more
rapidly towards the colossal patterns
433
Observation: Colossal Patterns and Core Patterns
A colossal pattern α
D
Dα
α1
Transaction Database D
Dα1
Dα2
α2
α
αk
Dαk
Subpatterns α1 to αk cluster tightly around the colossal pattern α by
sharing a similar support. We call such subpatterns core patterns of α
434
Robustness of Colossal Patterns
 Core Patterns
Intuitively, for a frequent pattern α, a subpattern β is a τ-core
pattern of α if β shares a similar support set with α, i.e.,
where τ is called the core ratio
 Robustness of Colossal Patterns
A colossal pattern is robust in the sense that it tends to have much
more core patterns than small patterns




|
|
|
|
D
D
1
0 

435
Example: Core Patterns
 A colossal pattern has far more core patterns than a small-sized
pattern
 A colossal pattern has far more core descendants of a smaller size c
 A random draw from a complete set of pattern of size c would more
likely to pick a core descendant of a colossal pattern
 A colossal pattern can be generated by merging a set of core
patterns
Transaction (# of
Ts)
Core Patterns (τ = 0.5)
(abe) (100) (abe), (ab), (be), (ae), (e)
(bcf) (100) (bcf), (bc), (bf)
(acf) (100) (acf), (ac), (af)
(abcef) (100) (ab), (ac), (af), (ae), (bc), (bf), (be) (ce), (fe), (e), (abc),
(abf), (abe), (ace), (acf), (afe), (bcf), (bce), (bfe), (cfe),
(abcf), (abce), (bcfe), (acfe), (abfe), (abcef)
437
Colossal Patterns Correspond to Dense Balls
 Due to their robustness,
colossal patterns correspond
to dense balls
 Ω( 2^d) in population
 A random draw in the pattern
space will hit somewhere in
the ball with high probability
438
Idea of Pattern-Fusion Algorithm
 Generate a complete set of frequent patterns up to a
small size
 Randomly pick a pattern β, and β has a high
probability to be a core-descendant of some colossal
pattern α
 Identify all α’s descendants in this complete set, and
merge all of them ― This would generate a much
larger core-descendant of α
 In the same fashion, we select K patterns. This set of
larger core-descendants will be the candidate pool for
the next iteration
439
Pattern-Fusion: The Algorithm
 Initialization (Initial pool): Use an existing algorithm to
mine all frequent patterns up to a small size, e.g., 3
 Iteration (Iterative Pattern Fusion):
 At each iteration, k seed patterns are randomly
picked from the current pattern pool
 For each seed pattern thus picked, we find all the
patterns within a bounding ball centered at the
seed pattern
 All these patterns found are fused together to
generate a set of super-patterns. All the super-
patterns thus generated form a new pool for the
next iteration
 Termination: when the current pool contains no more
than K patterns at the beginning of an iteration
440
Why Is Pattern-Fusion Efficient?
 A bounded-breadth pattern
tree traversal
 It avoids explosion in
mining mid-sized ones
 Randomness comes to
help to stay on the right
path
 Ability to identify “short-
cuts” and take “leaps”
 fuse small patterns
together in one step to
generate new patterns of
significant sizes
 Efficiency
441
Pattern-Fusion Leads to Good Approximation
 Gearing toward colossal patterns
 The larger the pattern, the greater the chance it
will be generated
 Catching outliers
 The more distinct the pattern, the greater the
chance it will be generated
442
Experimental Setting
 Synthetic data set
 Diagn an n x (n-1) table where ith
row has integers from 1 to n
except i. Each row is taken as an itemset. min_support is n/2.
 Real data set
 Replace: A program trace data set collected from the “replace”
program, widely used in software engineering research
 ALL: A popular gene expression data set, a clinical data on ALL-
AML leukemia (www.broad.mit.edu/tools/data.html).

Each item is a column, representing the activitiy level of
gene/protein in the same

Frequent pattern would reveal important correlation between
gene expression patterns and disease outcomes
443
Experiment Results on Diagn
 LCM run time increases
exponentially with pattern
size n
 Pattern-Fusion finishes
efficiently
 The approximation error of
Pattern-Fusion (with min-
sup 20) in comparison with
the complete set) is rather
close to uniform sampling
(which randomly picks K
patterns from the
complete answer set)
444
Experimental Results on ALL
 ALL: A popular gene expression data set with 38
transactions, each with 866 columns
 There are 1736 items in total
 The table shows a high frequency threshold of 30
445
Experimental Results on REPLACE
 REPLACE
 A program trace data set, recording 4395
calls and transitions
 The data set contains 4395 transactions with
57 items in total
 With support threshold of 0.03, the largest
patterns are of size 44
 They are all discovered by Pattern-Fusion
with different settings of K and τ, when
started with an initial pool of 20948 patterns
of size <=3
446
Experimental Results on REPLACE
 Approximation error when
compared with the complete
mining result
 Example. Out of the total 98
patterns of size >=42, when
K=100, Pattern-Fusion returns
80 of them
 A good approximation to the
colossal patterns in the sense
that any pattern in the
complete set is on average at
most 0.17 items away from
one of these 80 patterns
447
Chapter 7 : Advanced Frequent Pattern Mining
 Pattern Mining: A Road Map
 Pattern Mining in Multi-Level, Multi-Dimensional
Space
 Constraint-Based Frequent Pattern Mining
 Mining High-Dimensional Data and Colossal Patterns
 Mining Compressed or Approximate Patterns
 Pattern Exploration and Application
 Summary
448
Mining Compressed Patterns: δ-clustering
 Why compressed patterns?
 too many, but less
meaningful
 Pattern distance measure
 δ-clustering: For each pattern P,
find all patterns which can be
expressed by P and their
distance to P are within δ (δ-
cover)
 All patterns in the cluster can
be represented by P
 Xin et al., “Mining Compressed
ID Item-Sets Support
P1 {38,16,18,12} 205227
P2 {38,16,18,12,17} 205211
P3 {39,38,16,18,12,17} 101758
P4 {39,16,18,12,17} 161563
P5 {39,16,18,12} 161576
 Closed frequent pattern
 Report P1, P2, P3, P4, P5
 Emphasize too much on
support
 no compression
 Max-pattern, P3: info loss
 A desirable output: P2, P3,
P4
449
Redundancy-Award Top-k Patterns
 Why redundancy-aware top-k patterns?
 Desired patterns: high
significance & low
redundancy
 Propose the MMS
(Maximal Marginal
Significance) for
measuring the
combined significance
of a pattern set
 Xin et al., Extracting
Redundancy-Aware
Top-K Patterns,
KDD’06
450
Chapter 7 : Advanced Frequent Pattern Mining
 Pattern Mining: A Road Map
 Pattern Mining in Multi-Level, Multi-Dimensional
Space
 Constraint-Based Frequent Pattern Mining
 Mining High-Dimensional Data and Colossal Patterns
 Mining Compressed or Approximate Patterns
 Pattern Exploration and Application
 Summary
 Do they all make sense?
 What do they mean?
 How are they useful?
diaper beer
female sterile (2) tekele
Annotate patterns with semantic information
morphological info. and simple statistics
Semantic Information
Not all frequent patterns are useful, only meaningful ones …
How to Understand and Interpret Patterns?
Word: “pattern” – from Merriam-Webster
A Dictionary Analogy
Non-semantic info.
Examples of Usage
Definitions indicating
semantics
Synonyms
Related Words
Semantic Analysis with Context Models
 Task1: Model the context of a frequent pattern
Based on the Context Model…
 Task2: Extract strongest context indicators
 Task3: Extract representative transactions
 Task4: Extract semantically similar patterns
Annotating DBLP Co-authorship & Title Pattern
Substructure Similarity Search
in Graph Databases
X.Yan, P. Yu, J. Han
…
…
…
…
Database:
Title
Authors
Frequent Patterns
P1: { x_yan, j_han }
Frequent
Itemset
P2: “substructure search”
Pattern { x_yan, j_han}
Non Sup = …
CI {p_yu}, graph pattern, …
Trans. gSpan: graph-base……
SSPs { j_wang }, {j_han, p_yu}, …
Semantic Annotations
Context Units
< { p_yu, j_han}, { d_xin }, … , “graph pattern”,
… “substructure similarity”, … >
Pattern = {xifeng_yan, jiawei_han} Annotation Results:
Context Indicator (CI) graph; {philip_yu}; mine close; graph pattern; sequential pattern; …
Representative
Transactions (Trans)
> gSpan: graph-base substructure pattern mining;
> mining close relational graph connect constraint; …
Semantically Similar
Patterns (SSP)
{jiawei_han, philip_yu}; {jian_pei, jiawei_han}; {jiong_yang, philip_yu,
wei_wang}; …
455
Chapter 7 : Advanced Frequent Pattern Mining
 Pattern Mining: A Road Map
 Pattern Mining in Multi-Level, Multi-Dimensional
Space
 Constraint-Based Frequent Pattern Mining
 Mining High-Dimensional Data and Colossal Patterns
 Mining Compressed or Approximate Patterns
 Pattern Exploration and Application
 Summary
456
Summary
 Roadmap: Many aspects & extensions on pattern
mining
 Mining patterns in multi-level, multi dimensional
space
 Mining rare and negative patterns
 Constraint-based pattern mining
 Specialized methods for mining high-dimensional
data and colossal patterns
 Mining compressed or approximate patterns
 Pattern exploration and understanding: Semantic
457
Ref: Mining Multi-Level and Quantitative Rules
 Y. Aumann and Y. Lindell. A Statistical Theory for Quantitative Association
Rules, KDD'99
 T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using
two-dimensional optimized association rules: Scheme, algorithms, and
visualization. SIGMOD'96.
 J. Han and Y. Fu. Discovery of multiple-level association rules from large
databases. VLDB'95.
 R.J. Miller and Y. Yang. Association rules over interval data. SIGMOD'97.
 R. Srikant and R. Agrawal. Mining generalized association rules. VLDB'95.
 R. Srikant and R. Agrawal. Mining quantitative association rules in large
relational tables. SIGMOD'96.
 K. Wang, Y. He, and J. Han. Mining frequent itemsets using support
constraints. VLDB'00
 K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Computing
optimized rectilinear regions for association rules. KDD'97.
458
Ref: Mining Other Kinds of Rules
 F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new
paradigm for fast, quantifiable data mining. VLDB'98
 Y. Huhtala, J. Kärkkäinen, P. Porkka, H. Toivonen. Efficient Discovery of
Functional and Approximate Dependencies Using Partitions. ICDE’98.
 H. V. Jagadish, J. Madar, and R. Ng. Semantic Compression and Pattern
Extraction with Fascicles. VLDB'99
 B. Lent, A. Swami, and J. Widom. Clustering association rules. ICDE'97.
 R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining
association rules. VLDB'96.
 A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong negative
associations in a large database of customer transactions. ICDE'98.
 D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov.
Query flocks: A generalization of association-rule mining. SIGMOD'98.
459
Ref: Constraint-Based Pattern Mining
 R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item
constraints. KDD'97
 R. Ng, L.V.S. Lakshmanan, J. Han & A. Pang. Exploratory mining and pruning
optimizations of constrained association rules. SIGMOD’98
 G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of constrained
correlated sets. ICDE'00
 J. Pei, J. Han, and L. V. S. Lakshmanan. Mining Frequent Itemsets with
Convertible Constraints. ICDE'01
 J. Pei, J. Han, and W. Wang, Mining Sequential Patterns with Constraints in
Large Databases, CIKM'02
 F. Bonchi, F. Giannotti, A. Mazzanti, and D. Pedreschi. ExAnte: Anticipated
Data Reduction in Constrained Pattern Mining, PKDD'03
 F. Zhu, X. Yan, J. Han, and P. S. Yu, “gPrune: A Constraint Pushing Framework
for Graph Pattern Mining”, PAKDD'07
460
Ref: Mining Sequential Patterns
 X. Ji, J. Bailey, and G. Dong. Mining minimal distinguishing subsequence patterns with
gap constraints. ICDM'05
 H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event
sequences. DAMI:97.
 J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining Sequential
Patterns Efficiently by Prefix-Projected Pattern Growth. ICDE'01.
 R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and
performance improvements. EDBT’96.
 X. Yan, J. Han, and R. Afshar. CloSpan: Mining Closed Sequential Patterns in Large
Datasets. SDM'03.
 M. Zaki. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine
Learning:01.
Mining Graph and Structured Patterns
 A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for
mining frequent substructures from graph data. PKDD'00
 M. Kuramochi and G. Karypis. Frequent Subgraph Discovery. ICDM'01.
 X. Yan and J. Han. gSpan: Graph-based substructure pattern mining.
ICDM'02
 X. Yan and J. Han. CloseGraph: Mining Closed Frequent Graph Patterns.
KDD'03
 X. Yan, P. S. Yu, and J. Han. Graph indexing based on discriminative frequent
structure analysis. ACM TODS, 30:960–993, 2005
 X. Yan, F. Zhu, P. S. Yu, and J. Han. Feature-based substructure similarity
search. ACM Trans. Database Systems, 31:1418–1453, 2006
461
462
Ref: Mining Spatial, Spatiotemporal, Multimedia Data
 H. Cao, N. Mamoulis, and D. W. Cheung. Mining frequent spatiotemporal
sequential patterns. ICDM'05
 D. Gunopulos and I. Tsoukatos. Efficient Mining of Spatiotemporal Patterns.
SSTD'01
 K. Koperski and J. Han, Discovery of Spatial Association Rules in Geographic
Information Databases, SSD’95
 H. Xiong, S. Shekhar, Y. Huang, V. Kumar, X. Ma, and J. S. Yoo. A framework
for discovering co-location patterns in data sets with extended spatial
objects. SDM'04
 J. Yuan, Y. Wu, and M. Yang. Discovery of collocation patterns: From visual
words to visual phrases. CVPR'07
 O. R. Zaiane, J. Han, and H. Zhu, Mining Recurrent Items in Multimedia with
Progressive Resolution Refinement. ICDE'00
463
Ref: Mining Frequent Patterns in Time-Series Data
 B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98.
 J. Han, G. Dong and Y. Yin, Efficient Mining of Partial Periodic Patterns in Time Series
Database, ICDE'99.
 J. Shieh and E. Keogh. iSAX: Indexing and mining terabyte sized time series. KDD'08
 B.-K. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish, C. Faloutsos, and A. Biliris. Online
Data Mining for Co-Evolving Time Sequences. ICDE'00.
 W. Wang, J. Yang, R. Muntz. TAR: Temporal Association Rules on Evolving Numerical
Attributes. ICDE’01.
 J. Yang, W. Wang, P. S. Yu. Mining Asynchronous Periodic Patterns in Time Series Data.
TKDE’03
 L. Ye and E. Keogh. Time series shapelets: A new primitive for data mining. KDD'09
464
Ref: FP for Classification and Clustering
 G. Dong and J. Li. Efficient mining of emerging patterns: Discovering
trends and differences. KDD'99.
 B. Liu, W. Hsu, Y. Ma. Integrating Classification and Association Rule
Mining. KDD’98.
 W. Li, J. Han, and J. Pei. CMAR: Accurate and Efficient Classification Based
on Multiple Class-Association Rules. ICDM'01.
 H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in
large data sets. SIGMOD’ 02.
 J. Yang and W. Wang. CLUSEQ: efficient and effective sequence clustering.
ICDE’03.
 X. Yin and J. Han. CPAR: Classification based on Predictive Association
Rules. SDM'03.
 H. Cheng, X. Yan, J. Han, and C.-W. Hsu, Discriminative Frequent Pattern
Analysis for Effective Classification”, ICDE'07
465
Ref: Privacy-Preserving FP Mining
 A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Privacy Preserving Mining
of Association Rules. KDD’02.
 A. Evfimievski, J. Gehrke, and R. Srikant. Limiting Privacy Breaches in
Privacy Preserving Data Mining. PODS’03
 J. Vaidya and C. Clifton. Privacy Preserving Association Rule Mining in
Vertically Partitioned Data. KDD’02
Mining Compressed Patterns
 D. Xin, H. Cheng, X. Yan, and J. Han. Extracting redundancy-
aware top-k patterns. KDD'06
 D. Xin, J. Han, X. Yan, and H. Cheng. Mining compressed
frequent-pattern sets. VLDB'05
 X. Yan, H. Cheng, J. Han, and D. Xin. Summarizing itemset
patterns: A profile-based approach. KDD'05
466
Mining Colossal Patterns
 F. Zhu, X. Yan, J. Han, P. S. Yu, and H. Cheng. Mining colossal
frequent patterns by core pattern fusion. ICDE'07
 F. Zhu, Q. Qu, D. Lo, X. Yan, J. Han. P. S. Yu, Mining Top-K Large
Structural Patterns in a Massive Network. VLDB’11
467
468
Ref: FP Mining from Data Streams
 Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-Dimensional
Regression Analysis of Time-Series Data Streams. VLDB'02.
 R. M. Karp, C. H. Papadimitriou, and S. Shenker. A simple algorithm for
finding frequent elements in streams and bags. TODS 2003.
 G. Manku and R. Motwani. Approximate Frequency Counts over Data
Streams. VLDB’02.
 A. Metwally, D. Agrawal, and A. El Abbadi. Efficient computation of frequent
and top-k elements in data streams. ICDT'05
469
Ref: Freq. Pattern Mining Applications
 T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining Database Structure; or
How to Build a Data Quality Browser. SIGMOD'02
 M. Khan, H. Le, H. Ahmadi, T. Abdelzaher, and J. Han. DustMiner: Troubleshooting
interactive complexity bugs in sensor networks., SenSys'08
 Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner: A tool for finding copy-paste and related
bugs in operating system code. In Proc. 2004 Symp. Operating Systems Design and
Implementation (OSDI'04)
 Z. Li and Y. Zhou. PR-Miner: Automatically extracting implicit programming rules and
detecting violations in large software code. FSE'05
 D. Lo, H. Cheng, J. Han, S. Khoo, and C. Sun. Classification of software behaviors for failure
detection: A discriminative pattern mining approach. KDD'09
 Q. Mei, D. Xin, H. Cheng, J. Han, and C. Zhai. Semantic annotation of frequent patterns.
ACM TKDD, 2007.
 K. Wang, S. Zhou, J. Han. Profit Mining: From Patterns to Actions. EDBT’02.
470
Data Mining:
Concepts and Techniques
(3rd
ed.)
— Chapter 8 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
472
Chapter 8. Classification: Basic Concepts
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Summary
473
Supervised vs. Unsupervised Learning
 Supervised learning (classification)
 Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
474
 Classification
 predicts categorical class labels (discrete or nominal)
 classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute
and uses it in classifying new data
 Numeric Prediction
 models continuous-valued functions, i.e., predicts unknown
or missing values
 Typical applications
 Credit/loan approval:
 Medical diagnosis: if a tumor is cancerous or benign
 Fraud detection: if a transaction is fraudulent
 Web page categorization: which category it is
Prediction Problems: Classification vs.
Numeric Prediction
475
Classification—A Two-Step Process
 Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
 The set of tuples used for model construction is training set
 The model is represented as classification rules, decision trees, or
mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model

The known label of test sample is compared with the classified
result from the model

Accuracy rate is the percentage of test set samples that are
correctly classified by the model

Test set is independent of training set (otherwise overfitting)
 If the accuracy is acceptable, use the model to classify new data
 Note: If the test set is used to select models, it is called validation (test) set
476
Process (1): Model Construction
Training
Data
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Classification
Algorithms
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier
(Model)
477
Process (2): Using the Model in Prediction
Classifier
Testing
Data
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
478
Chapter 8. Classification: Basic Concepts
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Summary
479
Decision Tree Induction: An Example
age?
overcast
student? credit rating?
<=30 >40
no yes yes
yes
31..40
no
fair
excellent
yes
no
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
 Training data set: Buys_computer
 The data set follows an example of
Quinlan’s ID3 (Playing Tennis)
 Resulting tree:
480
Algorithm for Decision Tree Induction
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-conquer
manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they are
discretized in advance)
 Examples are partitioned recursively based on selected
attributes
 Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
 Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
 There are no samples left
Brief Review of Entropy

481
m = 2
482
Attribute Selection Measure:
Information Gain (ID3/C4.5)
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a tuple in D:
 Information needed (after using A to split D into v partitions) to
classify D:
 Information gained by branching on attribute A
)
(
log
)
( 2
1
i
m
i
i p
p
D
Info 



)
(
|
|
|
|
)
(
1
j
v
j
j
A D
Info
D
D
D
Info 


(D)
Info
Info(D)
Gain(A) A


483
Attribute Selection: Information Gain
g Class P: buys_computer = “yes”
g Class N: buys_computer = “no”
means “age <=30” has 5 out of
14 samples, with 2 yes’es and 3
no’s. Hence
Similarly,
age pi ni I(pi, ni)
<=30 2 3 0.971
31…40 4 0 0
>40 3 2 0.971
694
.
0
)
2
,
3
(
14
5
)
0
,
4
(
14
4
)
3
,
2
(
14
5
)
(




I
I
I
D
Infoage
048
.
0
)
_
(
151
.
0
)
(
029
.
0
)
(



rating
credit
Gain
student
Gain
income
Gain
246
.
0
)
(
)
(
)
( 

 D
Info
D
Info
age
Gain age
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
)
3
,
2
(
14
5
I
940
.
0
)
14
5
(
log
14
5
)
14
9
(
log
14
9
)
5
,
9
(
)
( 2
2 



I
D
Info
484
Computing Information-Gain for
Continuous-Valued Attributes
 Let attribute A be a continuous-valued attribute
 Must determine the best split point for A
 Sort the value A in increasing order
 Typically, the midpoint between each pair of adjacent values
is considered as a possible split point
 (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
 The point with the minimum expected information
requirement for A is selected as the split-point for A
 Split:
 D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
485
Gain Ratio for Attribute Selection (C4.5)
 Information gain measure is biased towards attributes with a
large number of values
 C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
 GainRatio(A) = Gain(A)/SplitInfo(A)
 Ex.
 gain_ratio(income) = 0.029/1.557 = 0.019
 The attribute with the maximum gain ratio is selected as the
splitting attribute
)
|
|
|
|
(
log
|
|
|
|
)
( 2
1 D
D
D
D
D
SplitInfo
j
v
j
j
A 

 

486
Gini Index (CART, IBM IntelligentMiner)
 If a data set D contains examples from n classes, gini index,
gini(D) is defined as
where pj is the relative frequency of class j in D
 If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as
 Reduction in Impurity:
 The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)




n
j
p j
D
gini
1
2
1
)
(
)
(
|
|
|
|
)
(
|
|
|
|
)
( 2
2
1
1
D
gini
D
D
D
gini
D
D
D
giniA


)
(
)
(
)
( D
gini
D
gini
A
gini A



487
Computation of Gini Index
 Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
 Suppose the attribute income partitions D into 10 in D1: {low,
medium} and 4 in D2
Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the
{low,medium} (and {high}) since it has the lowest Gini index
 All attributes are assumed continuous-valued
 May need other tools, e.g., clustering, to get the possible split
values

459
.
0
14
5
14
9
1
)
(
2
2
















D
gini
)
(
14
4
)
(
14
10
)
( 2
1
}
,
{ D
Gini
D
Gini
D
gini medium
low
income 














488
Comparing Attribute Selection Measures
 The three measures, in general, return good results but
 Information gain:

biased towards multivalued attributes
 Gain ratio:

tends to prefer unbalanced splits in which one partition is
much smaller than the others
 Gini index:

biased to multivalued attributes

has difficulty when # of classes is large

tends to favor tests that result in equal-sized partitions
and purity in both partitions
489
Other Attribute Selection Measures
 CHAID: a popular decision tree algorithm, measure based on χ2
test for
independence
 C-SEP: performs better than info. gain and gini index in certain cases
 G-statistic: has a close approximation to χ2
distribution
 MDL (Minimal Description Length) principle (i.e., the simplest solution is
preferred):
 The best tree as the one that requires the fewest # of bits to both (1)
encode the tree, and (2) encode the exceptions to the tree
 Multivariate splits (partition based on multiple variable combinations)
 CART: finds multivariate splits based on a linear comb. of attrs.
 Which attribute selection measure is the best?
 Most give good results, none is significantly superior than others
490
Overfitting and Tree Pruning
 Overfitting: An induced tree may overfit the training data
 Too many branches, some may reflect anomalies due to
noise or outliers
 Poor accuracy for unseen samples
 Two approaches to avoid overfitting
 Prepruning: Halt tree construction early do not split a node
̵
if this would result in the goodness measure falling below a
threshold

Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a “fully grown” tree—
get a sequence of progressively pruned trees

Use a set of data different from the training data to
decide which is the “best pruned tree”
491
Enhancements to Basic Decision Tree Induction
 Allow for continuous-valued attributes
 Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete set of
intervals
 Handle missing attribute values
 Assign the most common value of the attribute
 Assign probability to each of the possible values
 Attribute construction
 Create new attributes based on existing ones that are
sparsely represented
 This reduces fragmentation, repetition, and replication
492
Classification in Large Databases
 Classification—a classical problem extensively studied by
statisticians and machine learning researchers
 Scalability: Classifying data sets with millions of examples and
hundreds of attributes with reasonable speed
 Why is decision tree induction popular?

relatively faster learning speed (than other classification
methods)

convertible to simple and easy to understand classification
rules

can use SQL queries for accessing databases

comparable classification accuracy with other methods
 RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)

Builds an AVC-list (attribute, value, class label)
493
Scalability Framework for RainForest
 Separates the scalability aspects from the criteria that
determine the quality of the tree
 Builds an AVC-list: AVC (Attribute, Value, Class_label)
 AVC-set (of an attribute X )
 Projection of training dataset onto the attribute X and
class label where counts of individual class label are
aggregated
 AVC-group (of a node n )
 Set of AVC-sets of all predictor attributes at the node n
494
Rainforest: Training Set and Its AVC Sets
student Buy_Computer
yes no
yes 6 1
no 3 4
Age Buy_Computer
yes no
<=30 2 3
31..40 4 0
>40 3 2
Credit
rating
Buy_Computer
yes no
fair 6 2
excellent 3 3
age income studentcredit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
AVC-set on income
AVC-set on Age
AVC-set on Student
Training Examples
income Buy_Computer
yes no
high 2 2
medium 4 2
low 3 1
AVC-set on
credit_rating
495
BOAT (Bootstrapped Optimistic
Algorithm for Tree Construction)
 Use a statistical technique called bootstrapping to create
several smaller samples (subsets), each fits in memory
 Each subset is used to create a tree, resulting in several
trees
 These trees are examined and used to construct a new
tree T’
 It turns out that T’ is very close to the tree that would
be generated using the whole data set together
 Adv: requires only two scans of DB, an incremental alg.
October 24, 2024
Data Mining: Concepts and
Techniques 496
Presentation of Classification Results
October 24, 2024
Data Mining: Concepts and
Techniques 497
Visualization of a Decision Tree in SGI/MineSet 3.0
Data Mining: Concepts and
Techniques 498
Interactive Visual Mining by Perception-
Based Classification (PBC)
499
Chapter 8. Classification: Basic Concepts
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Summary
500
Bayesian Classification: Why?
 A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
 Foundation: Based on Bayes’ Theorem.
 Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct —
prior knowledge can be combined with observed data
 Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
501
Bayes’ Theorem: Basics
 Total probability Theorem:
 Bayes’ Theorem:
 Let X be a data sample (“evidence”): class label is unknown
 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), (i.e., posteriori probability): the
probability that the hypothesis holds given the observed data sample X
 P(H) (prior probability): the initial probability

E.g., X will buy computer, regardless of age, income, …
 P(X): probability that sample data is observed
 P(X|H) (likelihood): the probability of observing the sample X, given that
the hypothesis holds

E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
)
(
)
1
|
(
)
(
i
A
P
M
i i
A
B
P
B
P 


)
(
/
)
(
)
|
(
)
(
)
(
)
|
(
)
|
( X
X
X
X
X P
H
P
H
P
P
H
P
H
P
H
P 


502
Prediction Based on Bayes’ Theorem
 Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem
 Informally, this can be viewed as
posteriori = likelihood x prior/evidence
 Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
 Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost
)
(
/
)
(
)
|
(
)
(
)
(
)
|
(
)
|
( X
X
X
X
X P
H
P
H
P
P
H
P
H
P
H
P 


503
Classification Is to Derive the Maximum Posteriori
 Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
 This can be derived from Bayes’ theorem
 Since P(X) is constant for all classes, only
needs to be maximized
)
(
)
(
)
|
(
)
|
(
X
X
X
P
i
C
P
i
C
P
i
C
P 
)
(
)
|
(
)
|
(
i
C
P
i
C
P
i
C
P X
X 
504
Naïve Bayes Classifier
 A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
 This greatly reduces the computation cost: Only counts the
class distribution
 If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
 If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
and P(xk|Ci) is
)
|
(
...
)
|
(
)
|
(
1
)
|
(
)
|
(
2
1
Ci
x
P
Ci
x
P
Ci
x
P
n
k
Ci
x
P
Ci
P
n
k







X
2
2
2
)
(
2
1
)
,
,
( 








x
e
x
g
)
,
,
(
)
|
( i
i C
C
k
x
g
Ci
P 


X
505
Naïve Bayes Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data to be classified:
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)
age income student
credit_rating
buys_compu
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
506
Naïve Bayes Classifier: An Example
 P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
 Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) =
0.007
age income student
credit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
507
Avoiding the Zero-Probability Problem
 Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
 Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10)
 Use Laplacian correction (or Laplacian estimator)
 Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
 The “corrected” prob. estimates are close to their
“uncorrected” counterparts



n
k
Ci
xk
P
Ci
X
P
1
)
|
(
)
|
(
508
Naïve Bayes Classifier: Comments
 Advantages
 Easy to implement
 Good results obtained in most of the cases
 Disadvantages
 Assumption: class conditional independence, therefore loss of
accuracy
 Practically, dependencies exist among variables

E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.

Dependencies among these cannot be modeled by Naïve
Bayes Classifier
 How to deal with these dependencies? Bayesian Belief Networks
(Chapter 9)
509
Chapter 8. Classification: Basic Concepts
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Summary
510
Using IF-THEN Rules for Classification
 Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
 Rule antecedent/precondition vs. rule consequent
 Assessment of a rule: coverage and accuracy
 ncovers = # of tuples covered by R
 ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers
 If more than one rule are triggered, need conflict resolution
 Size ordering: assign the highest priority to the triggering rules that has
the “toughest” requirement (i.e., with the most attribute tests)
 Class-based ordering: decreasing order of prevalence or misclassification
cost per class
 Rule-based ordering (decision list): rules are organized into one long
priority list, according to some measure of rule quality or by experts
511
age?
student? credit rating?
<=30 >40
no yes yes
yes
31..40
no
fair
excellent
yes
no
 Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes
Rule Extraction from a Decision Tree
 Rules are easier to understand than large
trees
 One rule is created for each path from the
root to a leaf
 Each attribute-value pair along a path forms a
conjunction: the leaf holds the class
prediction
 Rules are mutually exclusive and exhaustive
512
Rule Induction: Sequential Covering Method
 Sequential covering algorithm: Extracts rules directly from training
data
 Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
 Rules are learned sequentially, each for a given class Ci will cover
many tuples of Ci but none (or few) of the tuples of other classes
 Steps:
 Rules are learned one at a time
 Each time a rule is learned, the tuples covered by the rules are
removed
 Repeat the process on the remaining tuples until termination
condition, e.g., when no more training examples or when the
quality of a rule returned is below a user-specified threshold
 Comp. w. decision-tree induction: learning a set of rules
simultaneously
513
Sequential Covering Algorithm
while (enough target tuples left)
generate a rule
remove positive target tuples satisfying this rule
Examples covered
by Rule 3
Examples covered
by Rule 2
Examples covered
by Rule 1
Positive
examples
514
Rule Generation
 To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break
Positive
examples
Negative
examples
A3=1
A3=1&&A1=2
A3=1&&A1=2
&&A8=5
515
How to Learn-One-Rule?
 Start with the most general rule possible: condition = empty
 Adding new attributes by adopting a greedy depth-first strategy
 Picks the one that most improves the rule quality
 Rule-Quality measures: consider both coverage and accuracy
 Foil-gain (in FOIL & RIPPER): assesses info_gain by extending
condition

favors rules that have high accuracy and cover many positive tuples
 Rule pruning based on an independent set of test tuples
Pos/neg are # of positive/negative tuples covered by R.
If FOIL_Prune is higher for the pruned version of R, prune R
)
log
'
'
'
(log
'
_ 2
2
neg
pos
pos
neg
pos
pos
pos
Gain
FOIL





neg
pos
neg
pos
R
Prune
FOIL



)
(
_
516
Chapter 8. Classification: Basic Concepts
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Summary
Model Evaluation and Selection
 Evaluation metrics: How can we measure accuracy? Other
metrics to consider?
 Use validation test set of class-labeled tuples instead of training
set when assessing accuracy
 Methods for estimating a classifier’s accuracy:
 Holdout method, random subsampling
 Cross-validation
 Bootstrap
 Comparing classifiers:
 Confidence intervals
 Cost-benefit analysis and ROC Curves
517
Classifier Evaluation Metrics: Confusion
Matrix
Actual classPredicted
class
buy_computer
= yes
buy_computer
= no
Total
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000
 Given m classes, an entry, CMi,j in a confusion matrix indicates
# of tuples in class i that were labeled by the classifier as class j
 May have extra rows/columns to provide totals
Confusion Matrix:
Actual classPredicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
Example of Confusion Matrix:
518
Classifier Evaluation Metrics: Accuracy,
Error Rate, Sensitivity and Specificity
 Classifier Accuracy, or
recognition rate: percentage of
test set tuples that are correctly
classified
Accuracy = (TP + TN)/All
 Error rate: 1 – accuracy, or
Error rate = (FP + FN)/All
 Class Imbalance Problem:
 One class may be rare, e.g.
fraud, or HIV-positive
 Significant majority of the
negative class and minority of
the positive class
 Sensitivity: True Positive
recognition rate

Sensitivity = TP/P
 Specificity: True Negative
recognition rate

Specificity = TN/N
AP C ¬C
C TP FN P
¬C FP TN N
P’ N’ All
519
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
 Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
 Recall: completeness – what % of positive tuples did the
classifier label as positive?
 Perfect score is 1.0
 Inverse relationship between precision & recall

F measure (F1 or F-score): harmonic mean of precision and
recall,

Fß: weighted measure of precision and recall

assigns ß times as much weight to recall as to precision
520
Classifier Evaluation Metrics: Example
521
 Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%
Actual ClassPredicted class cancer = yes cancer = no Total Recognition(%)
cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
 Holdout method

Given data is randomly partitioned into two independent sets

Training set (e.g., 2/3) for model construction

Test set (e.g., 1/3) for accuracy estimation

Random sampling: a variation of holdout

Repeat holdout k times, accuracy = avg. of the accuracies
obtained
 Cross-validation (k-fold, where k = 10 is most popular)

Randomly partition the data into k mutually exclusive subsets,
each approximately equal size

At i-th iteration, use Di as test set and others as training set

Leave-one-out: k folds where k = # of tuples, for small sized
data
 *Stratified cross-validation*: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial data
522
Evaluating Classifier Accuracy: Bootstrap
 Bootstrap
 Works well with small data sets
 Samples the given training tuples uniformly with replacement

i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
 Several bootstrap methods, and a common one is .632 boostrap
 A data set with d tuples is sampled d times, with replacement, resulting in
a training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data
end up in the bootstrap, and the remaining 36.8% form the test set (since
(1 – 1/d)d
≈ e-1
= 0.368)
 Repeat the sampling procedure k times, overall accuracy of the model:
523
Estimating Confidence Intervals:
Classifier Models M1 vs. M2
 Suppose we have 2 classifiers, M1 and M2, which one is better?
 Use 10-fold cross-validation to obtain and
 These mean error rates are just estimates of error on the true
population of future data cases
 What if the difference between the 2 error rates is just
attributed to chance?
 Use a test of statistical significance
 Obtain confidence limits for our error estimates
524
Estimating Confidence Intervals:
Null Hypothesis
 Perform 10-fold cross-validation
 Assume samples follow a t distribution with k–1 degrees of
freedom (here, k=10)
 Use t-test (or Student’s t-test)
 Null Hypothesis: M1 & M2 are the same
 If we can reject null hypothesis, then
 we conclude that the difference between M1 & M2 is
statistically significant
 Chose model with lower error rate
525
Estimating Confidence Intervals: t-test
 If only 1 test set available: pairwise comparison
 For ith
round of 10-fold cross-validation, the same cross
partitioning is used to obtain err(M1)i and err(M2)i
 Average over 10 rounds to get
 t-test computes t-statistic with k-1 degrees of
freedom:
 If two test sets available: use non-paired t-test
where
an
d
wher
e
where k1 & k2 are # of cross-validation samples used for M1 & M2, resp.
526
Estimating Confidence Intervals:
Table for t-distribution
 Symmetric
 Significance level,
e.g., sig = 0.05 or
5% means M1 & M2
are significantly
different for 95% of
population
 Confidence limit, z
= sig/2
527
Estimating Confidence Intervals:
Statistical Significance
 Are M1 & M2 significantly different?
 Compute t. Select significance level (e.g. sig = 5%)
 Consult table for t-distribution: Find t value corresponding to
k-1 degrees of freedom (here, 9)
 t-distribution is symmetric: typically upper % points of
distribution shown → look up value for confidence limit
z=sig/2 (here, 0.025)
 If t > z or t < -z, then t value lies in rejection region:
 Reject null hypothesis that mean error rates of M1 & M2
are same
 Conclude: statistically significant difference between M1
& M2
 Otherwise, conclude that any difference is chance
528
Model Selection: ROC Curves
 ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
 Originated from signal detection theory
 Shows the trade-off between the true
positive rate and the false positive rate
 The area under the ROC curve is a
measure of the accuracy of the model
 Rank the test tuples in decreasing
order: the one that is most likely to
belong to the positive class appears at
the top of the list
 The closer to the diagonal line (i.e., the
closer the area is to 0.5), the less
accurate is the model
 Vertical axis
represents the true
positive rate
 Horizontal axis rep.
the false positive rate
 The plot also shows a
diagonal line
 A model with perfect
accuracy will have an
area of 1.0
529
Issues Affecting Model Selection
 Accuracy
 classifier accuracy: predicting class label
 Speed
 time to construct the model (training time)
 time to use the model (classification/prediction time)
 Robustness: handling noise and missing values
 Scalability: efficiency in disk-resident databases
 Interpretability
 understanding and insight provided by the model
 Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
530
531
Chapter 8. Classification: Basic Concepts
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Summary
Ensemble Methods: Increasing the Accuracy
 Ensemble methods
 Use a combination of models to increase accuracy
 Combine a series of k learned models, M1, M2, …, Mk, with
the aim of creating an improved model M*
 Popular ensemble methods
 Bagging: averaging the prediction over a collection of
classifiers
 Boosting: weighted vote with a collection of classifiers
 Ensemble: combining a set of heterogeneous classifiers
532
Bagging: Boostrap Aggregation
 Analogy: Diagnosis based on multiple doctors’ majority vote
 Training
 Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., bootstrap)
 A classifier model Mi is learned for each training set Di
 Classification: classify an unknown sample X
 Each classifier Mi returns its class prediction
 The bagged classifier M* counts the votes and assigns the class with the
most votes to X
 Prediction: can be applied to the prediction of continuous values by taking
the average value of each prediction for a given test tuple
 Accuracy
 Often significantly better than a single classifier derived from D
 For noise data: not considerably worse, more robust
 Proved improved accuracy in prediction
533
Boosting
 Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
 How boosting works?
 Weights are assigned to each training tuple
 A series of k classifiers is iteratively learned

After a classifier Mi is learned, the weights are updated to
allow the subsequent classifier, Mi+1, to pay more attention to
the training tuples that were misclassified by Mi
 The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy
 Boosting algorithm can be extended for numeric prediction
 Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data 534
535
Adaboost (Freund and Schapire, 1997)
 Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
 Initially, all the weights of tuples are set the same (1/d)
 Generate k classifiers in k rounds. At round i,

Tuples from D are sampled (with replacement) to form a training set Di
of the same size
 Each tuple’s chance of being selected is based on its weight

A classification model Mi is derived from Di

Its error rate is calculated using Di as a test set
 If a tuple is misclassified, its weight is increased, o.w. it is decreased
 Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error
rate is the sum of the weights of the misclassified tuples:
 The weight of classifier Mi’s vote is
)
(
)
(
1
log
i
i
M
error
M
error

 

d
j
j
i err
w
M
error )
(
)
( j
X
Random Forest (Breiman 2001)
 Random Forest:
 Each classifier in the ensemble is a decision tree classifier and is
generated using a random selection of attributes at each node to
determine the split
 During classification, each tree votes and the most popular class is
returned
 Two Methods to construct Random Forest:
 Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node. The CART methodology
is used to grow the trees to maximum size
 Forest-RC (random linear combinations): Creates new attributes (or
features) that are a linear combination of the existing attributes (reduces
the correlation between individual classifiers)
 Comparable in accuracy to Adaboost, but more robust to errors and outliers
 Insensitive to the number of attributes selected for consideration at each
split, and faster than bagging or boosting
536
Classification of Class-Imbalanced Data Sets
 Class-imbalance problem: Rare positive example but numerous
negative ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc.
 Traditional methods assume a balanced distribution of classes
and equal error costs: not suitable for class-imbalanced data
 Typical methods for imbalance data in 2-class classification:
 Oversampling: re-sampling of data from positive class
 Under-sampling: randomly eliminate tuples from negative
class
 Threshold-moving: moves the decision threshold, t, so that
the rare class tuples are easier to classify, and hence, less
chance of costly false negative errors
 Ensemble techniques: Ensemble multiple classifiers
introduced above
 Still difficult for class imbalance problem on multiclass tasks
537
538
Chapter 8. Classification: Basic Concepts
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Summary
Summary (I)
 Classification is a form of data analysis that extracts models
describing important data classes.
 Effective and scalable methods have been developed for decision
tree induction, Naive Bayesian classification, rule-based
classification, and many other classification methods.
 Evaluation metrics include: accuracy, sensitivity, specificity,
precision, recall, F measure, and Fß measure.
 Stratified k-fold cross-validation is recommended for accuracy
estimation. Bagging and boosting can be used to increase overall
accuracy by learning and combining a series of individual models.
539
Summary (II)
 Significance tests and ROC curves are useful for model selection.
 There have been numerous comparisons of the different
classification methods; the matter remains a research topic
 No single method has been found to be superior over all others
for all data sets
 Issues such as accuracy, training time, robustness, scalability,
and interpretability must be considered and can involve trade-
offs, further complicating the quest for an overall superior
method
540
References (1)
 C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future
Generation Computer Systems, 13, 1997
 C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University Press,
1995
 L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
Wadsworth International Group, 1984
 C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data
Mining and Knowledge Discovery, 2(2): 121-168, 1998
 P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data
for scaling machine learning. KDD'95
 H. Cheng, X. Yan, J. Han, and C.-W. Hsu,
Discriminative Frequent Pattern Analysis for Effective Classification, ICDE'07
 H. Cheng, X. Yan, J. Han, and P. S. Yu,
Direct Discriminative Pattern Mining for Effective Classification, ICDE'08
 W. Cohen. Fast effective rule induction. ICML'95
 G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule groups for
gene expression data. SIGMOD'05
541
References (2)
 A. J. Dobson. An Introduction to Generalized Linear Models. Chapman & Hall, 1990.
 G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and
differences. KDD'99.
 R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed. John Wiley, 2001
 U. M. Fayyad. Branching on attribute values in decision tree generation. AAAI’94.
 Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and
an application to boosting. J. Computer and System Sciences, 1997.
 J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree
construction of large datasets. VLDB’98.
 J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision Tree
Construction. SIGMOD'99.
 T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer-Verlag, 2001.
 D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The
combination of knowledge and statistical data. Machine Learning, 1995.
 W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on Multiple
Class-Association Rules, ICDM'01.
542
References (3)
 T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy, complexity,
and training time of thirty-three old and new classification algorithms. Machine
Learning, 2000.
 J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic
interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing
Research, Blackwell Business, 1994.
 M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining.
EDBT'96.
 T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
 S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-
Disciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998
 J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
 J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. ECML’93.
 J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
 J. R. Quinlan. Bagging, boosting, and c4.5. AAAI'96.
543
References (4)
 R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building and
pruning. VLDB’98.
 J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data
mining. VLDB’96.
 J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan Kaufmann,
1990.
 P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley,
2005.
 S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert
Systems. Morgan Kaufman, 1991.
 S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997.
 I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and
Techniques, 2ed. Morgan Kaufmann, 2005.
 X. Yin and J. Han. CPAR: Classification based on predictive association rules. SDM'03
 H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with hierarchical
clusters. KDD'03.
544
CS412 Midterm Exam Statistics
 Opinion Question Answering:
 Like the style: 70.83%, dislike: 29.16%
 Exam is hard: 55.75%, easy: 0.6%, just right: 43.63%
 Time: plenty:3.03%, enough: 36.96%, not: 60%
 Score distribution: # of students (Total: 180)
 >=90: 24
 80-89: 54
 70-79: 46
 Final grading are based on overall score accumulation
and relative class distributions
546
 60-69: 37
 50-59: 15
 40-49: 2
 <40: 2
547
Issues: Evaluating Classification Methods
 Accuracy
 classifier accuracy: predicting class label
 predictor accuracy: guessing value of predicted attributes
 Speed
 time to construct the model (training time)
 time to use the model (classification/prediction time)
 Robustness: handling noise and missing values
 Scalability: efficiency in disk-resident databases
 Interpretability
 understanding and insight provided by the model
 Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
548
Predictor Error Measures
 Measure predictor accuracy: measure how far off the predicted value is from
the actual known value
 Loss function: measures the error betw. yi and the predicted value yi’
 Absolute error: | yi – yi’|
 Squared error: (yi – yi’)2
 Test error (generalization error): the average loss over the test set
 Mean absolute error: Mean squared error:
 Relative absolute error: Relative squared error:
The mean squared-error exaggerates the presence of outliers
Popularly use (square) root mean-square error, similarly, root relative
squared error
d
y
y
d
i
i
i



1
|
'
|
d
y
y
d
i
i
i



1
2
)
'
(






d
i
i
d
i
i
i
y
y
y
y
1
1
|
|
|
'
|






d
i
i
d
i
i
i
y
y
y
y
1
2
1
2
)
(
)
'
(
549
Scalable Decision Tree Induction Methods
 SLIQ (EDBT’96 — Mehta et al.)
 Builds an index for each attribute and only class list and the
current attribute list reside in memory
 SPRINT (VLDB’96 — J. Shafer et al.)
 Constructs an attribute list data structure
 PUBLIC (VLDB’98 — Rastogi & Shim)
 Integrates tree splitting and tree pruning: stop growing the
tree earlier
 RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
 Builds an AVC-list (attribute, value, class label)
 BOAT (PODS’99 — Gehrke, Ganti, Ramakrishnan & Loh)
 Uses bootstrapping to create several small samples
550
Data Cube-Based Decision-Tree Induction
 Integration of generalization with decision-tree induction
(Kamber et al.’97)
 Classification at primitive concept levels
 E.g., precise temperature, humidity, outlook, etc.
 Low-level concepts, scattered classes, bushy classification-
trees
 Semantic interpretation problems
 Cube-based multi-level classification
 Relevance analysis at multi-levels
 Information-gain analysis with dimension + level
551
Data Mining:
Concepts and Techniques
(3rd
ed.)
— Chapter 9 —
Classification: Advanced Methods
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
552
Chapter 9. Classification: Advanced Methods
 Bayesian Belief Networks
 Classification by Backpropagation
 Support Vector Machines
 Classification by Using Frequent Patterns
 Lazy Learners (or Learning from Your
Neighbors)
 Other Classification Methods
 Additional Topics Regarding Classification
 Summary
553
Bayesian Belief Networks
 Bayesian belief networks (also known as Bayesian
networks, probabilistic networks): allow class
conditional independencies between subsets of variables
 A (directed acyclic) graphical model of causal
relationships
 Represents dependency among the variables
 Gives a specification of joint probability distribution
X Y
Z
P
 Nodes: random variables
 Links: dependency
 X and Y are the parents of Z, and Y is
the parent of P
 No dependency between Z and P
 Has no loops/cycles
554
Bayesian Belief Network: An Example
Family
History (FH)
LungCancer
(LC)
PositiveXRay
Smoker (S)
Emphysema
Dyspnea
LC
~LC
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
0.8
0.2
0.5
0.5
0.7
0.3
0.1
0.9
Bayesian Belief Network
CPT: Conditional Probability Table
for variable LungCancer:



n
i
Y
Parents i
xi
P
x
x
P n
1
))
(
|
(
)
,...,
( 1
shows the conditional probability for
each possible combination of its
parents
Derivation of the probability of a
particular combination of values of
X, from CPT:
555
Training Bayesian Networks: Several
Scenarios
 Scenario 1: Given both the network structure and all variables
observable: compute only the CPT entries
 Scenario 2: Network structure known, some variables hidden:
gradient descent (greedy hill-climbing) method, i.e., search for a
solution along the steepest descent of a criterion function
 Weights are initialized to random probability values
 At each iteration, it moves towards what appears to be the best
solution at the moment, w.o. backtracking
 Weights are updated at each iteration & converge to local
optimum
 Scenario 3: Network structure unknown, all variables observable:
search through the model space to reconstruct network topology
 Scenario 4: Unknown structure, all hidden variables: No good
algorithms known for this purpose
 D. Heckerman. A Tutorial on Learning with Bayesian Networks. In
Learning in Graphical Models, M. Jordan, ed.. MIT Press, 1999.
556
Chapter 9. Classification: Advanced Methods
 Bayesian Belief Networks
 Classification by Backpropagation
 Support Vector Machines
 Classification by Using Frequent Patterns
 Lazy Learners (or Learning from Your
Neighbors)
 Other Classification Methods
 Additional Topics Regarding Classification
 Summary
557
Classification by Backpropagation
 Backpropagation: A neural network learning
algorithm
 Started by psychologists and neurobiologists to
develop and test computational analogues of neurons
 A neural network: A set of connected input/output
units where each connection has a weight associated
with it
 During the learning phase, the network learns by
adjusting the weights so as to be able to predict the
correct class label of the input tuples
 Also referred to as connectionist learning due to the
558
Neural Network as a Classifier
 Weakness
 Long training time
 Require a number of parameters typically best determined
empirically, e.g., the network topology or “structure.”
 Poor interpretability: Difficult to interpret the symbolic meaning
behind the learned weights and of “hidden units” in the
network
 Strength
 High tolerance to noisy data
 Ability to classify untrained patterns
 Well-suited for continuous-valued inputs and outputs
 Successful on an array of real-world data, e.g., hand-written
letters
 Algorithms are inherently parallel
 Techniques have recently been developed for the extraction of
559
A Multi-Layer Feed-Forward Neural Network
Output layer
Input layer
Hidden layer
Output vector
Input vector: X
wij
ij
k
i
i
k
j
k
j x
y
y
w
w )
ˆ
( )
(
)
(
)
1
(





560
How A Multi-Layer Neural Network Works
 The inputs to the network correspond to the attributes
measured for each training tuple
 Inputs are fed simultaneously into the units making up the input
layer
 They are then weighted and fed simultaneously to a hidden
layer
 The number of hidden layers is arbitrary, although usually only
one
 The weighted outputs of the last hidden layer are input to units
making up the output layer, which emits the network's
prediction
 The network is feed-forward: None of the weights cycles back to
an input unit or to an output unit of a previous layer
 From a statistical point of view, networks perform nonlinear
561
Defining a Network Topology
 Decide the network topology: Specify # of units in the
input layer, # of hidden layers (if > 1), # of units in each
hidden layer, and # of units in the output layer
 Normalize the input values for each attribute measured
in the training tuples to [0.0—1.0]
 One input unit per domain value, each initialized to 0
 Output, if for classification and more than two classes,
one output unit per class is used
 Once a network has been trained and its accuracy is
unacceptable, repeat the training process with a
different network topology or a different set of initial
weights
562
Backpropagation
 Iteratively process a set of training tuples & compare the network's
prediction with the actual known target value
 For each training tuple, the weights are modified to minimize the
mean squared error between the network's prediction and the
actual target value
 Modifications are made in the “backwards” direction: from the
output layer, through each hidden layer down to the first hidden
layer, hence “backpropagation”
 Steps
 Initialize weights to small random numbers, associated with
biases
 Propagate the inputs forward (by applying activation function)
 Backpropagate the error (by updating weights and biases)

563
Neuron: A Hidden/Output Layer Unit
 An n-dimensional input vector x is mapped into variable y by means of the
scalar product and a nonlinear function mapping
 The inputs to unit are outputs from the previous layer. They are multiplied by
their corresponding weights to form a weighted sum, which is added to the
bias associated with unit. Then a nonlinear activation function is applied to it.
mk
f
weighted
sum
Input
vector x
output y
Activation
function
weight
vector w
å
w0
w1
wn
x0
x1
xn
)
sign(
y
Example
For
n
0
i
k
i
i x
w 

 

bias
564
Efficiency and Interpretability
 Efficiency of backpropagation: Each epoch (one iteration through
the training set) takes O(|D| * w), with |D| tuples and w weights,
but # of epochs can be exponential to n, the number of inputs, in
worst case
 For easier comprehension: Rule extraction by network pruning
 Simplify the network structure by removing weighted links that
have the least effect on the trained network
 Then perform link, unit, or activation value clustering
 The set of input and activation values are studied to derive
rules describing the relationship between the input and hidden
unit layers
 Sensitivity analysis: assess the impact that a given input variable
has on a network output. The knowledge gained from this analysis
can be represented in rules
565
Chapter 9. Classification: Advanced Methods
 Bayesian Belief Networks
 Classification by Backpropagation
 Support Vector Machines
 Classification by Using Frequent Patterns
 Lazy Learners (or Learning from Your
Neighbors)
 Other Classification Methods
 Additional Topics Regarding Classification
 Summary
566
Classification: A Mathematical Mapping
 Classification: predicts categorical class labels
 E.g., Personal homepage classification
 xi = (x1, x2, x3, …), yi = +1 or –1
 x1 : # of word “homepage”
 x2 : # of word “welcome”
 Mathematically, x  X = n
, y  Y = {+1, –1},
 We want to derive a function f: X  Y
 Linear Classification
 Binary Classification problem
 Data above the red line belongs to class ‘x’
 Data below red line belongs to class ‘o’
 Examples: SVM, Perceptron, Probabilistic Classifiers
x
x
x
x
x
x
x
x
x
x o
o
o
o
o
o
o
o
o o
o
o
o
567
Discriminative Classifiers
 Advantages
 Prediction accuracy is generally high

As compared to Bayesian methods – in general
 Robust, works when training examples contain errors
 Fast evaluation of the learned target function

Bayesian networks are normally slow
 Criticism
 Long training time
 Difficult to understand the learned function (weights)

Bayesian networks can be used easily for pattern
discovery
 Not easy to incorporate domain knowledge

Easy in the form of priors on the data or
distributions
568
SVM—Support Vector Machines
 A relatively new classification method for both linear
and nonlinear data
 It uses a nonlinear mapping to transform the original
training data into a higher dimension
 With the new dimension, it searches for the linear
optimal separating hyperplane (i.e., “decision
boundary”)
 With an appropriate nonlinear mapping to a
sufficiently high dimension, data from two classes can
always be separated by a hyperplane
 SVM finds this hyperplane using support vectors
(“essential” training tuples) and margins (defined by
the support vectors)
569
SVM—History and Applications
 Vapnik and colleagues (1992)—groundwork from
Vapnik & Chervonenkis’ statistical learning theory in
1960s
 Features: training can be slow but accuracy is high
owing to their ability to model complex nonlinear
decision boundaries (margin maximization)
 Used for: classification and numeric prediction
 Applications:
 handwritten digit recognition, object recognition,
speaker identification, benchmarking time-series
prediction tests
570
SVM—General Philosophy
Support Vectors
Small Margin Large Margin
October 24, 2024
Data Mining: Concepts and
Techniques 571
SVM—Margins and Support Vectors
572
SVM—When Data Is Linearly Separable
m
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we
want to find the best one (the one that minimizes classification error on
unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum
marginal hyperplane (MMH)
573
SVM—Linearly Separable
 A separating hyperplane can be written as
W ● X + b = 0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
 For 2-D it can be written as
w0 + w1 x1 + w2 x2 = 0
 The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 1 for y
≥ i = +1, and
H2: w0 + w1 x1 + w2 x2 – 1 for y
≤ i = –1
 Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
 This becomes a constrained (convex) quadratic optimization
problem: Quadratic objective function and linear constraints 
Quadratic Programming (QP)  Lagrangian multipliers
574
Why Is SVM Effective on High Dimensional Data?
 The complexity of trained classifier is characterized by the # of
support vectors rather than the dimensionality of the data
 The support vectors are the essential or critical training examples
—they lie closest to the decision boundary (MMH)
 If all other training examples are removed and the training is
repeated, the same separating hyperplane would be found
 The number of support vectors found can be used to compute an
(upper) bound on the expected error rate of the SVM classifier,
which is independent of the data dimensionality
 Thus, an SVM with a small number of support vectors can have
good generalization, even when the dimensionality of the data is
high
575
SVM—Linearly Inseparable
 Transform the original input data into a higher
dimensional space
 Search for a linear separating hyperplane in the new
space
A1
A2
576
SVM: Different Kernel functions
 Instead of computing the dot product on the
transformed data, it is math. equivalent to applying a
kernel function K(Xi, Xj) to the original data, i.e., K(Xi, Xj)
= Φ(Xi) Φ(Xj)
 Typical Kernel Functions
 SVM can also be used for classifying multiple (> 2)
classes and for regression analysis (with additional
577
Scaling SVM by Hierarchical Micro-Clustering
 SVM is not scalable to the number of data objects in terms of
training time and memory usage
 H. Yu, J. Yang, and J. Han, “
Classifying Large Data Sets Using SVM with Hierarchical Clusters”,
KDD'03)
 CB-SVM (Clustering-Based SVM)
 Given limited amount of system resources (e.g., memory),
maximize the SVM performance in terms of accuracy and the
training speed
 Use micro-clustering to effectively reduce the number of
points to be considered
 At deriving support vectors, de-cluster micro-clusters near
“candidate vector” to ensure high classification accuracy
578
CF-Tree: Hierarchical Micro-cluster
 Read the data set once, construct a statistical summary of the
data (i.e., hierarchical clusters) given a limited amount of
memory
 Micro-clustering: Hierarchical indexing structure
 provide finer samples closer to the boundary and coarser
samples farther from the boundary
579
Selective Declustering: Ensure High Accuracy
 CF tree is a suitable base structure for selective declustering
 De-cluster only the cluster Ei such that
 Di – Ri < Ds, where Di is the distance from the boundary to the
center point of Ei and Ri is the radius of Ei
 Decluster only the cluster whose subclusters have possibilities
to be the support cluster of the boundary

“Support cluster”: The cluster whose centroid is a support
vector
580
CB-SVM Algorithm: Outline
 Construct two CF-trees from positive and negative data
sets independently
 Need one scan of the data set
 Train an SVM from the centroids of the root entries
 De-cluster the entries near the boundary into the next
level
 The children entries de-clustered from the parent
entries are accumulated into the training set with
the non-declustered parent entries
 Train an SVM again from the centroids of the entries in
the training set
 Repeat until nothing is accumulated
581
Accuracy and Scalability on Synthetic Dataset
 Experiments on large synthetic data sets shows better
accuracy than random sampling approaches and far
more scalable than the original SVM algorithm
582
SVM vs. Neural Network
 SVM
 Deterministic
algorithm
 Nice generalization
properties
 Hard to learn –
learned in batch mode
using quadratic
programming
techniques
 Using kernels can
 Neural Network
 Nondeterministic
algorithm
 Generalizes well but
doesn’t have strong
mathematical
foundation
 Can easily be learned in
incremental fashion
 To learn complex
functions—use
multilayer perceptron
583
SVM Related Links
 SVM Website: http://www.kernel-machines.org/
 Representative implementations
 LIBSVM: an efficient implementation of SVM, multi-
class classifications, nu-SVM, one-class SVM,
including also various interfaces with java, python,
etc.
 SVM-light: simpler but performance is not better
than LIBSVM, support only binary classification and
only in C
 SVM-torch: another recent implementation also
584
Chapter 9. Classification: Advanced Methods
 Bayesian Belief Networks
 Classification by Backpropagation
 Support Vector Machines
 Classification by Using Frequent Patterns
 Lazy Learners (or Learning from Your
Neighbors)
 Other Classification Methods
 Additional Topics Regarding Classification
 Summary
585
Associative Classification
 Associative classification: Major steps
 Mine data to find strong associations between frequent patterns
(conjunctions of attribute-value pairs) and class labels
 Association rules are generated in the form of
P1 ^ p2 … ^ pl  “Aclass = C” (conf, sup)
 Organize the rules to form a rule-based classifier
 Why effective?
 It explores highly confident associations among multiple
attributes and may overcome some constraints introduced by
decision-tree induction, which considers only one attribute at a
time
 Associative classification has been found to be often more
accurate than some traditional classification methods, such as
586
Typical Associative Classification Methods
 CBA (Classification Based on Associations: Liu, Hsu & Ma, KDD’98)
 Mine possible association rules in the form of

Cond-set (a set of attribute-value pairs)  class label
 Build classifier: Organize rules according to decreasing
precedence based on confidence and then support
 CMAR (Classification based on Multiple Association Rules: Li, Han,
Pei, ICDM’01)
 Classification: Statistical analysis on multiple rules
 CPAR (Classification based on Predictive Association Rules: Yin & Han,
SDM’03)
 Generation of predictive rules (FOIL-like analysis) but allow
covered rules to retain with reduced weight
 Prediction using best k rules

587
Frequent Pattern-Based Classification
 H. Cheng, X. Yan, J. Han, and C.-W. Hsu, “
Discriminative Frequent Pattern Analysis for Effective Cl
assification
”, ICDE'07
 Accuracy issue
 Increase the discriminative power
 Increase the expressive power of the feature space
 Scalability issue
 It is computationally infeasible to generate all
feature combinations and filter them with an
information gain threshold
 Efficient method (DDPMine: FPtree pruning): H.
Cheng, X. Yan, J. Han, and P. S. Yu, "
Direct Discriminative Pattern Mining for Effective Cla
588
Frequent Pattern vs. Single Feature
(a) Austral (c) Sonar
(b) Cleve
Fig. 1. Information Gain vs. Pattern Length
The discriminative power of some frequent patterns is
higher than that of single features.
589
Empirical Results
0 100 200 300 400 500 600 700
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
InfoGain
IG_UpperBnd
Support
Information
Gain
(a) Austral (c) Sonar
(b) Breast
Fig. 2. Information Gain vs. Pattern Frequency
590
Feature Selection
 Given a set of frequent patterns, both non-
discriminative and redundant patterns exist, which can
cause overfitting
 We want to single out the discriminative patterns and
remove redundant ones
 The notion of Maximal Marginal Relevance (MMR) is
borrowed
 A document has high marginal relevance if it is both
relevant to the query and contains minimal marginal
similarity to previously selected documents
591
Experimental Results
591
592
Scalability Tests
593
DDPMine: Branch-and-Bound Search
Association between information
gain and frequency
a
b
a: constant, a parent node
b: variable, a descendent
)
sup(
)
sup( parent
child 
)
sup(
)
sup( a
b 
594
DDPMine Efficiency: Runtime
PatClass
Harmony
DDPMine
PatClass:
ICDE’07 Pattern
Classification
Alg.
595
Chapter 9. Classification: Advanced Methods
 Bayesian Belief Networks
 Classification by Backpropagation
 Support Vector Machines
 Classification by Using Frequent Patterns
 Lazy Learners (or Learning from Your
Neighbors)
 Other Classification Methods
 Additional Topics Regarding Classification
 Summary
596
Lazy vs. Eager Learning
 Lazy vs. eager learning
 Lazy learning (e.g., instance-based learning): Simply
stores training data (or only minor processing) and
waits until it is given a test tuple
 Eager learning (the above discussed methods):
Given a set of training tuples, constructs a
classification model before receiving new (e.g., test)
data to classify
 Lazy: less time in training but more time in predicting
 Accuracy
 Lazy method effectively uses a richer hypothesis
space since it uses many local linear functions to
form an implicit global approximation to the target
function
 Eager: must commit to a single hypothesis that
597
Lazy Learner: Instance-Based Methods
 Instance-based learning:
 Store training examples and delay the processing
(“lazy evaluation”) until a new instance must be
classified
 Typical approaches
 k-nearest neighbor approach

Instances represented as points in a Euclidean
space.
 Locally weighted regression

Constructs local approximation
 Case-based reasoning

Uses symbolic representations and knowledge-
based inference
598
The k-Nearest Neighbor Algorithm
 All instances correspond to points in the n-D space
 The nearest neighbor are defined in terms of
Euclidean distance, dist(X1, X2)
 Target function could be discrete- or real- valued
 For discrete-valued, k-NN returns the most
common value among the k training examples
nearest to xq
 Vonoroi diagram: the decision surface induced by
1-NN for a typical set of training examples
.
_
+
_ xq
+
_ _
+
_
_
+
.
.
.
. .
599
Discussion on the k-NN Algorithm
 k-NN for real-valued prediction for a given unknown
tuple
 Returns the mean values of the k nearest neighbors
 Distance-weighted nearest neighbor algorithm
 Weight the contribution of each of the k neighbors
according to their distance to the query xq

Give greater weight to closer neighbors
 Robust to noisy data by averaging k-nearest neighbors
 Curse of dimensionality: distance between neighbors
could be dominated by irrelevant attributes
 To overcome it, axes stretch or elimination of the
least relevant attributes
2
)
,
(
1
i
x
q
x
d
w
600
Case-Based Reasoning (CBR)
 CBR: Uses a database of problem solutions to solve new problems
 Store symbolic description (tuples or cases)—not points in a
Euclidean space
 Applications: Customer-service (product-related diagnosis), legal
ruling
 Methodology
 Instances represented by rich symbolic descriptions (e.g.,
function graphs)
 Search for similar cases, multiple retrieved cases may be
combined
 Tight coupling between case retrieval, knowledge-based
reasoning, and problem solving
 Challenges
 Find a good similarity metric
 Indexing based on syntactic similarity measure, and when
601
Chapter 9. Classification: Advanced Methods
 Bayesian Belief Networks
 Classification by Backpropagation
 Support Vector Machines
 Classification by Using Frequent Patterns
 Lazy Learners (or Learning from Your
Neighbors)
 Other Classification Methods
 Additional Topics Regarding Classification
 Summary
602
Genetic Algorithms (GA)
 Genetic Algorithm: based on an analogy to biological evolution
 An initial population is created consisting of randomly generated
rules
 Each rule is represented by a string of bits
 E.g., if A1 and ¬A2 then C2 can be encoded as 100
 If an attribute has k > 2 values, k bits can be used
 Based on the notion of survival of the fittest, a new population is
formed to consist of the fittest rules and their offspring
 The fitness of a rule is represented by its classification accuracy on a
set of training examples
 Offspring are generated by crossover and mutation
 The process continues until a population P evolves when each rule in
P satisfies a prespecified threshold
 Slow but easily parallelizable
603
Rough Set Approach
 Rough sets are used to approximately or “roughly” define
equivalent classes
 A rough set for a given class C is approximated by two sets: a
lower approximation (certain to be in C) and an upper
approximation (cannot be described as not belonging to C)
 Finding the minimal subsets (reducts) of attributes for feature
reduction is NP-hard but a discernibility matrix (which stores
the differences between attribute values for each pair of data
tuples) is used to reduce the computation intensity
604
Fuzzy Set
Approaches
 Fuzzy logic uses truth values between 0.0 and 1.0 to represent
the degree of membership (such as in a fuzzy membership graph)
 Attribute values are converted to fuzzy values. Ex.:
 Income, x, is assigned a fuzzy membership value to each of
the discrete categories {low, medium, high}, e.g. $49K
belongs to “medium income” with fuzzy value 0.15 but
belongs to “high income” with fuzzy value 0.96
 Fuzzy membership values do not have to sum to 1.
 Each applicable rule contributes a vote for membership in the
categories
 Typically, the truth values for each predicted category are
summed, and these sums are combined
605
Chapter 9. Classification: Advanced Methods
 Bayesian Belief Networks
 Classification by Backpropagation
 Support Vector Machines
 Classification by Using Frequent Patterns
 Lazy Learners (or Learning from Your
Neighbors)
 Other Classification Methods
 Additional Topics Regarding Classification
 Summary
Multiclass Classification
 Classification involving more than two classes (i.e., > 2 Classes)
 Method 1. One-vs.-all (OVA): Learn a classifier one at a time
 Given m classes, train m classifiers: one for each class
 Classifier j: treat tuples in class j as positive & all others as
negative
 To classify a tuple X, the set of classifiers vote as an ensemble
 Method 2. All-vs.-all (AVA): Learn a classifier for each pair of classes
 Given m classes, construct m(m-1)/2 binary classifiers
 A classifier is trained using tuples of the two classes
 To classify a tuple X, each classifier votes. X is assigned to the
class with maximal vote
 Comparison
 All-vs.-all tends to be superior to one-vs.-all
 Problem: Binary classifier is sensitive to errors, and errors affect
606
Error-Correcting Codes for Multiclass Classification
 Originally designed to correct errors during
data transmission for communication tasks by
exploring data redundancy
 Example
 A 7-bit codeword associated with classes 1-4
607
Class Error-Corr.
Codeword
C1 1 1 1 1 1 1 1
C2 0 0 0 0 1 1 1
C3 0 0 1 1 0 0 1
C4 0 1 0 1 0 1 0
 Given a unknown tuple X, the 7-trained classifiers output:
0001010
 Hamming distance: # of different bits between two codewords
 H(X, C1) = 5, by checking # of bits between [1111111] & [0001010]
 H(X, C2) = 3, H(X, C3) = 3, H(X, C4) = 1, thus C4 as the label for X
 Error-correcting codes can correct up to (h-1)/h 1-bit error, where h
is the minimum Hamming distance between any two codewords
 If we use 1-bit per class, it is equiv. to one-vs.-all approach, the code
are insufficient to self-correct
 When selecting error-correcting codes, there should be good row-
wise and col.-wise separation between the codewords
Semi-Supervised Classification
 Semi-supervised: Uses labeled and unlabeled data to build a
classifier
 Self-training:
 Build a classifier using the labeled data
 Use it to label the unlabeled data, and those with the most
confident label prediction are added to the set of labeled data
 Repeat the above process
 Adv: easy to understand; disadv: may reinforce errors
 Co-training: Use two or more classifiers to teach each other
 Each learner uses a mutually independent set of features of each
tuple to train a good classifier, say f1
 Then f1 and f2 are used to predict the class label for unlabeled
data X
 Teach each other: The tuple having the most confident
prediction from f1 is added to the set of labeled data for f2, & vice
versa 608
Active Learning
 Class labels are expensive to obtain
 Active learner: query human (oracle) for labels
 Pool-based approach: Uses a pool of unlabeled data
 L: a small subset of D is labeled, U: a pool of unlabeled data in
D
 Use a query function to carefully select one or more tuples
from U and request labels from an oracle (a human annotator)
 The newly labeled samples are added to L, and learn a model
 Goal: Achieve high accuracy using as few labeled data as
possible
 Evaluated using learning curves: Accuracy as a function of the
number of instances queried (# of tuples to be queried should be
small)
 Research issue: How to choose the data tuples to be queried?
 Uncertainty sampling: choose the least certain ones
 Reduce version space, the subset of hypotheses consistent w.
the training data
 Reduce expected entropy over U: Find the greatest reduction in 609
Transfer Learning: Conceptual Framework
 Transfer learning: Extract knowledge from one or more source
tasks and apply the knowledge to a target task
 Traditional learning: Build a new classifier for each new task
 Transfer learning: Build new classifier by applying existing
knowledge learned from source tasks
Learning System Learning System Learning System
Different Tasks
610
Traditional Learning Framework Transfer Learning Framework
Knowledge Learning System
Source Tasks Target Task
Transfer Learning: Methods and Applications
 Applications: Especially useful when data is outdated or distribution
changes, e.g., Web document classification, e-mail spam filtering
 Instance-based transfer learning: Reweight some of the data from
source tasks and use it to learn the target task
 TrAdaBoost (Transfer AdaBoost)
 Assume source and target data each described by the same set of
attributes (features) & class labels, but rather diff. distributions
 Require only labeling a small amount of target data
 Use source data in training: When a source tuple is misclassified,
reduce the weight of such tupels so that they will have less effect
on the subsequent classifier
 Research issues
 Negative transfer: When it performs worse than no transfer at all
 Heterogeneous transfer learning: Transfer knowledge from
different feature space or multiple source domains
 Large-scale transfer learning
611
612
Chapter 9. Classification: Advanced Methods
 Bayesian Belief Networks
 Classification by Backpropagation
 Support Vector Machines
 Classification by Using Frequent Patterns
 Lazy Learners (or Learning from Your
Neighbors)
 Other Classification Methods
 Additional Topics Regarding Classification
 Summary
613
Summary
 Effective and advanced classification methods
 Bayesian belief network (probabilistic networks)
 Backpropagation (Neural networks)
 Support Vector Machine (SVM)
 Pattern-based classification
 Other classification methods: lazy learners (KNN, case-based
reasoning), genetic algorithms, rough set and fuzzy set
approaches
 Additional Topics on Classification
 Multiclass classification
 Semi-supervised classification
 Active learning
 Transfer learning
614
References
 Please see the references of Chapter 8
Surplus Slides
616
What Is Prediction?
 (Numerical) prediction is similar to classification
 construct a model
 use model to predict continuous or ordered value for a given
input
 Prediction is different from classification
 Classification refers to predict categorical class label
 Prediction models continuous-valued functions
 Major method for prediction: regression
 model the relationship between one or more independent or
predictor variables and a dependent or response variable
 Regression analysis
 Linear and multiple regression
 Non-linear regression
 Other regression methods: generalized linear model, Poisson
regression, log-linear models, regression trees
617
Linear Regression
 Linear regression: involves a response variable y and a single
predictor variable x
y = w0 + w1 x
where w0 (y-intercept) and w1 (slope) are regression coefficients
 Method of least squares: estimates the best-fitting straight line
 Multiple linear regression: involves more than one predictor
variable
 Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)
 Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2
 Solvable by extension of least square method or using SAS, S-







 |
|
1
2
|
|
1
)
(
)
)(
(
1 D
i
i
D
i
i
i
x
x
y
y
x
x
w x
w
y
w
1
0


618
 Some nonlinear models can be modeled by a polynomial
function
 A polynomial regression model can be transformed into
linear regression model. For example,
y = w0 + w1 x + w2 x2
+ w3 x3
convertible to linear with new variables: x2 = x2
, x3= x3
y = w0 + w1 x + w2 x2 + w3 x3
 Other functions, such as power function, can also be
transformed to linear model
 Some models are intractable nonlinear (e.g., sum of
exponential terms)
 possible to obtain least square estimates through
extensive calculation on more complex formulae
Nonlinear Regression
619
 Generalized linear model:
 Foundation on which linear regression can be applied to
modeling categorical response variables
 Variance of y is a function of the mean value of y, not a constant
 Logistic regression: models the prob. of some event occurring
as a linear function of a set of predictor variables
 Poisson regression: models the data that exhibit a Poisson
distribution
 Log-linear models: (for categorical data)
 Approximate discrete multidimensional prob. distributions
 Also useful for data compression and smoothing
 Regression trees and model trees
 Trees to predict continuous values rather than class labels
Other Regression-Based Models
620
Regression Trees and Model Trees
 Regression tree: proposed in CART system (Breiman et al. 1984)
 CART: Classification And Regression Trees
 Each leaf stores a continuous-valued prediction
 It is the average value of the predicted attribute for the training
tuples that reach the leaf
 Model tree: proposed by Quinlan (1992)
 Each leaf holds a regression model—a multivariate linear
equation for the predicted attribute
 A more general case than regression tree
 Regression and model trees tend to be more accurate than linear
regression when the data are not represented well by a simple
linear model
621
 Predictive modeling: Predict data values or construct
generalized linear models based on the database data
 One can only predict value ranges or category
distributions
 Method outline:
 Minimal generalization
 Attribute relevance analysis
 Generalized linear model construction
 Prediction
 Determine the major factors which influence the
prediction
 Data relevance analysis: uncertainty measurement,
entropy analysis, expert judgement, etc.
Predictive Modeling in Multidimensional Databases
622
Prediction: Numerical Data
623
Prediction: Categorical Data
624
SVM—Introductory Literature
 “Statistical Learning Theory” by Vapnik: extremely hard to
understand, containing many errors too.
 C. J. C. Burges.
A Tutorial on Support Vector Machines for Pattern Recognition.
Knowledge Discovery and Data Mining, 2(2), 1998.
 Better than the Vapnik’s book, but still written too hard for
introduction, and the examples are so not-intuitive
 The book “An Introduction to Support Vector Machines” by N.
Cristianini and J. Shawe-Taylor
 Also written hard for introduction, but the explanation about
the mercer’s theorem is better than above literatures
 The neural network book by Haykins
 Contains one nice chapter of SVM introduction
625
Notes about SVM—
Introductory Literature
 “Statistical Learning Theory” by Vapnik: difficult to understand,
containing many errors.
 C. J. C. Burges.
A Tutorial on Support Vector Machines for Pattern Recognition.
Knowledge Discovery and Data Mining, 2(2), 1998.
 Easier than Vapnik’s book, but still not introductory level; the
examples are not so intuitive
 The book An Introduction to Support Vector Machines by
Cristianini and Shawe-Taylor
 Not introductory level, but the explanation about Mercer’s
Theorem is better than above literatures
 Neural Networks and Learning Machines by Haykin
 Contains a nice chapter on SVM introduction
626
Associative Classification Can Achieve High
Accuracy and Efficiency (Cong et al. SIGMOD05)
627
A Closer Look at CMAR
 CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01)
 Efficiency: Uses an enhanced FP-tree that maintains the distribution
of class labels among tuples satisfying each frequent itemset
 Rule pruning whenever a rule is inserted into the tree
 Given two rules, R1 and R2, if the antecedent of R1 is more general
than that of R2 and conf(R1) conf(R
≥ 2), then prune R2
 Prunes rules for which the rule antecedent and class are not
positively correlated, based on a χ2
test of statistical significance
 Classification based on generated/pruned rules
 If only one rule satisfies tuple X, assign the class label of the rule
 If a rule set S satisfies X, CMAR

divides S into groups according to class labels

uses a weighted χ2
measure to find the strongest group of
rules, based on the statistical correlation of rules within a
group

assigns X the class label of the strongest group
628
Perceptron & Winnow
• Vector: x, w
• Scalar: x, y, w
Input: {(x1, y1), …}
Output: classification function f(x)
f(xi) > 0 for yi = +1
f(xi) < 0 for yi = -1
f(x) => wx + b = 0
or w1x1+w2x2+b = 0
x1
x2
• Perceptron: update W
additively
• Winnow: update W
multiplicatively
Data Mining:
Concepts and Techniques
(3rd
ed.)
— Chapter 10 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
629
630
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
 Cluster Analysis: Basic Concepts
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary
630
631
What is Cluster Analysis?
 Cluster: A collection of data objects
 similar (or related) to one another within the same group
 dissimilar (or unrelated) to the objects in other groups
 Cluster analysis (or clustering, data segmentation, …)
 Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
 Unsupervised learning: no predefined classes (i.e., learning
by observations vs. learning by examples: supervised)
 Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms
632
Clustering for Data Understanding and
Applications
 Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
 Information retrieval: document clustering
 Land use: Identification of areas of similar land use in an earth
observation database
 Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
 City-planning: Identifying groups of houses according to their house
type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
 Climate: understanding earth climate, find patterns of atmospheric
and ocean
 Economic Science: market resarch
633
Clustering as a Preprocessing Tool (Utility)
 Summarization:
 Preprocessing for regression, PCA, classification, and
association analysis
 Compression:
 Image processing: vector quantization
 Finding K-nearest Neighbors
 Localizing search to one or a small number of clusters
 Outlier detection
 Outliers are often viewed as those “far away” from any
cluster
Quality: What Is Good Clustering?
 A good clustering method will produce high quality
clusters
 high intra-class similarity: cohesive within clusters
 low inter-class similarity: distinctive between clusters
 The quality of a clustering method depends on
 the similarity measure used by the method
 its implementation, and
 Its ability to discover some or all of the hidden patterns
634
Measure the Quality of Clustering
 Dissimilarity/Similarity metric
 Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
 The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical,
ordinal ratio, and vector variables
 Weights should be associated with different variables
based on applications and data semantics
 Quality of clustering:
 There is usually a separate “quality” function that
measures the “goodness” of a cluster.
 It is hard to define “similar enough” or “good enough”

The answer is typically highly subjective
635
Considerations for Cluster Analysis
 Partitioning criteria
 Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
 Separation of clusters
 Exclusive (e.g., one customer belongs to only one region) vs.
non-exclusive (e.g., one document may belong to more than one
class)
 Similarity measure
 Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-based (e.g., density or contiguity)
 Clustering space
 Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering)
636
Requirements and Challenges
 Scalability
 Clustering all the data instead of only on samples
 Ability to deal with different types of attributes
 Numerical, binary, categorical, ordinal, linked, and mixture of
these
 Constraint-based clustering
 User may give inputs on constraints
 Use domain knowledge to determine input parameters
 Interpretability and usability
 Others
 Discovery of clusters with arbitrary shape
 Ability to deal with noisy data
 Incremental clustering and insensitivity to input order
 High dimensionality
637
Major Clustering Approaches (I)
 Partitioning approach:
 Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
 Typical methods: k-means, k-medoids, CLARANS
 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects)
using some criterion
 Typical methods: Diana, Agnes, BIRCH, CAMELEON
 Density-based approach:
 Based on connectivity and density functions
 Typical methods: DBSACN, OPTICS, DenClue
 Grid-based approach:
 based on a multiple-level granularity structure
 Typical methods: STING, WaveCluster, CLIQUE
638
Major Clustering Approaches (II)
 Model-based:
 A model is hypothesized for each of the clusters and tries to find
the best fit of that model to each other
 Typical methods: EM, SOM, COBWEB
 Frequent pattern-based:
 Based on the analysis of frequent patterns
 Typical methods: p-Cluster
 User-guided or constraint-based:
 Clustering by considering user-specified or application-specific
constraints
 Typical methods: COD (obstacles), constrained clustering
 Link-based clustering:
 Objects are often linked together in various ways
 Massive links can be used to cluster objects: SimRank, LinkClus
639
640
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
 Cluster Analysis: Basic Concepts
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary
640
Partitioning Algorithms: Basic Concept
 Partitioning method: Partitioning a database D of n objects into a set of
k clusters, such that the sum of squared distances is minimized (where
ci is the centroid or medoid of cluster Ci)
 Given k, find a partition of k clusters that optimizes the chosen
partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented
by the center of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
2
1 )
( i
C
p
k
i c
p
E i



 

641
The K-Means Clustering Method
 Given k, the k-means algorithm is implemented in four
steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the
clusters of the current partitioning (the centroid is
the center, i.e., mean point, of the cluster)
 Assign each object to the cluster with the nearest
seed point
 Go back to Step 2, stop when the assignment does
not change
642
An Example of K-Means Clustering
K=2
Arbitrarily
partition
objects
into k
groups
Update
the cluster
centroids
Update
the cluster
centroids
Reassign objects
Loop if
needed
643
The initial data
set
 Partition objects into k nonempty
subsets
 Repeat
 Compute centroid (i.e., mean
point) for each partition
 Assign each object to the
cluster of its nearest centroid
 Until no change
Comments on the K-Means Method
 Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is
# iterations. Normally, k, t << n.

Comparing: PAM: O(k(n-k)2
), CLARA: O(ks2
+ k(n-k))
 Comment: Often terminates at a local optimal.
 Weakness
 Applicable only to objects in a continuous n-dimensional space

Using the k-modes method for categorical data

In comparison, k-medoids can be applied to a wide range of
data
 Need to specify k, the number of clusters, in advance (there are
ways to automatically determine the best k (see Hastie et al., 2009)
 Sensitive to noisy data and outliers
 Not suitable to discover clusters with non-convex shapes
644
Variations of the K-Means Method
 Most of the variants of the k-means which differ in
 Selection of the initial k means
 Dissimilarity calculations
 Strategies to calculate cluster means
 Handling categorical data: k-modes
 Replacing means of clusters with modes
 Using new dissimilarity measures to deal with categorical objects
 Using a frequency-based method to update modes of clusters
 A mixture of categorical and numerical data: k-prototype method
645
What Is the Problem of the K-Means Method?
 The k-means algorithm is sensitive to outliers !
 Since an object with an extremely large value may substantially
distort the distribution of the data
 K-Medoids: Instead of taking the mean value of the object in a cluster
as a reference point, medoids can be used, which is the most
centrally located object in a cluster
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
646
647
PAM: A Typical K-Medoids Algorithm
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 20
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrar
y choose
k object
as initial
medoids
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Assign
each
remaini
ng
object to
nearest
medoids
Randomly select a
nonmedoid
object,Oramdom
Compute
total cost
of
swapping
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 26
Swapping O
and Oramdom
If quality is
improved.
Do loop
Until no
change
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
The K-Medoid Clustering Method
 K-Medoids Clustering: Find representative objects (medoids) in clusters
 PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)

Starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the total
distance of the resulting clustering

PAM works effectively for small data sets, but does not scale
well for large data sets (due to the computational complexity)
 Efficiency improvement on PAM
 CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
 CLARANS (Ng & Han, 1994): Randomized re-sampling
648
649
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
 Cluster Analysis: Basic Concepts
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary
649
Hierarchical Clustering
 Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but
needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
(AGNES)
divisive
(DIANA)
650
AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical packages, e.g., Splus
 Use the single-link method and the dissimilarity matrix
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
651
Dendrogram: Shows How Clusters are Merged
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram
A clustering of the data objects is obtained by cutting
the dendrogram at the desired level, then each
connected component forms a cluster
652
DIANA (Divisive Analysis)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical analysis packages, e.g., Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
653
Distance between Clusters
 Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
 Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
 Average: avg distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)
 Centroid: distance between the centroids of two clusters, i.e.,
dist(Ki, Kj) = dist(Ci, Cj)
 Medoid: distance between the medoids of two clusters, i.e., dist(Ki,
Kj) = dist(Mi, Mj)
 Medoid: a chosen, centrally located object in the cluster
X X
654
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
 Centroid: the “middle” of a cluster
 Radius: square root of average distance from any point
of the cluster to its centroid
 Diameter: square root of average mean squared
distance between all pairs of points in the cluster
N
t
N
i ip
m
C
)
(
1



N
m
c
ip
t
N
i
m
R
2
)
(
1




)
1
(
2
)
(
1
1







N
N
iq
t
ip
t
N
i
N
i
m
D
655
Extensions to Hierarchical Clustering
 Major weakness of agglomerative clustering methods
 Can never undo what was done previously
 Do not scale well: time complexity of at least O(n2
),
where n is the number of total objects
 Integration of hierarchical & distance-based clustering
 BIRCH (1996): uses CF-tree and incrementally adjusts
the quality of sub-clusters
 CHAMELEON (1999): hierarchical clustering using
dynamic modeling
656
BIRCH (Balanced Iterative Reducing and
Clustering Using Hierarchies)
 Zhang, Ramakrishnan & Livny, SIGMOD’96
 Incrementally construct a CF (Clustering Feature) tree, a hierarchical
data structure for multiphase clustering
 Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
 Phase 2: use an arbitrary clustering algorithm to cluster the leaf
nodes of the CF-tree
 Scales linearly: finds a good clustering with a single scan and improves
the quality with a few additional scans
 Weakness: handles only numeric data, and sensitive to the order of the
data record
657
Clustering Feature Vector in BIRCH
Clustering Feature (CF): CF = (N, LS, SS)
N: Number of data points
LS: linear sum of N points:
SS: square sum of N points
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
CF = (5, (16,30),(54,190))
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)


N
i
i
X
1
2
1


N
i
i
X
658
CF-Tree in BIRCH
 Clustering feature:
 Summary of the statistics for a given subcluster: the 0-th, 1st,
and 2nd moments of the subcluster from the statistical point
of view
 Registers crucial measurements for computing cluster and
utilizes storage efficiently
A CF tree is a height-balanced tree that stores the clustering
features for a hierarchical clustering
 A nonleaf node in a tree has descendants or “children”
 The nonleaf nodes store sums of the CFs of their children
 A CF tree has two parameters
 Branching factor: max # of children
 Threshold: max diameter of sub-clusters stored at the leaf
nodes 659
The CF Tree Structure
CF1
child1
CF3
child3
CF2
child2
CF6
child6
CF1
child1
CF3
child3
CF2
child2
CF5
child5
CF1 CF2 CF6
prev next CF1 CF2 CF4
prev next
B = 7
L = 6
Root
Non-leaf node
Leaf node Leaf node
660
The Birch Algorithm
 Cluster Diameter
 For each point in the input
 Find closest leaf entry
 Add point to leaf entry and update CF
 If entry diameter > max_diameter, then split leaf, and possibly
parents
 Algorithm is O(n)
 Concerns
 Sensitive to insertion order of data points
 Since we fix the size of leaf nodes, so clusters may not be so
natural
 Clusters tend to be spherical given the radius and diameter
measures
 

2
)
(
)
1
(
1
j
x
i
x
n
n
661
CHAMELEON: Hierarchical Clustering Using
Dynamic Modeling (1999)
 CHAMELEON: G. Karypis, E. H. Han, and V. Kumar, 1999
 Measures the similarity based on a dynamic model
 Two clusters are merged only if the interconnectivity
and closeness (proximity) between two clusters are
high relative to the internal interconnectivity of the
clusters and closeness of items within the clusters
 Graph-based, and a two-phase algorithm
1. Use a graph-partitioning algorithm: cluster objects into
a large number of relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm:
find the genuine clusters by repeatedly combining
these sub-clusters
662
Overall Framework of CHAMELEON
Construct (K-NN)
Sparse Graph Partition the Graph
Merge Partition
Final Clusters
Data Set
K-NN Graph
P and q are connected if
q is among the top k
closest neighbors of p
Relative interconnectivity:
connectivity of c1 and c2
over internal connectivity
Relative closeness:
closeness of c1 and c2 over
internal closeness 663
664
CHAMELEON (Clustering Complex Objects)
Probabilistic Hierarchical Clustering
 Algorithmic hierarchical clustering
 Nontrivial to choose a good distance measure
 Hard to handle missing attribute values
 Optimization goal not clear: heuristic, local search
 Probabilistic hierarchical clustering
 Use probabilistic models to measure distances between clusters
 Generative model: Regard the set of data objects to be clustered
as a sample of the underlying data generation mechanism to be
analyzed
 Easy to understand, same efficiency as algorithmic agglomerative
clustering method, can handle partially observed data
 In practice, assume the generative models adopt common distributions
functions, e.g., Gaussian distribution or Bernoulli distribution, governed
by parameters
665
Generative Model
 Given a set of 1-D points X = {x1, …, xn} for clustering
analysis & assuming they are generated by a
Gaussian distribution:
 The probability that a point xi ∈ X is generated by the
model
 The likelihood that X is generated by the model:
 The task of learning the generative model: find the
parameters μ and σ2
such that the maximum
likelihood
666
A Probabilistic Hierarchical Clustering Algorithm
 For a set of objects partitioned into m clusters C1, . . . ,Cm, the quality
can be measured by,
where P() is the maximum likelihood
 Distance between clusters C1 and C2:
 Algorithm: Progressively merge points and clusters
Input: D = {o1, ..., on}: a data set containing n objects
Output: A hierarchy of clusters
Method
Create a cluster for each object Ci = {oi}, 1 ≤ i ≤ n;
For i = 1 to n {
Find pair of clusters Ci and Cj such that
Ci,Cj = argmaxi ≠ j {log (P(Ci C
∪ j )/(P(Ci)P(Cj ))};
If log (P(Ci C
∪ j )/(P(Ci)P(Cj )) > 0 then merge Ci and Cj }
667
668
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
 Cluster Analysis: Basic Concepts
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary
668
Density-Based Clustering Methods
 Clustering based on density (local cluster criterion), such
as density-connected points

Major features:

Discover clusters of arbitrary shape

Handle noise

One scan

Need density parameters as termination condition
 Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)
 OPTICS: Ankerst, et al (SIGMOD’99).
 DENCLUE: Hinneburg & D. Keim (KDD’98)
 CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-
based)
669
Density-Based Clustering: Basic Concepts
 Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-
neighbourhood of that point
 NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
 Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if

p belongs to NEps(q)
 core point condition:
|NEps (q)| ≥ MinPts
MinPts = 5
Eps = 1 cm
p
q
670
Density-Reachable and Density-Connected
 Density-reachable:
 A point p is density-reachable from
a point q w.r.t. Eps, MinPts if there
is a chain of points p1, …, pn, p1 =
q, pn = p such that pi+1 is directly
density-reachable from pi
 Density-connected
 A point p is density-connected to a
point q w.r.t. Eps, MinPts if there is
a point o such that both, p and q
are density-reachable from o w.r.t.
Eps and MinPts
p
q
p1
p q
o
671
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
 Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
 Discovers clusters of arbitrary shape in spatial databases
with noise
Core
Border
Outlier
Eps = 1cm
MinPts = 5
672
DBSCAN: The Algorithm
 Arbitrary select a point p
 Retrieve all points density-reachable from p w.r.t. Eps and
MinPts
 If p is a core point, a cluster is formed
 If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the database
 Continue the process until all of the points have been
processed
673
DBSCAN: Sensitive to Parameters
674
OPTICS: A Cluster-Ordering Method (1999)
 OPTICS: Ordering Points To Identify the Clustering
Structure
 Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
 Produces a special order of the database wrt its
density-based clustering structure
 This cluster-ordering contains info equiv to the density-
based clusterings corresponding to a broad range of
parameter settings
 Good for both automatic and interactive cluster
analysis, including finding intrinsic clustering structure
 Can be represented graphically or using visualization
techniques
675
OPTICS: Some Extension from DBSCAN
 Index-based:

k = number of dimensions

N = 20

p = 75%

M = N(1-p) = 5
 Complexity: O(NlogN)
 Core Distance:
 min eps s.t. point is core
 Reachability Distance
D
p2
MinPts = 5
e = 3 cm
Max (core-distance (o), d (o, p))
r(p1, o) = 2.8cm. r(p2,o) = 4cm
o
o
p1
676


Reachability
-distance
Cluster-order
of the objects
undefined
 ‘
677
678
Density-Based Clustering: OPTICS & Its Applications
DENCLUE: Using Statistical Density Functions
 DENsity-based CLUstEring by Hinneburg & Keim (KDD’98)
 Using statistical density functions:
 Major features
 Solid mathematical foundation
 Good for data sets with large amounts of noise
 Allows a compact mathematical description of arbitrarily shaped
clusters in high-dimensional data sets
 Significant faster than existing algorithm (e.g., DBSCAN)
 But needs a large number of parameters
f x y e
Gaussian
d x y
( , )
( , )


2
2
2  


N
i
x
x
d
D
Gaussian
i
e
x
f 1
2
)
,
(
2
2
)
( 
 





N
i
x
x
d
i
i
D
Gaussian
i
e
x
x
x
x
f 1
2
)
,
(
2
2
)
(
)
,
( 
influence of y
on x
total influence
on x
gradient of x in
the direction
of xi
679
 Uses grid cells but only keeps information about grid cells that do
actually contain data points and manages these cells in a tree-based
access structure
 Influence function: describes the impact of a data point within its
neighborhood
 Overall density of the data space can be calculated as the sum of the
influence function of all data points
 Clusters can be determined mathematically by identifying density
attractors
 Density attractors are local maximal of the overall density function
 Center defined clusters: assign to each density attractor the points
density attracted to it
 Arbitrary shaped cluster: merge density attractors that are connected
through paths of high density (> threshold)
Denclue: Technical Essence
680
Density Attractor
681
Center-Defined and Arbitrary
682
683
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
 Cluster Analysis: Basic Concepts
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary
683
Grid-Based Clustering Method
 Using multi-resolution grid data structure
 Several interesting methods
 STING (a STatistical INformation Grid approach) by
Wang, Yang and Muntz (1997)
 WaveCluster by Sheikholeslami, Chatterjee, and
Zhang (VLDB’98)

A multi-resolution clustering approach using
wavelet method
 CLIQUE: Agrawal, et al. (SIGMOD’98)

Both grid-based and subspace clustering
684
STING: A Statistical Information Grid Approach
 Wang, Yang and Muntz (VLDB’97)
 The spatial area is divided into rectangular cells
 There are several levels of cells corresponding to different
levels of resolution
685
i-th layer
(i-1)st layer
1st layer
The STING Clustering Method
 Each cell at a high level is partitioned into a number of
smaller cells in the next lower level
 Statistical info of each cell is calculated and stored
beforehand and is used to answer queries
 Parameters of higher level cells can be easily calculated
from parameters of lower level cell
 count, mean, s, min, max
 type of distribution—normal, uniform, etc.
 Use a top-down approach to answer spatial data queries
 Start from a pre-selected layer—typically with a small
number of cells
 For each cell in the current level compute the confidence
interval
686
STING Algorithm and Its Analysis
 Remove the irrelevant cells from further consideration
 When finish examining the current layer, proceed to the
next lower level
 Repeat this process until the bottom layer is reached
 Advantages:
 Query-independent, easy to parallelize, incremental
update
 O(K), where K is the number of grid cells at the lowest
level
 Disadvantages:
 All the cluster boundaries are either horizontal or
vertical, and no diagonal boundary is detected
687
688
CLIQUE (Clustering In QUEst)
 Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)
 Automatically identifying subspaces of a high dimensional data space
that allow better clustering than original space
 CLIQUE can be considered as both density-based and grid-based
 It partitions each dimension into the same number of equal length
interval
 It partitions an m-dimensional data space into non-overlapping
rectangular units
 A unit is dense if the fraction of total data points contained in the unit
exceeds the input model parameter
 A cluster is a maximal set of connected dense units within a
subspace
689
CLIQUE: The Major Steps
 Partition the data space and find the number of points that
lie inside each cell of the partition.
 Identify the subspaces that contain clusters using the
Apriori principle
 Identify clusters
 Determine dense units in all subspaces of interests
 Determine connected dense units in all subspaces of
interests.
 Generate minimal description for the clusters
 Determine maximal regions that cover a cluster of
connected dense units for each cluster
 Determination of minimal cover for each cluster
690
Salary
(10,000)
20 30 40 50 60
age
5
4
3
1
2
6
7
0
20 30 40 50 60
age
5
4
3
1
2
6
7
0
Vacation(
week)
age
Vacation
Salary 30 50
 = 3
691
Strength and Weakness of CLIQUE
 Strength
 automatically finds subspaces of the highest
dimensionality such that high density clusters exist in
those subspaces
 insensitive to the order of records in input and does not
presume some canonical data distribution
 scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
 Weakness
 The accuracy of the clustering result may be degraded
at the expense of simplicity of the method
692
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
 Cluster Analysis: Basic Concepts
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary
692
Assessing Clustering Tendency
 Assess if non-random structure exists in the data by measuring the
probability that the data is generated by a uniform data distribution
 Test spatial randomness by statistic test: Hopkins Static
 Given a dataset D regarded as a sample of a random variable o,
determine how far away o is from being uniformly distributed in
the data space
 Sample n points, p1, …, pn, uniformly from D. For each pi, find its
nearest neighbor in D: xi = min{dist (pi, v)} where v in D
 Sample n points, q1, …, qn, uniformly from D. For each qi, find its
nearest neighbor in D – {qi}: yi = min{dist (qi, v)} where v in D and
v ≠ qi
 Calculate the Hopkins Statistic:
 If D is uniformly distributed, ∑ xi and ∑ yi will be close to each
other and H is close to 0.5. If D is highly skewed, H is close to 0 693
Determine the Number of Clusters
 Empirical method
 # of clusters ≈√n/2 for a dataset of n points
 Elbow method
 Use the turning point in the curve of sum of within cluster variance
w.r.t the # of clusters
 Cross validation method
 Divide a given data set into m parts
 Use m – 1 parts to obtain a clustering model
 Use the remaining part to test the quality of the clustering

E.g., For each point in the test set, find the closest centroid, and
use the sum of squared distance between all points in the test
set and the closest centroids to measure how well the model fits
the test set
 For any k > 0, repeat it m times, compare the overall quality measure
w.r.t. different k’s, and find # of clusters that fits the data the best
694
Measuring Clustering Quality
 Two methods: extrinsic vs. intrinsic
 Extrinsic: supervised, i.e., the ground truth is available
 Compare a clustering against the ground truth using
certain clustering quality measure
 Ex. BCubed precision and recall metrics
 Intrinsic: unsupervised, i.e., the ground truth is unavailable
 Evaluate the goodness of a clustering by considering
how well the clusters are separated, and how compact
the clusters are
 Ex. Silhouette coefficient
695
Measuring Clustering Quality: Extrinsic Methods
 Clustering quality measure: Q(C, Cg), for a clustering C
given the ground truth Cg.
 Q is good if it satisfies the following 4 essential criteria
 Cluster homogeneity: the purer, the better
 Cluster completeness: should assign objects belong to
the same category in the ground truth to the same
cluster
 Rag bag: putting a heterogeneous object into a pure
cluster should be penalized more than putting it into a
rag bag (i.e., “miscellaneous” or “other” category)
 Small cluster preservation: splitting a small category
into pieces is more harmful than splitting a large
category into pieces
696
697
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
 Cluster Analysis: Basic Concepts
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary
697
Summary
 Cluster analysis groups objects based on their similarity and has
wide applications
 Measure of similarity can be computed for various types of data
 Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
 K-means and K-medoids algorithms are popular partitioning-based
clustering algorithms
 Birch and Chameleon are interesting hierarchical clustering
algorithms, and there are also probabilistic hierarchical clustering
algorithms
 DBSCAN, OPTICS, and DENCLU are interesting density-based
algorithms
 STING and CLIQUE are grid-based methods, where CLIQUE is also
a subspace clustering algorithm
 Quality of clustering results can be evaluated in various ways
698
699
CS512-Spring 2011: An Introduction
 Coverage
 Cluster Analysis: Chapter 11
 Outlier Detection: Chapter 12
 Mining Sequence Data: BK2: Chapter 8
 Mining Graphs Data: BK2: Chapter 9
 Social and Information Network Analysis

BK2: Chapter 9

Partial coverage: Mark Newman: “Networks: An Introduction”, Oxford U., 2010

Scattered coverage: Easley and Kleinberg, “Networks, Crowds, and Markets:
Reasoning About a Highly Connected World”, Cambridge U., 2010

Recent research papers
 Mining Data Streams: BK2: Chapter 8
 Requirements
 One research project
 One class presentation (15 minutes)
 Two homeworks (no programming assignment)
 Two midterm exams (no final exam)
References (1)
 R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace
clustering of high dimensional data for data mining applications. SIGMOD'98
 M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
 M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points
to identify the clustering structure, SIGMOD’99.
 Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", KDD'02
 M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-Based
Local Outliers. SIGMOD 2000.
 M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for
discovering clusters in large spatial databases. KDD'96.
 M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial
databases: Focusing techniques for efficient class identification. SSD'95.
 D. Fisher. Knowledge acquisition via incremental conceptual clustering.
Machine Learning, 2:139-172, 1987.
 D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. VLDB’98.
 V. Ganti, J. Gehrke, R. Ramakrishan. CACTUS Clustering Categorical Data
Using Summaries. KDD'99.
700
References (2)
 D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. In Proc. VLDB’98.
 S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for
large databases. SIGMOD'98.
 S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for
categorical attributes. In ICDE'99, pp. 512-521, Sydney, Australia, March
1999.
 A. Hinneburg, D.l A. Keim: An Efficient Approach to Clustering in Large
Multimedia Databases with Noise. KDD’98.
 A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall,
1988.
 G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical Clustering
Algorithm Using Dynamic Modeling. COMPUTER, 32(8): 68-75, 1999.
 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to
Cluster Analysis. John Wiley & Sons, 1990.
 E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large
datasets. VLDB’98.
701
References (3)
 G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
 R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.
 L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional Data: A
Review, SIGKDD Explorations, 6(1), June 2004
 E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large
data sets. Proc. 1996 Int. Conf. on Pattern Recognition
 G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution
clustering approach for very large spatial databases. VLDB’98.
 A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based Clustering
in Large Databases, ICDT'01.
 A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of Obstacles,
ICDE'01
 H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data
sets, SIGMOD’02
 W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial
Data Mining, VLDB’97
 T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : An efficient data clustering method
for very large databases. SIGMOD'96
 X. Yin, J. Han, and P. S. Yu, “LinkClus: Efficient Clustering via Heterogeneous Semantic
Links”, VLDB'06
702
Slides unused in class
703
704
A Typical K-Medoids Algorithm (PAM)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 20
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrar
y choose
k object
as initial
medoids
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Assign
each
remaini
ng
object to
nearest
medoids
Randomly select a
nonmedoid
object,Oramdom
Compute
total cost
of
swapping
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 26
Swapping O
and Oramdom
If quality is
improved.
Do loop
Until no
change
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
705
PAM (Partitioning Around Medoids) (1987)
 PAM (Kaufman and Rousseeuw, 1987), built in Splus
 Use real object to represent the cluster
 Select k representative objects arbitrarily
 For each pair of non-selected object h and selected
object i, calculate the total swapping cost TCih
 For each pair of i and h,
 If TCih < 0, i is replaced by h

Then assign each non-selected object to the most
similar representative object
 repeat steps 2-3 until there is no change
706
PAM Clustering: Finding the Best Cluster Center
 Case 1: p currently belongs to oj. If oj is replaced by orandom as a
representative object and p is the closest to one of the other
representative object oi, then p is reassigned to oi
707
What Is the Problem with PAM?
 Pam is more robust than k-means in the presence of
noise and outliers because a medoid is less influenced by
outliers or other extreme values than a mean
 Pam works efficiently for small data sets but does not
scale well for large data sets.
 O(k(n-k)2
) for each iteration
where n is # of data,k is # of clusters
 Sampling-based method
CLARA(Clustering LARge Applications)
708
CLARA (Clustering Large Applications)
(1990)
 CLARA (Kaufmann and Rousseeuw in 1990)
 Built in statistical analysis packages, such as SPlus
 It draws multiple samples of the data set, applies PAM
on each sample, and gives the best clustering as the
output
 Strength: deals with larger data sets than PAM
 Weakness:
 Efficiency depends on the sample size
 A good clustering based on samples will not
necessarily represent a good clustering of the whole
data set if the sample is biased
709
CLARANS (“Randomized” CLARA) (1994)
 CLARANS (A Clustering Algorithm based on Randomized
Search) (Ng and Han’94)
 Draws sample of neighbors dynamically
 The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
set of k medoids
 If the local optimum is found, it starts with new randomly
selected node in search for a new local optimum
 Advantages: More efficient and scalable than both PAM
and CLARA
 Further improvement: Focusing techniques and spatial
access structures (Ester et al.’95)
710
ROCK: Clustering Categorical Data
 ROCK: RObust Clustering using linKs
 S. Guha, R. Rastogi & K. Shim, ICDE’99
 Major ideas
 Use links to measure similarity/proximity
 Not distance-based
 Algorithm: sampling-based clustering
 Draw random sample
 Cluster with links
 Label data in disk
 Experiments
 Congressional voting, mushroom data
711
Similarity Measure in ROCK
 Traditional measures for categorical data may not work well, e.g.,
Jaccard coefficient
 Example: Two groups (clusters) of transactions
 C1. <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e},
{a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
 C2. <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
 Jaccard co-efficient may lead to wrong clustering result
 C1: 0.2 ({a, b, c}, {b, d, e}} to 0.5 ({a, b, c}, {a, b, d})
 C1 & C2: could be as high as 0.5 ({a, b, c}, {a, b, f})
 Jaccard co-efficient-based similarity function:
 Ex. Let T1 = {a, b, c}, T2 = {c, d, e}
Sim T T
T T
T T
( , )
1 2
1 2
1 2



2
.
0
5
1
}
,
,
,
,
{
}
{
)
,
( 2
1 


e
d
c
b
a
c
T
T
Sim
712
Link Measure in ROCK
 Clusters

C1:<a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e},
{b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}

C2: <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
 Neighbors

Two transactions are neighbors if sim(T1,T2) > threshold
 Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}

T1 connected to: {a,b,d}, {a,b,e}, {a,c,d}, {a,c,e}, {b,c,d}, {b,c,e},
{a,b,f}, {a,b,g}

T2 connected to: {a,c,d}, {a,c,e}, {a,d,e}, {b,c,e}, {b,d,e}, {b,c,d}

T3 connected to: {a,b,c}, {a,b,d}, {a,b,e}, {a,b,g}, {a,f,g}, {b,f,g}
 Link Similarity

Link similarity between two transactions is the # of common neighbors
 link(T1, T2) = 4, since they have 4 common neighbors

{a, c, d}, {a, c, e}, {b, c, d}, {b, c, e}
 link(T1, T3) = 3, since they have 3 common neighbors

{a, b, d}, {a, b, e}, {a, b, g}
Aggregation-Based Similarity Computation
4 5
10 12 13 14
a b
ST2
ST1
11
0.2
0.9 1.0 0.8 0.9 1.0
For each node nk {
∈ n10, n11, n12} and nl {
∈ n13, n14}, their path-
based similarity simp(nk, nl) = s(nk, n4)·s(n4, n5)·s(n5, nl).
 
 
 
 
171
.
0
2
,
,
3
,
,
14
13 5
5
4
12
10 4





 
 l l
k k
b
a
n
n
s
n
n
s
n
n
s
n
n
sim
After aggregation, we reduce quadratic time computation to
linear time computation.
takes O(3+2) time
714
Computing Similarity with Aggregation
To compute sim(na,nb):
 Find all pairs of sibling nodes ni and nj, so that na linked with ni and nb
with nj.
 Calculate similarity (and weight) between na and nb w.r.t. ni and nj.
 Calculate weighted average similarity between na and nb w.r.t. all such
pairs.
sim(na, nb) = avg_sim(na,n4) x s(n4, n5) x avg_sim(nb,n5)
= 0.9 x 0.2 x 0.95 = 0.171
sim(na, nb) can be computed
from aggregated similarities
Average similarity
and total weight 4 5
10 12 13 14
a b
a:
(0.9,3)
b:(0.95,2)
11
0.2
715
716
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
 Cluster Analysis: Basic Concepts
 Overview of Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Summary
716
Link-Based Clustering: Calculate Similarities
Based On Links
Jeh & Widom, KDD’2002: SimRank
Two objects are similar if they are
linked with the same or similar
objects
 The similarity between two
objects x and y is defined as
the average similarity between
objects linked with x and those
with y:
 Issue: Expensive to compute:
 For a dataset of N objects
and M links, it takes O(N2
)
space and O(M2
) time to
compute all similarities.
Tom sigmod03
Mike
Cathy
John
sigmod04
sigmod05
vldb03
vldb04
vldb05
sigmod
vldb
Mary
aaai04
aaai05
aaai
Authors Proceedings Conferences
 
   
   
 
 
 
 
 

a
I
i
b
I
j
j
i b
I
a
I
b
I
a
I
C
b
a
1 1
,
sim
,
sim
717
Observation 1: Hierarchical Structures
 Hierarchical structures often exist naturally among objects
(e.g., taxonomy of animals)
All
electronics
grocery apparel
DVD camera
TV
A hierarchical structure of
products in Walmart
Articles
Words
Relationships between articles and
words (Chakrabarti, Papadimitriou,
Modha, Faloutsos, 2004)
718
Observation 2: Distribution of Similarity
 Power law distribution exists in similarities
 56% of similarity entries are in [0.005, 0.015]
 1.4% of similarity entries are larger than 0.1
 Can we design a data structure that stores the significant
similarities and compresses insignificant ones?
0
0.1
0.2
0.3
0.4
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
similarity value
portion
of
entries
Distribution of SimRank similarities
among DBLP authors
719
A Novel Data Structure: SimTree
Each leaf node
represents an object
Each non-leaf node
represents a group
of similar lower-level
nodes
Similarities between
siblings are stored
Consumer
electronics
Apparels
Canon A40
digital camera
Sony V3 digital
camera
Digital
Cameras
TVs
720
Similarity Defined by SimTree
 Path-based node similarity
 simp(n7,n8) = s(n7, n4) x s(n4, n5) x s(n5, n8)
 Similarity between two nodes is the average similarity
between objects linked with them in other SimTrees
 Adjust/ ratio for x =
n1 n2
n4 n5
n6
n3
0.9 1.0
0.9
0.8
0.2
n7 n9
0.3
n8
0.8
0.9
Similarity between two
sibling nodes n1 and n2
Adjustment ratio
for node n7
Average similarity between x and all other nodes
Average similarity between x’s parent and all other
nodes
721
LinkClus: Efficient Clustering via
Heterogeneous Semantic Links
Method
 Initialize a SimTree for objects of each type
 Repeat until stable
 For each SimTree, update the similarities between its
nodes using similarities in other SimTrees

Similarity between two nodes x and y is the average
similarity between objects linked with them
 Adjust the structure of each SimTree

Assign each node to the parent node that it is most
similar to
For details: X. Yin, J. Han, and P. S. Yu, “LinkClus: Efficient
Clustering via Heterogeneous Semantic Links”, VLDB'06
722
Initialization of SimTrees
 Initializing a SimTree
 Repeatedly find groups of tightly related nodes, which
are merged into a higher-level node
 Tightness of a group of nodes
 For a group of nodes {n1, …, nk}, its tightness is
defined as the number of leaf nodes in other SimTrees
that are connected to all of {n1, …, nk}
n1
1
2
3
4
5
n2
The tightness of {n1, n2} is 3
Nodes Leaf nodes in
another SimTree
723
Finding Tight Groups by Freq. Pattern Mining
 Finding tight groups Frequent pattern mining
 Procedure of initializing a tree
 Start from leaf nodes (level-0)
 At each level l, find non-overlapping groups of similar
nodes with frequent pattern mining
Reduced to
g1
g2
{n1}
{n1, n2}
{n2}
{n1, n2}
{n1, n2}
{n2, n3,
n4}
{n4}
{n3, n4}
{n3, n4}
Transactions
n1
1
2
3
4
5
6
7
8
9
n2
n3
n4
The tightness of a
group of nodes is the
support of a frequent
pattern
724
Adjusting SimTree Structures
 After similarity changes, the tree structure also needs to be
changed
 If a node is more similar to its parent’s sibling, then move
it to be a child of that sibling
 Try to move each node to its parent’s sibling that it is most
similar to, under the constraint that each parent node can
have at most c children
n1 n2
n4 n5
n6
n3
n7 n9
n8
0.8
0.9
n7
725
Complexity
Time Space
Updating similarities O(M(logN)2
) O(M+N)
Adjusting tree structures O(N) O(N)
LinkClus O(M(logN)2
) O(M+N)
SimRank O(M2
) O(N2
)
For two types of objects, N in each, and M linkages between them.
726
Experiment: Email Dataset
 F. Nielsen. Email dataset.
www.imm.dtu.dk/~rem/data/Email-1431.zip
 370 emails on conferences, 272 on jobs,
and 789 spam emails
 Accuracy: measured by manually labeled
data
 Accuracy of clustering: % of pairs of objects
in the same cluster that share common label
Approach Accuracy time (s)
LinkClus 0.8026 1579.6
SimRank 0.7965 39160
ReCom 0.5711 74.6
F-SimRank 0.3688 479.7
CLARANS 0.4768 8.55
 Approaches compared:
 SimRank (Jeh & Widom, KDD 2002): Computing pair-wise similarities
 SimRank with FingerPrints (F-SimRank): Fogaras & R´acz, WWW 2005

pre-computes a large sample of random paths from each object and uses
samples of two objects to estimate SimRank similarity
 ReCom (Wang et al. SIGIR 2003)

Iteratively clustering objects using cluster labels of linked objects
727
WaveCluster: Clustering by Wavelet Analysis (1998)
 Sheikholeslami, Chatterjee, and Zhang (VLDB’98)
 A multi-resolution clustering approach which applies wavelet transform
to the feature space; both grid-based and density-based
 Wavelet transform: A signal processing technique that decomposes a
signal into different frequency sub-band
 Data are transformed to preserve relative distance between objects
at different levels of resolution
 Allows natural clusters to become more distinguishable
728
The WaveCluster Algorithm
 How to apply wavelet transform to find clusters
 Summarizes the data by imposing a multidimensional grid
structure onto data space
 These multidimensional spatial data objects are represented in a
n-dimensional feature space
 Apply wavelet transform on feature space to find the dense
regions in the feature space
 Apply wavelet transform multiple times which result in clusters at
different scales from fine to coarse
 Major features:
 Complexity O(N)
 Detect arbitrary shaped clusters at different scales
 Not sensitive to noise, not sensitive to input order
 Only applicable to low dimensional data
729
730
Quantization
& Transformation
 Quantize data into m-D grid structure,
then wavelet transform
a) scale 1: high resolution
b) scale 2: medium resolution
c) scale 3: low resolution
731
Data Mining:
Concepts and Techniques
(3rd
ed.)
— Chapter 11 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
731
732
Review: Basic Cluster Analysis Methods (Chap.
10)
 Cluster Analysis: Basic Concepts
 Group data so that object similarity is high within clusters but low
across clusters
 Partitioning Methods
 K-means and k-medoids algorithms and their refinements
 Hierarchical Methods
 Agglomerative and divisive method, Birch, Cameleon
 Density-Based Methods
 DBScan, Optics and DenCLu
 Grid-Based Methods
 STING and CLIQUE (subspace clustering)
 Evaluation of Clustering
 Assess clustering tendency, determine # of clusters, and measure
clustering quality
732
K-Means Clustering
K=2
Arbitrarily
partition
objects
into k
groups
Update
the cluster
centroids
Update
the cluster
centroids
Reassign objects
Loop if
needed
733
The initial data
set
 Partition objects into k nonempty
subsets
 Repeat
 Compute centroid (i.e., mean
point) for each partition
 Assign each object to the
cluster of its nearest centroid
 Until no change
Hierarchical Clustering
 Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but
needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
(AGNES)
divisive
(DIANA)
734
Distance between Clusters
 Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
 Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
 Average: avg distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)
 Centroid: distance between the centroids of two clusters, i.e.,
dist(Ki, Kj) = dist(Ci, Cj)
 Medoid: distance between the medoids of two clusters, i.e., dist(Ki,
Kj) = dist(Mi, Mj)
 Medoid: a chosen, centrally located object in the cluster
X X
735
BIRCH and the Clustering Feature
(CF) Tree Structure
CF1
child1
CF3
child3
CF2
child2
CF6
child6
CF1
child1
CF3
child3
CF2
child2
CF5
child5
CF1 CF2 CF6
prev next CF1 CF2 CF4
prev next
B = 7
L = 6
Root
Non-leaf node
Leaf node Leaf node
736
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
CF = (5, (16,30),
(54,190))
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
Overall Framework of CHAMELEON
Construct (K-NN)
Sparse Graph Partition the Graph
Merge Partition
Final Clusters
Data Set
K-NN Graph
P and q are connected if
q is among the top k
closest neighbors of p
Relative interconnectivity:
connectivity of c1 and c2
over internal connectivity
Relative closeness:
closeness of c1 and c2 over
internal closeness 737
Density-Based Clustering: DBSCAN
 Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-
neighbourhood of that point
 NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
 Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if

p belongs to NEps(q)
 core point condition:
|NEps (q)| ≥ MinPts
MinPts = 5
Eps = 1 cm
p
q
738
739
Density-Based Clustering: OPTICS & Its Applications
DENCLU: Center-Defined and Arbitrary
740
STING: A Statistical Information Grid Approach
 Wang, Yang and Muntz (VLDB’97)
 The spatial area is divided into rectangular cells
 There are several levels of cells corresponding to different
levels of resolution
741
i-th layer
(i-1)st layer
1st layer
Evaluation of Clustering Quality
 Assessing Clustering Tendency
 Assess if non-random structure exists in the data by measuring
the probability that the data is generated by a uniform data
distribution
 Determine the Number of Clusters
 Empirical method: # of clusters ≈√n/2
 Elbow method: Use the turning point in the curve of sum of within
cluster variance w.r.t # of clusters
 Cross validation method
 Measuring Clustering Quality
 Extrinsic: supervised

Compare a clustering against the ground truth using certain
clustering quality measure
 Intrinsic: unsupervised

Evaluate the goodness of a clustering by considering how well
the clusters are separated, and how compact the clusters are
742
743
Outline of Advanced Clustering Analysis
 Probability Model-Based Clustering
 Each object may take a probability to belong to a cluster
 Clustering High-Dimensional Data
 Curse of dimensionality: Difficulty of distance measure in high-D
space
 Clustering Graphs and Network Data
 Similarity measurement and clustering methods for graph and
networks
 Clustering with Constraints
 Cluster analysis under different kinds of constraints, e.g., that raised
from background knowledge or spatial distribution of the objects
744
Chapter 11. Cluster Analysis: Advanced Methods
 Probability Model-Based Clustering
 Clustering High-Dimensional Data
 Clustering Graphs and Network Data
 Clustering with Constraints
 Summary
744
Fuzzy Set and Fuzzy Cluster
 Clustering methods discussed so far
 Every data object is assigned to exactly one cluster
 Some applications may need for fuzzy or soft cluster assignment
 Ex. An e-game could belong to both entertainment and software
 Methods: fuzzy clusters and probabilistic model-based clusters
 Fuzzy cluster: A fuzzy set S: FS : X → [0, 1] (value between 0 and 1)
 Example: Popularity of cameras is defined as a fuzzy mapping
 Then, A(0.05), B(1), C(0.86), D(0.27)
745
Fuzzy (Soft) Clustering
 Example: Let cluster features be
 C1 :“digital camera” and “lens”
 C2: “computer“
 Fuzzy clustering
 k fuzzy clusters C1, …,Ck ,represented as a partition matrix M = [wij]
 P1: for each object oi and cluster Cj, 0 ≤ wij ≤ 1 (fuzzy set)
 P2: for each object oi, , equal participation in the clustering
 P3: for each cluster Cj , ensures there is no empty cluster
 Let c1, …, ck as the center of the k clusters
 For an object oi, sum of the squared error (SSE), p is a parameter:
 For a cluster Ci, SSE:
 Measure how well a clustering fits the data:
746
Probabilistic Model-Based Clustering
 Cluster analysis is to find hidden categories.
 A hidden category (i.e., probabilistic cluster) is a distribution over the
data space, which can be mathematically represented using a
probability density function (or distribution function).
 Ex. 2 categories for digital cameras
sold
 consumer line vs. professional line
 density functions f1, f2 for C1, C2
 obtained by probabilistic clustering
 A mixture model assumes that a set of observed objects is a mixture
of instances from multiple probabilistic clusters, and conceptually
each observed object is generated independently
 Out task: infer a set of k probabilistic clusters that is mostly likely to
generate D using the above data generation process
747
748
Model-Based Clustering
 A set C of k probabilistic clusters C1, …,Ck with probability density
functions f1, …, fk, respectively, and their probabilities ω1, …, ωk.
 Probability of an object o generated by cluster Cj is
 Probability of o generated by the set of cluster C is
 Since objects are assumed to be generated
independently, for a data set D = {o1, …, on}, we have,
 Task: Find a set C of k probabilistic clusters s.t. P(D|C) is maximized
 However, maximizing P(D|C) is often intractable since the probability
density function of a cluster can take an arbitrarily complicated form
 To make it computationally feasible (as a compromise), assume the
probability density functions being some parameterized distributions
749
Univariate Gaussian Mixture Model
 O = {o1, …, on} (n observed objects), Θ = {θ1, …, θk} (parameters of the
k distributions), and Pj(oi| θj) is the probability that oi is generated from
the j-th distribution using parameter θj, we have
 Univariate Gaussian mixture model
 Assume the probability density function of each cluster follows a 1-
d Gaussian distribution. Suppose that there are k clusters.
 The probability density function of each cluster are centered at μj
with standard deviation σj, θj, = (μj, σj), we have
The EM (Expectation Maximization) Algorithm
 The k-means algorithm has two steps at each iteration:
 Expectation Step (E-step): Given the current cluster centers, each
object is assigned to the cluster whose center is closest to the
object: An object is expected to belong to the closest cluster
 Maximization Step (M-step): Given the cluster assignment, for
each cluster, the algorithm adjusts the center so that the sum of
distance from the objects assigned to this cluster and the new
center is minimized
 The (EM) algorithm: A framework to approach maximum likelihood or
maximum a posteriori estimates of parameters in statistical models.
 E-step assigns objects to clusters according to the current fuzzy
clustering or parameters of probabilistic clusters
 M-step finds the new clustering or parameters that maximize the
sum of squared error (SSE) or the expected likelihood
750
Fuzzy Clustering Using the EM Algorithm
 Initially, let c1 = a and c2 = b
 1st
E-step: assign o to c1,w. wt =

 1st
M-step: recalculate the centroids according to the partition matrix,
minimizing the sum of squared error (SSE)
 Iteratively calculate this until the cluster centers converge or the change
is small enough
752
Univariate Gaussian Mixture Model
 O = {o1, …, on} (n observed objects), Θ = {θ1, …, θk} (parameters of the
k distributions), and Pj(oi| θj) is the probability that oi is generated from
the j-th distribution using parameter θj, we have
 Univariate Gaussian mixture model
 Assume the probability density function of each cluster follows a 1-
d Gaussian distribution. Suppose that there are k clusters.
 The probability density function of each cluster are centered at μj
with standard deviation σj, θj, = (μj, σj), we have
753
Computing Mixture Models with EM
 Given n objects O = {o1, …, on}, we want to mine a set of parameters Θ
= {θ1, …, θk} s.t.,P(O|Θ) is maximized, where θj = (μj, σj) are the mean and
standard deviation of the j-th univariate Gaussian distribution
 We initially assign random values to parameters θj, then iteratively
conduct the E- and M- steps until converge or sufficiently small change
 At the E-step, for each object oi, calculate the probability that oi belongs
to each distribution,
 At the M-step, adjust the parameters θj = (μj, σj) so that the expected
likelihood P(O|Θ) is maximized
Advantages and Disadvantages of Mixture Models
 Strength
 Mixture models are more general than partitioning and fuzzy
clustering
 Clusters can be characterized by a small number of parameters
 The results may satisfy the statistical assumptions of the
generative models
 Weakness
 Converge to local optimal (overcome: run multi-times w. random
initialization)
 Computationally expensive if the number of distributions is large,
or the data set contains very few observed data points
 Need large data sets
 Hard to estimate the number of clusters
754
755
Chapter 11. Cluster Analysis: Advanced Methods
 Probability Model-Based Clustering
 Clustering High-Dimensional Data
 Clustering Graphs and Network Data
 Clustering with Constraints
 Summary
755
756
Clustering High-Dimensional Data
 Clustering high-dimensional data (How high is high-D in clustering?)
 Many applications: text documents, DNA micro-array data
 Major challenges:

Many irrelevant dimensions may mask clusters

Distance measure becomes meaningless—due to equi-distance

Clusters may exist only in some subspaces
 Methods
 Subspace-clustering: Search for clusters existing in subspaces of
the given high dimensional data space

CLIQUE, ProClus, and bi-clustering approaches
 Dimensionality reduction approaches: Construct a much lower
dimensional space and search for clusters there (may construct
new dimensions by combining some dimensions in the original
data)

Dimensionality reduction methods and spectral clustering
Traditional Distance Measures May Not
Be Effective on High-D Data
 Traditional distance measure could be dominated by noises in many
dimensions
 Ex. Which pairs of customers are more similar?
 By Euclidean distance, we get,
 despite Ada and Cathy look more similar
 Clustering should not only consider dimensions but also attributes
(features)
 Feature transformation: effective if most dimensions are relevant
(PCA & SVD useful when features are highly correlated/redundant)
 Feature selection: useful to find a subspace where the data have
nice clusters
757
758
The Curse of Dimensionality
(graphs adapted from Parsons et al. KDD Explorations 2004)
 Data in only one dimension is relatively
packed
 Adding a dimension “stretch” the
points across that dimension, making
them further apart
 Adding more dimensions will make the
points further apart—high dimensional
data is extremely sparse
 Distance measure becomes
meaningless—due to equi-distance
759
Why Subspace Clustering?
(adapted from Parsons et al. SIGKDD Explorations 2004)
 Clusters may exist only in some subspaces
 Subspace-clustering: find clusters in all the subspaces
Subspace Clustering Methods
 Subspace search methods: Search various subspaces to
find clusters
 Bottom-up approaches
 Top-down approaches
 Correlation-based clustering methods
 E.g., PCA based approaches
 Bi-clustering methods
 Optimization-based methods
 Enumeration methods
Subspace Clustering Method (I):
Subspace Search Methods
 Search various subspaces to find clusters
 Bottom-up approaches
 Start from low-D subspaces and search higher-D subspaces only
when there may be clusters in such subspaces
 Various pruning techniques to reduce the number of higher-D
subspaces to be searched
 Ex. CLIQUE (Agrawal et al. 1998)
 Top-down approaches
 Start from full space and search smaller subspaces recursively
 Effective only if the locality assumption holds: restricts that the
subspace of a cluster can be determined by the local neighborhood
 Ex. PROCLUS (Aggarwal et al. 1999): a k-medoid-like method
761
762
Salary
(10,000
)
20 30 40 50 60
age
5
4
3
1
2
6
7
0
20 30 40 50 60
age
5
4
3
1
2
6
7
0
Vacation
(week)
age
Vacatio
n
Salary 30 50
 = 3
CLIQUE: SubSpace Clustering with
Aprori Pruning
Subspace Clustering Method (II):
Correlation-Based Methods
 Subspace search method: similarity based on distance or
density
 Correlation-based method: based on advanced correlation
models
 Ex. PCA-based approach:
 Apply PCA (for Principal Component Analysis) to derive a
set of new, uncorrelated dimensions,
 then mine clusters in the new space or its subspaces
 Other space transformations:
 Hough transform
 Fractal dimensions
763
Subspace Clustering Method (III):
Bi-Clustering Methods
 Bi-clustering: Cluster both objects and attributes
simultaneously (treat objs and attrs in symmetric way)
 Four requirements:
 Only a small set of objects participate in a cluster
 A cluster only involves a small number of attributes
 An object may participate in multiple clusters, or
does not participate in any cluster at all
 An attribute may be involved in multiple clusters, or
is not involved in any cluster at all
764
 Ex 1. Gene expression or microarray data: a gene
sample/condition matrix.
 Each element in the matrix, a real number,
records the expression level of a gene under a
specific condition
 Ex. 2. Clustering customers and products
 Another bi-clustering problem
Types of Bi-clusters
 Let A = {a1, ..., an} be a set of genes, B = {b1, …, bn} a set of conditions
 A bi-cluster: A submatrix where genes and conditions follow some
consistent patterns
 4 types of bi-clusters (ideal cases)
 Bi-clusters with constant values:
 for any i in I and j in J, eij = c
 Bi-clusters with constant values on rows:
 eij = c + αi

Also, it can be constant values on columns
 Bi-clusters with coherent values (aka. pattern-based clusters)
 eij = c + αi + βj
 Bi-clusters with coherent evolutions on rows
 eij (ei1j1− ei1j2)(ei2j1− ei2j2) ≥ 0

i.e., only interested in the up- or down- regulated changes across
genes or conditions without constraining on the exact values 765
Bi-Clustering Methods
 Real-world data is noisy: Try to find approximate bi-clusters
 Methods: Optimization-based methods vs. enumeration methods
 Optimization-based methods
 Try to find a submatrix at a time that achieves the best significance
as a bi-cluster
 Due to the cost in computation, greedy search is employed to find
local optimal bi-clusters
 Ex. δ-Cluster Algorithm (Cheng and Church, ISMB’2000)
 Enumeration methods
 Use a tolerance threshold to specify the degree of noise allowed in
the bi-clusters to be mined
 Then try to enumerate all submatrices as bi-clusters that satisfy the
requirements
 Ex. δ-pCluster Algorithm (H. Wang et al.’ SIGMOD’2002, MaPle:
Pei et al., ICDM’2003)
766
767
Bi-Clustering for Micro-Array Data Analysis
 Left figure: Micro-array “raw” data shows 3 genes and their
values in a multi-D space: Difficult to find their patterns
 Right two: Some subsets of dimensions form nice shift and
scaling patterns
 No globally defined similarity/distance measure
 Clusters may not be exclusive
 An object can appear in multiple clusters
Bi-Clustering (I): δ-Bi-Cluster
 For a submatrix I x J, the mean of the i-th row:
 The mean of the j-th column:
 The mean of all elements in the submatrix is
 The quality of the submatrix as a bi-cluster can be measured by the mean
squared residue value
 A submatrix I x J is δ-bi-cluster if H(I x J) ≤ δ where δ ≥ 0 is a threshold.
When δ = 0, I x J is a perfect bi-cluster with coherent values. By setting δ > 0,
a user can specify the tolerance of average noise per element against a
perfect bi-cluster
 residue(eij) = eij − eiJ − eIj + eIJ
768
Bi-Clustering (I): The δ-Cluster Algorithm
 Maximal δ-bi-cluster is a δ-bi-cluster I x J such that there does not exist
another δ-bi-cluster I′ x J′ which contains I x J
 Computing is costly: Use heuristic greedy search to obtain local optimal clusters
 Two phase computation: deletion phase and additional phase
 Deletion phase: Start from the whole matrix, iteratively remove rows and
columns while the mean squared residue of the matrix is over δ
 At each iteration, for each row/column, compute the mean squared residue:
 Remove the row or column of the largest mean squared residue
 Addition phase:
 Expand iteratively the δ-bi-cluster I x J obtained in the deletion phase as
long as the δ-bi-cluster requirement is maintained
 Consider all the rows/columns not involved in the current bi-cluster I x J by
calculating their mean squared residues
 A row/column of the smallest mean squared residue is added into the current
δ-bi-cluster
 It finds only one δ-bi-cluster, thus needs to run multiple times: replacing the
elements in the output bi-cluster by random numbers 769
Bi-Clustering (II): δ-pCluster
 Enumerating all bi-clusters (δ-pClusters) [H. Wang, et al., Clustering by pattern
similarity in large data sets. SIGMOD’02]
 Since a submatrix I x J is a bi-cluster with (perfect) coherent values iff ei1j1 − ei2j1
= ei1j2 − ei2j2. For any 2 x 2 submatrix of I x J, define p-score
 A submatrix I x J is a δ-pCluster (pattern-based cluster) if the p-score of every 2
x 2 submatrix of I x J is at most δ, where δ ≥ 0 is a threshold specifying a user's
tolerance of noise against a perfect bi-cluster
 The p-score controls the noise on every element in a bi-cluster, while the mean
squared residue captures the average noise
 Monotonicity: If I x J is a δ-pClusters, every x x y (x,y ≥ 2) submatrix of I x J is
also a δ-pClusters.
 A δ-pCluster is maximal if no more row or column can be added into the cluster
and retain δ-pCluster: We only need to compute all maximal δ-pClusters.
770
MaPle: Efficient Enumeration of δ-pClusters
 Pei et al., MaPle: Efficient enumerating all maximal δ-
pClusters. ICDM'03
 Framework: Same as pattern-growth in frequent pattern
mining (based on the downward closure property)
 For each condition combination J, find the maximal subsets
of genes I such that I x J is a δ-pClusters
 If I x J is not a submatrix of another δ-pClusters
 then I x J is a maximal δ-pCluster.
 Algorithm is very similar to mining frequent closed itemsets
 Additional advantages of δ-pClusters:
 Due to averaging of δ-cluster, it may contain outliers
but still within δ-threshold
 Computing bi-clusters for scaling patterns, take
logarithmic on
will lead to the p-score form 771


yb
xb
ya
xa
d
d
d
d
/
/
Dimensionality-Reduction Methods
 Dimensionality reduction: In some situations, it is
more effective to construct a new space instead
of using some subspaces of the original data
772
 Ex. To cluster the points in the right figure, any subspace of the original
one, X and Y, cannot help, since all the three clusters will be projected
into the overlapping areas in X and Y axes.
 Construct a new dimension as the dashed one, the three clusters
become apparent when the points projected into the new dimension
 Dimensionality reduction methods
 Feature selection and extraction: But may not focus on clustering
structure finding
 Spectral clustering: Combining feature extraction and clustering (i.e.,
use the spectrum of the similarity matrix of the data to perform
dimensionality reduction for clustering in fewer dimensions)

Normalized Cuts (Shi and Malik, CVPR’97 or PAMI’2000)

The Ng-Jordan-Weiss algorithm (NIPS’01)
Spectral Clustering:
The Ng-Jordan-Weiss (NJW) Algorithm
 Given a set of objects o1, …, on, and the distance between each pair
of objects, dist(oi, oj), find the desired number k of clusters
 Calculate an affinity matrix W, where σ is a scaling parameter that
controls how fast the affinity Wij decreases as dist(oi, oj) increases.
In NJW, set Wij = 0
 Derive a matrix A = f(W). NJW defines a matrix D to be a diagonal
matrix s.t. Dii is the sum of the i-th row of W, i.e.,
Then, A is set to
 A spectral clustering method finds the k leading eigenvectors of A
 A vector v is an eigenvector of matrix A if Av = λv, where λ is the
corresponding eigen-value
 Using the k leading eigenvectors, project the original data into the
new space defined by the k leading eigenvectors, and run a
clustering algorithm, such as k-means, to find k clusters
 Assign the original data points to clusters according to how the
transformed points are assigned in the clusters obtained
773
Spectral Clustering: Illustration and Comments
 Spectral clustering: Effective in tasks like image processing
 Scalability challenge: Computing eigenvectors on a large matrix is costly
 Can be combined with other clustering methods, such as bi-clustering
774
775
Chapter 11. Cluster Analysis: Advanced Methods
 Probability Model-Based Clustering
 Clustering High-Dimensional Data
 Clustering Graphs and Network Data
 Clustering with Constraints
 Summary
775
Clustering Graphs and Network Data
 Applications
 Bi-partite graphs, e.g., customers and products,
authors and conferences
 Web search engines, e.g., click through graphs and
Web graphs
 Social networks, friendship/coauthor graphs
 Similarity measures
 Geodesic distances
 Distance based on random walk (SimRank)
 Graph clustering methods
 Minimum cuts: FastModularity (Clauset, Newman &
Moore, 2004)
 Density-based clustering: SCAN (Xu et al., KDD’2007)
776
Similarity Measure (I): Geodesic Distance
 Geodesic distance (A, B): length (i.e., # of edges) of the shortest path
between A and B (if not connected, defined as infinite)
 Eccentricity of v, eccen(v): The largest geodesic distance between v
and any other vertex u V − {v}.
∈
 E.g., eccen(a) = eccen(b) = 2; eccen(c) = eccen(d) = eccen(e) = 3
 Radius of graph G: The minimum eccentricity of all vertices, i.e., the
distance between the “most central point” and the “farthest border”
 r = min v V
∈ eccen(v)
 E.g., radius (g) = 2
 Diameter of graph G: The maximum eccentricity of all vertices, i.e., the
largest distance between any pair of vertices in G
 d = max v V
∈ eccen(v)
 E.g., diameter (g) = 3
 A peripheral vertex is a vertex that achieves the diameter.
 E.g., Vertices c, d, and e are peripheral vertices
777
SimRank: Similarity Based on Random
Walk and Structural Context
 SimRank: structural-context similarity, i.e., based on the similarity of its
neighbors
 In a directed graph G = (V,E),
 individual in-neighborhood of v: I(v) = {u | (u, v) E}
∈
 individual out-neighborhood of v: O(v) = {w | (v, w) E}
∈
 Similarity in SimRank:
 Initialization:
 Then we can compute si+1 from si based on the definition
 Similarity based on random walk: in a strongly connected component
 Expected distance:
 Expected meeting distance:
 Expected meeting probability:
778
P[t] is the probability of the
tour
Graph Clustering: Sparsest Cut
 G = (V,E). The cut set of a cut is the set
of edges {(u, v) E | u S, v T }
∈ ∈ ∈
and S and T are in two partitions
 Size of the cut: # of edges in the cut set
 Min-cut (e.g., C1) is not a good partition
 A better measure: Sparsity:
 A cut is sparsest if its sparsity is not greater than that of any other cut
 Ex. Cut C2 = ({a, b, c, d, e, f, l}, {g, h, i, j, k}) is the sparsest cut
 For k clusters, the modularity of a clustering assesses the quality of the
clustering:
 The modularity of a clustering of a graph is the difference between the
fraction of all edges that fall into individual clusters and the fraction that
would do so if the graph vertices were randomly connected
 The optimal clustering of graphs maximizes the modularity
li: # edges between vertices in the i-th cluster
di: the sum of the degrees of the vertices in the i-th
cluster
779
Graph Clustering: Challenges of Finding Good Cuts
 High computational cost
 Many graph cut problems are computationally expensive
 The sparsest cut problem is NP-hard
 Need to tradeoff between efficiency/scalability and quality
 Sophisticated graphs
 May involve weights and/or cycles.
 High dimensionality
 A graph can have many vertices. In a similarity matrix, a vertex is
represented as a vector (a row in the matrix) whose
dimensionality is the number of vertices in the graph
 Sparsity
 A large graph is often sparse, meaning each vertex on average
connects to only a small number of other vertices
 A similarity matrix from a large sparse graph can also be sparse
780
Two Approaches for Graph Clustering
 Two approaches for clustering graph data
 Use generic clustering methods for high-dimensional data
 Designed specifically for clustering graphs
 Using clustering methods for high-dimensional data
 Extract a similarity matrix from a graph using a similarity measure
 A generic clustering method can then be applied on the similarity
matrix to discover clusters
 Ex. Spectral clustering: approximate optimal graph cut solutions
 Methods specific to graphs
 Search the graph to find well-connected components as clusters
 Ex. SCAN (Structural Clustering Algorithm for Networks)

X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger, “SCAN: A
Structural Clustering Algorithm for Networks”, KDD'07
781
SCAN: Density-Based Clustering of
Networks
 How many clusters?
 What size should they be?
 What is the best partitioning?
 Should some points be
segregated?
782
An Example Network
 Application: Given simply information of who associates with whom,
could one identify clusters of individuals with common interests or
special relationships (families, cliques, terrorist cells)?
A Social Network Model
 Cliques, hubs and outliers
 Individuals in a tight social group, or clique, know many of the
same people, regardless of the size of the group
 Individuals who are hubs know many people in different groups
but belong to no single group. Politicians, for example bridge
multiple groups
 Individuals who are outliers reside at the margins of society.
Hermits, for example, know few people and belong to no group
 The Neighborhood of a Vertex
783
v
 Define () as the immediate
neighborhood of a vertex (i.e. the set
of people that an individual knows )
Structure Similarity
 The desired features tend to be captured by a measure
we call Structural Similarity
 Structural similarity is large for members of a clique
and small for hubs and outliers
|
)
(
||
)
(
|
|
)
(
)
(
|
)
,
(
w
v
w
v
w
v







784
v
Structural Connectivity [1]
 -Neighborhood:
 Core:
 Direct structure reachable:
 Structure reachable: transitive closure of direct structure
reachability
 Structure connected:
}
)
,
(
|
)
(
{
)
( 

 


 w
v
v
w
v
N



 
 |
)
(
|
)
(
, v
N
v
CORE
)
(
)
(
)
,
( ,
, v
N
w
v
CORE
w
v
DirRECH 



 


)
,
(
)
,
(
:
)
,
( ,
,
, w
u
RECH
v
u
RECH
V
u
w
v
CONNECT 




 



[1] M. Ester, H. P. Kriegel, J. Sander, & X. Xu (KDD'96) “A Density-Based
Algorithm for Discovering Clusters in Large Spatial Databases
785
Structure-Connected Clusters
 Structure-connected cluster C
 Connectivity:
 Maximality:
 Hubs:
 Not belong to any cluster
 Bridge to many clusters
 Outliers:
 Not belong to any cluster
 Connect to less clusters
)
,
(
:
, , w
v
CONNECT
C
w
v 



C
w
w
v
REACH
C
v
V
w
v 




 )
,
(
:
, ,

hub
outlier
786
13
9
10
11
7
8
12
6
4
0
1
5
2
3
Algorithm
 = 2
 = 0.7
787
13
9
10
11
7
8
12
6
4
0
1
5
2
3
Algorithm
 = 2
 = 0.7
0.63
788
13
9
10
11
7
8
12
6
4
0
1
5
2
3
Algorithm
 = 2
 = 0.7
0.75
0.67
0.82
789
13
9
10
11
7
8
12
6
4
0
1
5
2
3
Algorithm
 = 2
 = 0.7
790
13
9
10
11
7
8
12
6
4
0
1
5
2
3
Algorithm
 = 2
 = 0.7
0.67
791
13
9
10
11
7
8
12
6
4
0
1
5
2
3
Algorithm
 = 2
 = 0.7
0.73
0.73
0.73
792
13
9
10
11
7
8
12
6
4
0
1
5
2
3
Algorithm
 = 2
 = 0.7
793
13
9
10
11
7
8
12
6
4
0
1
5
2
3
Algorithm
 = 2
 = 0.7
0.51
794
13
9
10
11
7
8
12
6
4
0
1
5
2
3
Algorithm
 = 2
 = 0.7
0.68
795
13
9
10
11
7
8
12
6
4
0
1
5
2
3
Algorithm
 = 2
 = 0.7
0.51
796
13
9
10
11
7
8
12
6
4
0
1
5
2
3
Algorithm
 = 2
 = 0.7
797
13
9
10
11
7
8
12
6
4
0
1
5
2
3
Algorithm
 = 2
 = 0.7 0.51
0.51
0.68
798
13
9
10
11
7
8
12
6
4
0
1
5
2
3
Algorithm
 = 2
 = 0.7
799
Running Time
 Running time = O(|E|)
 For sparse networks = O(|V|)
[2] A. Clauset, M. E. J. Newman, & C. Moore, Phys. Rev. E 70, 066111 (2004).
800
Chapter 11. Cluster Analysis: Advanced Methods
 Probability Model-Based Clustering
 Clustering High-Dimensional Data
 Clustering Graphs and Network Data
 Clustering with Constraints
 Summary
801
802
Why Constraint-Based Cluster Analysis?
 Need user feedback: Users know their applications the best
 Less parameters but more user-desired constraints, e.g., an
ATM allocation problem: obstacle & desired clusters
803
Categorization of Constraints
 Constraints on instances: specifies how a pair or a set of instances
should be grouped in the cluster analysis
 Must-link vs. cannot link constraints

must-link(x, y): x and y should be grouped into one cluster
 Constraints can be defined using variables, e.g.,

cannot-link(x, y) if dist(x, y) > d
 Constraints on clusters: specifies a requirement on the clusters
 E.g., specify the min # of objects in a cluster, the max diameter of a
cluster, the shape of a cluster (e.g., a convex), # of clusters (e.g., k)
 Constraints on similarity measurements: specifies a requirement that
the similarity calculation must respect
 E.g., driving on roads, obstacles (e.g., rivers, lakes)
 Issues: Hard vs. soft constraints; conflicting or redundant constraints
804
Constraint-Based Clustering Methods (I):
Handling Hard Constraints
 Handling hard constraints: Strictly respect the constraints in cluster
assignments
 Example: The COP-k-means algorithm
 Generate super-instances for must-link constraints

Compute the transitive closure of the must-link constraints

To represent such a subset, replace all those objects in the
subset by the mean.

The super-instance also carries a weight, which is the number
of objects it represents
 Conduct modified k-means clustering to respect cannot-link
constraints

Modify the center-assignment process in k-means to a nearest
feasible center assignment

An object is assigned to the nearest center so that the
assignment respects all cannot-link constraints
Constraint-Based Clustering Methods (II):
Handling Soft Constraints
 Treated as an optimization problem: When a clustering violates a soft
constraint, a penalty is imposed on the clustering
 Overall objective: Optimizing the clustering quality, and minimizing the
constraint violation penalty
 Ex. CVQE (Constrained Vector Quantization Error) algorithm: Conduct
k-means clustering while enforcing constraint violation penalties
 Objective function: Sum of distance used in k-means, adjusted by the
constraint violation penalties
 Penalty of a must-link violation

If objects x and y must-be-linked but they are assigned to two
different centers, c1 and c2, dist(c1, c2) is added to the objective
function as the penalty
 Penalty of a cannot-link violation

If objects x and y cannot-be-linked but they are assigned to a
common center c, dist(c, c′), between c and c′ is added to the
objective function as the penalty, where c′ is the closest cluster
to c that can accommodate x or y
805
806
Speeding Up Constrained Clustering
 It is costly to compute some constrained
clustering
 Ex. Clustering with obstacle objects: Tung,
Hou, and Han. Spatial clustering in the
presence of obstacles, ICDE'01
 K-medoids is more preferable since k-means
may locate the ATM center in the middle of a
lake
 Visibility graph and shortest path
 Triangulation and micro-clustering
 Two kinds of join indices (shortest-paths)
worth pre-computation
 VV index: indices for any pair of obstacle
vertices
 MV index: indices for any pair of micro-
cluster and obstacle indices
807
An Example: Clustering With Obstacle Objects
Taking obstacles into account
Not Taking obstacles into account
808
User-Guided Clustering: A Special Kind of
Constraints
name
office
position
Professor
course-id
name
area
course
semester
instructor
office
position
Student
name
student
course
semester
unit
Register
grade
professor
student
degree
Advise
name
Group
person
group
Work-In
area
year
conf
Publication
title
title
Publish
author
Target of
clustering
User hint
Course
Open-course
 X. Yin, J. Han, P. S. Yu, “Cross-Relational Clustering with User's Guidance”,
KDD'05
 User usually has a goal of clustering, e.g., clustering students by research area
 User specifies his clustering goal to CrossClus
809
Comparing with Classification
 User-specified feature (in the form
of attribute) is used as a hint, not
class labels
 The attribute may contain too
many or too few distinct values,
e.g., a user may want to
cluster students into 20
clusters instead of 3
 Additional features need to be
included in cluster analysis
All tuples for clustering
User hint
810
Comparing with Semi-Supervised Clustering
 Semi-supervised clustering: User provides a training set
consisting of “similar” (“must-link) and “dissimilar”
(“cannot link”) pairs of objects
 User-guided clustering: User specifies an attribute as a
hint, and more relevant features are found for clustering
All
tuples
for
clustering
Semi-supervised clustering
All tuples for clustering
User-guided clustering
x
811
Why Not Semi-Supervised Clustering?
 Much information (in multiple relations) is needed to judge
whether two tuples are similar
 A user may not be able to provide a good training set
 It is much easier for a user to specify an attribute as a hint,
such as a student’s research area
Tom Smith SC1211 TA
Jane Chang BI205 RA
Tuples to be compared
User hint
812
CrossClus: An Overview
 Measure similarity between features by how they group
objects into clusters
 Use a heuristic method to search for pertinent features
 Start from user-specified feature and gradually
expand search range
 Use tuple ID propagation to create feature values
 Features can be easily created during the expansion
of search range, by propagating IDs
 Explore three clustering algorithms: k-means, k-medoids,
and hierarchical clustering
813
Multi-Relational Features
 A multi-relational feature is defined by:
 A join path, e.g., Student → Register → OpenCourse → Course
 An attribute, e.g., Course.area
 (For numerical feature) an aggregation operator, e.g., sum or average
 Categorical feature f = [Student → Register → OpenCourse → Course,
Course.area, null]
Tuple Areas of courses
DB AI TH
t1 5 5 0
t2 0 3 7
t3 1 5 4
t4 5 0 5
t5 3 3 4
areas of courses of each student
Tuple Feature f
DB AI TH
t1 0.5 0.5 0
t2 0 0.3 0.7
t3 0.1 0.5 0.4
t4 0.5 0 0.5
t5 0.3 0.3 0.4
Values of feature f f(t1)
f(t2)
f(t3)
f(t4)
f(t5)
DB
AI
TH
814
Representing Features
 Similarity between tuples t1 and t2 w.r.t. categorical feature f
 Cosine similarity between vectors f(t1) and f(t2)
 Most important information of a
feature f is how f groups tuples into
clusters
 f is represented by similarities
between every pair of tuples
indicated by f
 The horizontal axes are the tuple
indices, and the vertical axis is the
similarity
 This can be considered as a vector
of N x N dimensions
Similarity vector Vf
 
   
   









L
k
k
L
k
k
L
k
k
k
f
p
t
f
p
t
f
p
t
f
p
t
f
t
t
1
2
2
1
2
1
1
2
1
2
1
.
.
.
.
,
sim
815
Similarity Between Features
Feature f (course) Feature g (group)
DB AI TH Info sys Cog sci Theory
t1 0.5 0.5 0 1 0 0
t2 0 0.3 0.7 0 0 1
t3 0.1 0.5 0.4 0 0.5 0.5
t4 0.5 0 0.5 0.5 0 0.5
t5 0.3 0.3 0.4 0.5 0.5 0
Values of Feature f and g
Similarity between two features –
cosine similarity of two vectors
Vf
Vg
  g
f
g
f
V
V
V
V
g
f
sim


,
816
Computing Feature Similarity
Tuples
Feature f Feature g
DB
AI
TH
Info sys
Cog sci
Theory
Similarity between feature
values w.r.t. the tuples
sim(fk,gq)=Σi=1 to N f(ti).pk∙g(ti).pq
DB Info sys
     
2
1 1
1 1
,
,
, 
  
 




l
k
m
q
q
k
N
i
N
j
j
i
g
j
i
f
g
f
g
f
sim
t
t
sim
t
t
sim
V
V Tuple similarities,
hard to compute
Feature value similarities,
easy to compute
DB
AI
TH
Info sys
Cog sci
Theory
Compute similarity
between each pair of
feature values by one
scan on data
817
Searching for Pertinent Features
 Different features convey different aspects of information
 Features conveying same aspect of information usually
cluster tuples in more similar ways
 Research group areas vs. conferences of publications
 Given user specified feature
 Find pertinent features by computing feature similarity
Research group area
Advisor
Conferences of papers
Research area
GPA
Number of papers
GRE score
Academic Performances
Nationality
Permanent address
Demographic info
818
Heuristic Search for Pertinent Features
Overall procedure
1. Start from the user-
specified feature
2. Search in neighborhood
of existing pertinent
features
3. Expand search range
gradually
name
office
position
Professor
office
position
Student
name
student
course
semester
unit
Register
grade
professor
student
degree
Advise
person
group
Work-In
name
Group
area
year
conf
Publication
title
title
Publish
author
Target of
clustering
User hint
course-id
name
area
Course
course
semester
instructor
Open-course
1
2
 Tuple ID propagation is used to create multi-relational features
 IDs of target tuples can be propagated along any join path, from
which we can find tuples joinable with each target tuple
819
Clustering with Multi-Relational Features
 Given a set of L pertinent features f1, …, fL, similarity
between two tuples
 Weight of a feature is determined in feature search by
its similarity with other pertinent features
 Clustering methods
 CLARANS [Ng & Han 94], a scalable clustering
algorithm for non-Euclidean space
 K-means
 Agglomerative hierarchical clustering
   




L
i
i
f weight
f
t
t
t
t i
1
2
1
2
1 .
,
sim
,
sim
820
Experiments: Compare CrossClus with
 Baseline: Only use the user specified feature
 PROCLUS [Aggarwal, et al. 99]: a state-of-the-art
subspace clustering algorithm
 Use a subset of features for each cluster
 We convert relational database to a table by
propositionalization
 User-specified feature is forced to be used in every
cluster
 RDBC [Kirsten and Wrobel’00]
 A representative ILP clustering algorithm
 Use neighbor information of objects for clustering
 User-specified feature is forced to be used
821
Measure of Clustering Accuracy
 Accuracy
 Measured by manually labeled data

We manually assign tuples into clusters according
to their properties (e.g., professors in different
research areas)
 Accuracy of clustering: Percentage of pairs of tuples in
the same cluster that share common label

This measure favors many small clusters

We let each approach generate the same number of
clusters
822
DBLP Dataset
Clustering Accurarcy - DBLP
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Conf
W
ord
Coauthor
Conf+W
ord
Conf+Coauthor
W
ord+Coauthor
A
ll three
CrossClus K-Medoids
CrossClus K-Means
CrossClus Agglm
Baseline
PROCLUS
RDBC
823
Chapter 11. Cluster Analysis: Advanced Methods
 Probability Model-Based Clustering
 Clustering High-Dimensional Data
 Clustering Graphs and Network Data
 Clustering with Constraints
 Summary
823
824
Summary
 Probability Model-Based Clustering
 Fuzzy clustering
 Probability-model-based clustering
 The EM algorithm
 Clustering High-Dimensional Data
 Subspace clustering: bi-clustering methods
 Dimensionality reduction: Spectral clustering
 Clustering Graphs and Network Data
 Graph clustering: min-cut vs. sparsest cut
 High-dimensional clustering methods
 Graph-specific clustering methods, e.g., SCAN
 Clustering with Constraints
 Constraints on instance objects, e.g., Must link vs. Cannot Link
 Constraint-based clustering algorithms
825
References (I)
 R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high
dimensional data for data mining applications. SIGMOD’98
 C. C. Aggarwal, C. Procopiuc, J. Wolf, P. S. Yu, and J.-S. Park. Fast algorithms for projected
clustering. SIGMOD’99
 S. Arora, S. Rao, and U. Vazirani. Expander flows, geometric embeddings and graph partitioning.
J. ACM, 56:5:1–5:37, 2009.
 J. C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press,
1981.
 K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is ”nearest neighbor”
meaningful? ICDT’99
 Y. Cheng and G. Church. Biclustering of expression data. ISMB’00
 I. Davidson and S. S. Ravi. Clustering with constraints: Feasibility issues and the k-means
algorithm. SDM’05
 I. Davidson, K. L. Wagstaff, and S. Basu. Measuring constraint-set utility for partitional clustering
algorithms. PKDD’06
 C. Fraley and A. E. Raftery. Model-based clustering, discriminant analysis, and density estimation.
J. American Stat. Assoc., 97:611–631, 2002.
 F. H¨oppner, F. Klawonn, R. Kruse, and T. Runkler. Fuzzy Cluster Analysis: Methods for
Classification, Data Analysis and Image Recognition. Wiley, 1999.
 G. Jeh and J. Widom. SimRank: a measure of structural-context similarity. KDD’02
 H.-P. Kriegel, P. Kroeger, and A. Zimek. Clustering high dimensional data: A survey on subspace
clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowledge Discovery
from Data (TKDD), 3, 2009.
 U. Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17:395–416, 2007
References (II)
 G. J. McLachlan and K. E. Bkasford. Mixture Models: Inference and Applications to Clustering. John
Wiley & Sons, 1988.
 B. Mirkin. Mathematical classification and clustering. J. of Global Optimization, 12:105–108, 1998.
 S. C. Madeira and A. L. Oliveira. Biclustering algorithms for biological data analysis: A survey.
IEEE/ACM Trans. Comput. Biol. Bioinformatics, 1, 2004.
 A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. NIPS’01
 J. Pei, X. Zhang, M. Cho, H. Wang, and P. S. Yu. Maple: A fast algorithm for maximal pattern-based
clustering. ICDM’03
 M. Radovanovi´c, A. Nanopoulos, and M. Ivanovi´c. Nearest neighbors in high-dimensional data: the
emergence and influence of hubs. ICML’09
 S. E. Schaeffer. Graph clustering. Computer Science Review, 1:27–64, 2007.
 A. K. H. Tung, J. Hou, and J. Han. Spatial clustering in the presence of obstacles. ICDE’01
 A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-based clustering in large
databases. ICDT’01
 A. Tanay, R. Sharan, and R. Shamir. Biclustering algorithms: A survey. In Handbook of Computational
Molecular Biology, Chapman & Hall, 2004.
 K. Wagstaff, C. Cardie, S. Rogers, and S. Schr¨odl. Constrained k-means clustering with background
knowledge. ICML’01
 H. Wang, W. Wang, J. Yang, and P. S. Yu. Clustering by pattern similarity in large data sets.
SIGMOD’02
 X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger. SCAN: A structural clustering algorithm for networks.
KDD’07
 X. Yin, J. Han, and P.S. Yu, “Cross-Relational Clustering with User's Guidance”, KDD'05
Slides Not to Be Used in Class
827
828
Conceptual Clustering
 Conceptual clustering
 A form of clustering in machine learning
 Produces a classification scheme for a set of unlabeled
objects
 Finds characteristic description for each concept (class)
 COBWEB (Fisher’87)
 A popular a simple method of incremental conceptual
learning
 Creates a hierarchical clustering in the form of a
classification tree
 Each node refers to a concept and contains a
probabilistic description of that concept
829
COBWEB Clustering Method
A classification tree
830
More on Conceptual Clustering
 Limitations of COBWEB
 The assumption that the attributes are independent of each other is
often too strong because correlation may exist
 Not suitable for clustering large database data – skewed tree and
expensive probability distributions
 CLASSIT
 an extension of COBWEB for incremental clustering of continuous
data
 suffers similar problems as COBWEB
 AutoClass (Cheeseman and Stutz, 1996)
 Uses Bayesian statistical analysis to estimate the number of
clusters
 Popular in industry
831
Neural Network Approaches
 Neural network approaches
 Represent each cluster as an exemplar, acting as a
“prototype” of the cluster
 New objects are distributed to the cluster whose
exemplar is the most similar according to some
distance measure
 Typical methods
 SOM (Soft-Organizing feature Map)
 Competitive learning

Involves a hierarchical architecture of several units
(neurons)

Neurons compete in a “winner-takes-all” fashion for
the object currently being presented
832
Self-Organizing Feature Map (SOM)
 SOMs, also called topological ordered maps, or Kohonen Self-
Organizing Feature Map (KSOMs)
 It maps all the points in a high-dimensional source space into a 2 to 3-d
target space, s.t., the distance and proximity relationship (i.e., topology)
are preserved as much as possible
 Similar to k-means: cluster centers tend to lie in a low-dimensional
manifold in the feature space
 Clustering is performed by having several units competing for the
current object
 The unit whose weight vector is closest to the current object wins
 The winner and its neighbors learn by having their weights adjusted
 SOMs are believed to resemble processing that can occur in the brain
 Useful for visualizing high-dimensional data in 2- or 3-D space
833
Web Document Clustering Using SOM
 The result of
SOM clustering
of 12088 Web
articles
 The picture on
the right: drilling
down on the
keyword
“mining”
 Based on
websom.hut.fi
Web page
845
Data Mining:
Concepts and Techniques
(3rd
ed.)
— Chapter 12 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
846
Chapter 12. Outlier Analysis
 Outlier and Outlier Analysis
 Outlier Detection Methods
 Statistical Approaches
 Proximity-Base Approaches
 Clustering-Base Approaches
 Classification Approaches
 Mining Contextual and Collective Outliers
 Outlier Detection in High Dimensional Data
 Summary
847
What Are Outliers?
 Outlier: A data object that deviates significantly from the normal
objects as if it were generated by a different mechanism
 Ex.: Unusual credit card purchase, sports: Michael Jordon, Wayne
Gretzky, ...
 Outliers are different from the noise data
 Noise is random error or variance in a measured variable
 Noise should be removed before outlier detection
 Outliers are interesting: It violates the mechanism that generates the
normal data
 Outlier detection vs. novelty detection: early stage, outlier; but later
merged into the model
 Applications:
 Credit card fraud detection
 Telecom fraud detection
 Customer segmentation
 Medical analysis
848
Types of Outliers (I)
 Three kinds: global, contextual and collective outliers
 Global outlier (or point anomaly)
 Object is Og if it significantly deviates from the rest of the data set
 Ex. Intrusion detection in computer networks
 Issue: Find an appropriate measurement of deviation
 Contextual outlier (or conditional outlier)
 Object is Oc if it deviates significantly based on a selected context
 Ex. 80o
F in Urbana: outlier? (depending on summer or winter?)
 Attributes of data objects should be divided into two groups

Contextual attributes: defines the context, e.g., time & location

Behavioral attributes: characteristics of the object, used in outlier
evaluation, e.g., temperature
 Can be viewed as a generalization of local outliers—whose density
significantly deviates from its local area
 Issue: How to define or formulate meaningful context?
Global Outlier
849
Types of Outliers (II)
 Collective Outliers
 A subset of data objects collectively deviate
significantly from the whole data set, even if the
individual data objects may not be outliers
 Applications: E.g., intrusion detection:

When a number of computers keep sending
denial-of-service packages to each other
Collective Outlier
 Detection of collective outliers

Consider not only behavior of individual objects, but also that of
groups of objects

Need to have the background knowledge on the relationship
among data objects, such as a distance or similarity measure
on objects.
 A data set may have multiple types of outlier
 One object may belong to more than one type of outlier
850
Challenges of Outlier Detection
 Modeling normal objects and outliers properly
 Hard to enumerate all possible normal behaviors in an application
 The border between normal and outlier objects is often a gray area
 Application-specific outlier detection
 Choice of distance measure among objects and the model of
relationship among objects are often application-dependent
 E.g., clinic data: a small deviation could be an outlier; while in
marketing analysis, larger fluctuations
 Handling noise in outlier detection
 Noise may distort the normal objects and blur the distinction
between normal objects and outliers. It may help hide outliers and
reduce the effectiveness of outlier detection
 Understandability
 Understand why these are outliers: Justification of the detection
 Specify the degree of an outlier: the unlikelihood of the object being
generated by a normal mechanism
851
Chapter 12. Outlier Analysis
 Outlier and Outlier Analysis
 Outlier Detection Methods
 Statistical Approaches
 Proximity-Base Approaches
 Clustering-Base Approaches
 Classification Approaches
 Mining Contextual and Collective Outliers
 Outlier Detection in High Dimensional Data
 Summary
Outlier Detection I: Supervised Methods
 Two ways to categorize outlier detection methods:
 Based on whether user-labeled examples of outliers can be obtained:

Supervised, semi-supervised vs. unsupervised methods
 Based on assumptions about normal data and outliers:

Statistical, proximity-based, and clustering-based methods
 Outlier Detection I: Supervised Methods
 Modeling outlier detection as a classification problem

Samples examined by domain experts used for training & testing
 Methods for Learning a classifier for outlier detection effectively:

Model normal objects & report those not matching the model as
outliers, or

Model outliers and treat those not matching the model as normal
 Challenges

Imbalanced classes, i.e., outliers are rare: Boost the outlier class
and make up some artificial outliers

Catch as many outliers as possible, i.e., recall is more important
than accuracy (i.e., not mislabeling normal objects as outliers)
852
Outlier Detection II: Unsupervised Methods
 Assume the normal objects are somewhat ``clustered'‘ into multiple
groups, each having some distinct features
 An outlier is expected to be far away from any groups of normal objects
 Weakness: Cannot detect collective outlier effectively
 Normal objects may not share any strong patterns, but the collective
outliers may share high similarity in a small area
 Ex. In some intrusion or virus detection, normal activities are diverse
 Unsupervised methods may have a high false positive rate but still
miss many real outliers.
 Supervised methods can be more effective, e.g., identify attacking
some key resources
 Many clustering methods can be adapted for unsupervised methods
 Find clusters, then outliers: not belonging to any cluster
 Problem 1: Hard to distinguish noise from outliers
 Problem 2: Costly since first clustering: but far less outliers than
normal objects

Newer methods: tackle outliers directly
853
Outlier Detection III: Semi-Supervised Methods
 Situation: In many applications, the number of labeled data is often
small: Labels could be on outliers only, normal objects only, or both
 Semi-supervised outlier detection: Regarded as applications of semi-
supervised learning
 If some labeled normal objects are available
 Use the labeled examples and the proximate unlabeled objects to
train a model for normal objects
 Those not fitting the model of normal objects are detected as outliers
 If only some labeled outliers are available, a small number of labeled
outliers many not cover the possible outliers well
 To improve the quality of outlier detection, one can get help from
models for normal objects learned from unsupervised methods
854
Outlier Detection (1): Statistical Methods
 Statistical methods (also known as model-based methods) assume
that the normal data follow some statistical model (a stochastic model)
 The data not following the model are outliers.
855
 Effectiveness of statistical methods: highly depends on whether the
assumption of statistical model holds in the real data
 There are rich alternatives to use various statistical models
 E.g., parametric vs. non-parametric
 Example (right figure): First use Gaussian distribution
to model the normal data
 For each object y in region R, estimate gD(y), the
probability of y fits the Gaussian distribution
 If gD(y) is very low, y is unlikely generated by the
Gaussian model, thus an outlier
Outlier Detection (2): Proximity-Based Methods
 An object is an outlier if the nearest neighbors of the object are far
away, i.e., the proximity of the object is significantly deviates from
the proximity of most of the other objects in the same data set
856
 The effectiveness of proximity-based methods highly relies on the
proximity measure.
 In some applications, proximity or distance measures cannot be
obtained easily.
 Often have a difficulty in finding a group of outliers which stay close to
each other
 Two major types of proximity-based outlier detection
 Distance-based vs. density-based
 Example (right figure): Model the proximity of an
object using its 3 nearest neighbors
 Objects in region R are substantially different
from other objects in the data set.
 Thus the objects in R are outliers
Outlier Detection (3): Clustering-Based Methods
 Normal data belong to large and dense clusters, whereas
outliers belong to small or sparse clusters, or do not belong
to any clusters
857
 Since there are many clustering methods, there are many
clustering-based outlier detection methods as well
 Clustering is expensive: straightforward adaption of a
clustering method for outlier detection can be costly and
does not scale up well for large data sets
 Example (right figure): two clusters
 All points not in R form a large cluster
 The two points in R form a tiny cluster,
thus are outliers
858
Chapter 12. Outlier Analysis
 Outlier and Outlier Analysis
 Outlier Detection Methods
 Statistical Approaches
 Proximity-Base Approaches
 Clustering-Base Approaches
 Classification Approaches
 Mining Contextual and Collective Outliers
 Outlier Detection in High Dimensional Data
 Summary
Statistical Approaches
 Statistical approaches assume that the objects in a data set are
generated by a stochastic process (a generative model)
 Idea: learn a generative model fitting the given data set, and then
identify the objects in low probability regions of the model as outliers
 Methods are divided into two categories: parametric vs. non-
parametric
 Parametric method
 Assumes that the normal data is generated by a parametric
distribution with parameter θ
 The probability density function of the parametric distribution f(x, θ)
gives the probability that object x is generated by the distribution
 The smaller this value, the more likely x is an outlier
 Non-parametric method
 Not assume an a-priori statistical model and determine the model
from the input data
 Not completely parameter free but consider the number and nature
of the parameters are flexible and not fixed in advance
 Examples: histogram and kernel density estimation
859
Parametric Methods I: Detection Univariate
Outliers Based on Normal Distribution
 Univariate data: A data set involving only one attribute or variable
 Often assume that data are generated from a normal distribution, learn
the parameters from the input data, and identify the points with low
probability as outliers
 Ex: Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}
 Use the maximum likelihood method to estimate μ and σ
860
 Taking derivatives with respect to μ and σ2
, we derive the following
maximum likelihood estimates
 For the above data with n = 10, we have
 Then (24 – 28.61) /1.51 = – 3.04 < –3, 24 is an outlier since
Parametric Methods I: The Grubb’s Test
 Univariate outlier detection: The Grubb's test (maximum normed
residual test) ─ another statistical method under normal distribution
 For each object x in a data set, compute its z-score: x is an outlier if
where is the value taken by a t-distribution at a
significance level of α/(2N), and N is the # of objects in the data
set
861
Parametric Methods II: Detection of
Multivariate Outliers
 Multivariate data: A data set involving two or more attributes or
variables
 Transform the multivariate outlier detection task into a univariate
outlier detection problem
 Method 1. Compute Mahalaobis distance
 Let ō be the mean vector for a multivariate data set. Mahalaobis
distance for an object o to ō is MDist(o, ō) = (o – ō )T
S –1
(o – ō)
where S is the covariance matrix
 Use the Grubb's test on this measure to detect outliers
 Method 2. Use χ2
–statistic:
 where Ei is the mean of the i-dimension among all objects, and n is
the dimensionality
 If χ2
–statistic is large, then object oi is an outlier
862
Parametric Methods III: Using Mixture of
Parametric Distributions
 Assuming data generated by a normal distribution
could be sometimes overly simplified
 Example (right figure): The objects between the two
clusters cannot be captured as outliers since they
are close to the estimated mean
863
 To overcome this problem, assume the normal data is generated by two
normal distributions. For any object o in the data set, the probability that
o is generated by the mixture of the two distributions is given by
where fθ1 and fθ2 are the probability density functions of θ1 and θ2
 Then use EM algorithm to learn the parameters μ1, σ1, μ2, σ2 from data
 An object o is an outlier if it does not belong to any cluster
Non-Parametric Methods: Detection Using Histogram
 The model of normal data is learned from the
input data without any a priori structure.
 Often makes fewer assumptions about the data,
and thus can be applicable in more scenarios
 Outlier detection using histogram:
864
 Figure shows the histogram of purchase amounts in transactions
 A transaction in the amount of $7,500 is an outlier, since only 0.2%
transactions have an amount higher than $5,000
 Problem: Hard to choose an appropriate bin size for histogram
 Too small bin size → normal objects in empty/rare bins, false positive
 Too big bin size → outliers in some frequent bins, false negative
 Solution: Adopt kernel density estimation to estimate the probability
density distribution of the data. If the estimated density function is high,
the object is likely normal. Otherwise, it is likely an outlier.
865
Chapter 12. Outlier Analysis
 Outlier and Outlier Analysis
 Outlier Detection Methods
 Statistical Approaches
 Proximity-Base Approaches
 Clustering-Base Approaches
 Classification Approaches
 Mining Contextual and Collective Outliers
 Outlier Detection in High Dimensional Data
 Summary
Proximity-Based Approaches: Distance-Based vs.
Density-Based Outlier Detection
 Intuition: Objects that are far away from the others are
outliers
 Assumption of proximity-based approach: The proximity of
an outlier deviates significantly from that of most of the
others in the data set
 Two types of proximity-based outlier detection methods
 Distance-based outlier detection: An object o is an
outlier if its neighborhood does not have enough other
points
 Density-based outlier detection: An object o is an outlier
if its density is relatively much lower than that of its
neighbors
866
Distance-Based Outlier Detection
 For each object o, examine the # of other objects in the r-
neighborhood of o, where r is a user-specified distance threshold
 An object o is an outlier if most (taking π as a fraction threshold) of
the objects in D are far away from o, i.e., not in the r-neighborhood of o
 An object o is a DB(r, π) outlier if
 Equivalently, one can check the distance between o and its k-th
nearest neighbor ok, where . o is an outlier if dist(o, ok) > r
 Efficient computation: Nested loop algorithm
 For any object oi, calculate its distance from other objects, and
count the # of other objects in the r-neighborhood.
 If π∙n other objects are within r distance, terminate the inner loop
 Otherwise, oi is a DB(r, π) outlier
 Efficiency: Actually CPU time is not O(n2
) but linear to the data set size
since for most non-outlier objects, the inner loop terminates early 867
Distance-Based Outlier Detection: A Grid-Based Method
 Why efficiency is still a concern? When the complete set of objects
cannot be held into main memory, cost I/O swapping
 The major cost: (1) each object tests against the whole data set, why
not only its close neighbor? (2) check objects one by one, why not
group by group?
 Grid-based method (CELL): Data space is partitioned into a multi-D
grid. Each cell is a hyper cube with diagonal length r/2
868

Pruning using the level-1 & level 2 cell properties:
 For any possible point x in cell C and any
possible point y in a level-1 cell, dist(x,y) ≤ r
 For any possible point x in cell C and any point y
such that dist(x,y) ≥ r, y is in a level-2 cell
 Thus we only need to check the objects that cannot be pruned, and
even for such an object o, only need to compute the distance between
o and the objects in the level-2 cells (since beyond level-2, the
distance from o is more than r)
Density-Based Outlier Detection
 Local outliers: Outliers comparing to their local
neighborhoods, instead of the global data
distribution
 In Fig., o1 and o2 are local outliers to C1, o3 is a
global outlier, but o4 is not an outlier. However,
proximity-based clustering cannot find o1 and o2
are outlier (e.g., comparing with O4).
869
 Intuition (density-based outlier detection): The density around an outlier
object is significantly different from the density around its neighbors
 Method: Use the relative density of an object against its neighbors as
the indicator of the degree of the object being outliers
 k-distance of an object o, distk(o): distance between o and its k-th NN
 k-distance neighborhood of o, Nk(o) = {o’| o’ in D, dist(o, o’) ≤ distk(o)}
 Nk(o) could be bigger than k since multiple objects may have
identical distance to o
Local Outlier Factor: LOF
 Reachability distance from o’ to o:
 where k is a user-specified parameter
 Local reachability density of o:
870
 LOF (Local outlier factor) of an object o is the average of the ratio of
local reachability of o and those of o’s k-nearest neighbors
 The lower the local reachability density of o, and the higher the local
reachability density of the kNN of o, the higher LOF
 This captures a local outlier whose local density is relatively low
comparing to the local densities of its kNN
871
Chapter 12. Outlier Analysis
 Outlier and Outlier Analysis
 Outlier Detection Methods
 Statistical Approaches
 Proximity-Base Approaches
 Clustering-Base Approaches
 Classification Approaches
 Mining Contextual and Collective Outliers
 Outlier Detection in High Dimensional Data
 Summary
Clustering-Based Outlier Detection (1 & 2):
Not belong to any cluster, or far from the closest one
 An object is an outlier if (1) it does not belong to any cluster, (2) there is
a large distance between the object and its closest cluster , or (3) it
belongs to a small or sparse cluster
 Case I: Not belong to any cluster
 Identify animals not part of a flock: Using a density-
based clustering method such as DBSCAN
 Case 2: Far from its closest cluster
 Using k-means, partition data points of into clusters
 For each object o, assign an outlier score based on
its distance from its closest center
 If dist(o, co)/avg_dist(co) is large, likely an outlier
 Ex. Intrusion detection: Consider the similarity between
data points and the clusters in a training data set
 Use a training set to find patterns of “normal” data, e.g., frequent
itemsets in each segment, and cluster similar connections into groups
 Compare new data points with the clusters mined—Outliers are
possible attacks 872
 FindCBLOF: Detect outliers in small clusters
 Find clusters, and sort them in decreasing size
 To each data point, assign a cluster-based local
outlier factor (CBLOF):
 If obj p belongs to a large cluster, CBLOF =
cluster_size X similarity between p and cluster
 If p belongs to a small one, CBLOF = cluster size
X similarity betw. p and the closest large cluster
873
Clustering-Based Outlier Detection (3):
Detecting Outliers in Small Clusters
 Ex. In the figure, o is outlier since its closest large cluster is C1, but the
similarity between o and C1 is small. For any point in C3, its closest
large cluster is C2 but its similarity from C2 is low, plus |C3| = 3 is small
Clustering-Based Method: Strength and Weakness
 Strength
 Detect outliers without requiring any labeled data
 Work for many types of data
 Clusters can be regarded as summaries of the data
 Once the cluster are obtained, need only compare any object
against the clusters to determine whether it is an outlier (fast)
 Weakness
 Effectiveness depends highly on the clustering method used—they
may not be optimized for outlier detection
 High computational cost: Need to first find clusters
 A method to reduce the cost: Fixed-width clustering

A point is assigned to a cluster if the center of the cluster is
within a pre-defined distance threshold from the point

If a point cannot be assigned to any existing cluster, a new
cluster is created and the distance threshold may be learned
from the training data under certain conditions
875
Chapter 12. Outlier Analysis
 Outlier and Outlier Analysis
 Outlier Detection Methods
 Statistical Approaches
 Proximity-Base Approaches
 Clustering-Base Approaches
 Classification Approaches
 Mining Contextual and Collective Outliers
 Outlier Detection in High Dimensional Data
 Summary
Classification-Based Method I: One-Class Model
 Idea: Train a classification model that can
distinguish “normal” data from outliers
 A brute-force approach: Consider a training set
that contains samples labeled as “normal” and
others labeled as “outlier”
 But, the training set is typically heavily
biased: # of “normal” samples likely far
exceeds # of outlier samples
 Cannot detect unseen anomaly
876
 One-class model: A classifier is built to describe only the normal class.
 Learn the decision boundary of the normal class using classification
methods such as SVM
 Any samples that do not belong to the normal class (not within the
decision boundary) are declared as outliers
 Adv: can detect new outliers that may not appear close to any outlier
objects in the training set
 Extension: Normal objects may belong to multiple classes
Classification-Based Method II: Semi-Supervised Learning
 Semi-supervised learning: Combining classification-
based and clustering-based methods
 Method
 Using a clustering-based approach, find a large
cluster, C, and a small cluster, C1
 Since some objects in C carry the label “normal”,
treat all objects in C as normal
 Use the one-class model of this cluster to identify
normal objects in outlier detection

Since some objects in cluster C1 carry the label
“outlier”, declare all objects in C1 as outliers
 Any object that does not fall into the model for C
(such as a) is considered an outlier as well
877
 Comments on classification-based outlier detection methods
 Strength: Outlier detection is fast
 Bottleneck: Quality heavily depends on the availability and quality of
the training set, but often difficult to obtain representative and high-
quality training data
878
Chapter 12. Outlier Analysis
 Outlier and Outlier Analysis
 Outlier Detection Methods
 Statistical Approaches
 Proximity-Base Approaches
 Clustering-Base Approaches
 Classification Approaches
 Mining Contextual and Collective Outliers
 Outlier Detection in High Dimensional Data
 Summary
Mining Contextual Outliers I: Transform into
Conventional Outlier Detection
 If the contexts can be clearly identified, transform it to conventional
outlier detection
1. Identify the context of the object using the contextual attributes
2. Calculate the outlier score for the object in the context using a
conventional outlier detection method
 Ex. Detect outlier customers in the context of customer groups
 Contextual attributes: age group, postal code
 Behavioral attributes: # of trans/yr, annual total trans. amount
 Steps: (1) locate c’s context, (2) compare c with the other customers in
the same group, and (3) use a conventional outlier detection method
 If the context contains very few customers, generalize contexts
 Ex. Learn a mixture model U on the contextual attributes, and
another mixture model V of the data on the behavior attributes

Learn a mapping p(Vi|Uj): the probability that a data object o
belonging to cluster Uj on the contextual attributes is generated by
cluster Vi on the behavior attributes
 Outlier score:
879
Mining Contextual Outliers II: Modeling Normal
Behavior with Respect to Contexts
 In some applications, one cannot clearly partition the data into contexts
 Ex. if a customer suddenly purchased a product that is unrelated to
those she recently browsed, it is unclear how many products
browsed earlier should be considered as the context
 Model the “normal” behavior with respect to contexts
 Using a training data set, train a model that predicts the expected
behavior attribute values with respect to the contextual attribute
values
 An object is a contextual outlier if its behavior attribute values
significantly deviate from the values predicted by the model
 Using a prediction model that links the contexts and behavior, these
methods avoid the explicit identification of specific contexts
 Methods: A number of classification and prediction techniques can be
used to build such models, such as regression, Markov Models, and
Finite State Automaton
880
Mining Collective Outliers I: On the Set
of “Structured Objects”
 Collective outlier if objects as a group deviate
significantly from the entire data
 Need to examine the structure of the data set, i.e, the
relationships between multiple data objects
881
 Each of these structures is inherent to its respective type of data

For temporal data (such as time series and sequences), we explore
the structures formed by time, which occur in segments of the time
series or subsequences

For spatial data, explore local areas

For graph and network data, we explore subgraphs
 Difference from the contextual outlier detection: the structures are
often not explicitly defined, and have to be discovered as part of the
outlier detection process.
 Collective outlier detection methods: two categories

Reduce the problem to conventional outlier detection

Identify structure units, treat each structure unit (e.g.,
subsequence, time series segment, local area, or subgraph) as
a data object, and extract features

Then outlier detection on the set of “structured objects”
constructed as such using the extracted features
Mining Collective Outliers II: Direct Modeling of
the Expected Behavior of Structure Units
 Models the expected behavior of structure units directly
 Ex. 1. Detect collective outliers in online social network of customers
 Treat each possible subgraph of the network as a structure unit
 Collective outlier: An outlier subgraph in the social network

Small subgraphs that are of very low frequency

Large subgraphs that are surprisingly frequent
 Ex. 2. Detect collective outliers in temporal sequences
 Learn a Markov model from the sequences
 A subsequence can then be declared as a collective outlier if it
significantly deviates from the model
 Collective outlier detection is subtle due to the challenge of exploring
the structures in data
 The exploration typically uses heuristics, and thus may be
application dependent
 The computational cost is often high due to the sophisticated
mining process
882
883
Chapter 12. Outlier Analysis
 Outlier and Outlier Analysis
 Outlier Detection Methods
 Statistical Approaches
 Proximity-Base Approaches
 Clustering-Base Approaches
 Classification Approaches
 Mining Contextual and Collective Outliers
 Outlier Detection in High Dimensional Data
 Summary
Challenges for Outlier Detection in High-
Dimensional Data
 Interpretation of outliers
 Detecting outliers without saying why they are outliers is not very
useful in high-D due to many features (or dimensions) are involved
in a high-dimensional data set
 E.g., which subspaces that manifest the outliers or an assessment
regarding the “outlier-ness” of the objects
 Data sparsity
 Data in high-D spaces are often sparse
 The distance between objects becomes heavily dominated by
noise as the dimensionality increases
 Data subspaces
 Adaptive to the subspaces signifying the outliers
 Capturing the local behavior of data
 Scalable with respect to dimensionality
 # of subspaces increases exponentially
884
Approach I: Extending Conventional Outlier
Detection
 Method 1: Detect outliers in the full space, e.g., HilOut Algorithm
 Find distance-based outliers, but use the ranks of distance instead of
the absolute distance in outlier detection
 For each object o, find its k-nearest neighbors: nn1(o), . . . , nnk(o)
 The weight of object o:
 All objects are ranked in weight-descending order
 Top-l objects in weight are output as outliers (l: user-specified parm)
 Employ space-filling curves for approximation: scalable in both time
and space w.r.t. data size and dimensionality
 Method 2: Dimensionality reduction
 Works only when in lower-dimensionality, normal instances can still
be distinguished from outliers
 PCA: Heuristically, the principal components with low variance are
preferred because, on such dimensions, normal objects are likely
close to each other and outliers often deviate from the majority
885
Approach II: Finding Outliers in Subspaces
 Extending conventional outlier detection: Hard for outlier interpretation
 Find outliers in much lower dimensional subspaces: easy to interpret
why and to what extent the object is an outlier
 E.g., find outlier customers in certain subspace: average transaction
amount >> avg. and purchase frequency << avg.
 Ex. A grid-based subspace outlier detection method
 Project data onto various subspaces to find an area whose density is
much lower than average
 Discretize the data into a grid with φ equi-depth (why?) regions
 Search for regions that are significantly sparse

Consider a k-d cube: k ranges on k dimensions, with n objects

If objects are independently distributed, the expected number of
objects falling into a k-dimensional region is (1/ φ)k
n = fk
n,the
standard deviation is

The sparsity coefficient of cube C:

If S(C) < 0, C contains less objects than expected

The more negative, the sparser C is and the more likely the
objects in C are outliers in the subspace
886
Approach III: Modeling High-Dimensional Outliers
 Ex. Angle-based outliers: Kriegel, Schubert, and Zimek [KSZ08]
 For each point o, examine the angle ∆xoy for every pair of points x, y.
 Point in the center (e.g., a), the angles formed differ widely
 An outlier (e.g., c), angle variable is substantially smaller
 Use the variance of angles for a point to determine outlier
 Combine angles and distance to model outliers
 Use the distance-weighted angle variance as the outlier score
 Angle-based outlier factor (ABOF):
 Efficient approximation computation method is developed
 It can be generalized to handle arbitrary types of data 887
 Develop new models for high-
dimensional outliers directly
 Avoid proximity measures and adopt
new heuristics that do not deteriorate
in high-dimensional data
A set of points
form a cluster
except c (outlier)
888
Chapter 12. Outlier Analysis
 Outlier and Outlier Analysis
 Outlier Detection Methods
 Statistical Approaches
 Proximity-Base Approaches
 Clustering-Base Approaches
 Classification Approaches
 Mining Contextual and Collective Outliers
 Outlier Detection in High Dimensional Data
 Summary
Summary
 Types of outliers
 global, contextual & collective outliers
 Outlier detection
 supervised, semi-supervised, or unsupervised
 Statistical (or model-based) approaches
 Proximity-base approaches
 Clustering-base approaches
 Classification approaches
 Mining contextual and collective outliers
 Outlier detection in high dimensional data
889
References (I)
 B. Abraham and G.E.P. Box. Bayesian analysis of some outlier problems in time series. Biometrika, 66:229–248,
1979.
 M. Agyemang, K. Barker, and R. Alhajj. A comprehensive survey of numeric and symbolic outlier mining
techniques. Intell. Data Anal., 10:521–538, 2006.
 F. J. Anscombe and I. Guttman. Rejection of outliers. Technometrics, 2:123–147, 1960.
 D. Agarwal. Detecting anomalies in cross-classified streams: a bayesian approach. Knowl. Inf. Syst., 11:29–44,
2006.
 F. Angiulli and C. Pizzuti. Outlier mining in large high-dimensional data sets. TKDE, 2005.
 C. C. Aggarwal and P. S. Yu. Outlier detection for high dimensional data. SIGMOD’01
 R.J. Beckman and R.D. Cook. Outlier...s. Technometrics, 25:119–149, 1983.
 I. Ben-Gal. Outlier detection. In Maimon O. and Rockach L. (eds.) Data Mining and Knowledge Discovery
Handbook: A Complete Guide for Practitioners and Researchers, Kluwer Academic, 2005.
 M. M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander. LOF: Identifying density-based local outliers. SIGMOD’00
 D. Barbar´a, Y. Li, J. Couto, J.-L. Lin, and S. Jajodia. Bootstrapping a data mining intrusion detection system.
SAC’03
 Z. A. Bakar, R. Mohemad, A. Ahmad, and M. M. Deris. A comparative study for outlier detection techniques in
data mining. IEEE Conf. on Cybernetics and Intelligent Systems, 2006.
 S. D. Bay and M. Schwabacher. Mining distance-based outliers in near linear time with randomization and a
simple pruning rule. KDD’03
 D. Barbara, N. Wu, and S. Jajodia. Detecting novel network intrusion using bayesian estimators. SDM’01
 V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing Surveys, 41:1–58, 2009.
 D. Dasgupta and N.S. Majumdar. Anomaly detection in multidimensional data using negative selection
algorithm. In CEC’02
References (2)
 E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo. A geometric framework for unsupervised anomaly
detection: Detecting intrusions in unlabeled data. In Proc. 2002 Int. Conf. of Data Mining for Security
Applications, 2002.
 E. Eskin. Anomaly detection over noisy data using learned probability distributions. ICML’00
 T. Fawcett and F. Provost. Adaptive fraud detection. Data Mining and Knowledge Discovery, 1:291–316, 1997.
 V. J. Hodge and J. Austin. A survey of outlier detection methdologies. Artif. Intell. Rev., 22:85–126, 2004.
 D. M. Hawkins. Identification of Outliers. Chapman and Hall, London, 1980.
 Z. He, X. Xu, and S. Deng. Discovering cluster-based local outliers. Pattern Recogn. Lett., 24, June, 2003.
 W. Jin, K. H. Tung, and J. Han. Mining top-n local outliers in large databases. KDD’01
 W. Jin, A. K. H. Tung, J. Han, and W. Wang. Ranking outliers using symmetric neighborhood relationship.
PAKDD’06
 E. Knorr and R. Ng. A unified notion of outliers: Properties and computation. KDD’97
 E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98
 E. M. Knorr, R. T. Ng, and V. Tucakov. Distance-based outliers: Algorithms and applications. VLDB J., 8:237–253,
2000.
 H.-P. Kriegel, M. Schubert, and A. Zimek. Angle-based outlier detection in high-dimensional data. KDD’08
 M. Markou and S. Singh. Novelty detection: A review—part 1: Statistical approaches. Signal Process., 83:2481–
2497, 2003.
 M. Markou and S. Singh. Novelty detection: A review—part 2: Neural network based approaches. Signal
Process., 83:2499–2521, 2003.
 C. C. Noble and D. J. Cook. Graph-based anomaly detection. KDD’03
References (3)
 S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. Loci: Fast outlier detection using the local
correlation integral. ICDE’03
 A. Patcha and J.-M. Park. An overview of anomaly detection techniques: Existing solutions and latest
technological trends. Comput. Netw., 51, 2007.
 X. Song, M. Wu, C. Jermaine, and S. Ranka. Conditional anomaly detection. IEEE Trans. on Knowl. and Data
Eng., 19, 2007.
 Y. Tao, X. Xiao, and S. Zhou. Mining distance-based outliers from large databases in any metric space. KDD’06
 N. Ye and Q. Chen. An anomaly detection technique based on a chi-square statistic for detecting intrusions into
information systems. Quality and Reliability Engineering International, 17:105–112, 2001.
 B.-K. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish, C. Faloutsos, and A. Biliris. Online data mining for co-
evolving time sequences. ICDE’00
Un-Used Slides
893
894
Outlier Discovery:
Statistical Approaches
Assume a model underlying distribution that generates data
set (e.g. normal distribution)
 Use discordancy tests depending on
 data distribution
 distribution parameter (e.g., mean, variance)
 number of expected outliers
 Drawbacks
 most tests are for single attribute
 In many cases, data distribution may not be known
895
Outlier Discovery: Distance-Based Approach
 Introduced to counter the main limitations imposed by
statistical methods
 We need multi-dimensional analysis without knowing
data distribution
 Distance-based outlier: A DB(p, D)-outlier is an object O in
a dataset T such that at least a fraction p of the objects in T
lies at a distance greater than D from O
 Algorithms for mining distance-based outliers [Knorr & Ng,
VLDB’98]
 Index-based algorithm
 Nested-loop algorithm
 Cell-based algorithm
896
Density-Based Local
Outlier Detection
 M. M. Breunig, H.-P. Kriegel, R. Ng, J.
Sander. LOF: Identifying Density-Based
Local Outliers. SIGMOD 2000.
 Distance-based outlier detection is based
on global distance distribution
 It encounters difficulties to identify outliers
if data is not uniformly distributed
 Ex. C1 contains 400 loosely distributed
points, C2 has 100 tightly condensed
points, 2 outlier points o1, o2
 Distance-based method cannot identify o2
as an outlier
 Need the concept of local
outlier
 Local outlier factor (LOF)
 Assume outlier is not
crisp
 Each point has a LOF
897
Outlier Discovery: Deviation-Based Approach
 Identifies outliers by examining the main characteristics
of objects in a group
 Objects that “deviate” from this description are
considered outliers
 Sequential exception technique
 simulates the way in which humans can distinguish
unusual objects from among a series of supposedly
like objects
 OLAP data cube technique
 uses data cubes to identify regions of anomalies in
large multidimensional data
898
References (1)
 B. Abraham and G.E.P. Box. Bayesian analysis of some outlier problems in time series. Biometrika,
1979.
 Malik Agyemang, Ken Barker, and Rada Alhajj. A comprehensive survey of numeric and symbolic
outlier mining techniques. Intell. Data Anal., 2006.
 Deepak Agarwal. Detecting anomalies in cross-classied streams: a bayesian approach. Knowl. Inf.
Syst., 2006.
 C. C. Aggarwal and P. S. Yu. Outlier detection for high dimensional data. SIGMOD'01.
 M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Optics-of: Identifying local outliers. PKDD '99
 M. M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander. LOF: Identifying density-based local outliers.
SIGMOD'00.
 V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Comput. Surv., 2009.
 D. Dasgupta and N.S. Majumdar. Anomaly detection in multidimensional data using negative
selection algorithm. Computational Intelligence, 2002.
 E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo. A geometric framework for unsupervised
anomaly detection: Detecting intrusions in unlabeled data. In Proc. 2002 Int. Conf. of Data Mining
for Security Applications, 2002.
 E. Eskin. Anomaly detection over noisy data using learned probability distributions. ICML’00.
 T. Fawcett and F. Provost. Adaptive fraud detection. Data Mining and Knowledge Discovery, 1997.
 R. Fujimaki, T. Yairi, and K. Machida. An approach to spacecraft anomaly detection problem using
kernel feature space. KDD '05
 F. E. Grubbs. Procedures for detecting outlying observations in samples. Technometrics, 1969.
899
References (2)
 V. Hodge and J. Austin. A survey of outlier detection methodologies. Artif. Intell. Rev., 2004.
 Douglas M Hawkins. Identification of Outliers. Chapman and Hall, 1980.
 P. S. Horn, L. Feng, Y. Li, and A. J. Pesce. Effect of Outliers and Nonhealthy Individuals on Reference
Interval Estimation. Clin Chem, 2001.
 W. Jin, A. K. H. Tung, J. Han, and W. Wang. Ranking outliers using symmetric neighborhood
relationship. PAKDD'06
 E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98
 M. Markou and S. Singh.. Novelty detection: a review| part 1: statistical approaches. Signal
Process., 83(12), 2003.
 M. Markou and S. Singh. Novelty detection: a review| part 2: neural network based approaches.
Signal Process., 83(12), 2003.
 S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. Loci: Fast outlier detection using
the local correlation integral. ICDE'03.
 A. Patcha and J.-M. Park. An overview of anomaly detection techniques: Existing solutions and
latest technological trends. Comput. Netw., 51(12):3448{3470, 2007.
 W. Stefansky. Rejecting outliers in factorial designs. Technometrics, 14(2):469{479, 1972.
 X. Song, M. Wu, C. Jermaine, and S. Ranka. Conditional anomaly detection. IEEE Trans. on Knowl.
and Data Eng., 19(5):631{645, 2007.
 Y. Tao, X. Xiao, and S. Zhou. Mining distance-based outliers from large databases in any metric
space. KDD '06:
 N. Ye and Q. Chen. An anomaly detection technique based on a chi-square statistic for detecting
intrusions into information systems. Quality and Reliability Engineering International, 2001.
Data Mining:
Concepts and Techniques
(3rd
ed.)
— Chapter 13 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
902
Chapter 13: Data Mining Trends and
Research Frontiers
 Mining Complex Types of Data
 Other Methodologies of Data Mining
 Data Mining Applications
 Data Mining and Society
 Data Mining Trends
 Summary
903
Mining Complex Types of Data
 Mining Sequence Data
 Mining Time Series
 Mining Symbolic Sequences
 Mining Biological Sequences
 Mining Graphs and Networks
 Mining Other Kinds of Data
904
Mining Sequence Data
 Similarity Search in Time Series Data
 Subsequence match, dimensionality reduction, query-based
similarity search, motif-based similarity search
 Regression and Trend Analysis in Time-Series Data
 long term + cyclic + seasonal variation + random movements
 Sequential Pattern Mining in Symbolic Sequences
 GSP, PrefixSpan, constraint-based sequential pattern mining
 Sequence Classification
 Feature-based vs. sequence-distance-based vs. model-based
 Alignment of Biological Sequences
 Pair-wise vs. multi-sequence alignment, substitution matirces, BLAST
 Hidden Markov Model for Biological Sequence Analysis
 Markov chain vs. hidden Markov models, forward vs. Viterbi vs.
Baum-Welch algorithms
905
Mining Graphs and Networks
 Graph Pattern Mining
 Frequent subgraph patterns, closed graph patterns, gSpan vs.
CloseGraph
 Statistical Modeling of Networks
 Small world phenomenon, power law (log-tail) distribution,
densification
 Clustering and Classification of Graphs and Homogeneous Networks
 Clustering: Fast Modularity vs. SCAN
 Classification: model vs. pattern-based mining
 Clustering, Ranking and Classification of Heterogeneous Networks
 RankClus, RankClass, and meta path-based, user-guided methodology
 Role Discovery and Link Prediction in Information Networks
 PathPredict
 Similarity Search and OLAP in Information Networks: PathSim, GraphCube
 Evolution of Social and Information Networks: EvoNetClus
906
Mining Other Kinds of Data
 Mining Spatial Data
 Spatial frequent/co-located patterns, spatial clustering and
classification
 Mining Spatiotemporal and Moving Object Data
 Spatiotemporal data mining, trajectory mining, periodica, swarm, …
 Mining Cyber-Physical System Data
 Applications: healthcare, air-traffic control, flood simulation
 Mining Multimedia Data
 Social media data, geo-tagged spatial clustering, periodicity analysis, …
 Mining Text Data
 Topic modeling, i-topic model, integration with geo- and networked
data
 Mining Web Data
 Web content, web structure, and web usage mining
 Mining Data Streams

907
Chapter 13: Data Mining Trends and
Research Frontiers
 Mining Complex Types of Data
 Other Methodologies of Data Mining
 Data Mining Applications
 Data Mining and Society
 Data Mining Trends
 Summary
908
Other Methodologies of Data Mining
 Statistical Data Mining
 Views on Data Mining Foundations
 Visual and Audio Data Mining
909
Major Statistical Data Mining Methods
 Regression
 Generalized Linear Model
 Analysis of Variance
 Mixed-Effect Models
 Factor Analysis
 Discriminant Analysis
 Survival Analysis
910
Statistical Data Mining (1)
 There are many well-established statistical techniques for data
analysis, particularly for numeric data
 applied extensively to data from scientific experiments and
data from economics and the social sciences
 Regression
 predict the value of a response
(dependent) variable from one or
more predictor (independent)
variables where the variables are
numeric
 forms of regression: linear,
multiple, weighted, polynomial,
nonparametric, and robust
911
Scientific and Statistical Data Mining (2)
 Generalized linear models
 allow a categorical response variable
(or some transformation of it) to be
related to a set of predictor variables
 similar to the modeling of a numeric
response variable using linear
regression
 include logistic regression and Poisson
regression
 Mixed-effect models

For analyzing grouped data, i.e. data that can be classified
according to one or more grouping variables
 Typically describe relationships between a response variable
and some covariates in data grouped according to one or more
factors
912
Scientific and Statistical Data Mining (3)
 Regression trees
 Binary trees used for classification
and prediction
 Similar to decision trees:Tests are
performed at the internal nodes
 In a regression tree the mean of
the objective attribute is computed
and used as the predicted value
 Analysis of variance
 Analyze experimental data for two
or more populations described by a
numeric response variable and one
or more categorical variables
(factors)
913
Statistical Data Mining (4)
 Factor analysis
 determine which variables are
combined to generate a given
factor
 e.g., for many psychiatric data,
one can indirectly measure other
quantities (such as test scores)
that reflect the factor of interest
 Discriminant analysis
 predict a categorical response
variable, commonly used in social
science
 Attempts to determine several
discriminant functions (linear
combinations of the independent
variables) that discriminate
among the groups defined by the
response variable www.spss.com/datamine/factor.htm
914
Statistical Data Mining (5)
 Time series: many methods such as autoregression,
ARIMA (Autoregressive integrated moving-average
modeling), long memory time-series modeling
 Quality control: displays group summary charts
 Survival analysis
 Predicts the
probability that a
patient undergoing a
medical treatment
would survive at least
to time t (life span
prediction)
915
Other Methodologies of Data Mining
 Statistical Data Mining
 Views on Data Mining Foundations
 Visual and Audio Data Mining
916
Views on Data Mining Foundations (I)
 Data reduction
 Basis of data mining: Reduce data representation
 Trades accuracy for speed in response
 Data compression
 Basis of data mining: Compress the given data by
encoding in terms of bits, association rules, decision
trees, clusters, etc.
 Probability and statistical theory
 Basis of data mining: Discover joint probability
distributions of random variables
917
 Microeconomic view
 A view of utility: Finding patterns that are interesting only to the
extent in that they can be used in the decision-making process
of some enterprise
 Pattern Discovery and Inductive databases
 Basis of data mining: Discover patterns occurring in the
database, such as associations, classification models,
sequential patterns, etc.
 Data mining is the problem of performing inductive logic on
databases
 The task is to query the data and the theory (i.e., patterns) of
the database
 Popular among many researchers in database systems
Views on Data Mining Foundations (II)
918
Other Methodologies of Data Mining
 Statistical Data Mining
 Views on Data Mining Foundations
 Visual and Audio Data Mining
919
Visual Data Mining
 Visualization: Use of computer graphics to create visual
images which aid in the understanding of complex,
often massive representations of data
 Visual Data Mining: discovering implicit but useful
knowledge from large data sets using visualization
techniques
Compute
r
Graphics
High
Performance
Computing
Pattern
Recognitio
n
Human
Compute
r
Interface
s
Multimedia
Systems
Visual Data
Mining
920
Visualization
 Purpose of Visualization
 Gain insight into an information space by mapping
data onto graphical primitives
 Provide qualitative overview of large data sets
 Search for patterns, trends, structure, irregularities,
relationships among data.
 Help find interesting regions and suitable
parameters for further quantitative analysis.
 Provide a visual proof of computer representations
derived
921
Visual Data Mining & Data Visualization
 Integration of visualization and data mining
 data visualization
 data mining result visualization
 data mining process visualization
 interactive visual data mining
 Data visualization
 Data in a database or data warehouse can be
viewed

at different levels of abstraction

as different combinations of attributes or
dimensions
 Data can be presented in various visual forms
922
Data Mining Result Visualization
 Presentation of the results or knowledge obtained
from data mining in visual forms
 Examples
 Scatter plots and boxplots (obtained from
descriptive data mining)
 Decision trees
 Association rules
 Clusters
 Outliers
 Generalized rules
923
Boxplots from Statsoft: Multiple
Variable Combinations
924
Visualization of Data Mining Results in SAS
Enterprise Miner: Scatter Plots
925
Visualization of Association Rules in
SGI/MineSet 3.0
926
Visualization of a Decision Tree in
SGI/MineSet 3.0
927
Visualization of Cluster Grouping in IBM
Intelligent Miner
928
Data Mining Process Visualization
 Presentation of the various processes of data mining
in visual forms so that users can see
 Data extraction process
 Where the data is extracted
 How the data is cleaned, integrated,
preprocessed, and mined
 Method selected for data mining
 Where the results are stored
 How they may be viewed
929
Visualization of Data Mining Processes
by Clementine
Understand
variations with
visualized data
See your solution
discovery
process clearly
930
Interactive Visual Data Mining
 Using visualization tools in the data mining process to
help users make smart data mining decisions
 Example
 Display the data distribution in a set of attributes
using colored sectors or columns (depending on
whether the whole space is represented by either a
circle or a set of columns)
 Use the display to which sector should first be
selected for classification and where a good split
point for this sector may be
931
Interactive Visual Mining by
Perception-Based Classification (PBC)
932
Audio Data Mining
 Uses audio signals to indicate the patterns of data or
the features of data mining results
 An interesting alternative to visual mining
 An inverse task of mining audio (such as music)
databases which is to find patterns from audio data
 Visual data mining may disclose interesting patterns
using graphical displays, but requires users to
concentrate on watching patterns
 Instead, transform patterns into sound and music
and listen to pitches, rhythms, tune, and melody in
order to identify anything interesting or unusual
933
Chapter 13: Data Mining Trends and
Research Frontiers
 Mining Complex Types of Data
 Other Methodologies of Data Mining
 Data Mining Applications
 Data Mining and Society
 Data Mining Trends
 Summary
934
Data Mining Applications
 Data mining: A young discipline with broad and
diverse applications
 There still exists a nontrivial gap between generic
data mining methods and effective and scalable
data mining tools for domain-specific applications
 Some application domains (briefly discussed here)
 Data Mining for Financial data analysis
 Data Mining for Retail and Telecommunication
Industries
 Data Mining in Science and Engineering
 Data Mining for Intrusion Detection and Prevention
 Data Mining and Recommender Systems
935
Data Mining for Financial Data Analysis (I)
 Financial data collected in banks and financial
institutions are often relatively complete, reliable, and
of high quality
 Design and construction of data warehouses for
multidimensional data analysis and data mining
 View the debt and revenue changes by month, by
region, by sector, and by other factors
 Access statistical information such as max, min,
total, average, trend, etc.
 Loan payment prediction/consumer credit policy
analysis
 feature selection and attribute relevance ranking
 Loan payment performance
936
 Classification and clustering of customers for targeted
marketing
 multidimensional segmentation by nearest-
neighbor, classification, decision trees, etc. to
identify customer groups or associate a new
customer to an appropriate customer group
 Detection of money laundering and other financial
crimes
 integration of from multiple DBs (e.g., bank
transactions, federal/state crime history DBs)
 Tools: data visualization, linkage analysis,
classification, clustering tools, outlier analysis, and
sequential pattern analysis tools (find unusual
access sequences)
Data Mining for Financial Data Analysis (II)
937
Data Mining for Retail & Telcomm. Industries (I)
 Retail industry: huge amounts of data on sales,
customer shopping history, e-commerce, etc.
 Applications of retail data mining
 Identify customer buying behaviors
 Discover customer shopping patterns and trends
 Improve the quality of customer service
 Achieve better customer retention and satisfaction
 Enhance goods consumption ratios
 Design more effective goods transportation and
distribution policies
 Telcomm. and many other industries: Share many
similar goals and expectations of retail data mining
938
Data Mining Practice for Retail Industry
 Design and construction of data warehouses
 Multidimensional analysis of sales, customers, products, time,
and region
 Analysis of the effectiveness of sales campaigns
 Customer retention: Analysis of customer loyalty
 Use customer loyalty card information to register sequences
of purchases of particular customers
 Use sequential pattern mining to investigate changes in
customer consumption or loyalty
 Suggest adjustments on the pricing and variety of goods
 Product recommendation and cross-reference of items
 Fraudulent analysis and the identification of usual patterns
 Use of visualization tools in data analysis
939
Data Mining in Science and Engineering
 Data warehouses and data preprocessing
 Resolving inconsistencies or incompatible data collected in
diverse environments and different periods (e.g. eco-system
studies)
 Mining complex data types
 Spatiotemporal, biological, diverse semantics and
relationships
 Graph-based and network-based mining
 Links, relationships, data flow, etc.
 Visualization tools and domain-specific knowledge
 Other issues
 Data mining in social sciences and social studies: text and
social media
 Data mining in computer science: monitoring systems,
940
Data Mining for Intrusion Detection and
Prevention
 Majority of intrusion detection and prevention systems use
 Signature-based detection: use signatures, attack patterns that
are preconfigured and predetermined by domain experts
 Anomaly-based detection: build profiles (models of normal
behavior) and detect those that are substantially deviate from
the profiles
 What data mining can help
 New data mining algorithms for intrusion detection
 Association, correlation, and discriminative pattern analysis
help select and build discriminative classifiers
 Analysis of stream data: outlier detection, clustering, model
shifting
 Distributed data mining
 Visualization and querying tools
941
Data Mining and Recommender Systems
 Recommender systems: Personalization, making product
recommendations that are likely to be of interest to a user
 Approaches: Content-based, collaborative, or their hybrid
 Content-based: Recommends items that are similar to items
the user preferred or queried in the past
 Collaborative filtering: Consider a user's social environment,
opinions of other customers who have similar tastes or
preferences
 Data mining and recommender systems
 Users C × items S: extract from known to unknown ratings to
predict user-item combinations
 Memory-based method often uses k-nearest neighbor
approach
 Model-based method uses a collection of ratings to learn a
model (e.g., probabilistic models, clustering, Bayesian
networks, etc.)
942
Chapter 13: Data Mining Trends and
Research Frontiers
 Mining Complex Types of Data
 Other Methodologies of Data Mining
 Data Mining Applications
 Data Mining and Society
 Data Mining Trends
 Summary
943
Ubiquitous and Invisible Data Mining
 Ubiquitous Data Mining
 Data mining is used everywhere, e.g., online shopping
 Ex. Customer relationship management (CRM)
 Invisible Data Mining
 Invisible: Data mining functions are built in daily life
operations
 Ex. Google search: Users may be unaware that they are
examining results returned by data
 Invisible data mining is highly desirable
 Invisible mining needs to consider efficiency and scalability,
user interaction, incorporation of background knowledge and
visualization techniques, finding interesting patterns, real-
time, …
 Further work: Integration of data mining into existing
business and scientific technologies to provide domain-
944
Privacy, Security and Social Impacts of Data
Mining
 Many data mining applications do not touch personal data
 E.g., meteorology, astronomy, geography, geology, biology, and
other scientific and engineering data
 Many DM studies are on developing scalable algorithms to find
general or statistically significant patterns, not touching individuals
 The real privacy concern: unconstrained access of individual
records, especially privacy-sensitive information
 Method 1: Removing sensitive IDs associated with the data
 Method 2: Data security-enhancing methods
 Multi-level security model: permit to access to only authorized
level
 Encryption: e.g., blind signatures, biometric encryption, and
anonymous databases (personal information is encrypted and
stored at different locations)
 Method 3: Privacy-preserving data mining methods
945
Privacy-Preserving Data Mining
 Privacy-preserving (privacy-enhanced or privacy-sensitive)
mining:
 Obtaining valid mining results without disclosing the
underlying sensitive data values
 Often needs trade-off between information loss and privacy
 Privacy-preserving data mining methods:
 Randomization (e.g., perturbation): Add noise to the data in
order to mask some attribute values of records
 K-anonymity and l-diversity: Alter individual records so that
they cannot be uniquely identified

k-anonymity: Any given record maps onto at least k other records

l-diversity: enforcing intra-group diversity of sensitive values
 Distributed privacy preservation: Data partitioned and
distributed either horizontally, vertically, or a combination of
both
 Downgrading the effectiveness of data mining: The output of
data mining may violate privacy
946
Chapter 13: Data Mining Trends and
Research Frontiers
 Mining Complex Types of Data
 Other Methodologies of Data Mining
 Data Mining Applications
 Data Mining and Society
 Data Mining Trends
 Summary
947
Trends of Data Mining
 Application exploration: Dealing with application-specific
problems
 Scalable and interactive data mining methods
 Integration of data mining with Web search engines, database
systems, data warehouse systems and cloud computing systems
 Mining social and information networks
 Mining spatiotemporal, moving objects and cyber-physical
systems
 Mining multimedia, text and web data
 Mining biological and biomedical data
 Data mining with software engineering and system engineering
 Visual and audio data mining
 Distributed data mining and real-time data stream mining
 Privacy protection and information security in data mining
948
Chapter 13: Data Mining Trends and
Research Frontiers
 Mining Complex Types of Data
 Other Methodologies of Data Mining
 Data Mining Applications
 Data Mining and Society
 Data Mining Trends
 Summary
949
Summary
 We present a high-level overview of mining complex data types
 Statistical data mining methods, such as regression, generalized
linear models, analysis of variance, etc., are popularly adopted
 Researchers also try to build theoretical foundations for data
mining
 Visual/audio data mining has been popular and effective
 Application-based mining integrates domain-specific knowledge
with data analysis techniques and provide mission-specific
solutions
 Ubiquitous data mining and invisible data mining are penetrating
our data lives
 Privacy and data security are importance issues in data mining,
and privacy-preserving data mining has been developed recently
 Our discussion on trends in data mining shows that data mining is
950
References and Further Reading
 The books lists a lot of references for further reading. Here we only list a few books
 E. Alpaydin. Introduction to Machine Learning, 2nd
ed., MIT Press, 2011
 S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan
Kaufmann, 2002
 R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed., Wiley-Interscience, 2000
 D. Easley and J. Kleinberg. Networks, Crowds, and Markets: Reasoning about a Highly Connected
World. Cambridge University Press, 2010.
 U. Fayyad, G. Grinstein, and A. Wierse (eds.), Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
 J. Han, M. Kamber, J. Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd
ed. 2011
 T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference,
and Prediction, 2nd
ed., Springer-Verlag, 2009
 D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press,
2009.
 B. Liu. Web Data Mining, Springer 2006.
 T. M. Mitchell. Machine Learning, McGraw Hill, 1997
 M. Newman. Networks: An Introduction. Oxford University Press, 2010.
 P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
 I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java
Implementations, Morgan Kaufmann, 2nd
ed. 2005
951

DWDM 3rd EDITION TEXT BOOK SLIDES24.pptx

  • 1.
    1 1 Data Mining: Concepts andTechniques (3rd ed.) — Chapter 1 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2011 Han, Kamber & Pei. All rights reserved.
  • 2.
    2 Chapter 1. Introduction Why Data Mining?  What Is Data Mining?  A Multi-Dimensional View of Data Mining  What Kind of Data Can Be Mined?  What Kinds of Patterns Can Be Mined?  What Technology Are Used?  What Kind of Applications Are Targeted?  Major Issues in Data Mining  A Brief History of Data Mining and Data Mining Society  Summary
  • 3.
    3 Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes  Data collection and data availability  Automated data collection tools, database systems, Web, computerized society  Major sources of abundant data  Business: Web, e-commerce, transactions, stocks, …  Science: Remote sensing, bioinformatics, scientific simulation, …  Society and everyone: news, digital cameras, YouTube  We are drowning in data, but starving for knowledge!  “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets
  • 4.
    4 Evolution of Sciences Before 1600, empirical science  1600-1950s, theoretical science  Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding.  1950s-1990s, computational science  Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)  Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models.  1990-now, data science  The flood of data from new scientific instruments and simulations  The ability to economically store and manage petabytes of data online  The Internet and computing Grid that makes all these archives universally accessible  Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes. Data mining is a major new challenge!  Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science,
  • 5.
    5 Evolution of DatabaseTechnology  1960s:  Data collection, database creation, IMS and network DBMS  1970s:  Relational data model, relational DBMS implementation  1980s:  RDBMS, advanced data models (extended-relational, OO, deductive, etc.)  Application-oriented DBMS (spatial, scientific, engineering, etc.)  1990s:  Data mining, data warehousing, multimedia databases, and Web databases  2000s  Stream data management and mining  Data mining and its applications  Web technology (XML, data integration) and global information systems
  • 6.
    6 Chapter 1. Introduction Why Data Mining?  What Is Data Mining?  A Multi-Dimensional View of Data Mining  What Kind of Data Can Be Mined?  What Kinds of Patterns Can Be Mined?  What Technology Are Used?  What Kind of Applications Are Targeted?  Major Issues in Data Mining  A Brief History of Data Mining and Data Mining Society  Summary
  • 7.
    7 What Is DataMining?  Data mining (knowledge discovery from data)  Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data  Data mining: a misnomer?  Alternative names  Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.  Watch out: Is everything “data mining”?  Simple search and query processing  (Deductive) expert systems
  • 8.
    8 Knowledge Discovery (KDD)Process  This is a view from typical database systems and data warehousing communities  Data mining plays an essential role in the knowledge discovery process Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation
  • 9.
    9 Example: A WebMining Framework  Web mining usually involves  Data cleaning  Data integration from multiple sources  Warehousing the data  Data cube construction  Data selection for data mining  Data mining  Presentation of the mining results  Patterns and knowledge to be used or stored into knowledge-base
  • 10.
    10 Data Mining inBusiness Intelligence Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems
  • 11.
    11 Example: Mining vs.Data Exploration  Business intelligence view  Warehouse, data cube, reporting but not much mining  Business objects vs. data mining tools  Supply chain example: tools  Data presentation  Exploration
  • 12.
    12 KDD Process: ATypical View from ML and Statistics Input Data Data Mining Data Pre- Processing Post- Processing  This is a view from typical machine learning and statistics communities Data integration Normalization Feature selection Dimension reduction Pattern discovery Association & correlation Classification Clustering Outlier analysis … … … … Pattern evaluation Pattern selection Pattern interpretation Pattern visualization
  • 13.
    13 Example: Medical DataMining  Health care & medical data mining – often adopted such a view in statistics and machine learning  Preprocessing of the data (including feature extraction and dimension reduction)  Classification or/and clustering processes  Post-processing for presentation
  • 14.
    14 Chapter 1. Introduction Why Data Mining?  What Is Data Mining?  A Multi-Dimensional View of Data Mining  What Kind of Data Can Be Mined?  What Kinds of Patterns Can Be Mined?  What Technology Are Used?  What Kind of Applications Are Targeted?  Major Issues in Data Mining  A Brief History of Data Mining and Data Mining Society  Summary
  • 15.
    15 Multi-Dimensional View ofData Mining  Data to be mined  Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks  Knowledge to be mined (or: Data mining functions)  Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.  Descriptive vs. predictive data mining  Multiple/integrated functions and mining at multiple levels  Techniques utilized  Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc.  Applications adapted  Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.
  • 16.
    16 Chapter 1. Introduction Why Data Mining?  What Is Data Mining?  A Multi-Dimensional View of Data Mining  What Kind of Data Can Be Mined?  What Kinds of Patterns Can Be Mined?  What Technology Are Used?  What Kind of Applications Are Targeted?  Major Issues in Data Mining  A Brief History of Data Mining and Data Mining Society  Summary
  • 17.
    17 Data Mining: OnWhat Kinds of Data?  Database-oriented data sets and applications  Relational database, data warehouse, transactional database  Advanced data sets and advanced applications  Data streams and sensor data  Time-series data, temporal data, sequence data (incl. bio-sequences)  Structure data, graphs, social networks and multi-linked data  Object-relational databases  Heterogeneous databases and legacy databases  Spatial data and spatiotemporal data  Multimedia database  Text databases  The World-Wide Web
  • 18.
    18 Chapter 1. Introduction Why Data Mining?  What Is Data Mining?  A Multi-Dimensional View of Data Mining  What Kind of Data Can Be Mined?  What Kinds of Patterns Can Be Mined?  What Technology Are Used?  What Kind of Applications Are Targeted?  Major Issues in Data Mining  A Brief History of Data Mining and Data Mining Society  Summary
  • 19.
    19 Data Mining Function:(1) Generalization  Information integration and data warehouse construction  Data cleaning, transformation, integration, and multidimensional data model  Data cube technology  Scalable methods for computing (i.e., materializing) multidimensional aggregates  OLAP (online analytical processing)  Multidimensional concept description: Characterization and discrimination  Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet region
  • 20.
    20 Data Mining Function:(2) Association and Correlation Analysis  Frequent patterns (or frequent itemsets)  What items are frequently purchased together in your Walmart?  Association, correlation vs. causality  A typical association rule  Diaper  Beer [0.5%, 75%] (support, confidence)  Are strongly associated items also strongly correlated?  How to mine such patterns and rules efficiently in large datasets?  How to use such patterns for classification, clustering,
  • 21.
    21 Data Mining Function:(3) Classification  Classification and label prediction  Construct models (functions) based on some training examples  Describe and distinguish classes or concepts for future prediction  E.g., classify countries based on (climate), or classify cars based on (gas mileage)  Predict some unknown class labels  Typical methods  Decision trees, naïve Bayesian classification, support vector machines, neural networks, rule-based classification, pattern- based classification, logistic regression, …  Typical applications:  Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages, …
  • 22.
    22 Data Mining Function:(4) Cluster Analysis  Unsupervised learning (i.e., Class label is unknown)  Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns  Principle: Maximizing intra-class similarity & minimizing interclass similarity  Many methods and applications
  • 23.
    23 Data Mining Function:(5) Outlier Analysis  Outlier analysis  Outlier: A data object that does not comply with the general behavior of the data  Noise or exception? ― One person’s garbage could be another person’s treasure  Methods: by product of clustering or regression analysis, …  Useful in fraud detection, rare events analysis
  • 24.
    24 Time and Ordering:Sequential Pattern, Trend and Evolution Analysis  Sequence, trend and evolution analysis  Trend, time-series, and deviation analysis: e.g., regression and value prediction  Sequential pattern mining  e.g., first buy digital camera, then buy large SD memory cards  Periodicity analysis  Motifs and biological sequence analysis  Approximate and consecutive motifs  Similarity-based analysis  Mining data streams  Ordered, time-varying, potentially infinite, data streams
  • 25.
    25 Structure and NetworkAnalysis  Graph mining  Finding frequent subgraphs (e.g., chemical compounds), trees (XML), substructures (web fragments)  Information network analysis  Social networks: actors (objects, nodes) and relationships (edges)  e.g., author networks in CS, terrorist networks  Multiple heterogeneous networks  A person could be multiple information networks: friends, family, classmates, …  Links carry a lot of semantic information: Link mining  Web mining  Web is a big information network: from PageRank to Google  Analysis of Web information networks  Web community discovery, opinion mining, usage mining, …
  • 26.
    26 Evaluation of Knowledge Are all mined knowledge interesting?  One can mine tremendous amount of “patterns” and knowledge  Some may fit only certain dimension space (time, location, …)  Some may not be representative, may be transient, …  Evaluation of mined knowledge → directly mine only interesting knowledge?  Descriptive vs. predictive  Coverage  Typicality vs. novelty  Accuracy  Timeliness  …
  • 27.
    27 Chapter 1. Introduction Why Data Mining?  What Is Data Mining?  A Multi-Dimensional View of Data Mining  What Kind of Data Can Be Mined?  What Kinds of Patterns Can Be Mined?  What Technology Are Used?  What Kind of Applications Are Targeted?  Major Issues in Data Mining  A Brief History of Data Mining and Data Mining Society  Summary
  • 28.
    28 Data Mining: Confluenceof Multiple Disciplines Data Mining Machine Learning Statistics Applications Algorithm Pattern Recognition High-Performance Computing Visualization Database Technology
  • 29.
    29 Why Confluence ofMultiple Disciplines?  Tremendous amount of data  Algorithms must be highly scalable to handle such as tera-bytes of data  High-dimensionality of data  Micro-array may have tens of thousands of dimensions  High complexity of data  Data streams and sensor data  Time-series data, temporal data, sequence data  Structure data, graphs, social networks and multi-linked data  Heterogeneous databases and legacy databases  Spatial, spatiotemporal, multimedia, text and Web data  Software programs, scientific simulations  New and sophisticated applications
  • 30.
    30 Chapter 1. Introduction Why Data Mining?  What Is Data Mining?  A Multi-Dimensional View of Data Mining  What Kind of Data Can Be Mined?  What Kinds of Patterns Can Be Mined?  What Technology Are Used?  What Kind of Applications Are Targeted?  Major Issues in Data Mining  A Brief History of Data Mining and Data Mining Society  Summary
  • 31.
    31 Applications of DataMining  Web page analysis: from web page classification, clustering to PageRank & HITS algorithms  Collaborative analysis & recommender systems  Basket data analysis to targeted marketing  Biological and medical data analysis: classification, cluster analysis (microarray data analysis), biological sequence analysis, biological network analysis  Data mining and software engineering (e.g., IEEE Computer, Aug. 2009 issue)  From major dedicated data mining systems/tools (e.g., SAS, MS SQL-Server Analysis Manager, Oracle Data Mining Tools) to invisible data mining
  • 32.
    32 Chapter 1. Introduction Why Data Mining?  What Is Data Mining?  A Multi-Dimensional View of Data Mining  What Kind of Data Can Be Mined?  What Kinds of Patterns Can Be Mined?  What Technology Are Used?  What Kind of Applications Are Targeted?  Major Issues in Data Mining  A Brief History of Data Mining and Data Mining Society  Summary
  • 33.
    33 Major Issues inData Mining (1)  Mining Methodology  Mining various and new kinds of knowledge  Mining knowledge in multi-dimensional space  Data mining: An interdisciplinary effort  Boosting the power of discovery in a networked environment  Handling noise, uncertainty, and incompleteness of data  Pattern evaluation and pattern- or constraint-guided mining  User Interaction  Interactive mining  Incorporation of background knowledge  Presentation and visualization of data mining results
  • 34.
    34 Major Issues inData Mining (2)  Efficiency and Scalability  Efficiency and scalability of data mining algorithms  Parallel, distributed, stream, and incremental mining methods  Diversity of data types  Handling complex types of data  Mining dynamic, networked, and global data repositories  Data mining and society  Social impacts of data mining  Privacy-preserving data mining  Invisible data mining
  • 35.
    35 Chapter 1. Introduction Why Data Mining?  What Is Data Mining?  A Multi-Dimensional View of Data Mining  What Kind of Data Can Be Mined?  What Kinds of Patterns Can Be Mined?  What Technology Are Used?  What Kind of Applications Are Targeted?  Major Issues in Data Mining  A Brief History of Data Mining and Data Mining Society  Summary
  • 36.
    36 A Brief Historyof Data Mining Society  1989 IJCAI Workshop on Knowledge Discovery in Databases  Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)  1991-1994 Workshops on Knowledge Discovery in Databases  Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)  1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98)  Journal of Data Mining and Knowledge Discovery (1997)  ACM SIGKDD conferences since 1998 and SIGKDD Explorations  More conferences on data mining  PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.  ACM Transactions on KDD starting in 2007
  • 37.
    37 Conferences and Journalson Data Mining  KDD Conferences  ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD)  SIAM Data Mining Conf. (SDM)  (IEEE) Int. Conf. on Data Mining (ICDM)  European Conf. on Machine Learning and Principles and practices of Knowledge Discovery and Data Mining (ECML-PKDD)  Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD)  Int. Conf. on Web Search and Data Mining (WSDM)  Other related conferences  DB conferences: ACM SIGMOD, VLDB, ICDE, EDBT, ICDT, …  Web and IR conferences: WWW, SIGIR, WSDM  ML conferences: ICML, NIPS  PR conferences: CVPR,  Journals  Data Mining and Knowledge Discovery (DAMI or DMKD)  IEEE Trans. On Knowledge and Data Eng. (TKDE)  KDD Explorations  ACM Trans. on KDD
  • 38.
    38 Where to FindReferences? DBLP, CiteSeer, Google  Data mining and KDD (SIGKDD: CDROM)  Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.  Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD  Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM)  Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA  Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.  AI & Machine Learning  Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.  Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-PAMI, etc.  Web and IR  Conferences: SIGIR, WWW, CIKM, etc.  Journals: WWW: Internet and Web Information Systems,  Statistics  Conferences: Joint Stat. Meeting, etc.  Journals: Annals of statistics, etc.  Visualization  Conference proceedings: CHI, ACM-SIGGraph, etc.  Journals: IEEE Trans. visualization and computer graphics, etc.
  • 39.
    39 Chapter 1. Introduction Why Data Mining?  What Is Data Mining?  A Multi-Dimensional View of Data Mining  What Kind of Data Can Be Mined?  What Kinds of Patterns Can Be Mined?  What Technology Are Used?  What Kind of Applications Are Targeted?  Major Issues in Data Mining  A Brief History of Data Mining and Data Mining Society  Summary
  • 40.
    40 Summary  Data mining:Discovering interesting patterns and knowledge from massive amount of data  A natural evolution of database technology, in great demand, with wide applications  A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation  Mining can be performed in a variety of data  Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.  Data mining technologies and applications  Major issues in data mining
  • 41.
    41 Recommended Reference Books S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002  R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000  T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003  U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996  U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001  J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed., 2011  D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001  T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer-Verlag, 2009  B. Liu, Web Data Mining, Springer 2006.  T. M. Mitchell, Machine Learning, McGraw Hill, 1997  G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991  P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005  S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998  I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2nd ed. 2005
  • 42.
    42 Data Mining: Concepts andTechniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser University ©2011 Han, Kamber, and Pei. All rights reserved.
  • 43.
    43 Chapter 2: Gettingto Know Your Data  Data Objects and Attribute Types  Basic Statistical Descriptions of Data  Data Visualization  Measuring Data Similarity and Dissimilarity  Summary
  • 44.
    44 Types of DataSets  Record  Relational records  Data matrix, e.g., numerical matrix, crosstabs  Document data: text documents: term- frequency vector  Transaction data  Graph and network  World Wide Web  Social or information networks  Molecular Structures  Ordered  Video data: sequence of images  Temporal data: time-series  Sequential Data: transaction sequences  Genetic sequence data  Spatial, image and multimedia:  Spatial data: maps  Image data:  Video data: Document 1 season timeout lost wi n game score ball pla y coach team Document 2 Document 3 3 0 5 0 2 6 0 2 0 2 0 0 7 0 2 1 0 0 3 0 0 1 0 0 1 2 2 0 3 0 TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk
  • 45.
    45 Important Characteristics ofStructured Data  Dimensionality  Curse of dimensionality  Sparsity  Only presence counts  Resolution  Patterns depend on the scale  Distribution  Centrality and dispersion
  • 46.
    46 Data Objects  Datasets are made up of data objects.  A data object represents an entity.  Examples:  sales database: customers, store items, sales  medical database: patients, treatments  university database: students, professors, courses  Also called samples , examples, instances, data points, objects, tuples.  Data objects are described by attributes.  Database rows -> data objects; columns ->attributes.
  • 47.
    47 Attributes  Attribute (ordimensions, features, variables): a data field, representing a characteristic or feature of a data object.  E.g., customer _ID, name, address  Types:  Nominal  Binary  Numeric: quantitative  Interval-scaled  Ratio-scaled
  • 48.
    48 Attribute Types  Nominal:categories, states, or “names of things”  Hair_color = {auburn, black, blond, brown, grey, red, white}  marital status, occupation, ID numbers, zip codes  Binary  Nominal attribute with only 2 states (0 and 1)  Symmetric binary: both outcomes equally important  e.g., gender  Asymmetric binary: outcomes not equally important.  e.g., medical test (positive vs. negative)  Convention: assign 1 to most important outcome (e.g., HIV positive)  Ordinal  Values have a meaningful order (ranking) but magnitude between successive values is not known.  Size = {small, medium, large}, grades, army rankings
  • 49.
    49 Numeric Attribute Types Quantity (integer or real-valued)  Interval  Measured on a scale of equal-sized units  Values have order  E.g., temperature in C˚or F˚, calendar dates  No true zero-point  Ratio  Inherent zero-point  We can speak of values as being an order of magnitude larger than the unit of measurement (10 K˚ is twice as high as 5 K˚).  e.g., temperature in Kelvin, length, counts, monetary quantities
  • 50.
    50 Discrete vs. ContinuousAttributes  Discrete Attribute  Has only a finite or countably infinite set of values  E.g., zip codes, profession, or the set of words in a collection of documents  Sometimes, represented as integer variables  Note: Binary attributes are a special case of discrete attributes  Continuous Attribute  Has real numbers as attribute values  E.g., temperature, height, or weight  Practically, real values can only be measured and represented using a finite number of digits  Continuous attributes are typically represented as floating-point variables
  • 51.
    51 Chapter 2: Gettingto Know Your Data  Data Objects and Attribute Types  Basic Statistical Descriptions of Data  Data Visualization  Measuring Data Similarity and Dissimilarity  Summary
  • 52.
    52 Basic Statistical Descriptionsof Data  Motivation  To better understand the data: central tendency, variation and spread  Data dispersion characteristics  median, max, min, quantiles, outliers, variance, etc.  Numerical dimensions correspond to sorted intervals  Data dispersion: analyzed with multiple granularities of precision  Boxplot or quantile analysis on sorted intervals  Dispersion analysis on computed measures  Folding measures into numerical dimensions  Boxplot or quantile analysis on the transformed cube
  • 53.
    53 Measuring the CentralTendency  Mean (algebraic measure) (sample vs. population): Note: n is sample size and N is population size.  Weighted arithmetic mean:  Trimmed mean: chopping extreme values  Median:  Middle value if odd number of values, or average of the middle two values otherwise  Estimated by interpolation (for grouped data):  Mode  Value that occurs most frequently in the data  Unimodal, bimodal, trimodal  Empirical formula:    n i i x n x 1 1      n i i n i i i w x w x 1 1 width freq l freq n L median median ) ) ( 2 / ( 1     ) ( 3 median mean mode mean     N x   
  • 54.
    October 24, 2024 DataMining: Concepts and Techniques 54 Symmetric vs. Skewed Data  Median, mean and mode of symmetric, positively and negatively skewed data positively skewed negatively skewed symmetric
  • 55.
    55 Measuring the Dispersionof Data  Quartiles, outliers and boxplots  Quartiles: Q1 (25th percentile), Q3 (75th percentile)  Inter-quartile range: IQR = Q3 –Q1  Five number summary: min, Q1, median, Q3, max  Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers individually  Outlier: usually, a value higher/lower than 1.5 x IQR  Variance and standard deviation (sample: s, population: σ)  Variance: (algebraic, scalable computation)  Standard deviation s (or σ) is the square root of variance s2 ( or σ2)             n i n i i i n i i x n x n x x n s 1 1 2 2 1 2 2 ] ) ( 1 [ 1 1 ) ( 1 1         n i i n i i x N x N 1 2 2 1 2 2 1 ) ( 1   
  • 56.
    56 Boxplot Analysis  Five-numbersummary of a distribution  Minimum, Q1, Median, Q3, Maximum  Boxplot  Data is represented with a box  The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR  The median is marked by a line within the box  Whiskers: two lines outside the box extended to Minimum and Maximum  Outliers: points beyond a specified outlier threshold, plotted individually
  • 57.
    October 24, 2024 DataMining: Concepts and Techniques 57 Visualization of Data Dispersion: 3-D Boxplots
  • 58.
    58 Properties of NormalDistribution Curve  The normal (distribution) curve  From μ–σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard deviation)  From μ–2σ to μ+2σ: contains about 95% of it  From μ–3σ to μ+3σ: contains about 99.7% of it
  • 59.
    59 Graphic Displays ofBasic Statistical Descriptions  Boxplot: graphic display of five-number summary  Histogram: x-axis are values, y-axis repres. frequencies  Quantile plot: each value xi is paired with fi indicating that approximately 100 fi % of data are  xi  Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another  Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane
  • 60.
    60 Histogram Analysis  Histogram:Graph display of tabulated frequencies, shown as bars  It shows what proportion of cases fall into each of several categories  Differs from a bar chart in that it is the area of the bar that denotes the value, not the height as in bar charts, a crucial distinction when the categories are not of uniform width  The categories are usually specified as non-overlapping intervals of some variable. The categories (bars) must be adjacent 0 5 10 15 20 25 30 35 40 10000 30000 50000 70000 90000
  • 61.
    61 Histograms Often TellMore than Boxplots  The two histograms shown in the left may have the same boxplot representation  The same values for: min, Q1, median, Q3, max  But they have rather different data distributions
  • 62.
    Data Mining: Conceptsand Techniques 62 Quantile Plot  Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences)  Plots quantile information  For a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xi
  • 63.
    63 Quantile-Quantile (Q-Q) Plot Graphs the quantiles of one univariate distribution against the corresponding quantiles of another  View: Is there is a shift in going from one distribution to another?  Example shows unit price of items sold at Branch 1 vs. Branch 2 for each quantile. Unit prices of items sold at Branch 1 tend to be lower than those at Branch 2.
  • 64.
    64 Scatter plot  Providesa first look at bivariate data to see clusters of points, outliers, etc  Each pair of values is treated as a pair of coordinates and plotted as points in the plane
  • 65.
    65 Positively and NegativelyCorrelated Data  The left half fragment is positively correlated  The right half is negative correlated
  • 66.
  • 67.
    67 Chapter 2: Gettingto Know Your Data  Data Objects and Attribute Types  Basic Statistical Descriptions of Data  Data Visualization  Measuring Data Similarity and Dissimilarity  Summary
  • 68.
    68 Data Visualization  Whydata visualization?  Gain insight into an information space by mapping data onto graphical primitives  Provide qualitative overview of large data sets  Search for patterns, trends, structure, irregularities, relationships among data  Help find interesting regions and suitable parameters for further quantitative analysis  Provide a visual proof of computer representations derived  Categorization of visualization methods:  Pixel-oriented visualization techniques  Geometric projection visualization techniques  Icon-based visualization techniques  Hierarchical visualization techniques  Visualizing complex data and relations
  • 69.
    69 Pixel-Oriented Visualization Techniques For a data set of m dimensions, create m windows on the screen, one for each dimension  The m dimension values of a record are mapped to m pixels at the corresponding positions in the windows  The colors of the pixels reflect the corresponding values (a) Income (b) Credit Limit (c) transaction volume (d) age
  • 70.
    70 Laying Out Pixelsin Circle Segments  To save space and show the connections among multiple dimensions, space filling is often done in a circle segment (a) Representing a data record in circle segment (b) Laying out pixels in circle segment
  • 71.
    71 Geometric Projection VisualizationTechniques  Visualization of geometric transformations and projections of the data  Methods  Direct visualization  Scatterplot and scatterplot matrices  Landscapes  Projection pursuit technique: Help users find meaningful projections of multidimensional data  Prosection views  Hyperslice  Parallel coordinates
  • 72.
    Data Mining: Conceptsand Techniques 72 Direct Data Visualization Ribbons with Twists Based on Vorticity
  • 73.
    73 Scatterplot Matrices Matrix ofscatterplots (x-y-diagrams) of the k-dim. data [total of (k2/2-k) scatterplots] Used by ermission of M. Ward, Worcester Polytechnic Institute
  • 74.
    74 news articles visualized as alandscape Used by permission of B. Wright, Visible Decisions Inc. Landscapes  Visualization of the data as perspective landscape  The data needs to be transformed into a (possibly artificial) 2D spatial representation which preserves the characteristics of the data
  • 75.
    75 Attr. 1 Attr.2 Attr. k Attr. 3 • • • Parallel Coordinates  n equidistant axes which are parallel to one of the screen axes and correspond to the attributes  The axes are scaled to the [minimum, maximum]: range of the corresponding attribute  Every data item corresponds to a polygonal line which intersects each of the axes at the point which corresponds to the value for the attribute
  • 76.
  • 77.
    77 Icon-Based Visualization Techniques Visualization of the data values as features of icons  Typical visualization methods  Chernoff Faces  Stick Figures  General techniques  Shape coding: Use shape to represent certain information encoding  Color icons: Use color icons to encode more information  Tile bars: Use small icons to represent the relevant feature vectors in document retrieval
  • 78.
    78 Chernoff Faces  Away to display variables on a two-dimensional surface, e.g., let x be eyebrow slant, y be eye size, z be nose length, etc.  The figure shows faces produced using 10 characteristics--head eccentricity, eye size, eye spacing, eye eccentricity, pupil size, eyebrow slant, nose size, mouth shape, mouth size, and mouth opening): Each assigned one of 10 possible values, generated using Mathematica (S. Dickson)  REFERENCE: Gonick, L. and Smith, W. The Cartoon Guide to Statistics. New York: Harper Perennial, p. 212, 1993  Weisstein, Eric W. "Chernoff Face." From MathWorld--A Wolfram Web Resource. mathworld.wolfram.com/ChernoffFace.ht ml
  • 79.
    79 Two attributes mappedto axes, remaining attributes mapped to angle or length of limbs”. Look at texture A census data figure showing age, income, gender, education, etc. used by permissio n of G. Grinste in, University of Massac husettes at Lowell Stick Figure A 5-piece stick figure (1 body and 4 limbs w. different angle/length)
  • 80.
    80 Hierarchical Visualization Techniques Visualization of the data using a hierarchical partitioning into subspaces  Methods  Dimensional Stacking  Worlds-within-Worlds  Tree-Map  Cone Trees  InfoCube
  • 81.
    81 Dimensional Stacking attribute 1 attribute2 attribute 3 attribute 4  Partitioning of the n-dimensional attribute space in 2-D subspaces, which are ‘stacked’ into each other  Partitioning of the attribute value ranges into classes. The important attributes should be used on the outer levels.  Adequate for data with ordinal attributes of low cardinality  But, difficult to display more than nine dimensions
  • 82.
    82 Used by permissionof M. Ward, Worcester Polytechnic Institute Visualization of oil mining data with longitude and latitude mapped to the outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes Dimensional Stacking
  • 83.
    83 Worlds-within-Worlds  Assign thefunction and two most important parameters to innermost world  Fix all other parameters at constant values - draw other (1 or 2 or 3 dimensional worlds choosing these as the axes)  Software that uses this paradigm  N–vision: Dynamic interaction through data glove and stereo displays, including rotation, scaling (inner) and translation (inner/outer)  Auto Visual: Static interaction by means of queries
  • 84.
    84 Tree-Map  Screen-filling methodwhich uses a hierarchical partitioning of the screen into regions depending on the attribute values  The x- and y-dimension of the screen are partitioned alternately according to the attribute values (classes) MSR Netscan Image Ack.:
  • 85.
    85 Tree-Map of aFile System (Schneiderman)
  • 86.
    86 InfoCube  A 3-Dvisualization technique where hierarchical information is displayed as nested semi- transparent cubes  The outermost cubes correspond to the top level data, while the subnodes or the lower level data are represented as smaller cubes inside the outermost cubes, and so on
  • 87.
    87 Three-D Cone Trees 3D cone tree visualization technique works well for up to a thousand nodes or so  First build a 2D circle tree that arranges its nodes in concentric circles centered on the root node  Cannot avoid overlaps when projected to 2D  G. Robertson, J. Mackinlay, S. Card. “Cone Trees: Animated 3D Visualizations of Hierarchical Information”, ACM SIGCHI'91  Graph from Nadeau Software Consulting website: Visualize a social network data set that models the way an infection spreads from one person to the next Ack.: http://nadeausoftware.com/articles/visualization
  • 88.
    Visualizing Complex Dataand Relations  Visualizing non-numerical data: text and social networks  Tag cloud: visualizing user-generated tags  The importance of tag is represented by font size/color  Besides text data, there are also methods to visualize relationships, such as visualizing social networks Newsmap: Google News Stories in
  • 89.
    89 Chapter 2: Gettingto Know Your Data  Data Objects and Attribute Types  Basic Statistical Descriptions of Data  Data Visualization  Measuring Data Similarity and Dissimilarity  Summary
  • 90.
    90 Similarity and Dissimilarity Similarity  Numerical measure of how alike two data objects are  Value is higher when objects are more alike  Often falls in the range [0,1]  Dissimilarity (e.g., distance)  Numerical measure of how different two data objects are  Lower when objects are more alike  Minimum dissimilarity is often 0  Upper limit varies  Proximity refers to a similarity or dissimilarity
  • 91.
    91 Data Matrix andDissimilarity Matrix  Data matrix  n data points with p dimensions  Two modes  Dissimilarity matrix  n data points, but registers only the distance  A triangular matrix  Single mode                   np x ... nf x ... n1 x ... ... ... ... ... ip x ... if x ... i1 x ... ... ... ... ... 1p x ... 1f x ... 11 x                 0 ... ) 2 , ( ) 1 , ( : : : ) 2 , 3 ( ) ... n d n d 0 d d(3,1 0 d(2,1) 0
  • 92.
    92 Proximity Measure forNominal Attributes  Can take 2 or more states, e.g., red, yellow, blue, green (generalization of a binary attribute)  Method 1: Simple matching  m: # of matches, p: total # of variables  Method 2: Use a large number of binary attributes  creating a new binary attribute for each of p m p j i d   ) , (
  • 93.
    93 Proximity Measure forBinary Attributes  A contingency table for binary data  Distance measure for symmetric binary variables:  Distance measure for asymmetric binary variables:  Jaccard coefficient (similarity measure for asymmetric binary variables):  Note: Jaccard coefficient is the same as “coherence”: Object i Object j
  • 94.
    94 Dissimilarity between BinaryVariables  Example  Gender is a symmetric attribute  The remaining attributes are asymmetric binary  Let the values Y and P be 1, and the value N 0 Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N 75 . 0 2 1 1 2 1 ) , ( 67 . 0 1 1 1 1 1 ) , ( 33 . 0 1 0 2 1 0 ) , (                mary jim d jim jack d mary jack d
  • 95.
    95 Standardizing Numeric Data Z-score:  X: raw score to be standardized, μ: mean of the population, σ: standard deviation  the distance between the raw score and the population mean in units of the standard deviation  negative when the raw score is below the mean, “+” when above  An alternative way: Calculate the mean absolute deviation where  standardized measure (z-score):  Using mean absolute deviation is more robust than using standard deviation . ) ... 2 1 1 nf f f f x x (x n m     |) | ... | | | (| 1 2 1 f nf f f f f f m x m x m x n s        f f if if s m x z       x z
  • 96.
    96 Example: Data Matrix andDissimilarity Matrix point attribute1 attribute2 x1 1 2 x2 3 5 x3 2 0 x4 4 5 Dissimilarity Matrix (with Euclidean Distance) x1 x2 x3 x4 x1 0 x2 3.61 0 x3 5.1 5.1 0 x4 4.24 1 5.39 0 Data Matrix 0 2 4 2 4 x1 x2 x3 x4
  • 97.
    97 Distance on NumericData: Minkowski Distance  Minkowski distance: A popular distance measure where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p- dimensional data objects, and h is the order (the distance so defined is also called L-h norm)  Properties  d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)  d(i, j) = d(j, i) (Symmetry)  d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)  A distance that satisfies these properties is a metric
  • 98.
    98 Special Cases ofMinkowski Distance  h = 1: Manhattan (city block, L1 norm) distance  E.g., the Hamming distance: the number of bits that are different between two binary vectors  h = 2: (L2 norm) Euclidean distance  h  . “supremum” (Lmax norm, L norm) distance.  This is the maximum difference between any component (attribute) of the vectors ) | | ... | | | (| ) , ( 2 2 2 2 2 1 1 p p j x i x j x i x j x i x j i d        | | ... | | | | ) , ( 2 2 1 1 p p j x i x j x i x j x i x j i d       
  • 99.
    99 Example: Minkowski Distance DissimilarityMatrices point attribute 1 attribute 2 x1 1 2 x2 3 5 x3 2 0 x4 4 5 L x1 x2 x3 x4 x1 0 x2 5 0 x3 3 6 0 x4 6 1 7 0 L2 x1 x2 x3 x4 x1 0 x2 3.61 0 x3 2.24 5.1 0 x4 4.24 1 5.39 0 L x1 x2 x3 x4 x1 0 x2 3 0 x3 2 5 0 x4 3 1 5 0 Manhattan (L1) Euclidean (L2) Supremum 0 2 4 2 4 x1 x2 x3 x4
  • 100.
    100 Ordinal Variables  Anordinal variable can be discrete or continuous  Order is important, e.g., rank  Can be treated like interval-scaled  replace xif by their rank  map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by  compute the dissimilarity using methods for interval-scaled variables 1 1    f if if M r z } ,..., 1 { f if M r 
  • 101.
    101 Attributes of MixedType  A database may contain all attribute types  Nominal, symmetric binary, asymmetric binary, numeric, ordinal  One may use a weighted formula to combine their effects  f is binary or nominal: dij (f) = 0 if xif = xjf , or dij (f) = 1 otherwise  f is numeric: use the normalized distance  f is ordinal  Compute ranks rif and ) ( 1 ) ( ) ( 1 ) , ( f ij p f f ij f ij p f d j i d        1 1    f if M r zif
  • 102.
    102 Cosine Similarity  Adocument can be represented by thousands of attributes, each recording the frequency of a particular word (such as keywords) or phrase in the document.  Other vector objects: gene features in micro-arrays, …  Applications: information retrieval, biologic taxonomy, gene feature mapping, ...  Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then cos(d1 , d2 ) = (d1  d2 ) /||d1 || ||d2 || , where  indicates vector dot product, ||d||: the length of vector d
  • 103.
    103 Example: Cosine Similarity cos(d1 , d2 ) = (d1  d2 ) /||d1 || ||d2 || , where  indicates vector dot product, ||d|: the length of vector d  Ex: Find the similarity between documents 1 and 2. d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1) d1 d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25 ||d1 ||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5 =(42)0.5 = 6.481 ||d2 ||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5 =(17)0.5 = 4.12 cos(d1 , d2 ) = 0.94
  • 104.
    104 Chapter 2: Gettingto Know Your Data  Data Objects and Attribute Types  Basic Statistical Descriptions of Data  Data Visualization  Measuring Data Similarity and Dissimilarity  Summary
  • 105.
    Summary  Data attributetypes: nominal, binary, ordinal, interval-scaled, ratio- scaled  Many types of data sets, e.g., numerical, text, graph, Web, image.  Gain insight into the data by:  Basic statistical data description: central tendency, dispersion, graphical displays  Data visualization: map data onto graphical primitives  Measure data similarity  Above steps are the beginning of data preprocessing.  Many methods have been developed but still an active area of research. 105
  • 106.
    References  W. Cleveland,Visualizing Data, Hobart Press, 1993  T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003  U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001  L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990.  H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech. Committee on Data Eng., 20(4), Dec. 1997  D. A. Keim. Information visualization and visual data mining, IEEE trans. on Visualization and Computer Graphics, 8(1), 2002  D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999  S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and Machine Intelligence, 21(9), 1999  E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press, 2001  C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies, Information Visualization, 8(1), 2009 106
  • 107.
    107 Data Mining: Concepts andTechniques (3rd ed.) — Chapter 3 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2011 Han, Kamber & Pei. All rights reserved.
  • 108.
    108 108 Chapter 3: DataPreprocessing  Data Preprocessing: An Overview  Data Quality  Major Tasks in Data Preprocessing  Data Cleaning  Data Integration  Data Reduction  Data Transformation and Data Discretization  Summary
  • 109.
    109 Data Quality: WhyPreprocess the Data?  Measures for data quality: A multidimensional view  Accuracy: correct or wrong, accurate or not  Completeness: not recorded, unavailable, …  Consistency: some modified but some not, dangling, …  Timeliness: timely update?  Believability: how trustable the data are correct?  Interpretability: how easily the data can be understood?
  • 110.
    110 Major Tasks inData Preprocessing  Data cleaning  Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies  Data integration  Integration of multiple databases, data cubes, or files  Data reduction  Dimensionality reduction  Numerosity reduction  Data compression  Data transformation and data discretization  Normalization  Concept hierarchy generation
  • 111.
    111 111 Chapter 3: DataPreprocessing  Data Preprocessing: An Overview  Data Quality  Major Tasks in Data Preprocessing  Data Cleaning  Data Integration  Data Reduction  Data Transformation and Data Discretization  Summary
  • 112.
    112 Data Cleaning  Datain the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error  incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data  e.g., Occupation=“ ” (missing data)  noisy: containing noise, errors, or outliers  e.g., Salary=“ 10” (an error) −  inconsistent: containing discrepancies in codes or names, e.g.,  Age=“42”, Birthday=“03/07/2010”  Was rating “1, 2, 3”, now rating “A, B, C”  discrepancy between duplicate records  Intentional (e.g., disguised missing data)  Jan. 1 as everyone’s birthday?
  • 113.
    113 Incomplete (Missing) Data Data is not always available  E.g., many tuples have no recorded value for several attributes, such as customer income in sales data  Missing data may be due to  equipment malfunction  inconsistent with other recorded data and thus deleted  data not entered due to misunderstanding  certain data may not be considered important at the time of entry  not register history or changes of the data
  • 114.
    114 How to HandleMissing Data?  Ignore the tuple: usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably  Fill in the missing value manually: tedious + infeasible?  Fill in it automatically with  a global constant : e.g., “unknown”, a new class?!  the attribute mean  the attribute mean for all samples belonging to the same class: smarter  the most probable value: inference-based such as
  • 115.
    115 Noisy Data  Noise:random error or variance in a measured variable  Incorrect attribute values may be due to  faulty data collection instruments  data entry problems  data transmission problems  technology limitation  inconsistency in naming convention  Other data problems which require data cleaning  duplicate records  incomplete data  inconsistent data
  • 116.
    116 How to HandleNoisy Data?  Binning  first sort data and partition into (equal-frequency) bins  then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.  Regression  smooth by fitting the data into regression functions  Clustering  detect and remove outliers  Combined computer and human inspection  detect suspicious values and check by human (e.g., deal with possible outliers)
  • 117.
    117 Data Cleaning asa Process  Data discrepancy detection  Use metadata (e.g., domain, range, dependency, distribution)  Check field overloading  Check uniqueness rule, consecutive rule and null rule  Use commercial tools  Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections  Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers)  Data migration and integration  Data migration tools: allow transformations to be specified  ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface  Integration of the two processes  Iterative and interactive (e.g., Potter’s Wheels)
  • 118.
    118 118 Chapter 3: DataPreprocessing  Data Preprocessing: An Overview  Data Quality  Major Tasks in Data Preprocessing  Data Cleaning  Data Integration  Data Reduction  Data Transformation and Data Discretization  Summary
  • 119.
    119 119 Data Integration  Dataintegration:  Combines data from multiple sources into a coherent store  Schema integration: e.g., A.cust-id  B.cust-#  Integrate metadata from different sources  Entity identification problem:  Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton  Detecting and resolving data value conflicts  For the same real world entity, attribute values from different sources are different  Possible reasons: different representations, different scales, e.g., metric vs. British units
  • 120.
    120 120 Handling Redundancy inData Integration  Redundant data occur often when integration of multiple databases  Object identification: The same attribute or object may have different names in different databases  Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue  Redundant attributes may be able to be detected by correlation analysis and covariance analysis  Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
  • 121.
    121 Correlation Analysis (NominalData)  Χ2 (chi-square) test  The larger the Χ2 value, the more likely the variables are related  The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected count  Correlation does not imply causality  # of hospitals and # of car-theft in a city are correlated  Both are causally linked to the third variable: population    Expected Expected Observed 2 2 ) ( 
  • 122.
    122 Chi-Square Calculation: AnExample  Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories)  It shows that like_science_fiction and play_chess are correlated in the group 93 . 507 840 ) 840 1000 ( 360 ) 360 200 ( 210 ) 210 50 ( 90 ) 90 250 ( 2 2 2 2 2           Play chess Not play chess Sum (row) Like science fiction 250(90) 200(360) 450 Not like science fiction 50(210) 1000(840) 1050 Sum(col.) 300 1200 1500
  • 123.
    123 Correlation Analysis (NumericData)  Correlation coefficient (also called Pearson’s product moment coefficient) where n is the number of tuples, and are the respective means of A and B, σA and σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross- product.  If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the stronger correlation.  rA,B = 0: independent; rAB < 0: negatively correlated B A n i i i B A n i i i B A n B A n b a n B b A a r     ) 1 ( ) ( ) 1 ( ) )( ( 1 1 ,            A B
  • 124.
    124 Visually Evaluating Correlation Scatterplots showing the similarity from –1 to 1.
  • 125.
    125 Correlation (viewed aslinear relationship)  Correlation measures the linear relationship between objects  To compute correlation, we standardize data objects, A and B, and then take their dot product ) ( / )) ( ( ' A std A mean a a k k   ) ( / )) ( ( ' B std B mean b b k k   ' ' ) , ( B A B A n correlatio  
  • 126.
    126 Covariance (Numeric Data) Covariance is similar to correlation where n is the number of tuples, and are the respective mean or expected values of A and B, σA and σB are the respective standard deviation of A and B.  Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected values.  Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to be smaller than its expected value.  Independence: CovA,B = 0 but the converse is not true:  Some pairs of random variables may have a covariance of 0 but are not independent. Only under some additional assumptions (e.g., the data follow multivariate normal distributions) does a covariance of 0 imply A B Correlation coefficient:
  • 127.
    Co-Variance: An Example It can be simplified in computation as  Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).  Question: If the stocks are affected by the same industry trends, will their prices rise or fall together?  E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4  E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6  Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 4 × 9.6 = 4 −  Thus, A and B rise together since Cov(A, B) > 0.
  • 128.
    128 128 Chapter 3: DataPreprocessing  Data Preprocessing: An Overview  Data Quality  Major Tasks in Data Preprocessing  Data Cleaning  Data Integration  Data Reduction  Data Transformation and Data Discretization  Summary
  • 129.
    129 Data Reduction Strategies Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results  Why data reduction? — A database/data warehouse may store terabytes of data. Complex data analysis may take a very long time to run on the complete data set.  Data reduction strategies  Dimensionality reduction, e.g., remove unimportant attributes  Wavelet transforms  Principal Components Analysis (PCA)  Feature subset selection, feature creation  Numerosity reduction (some simply call it: Data Reduction)  Regression and Log-Linear Models  Histograms, clustering, sampling  Data cube aggregation  Data compression
  • 130.
    130 Data Reduction 1:Dimensionality Reduction  Curse of dimensionality  When dimensionality increases, data becomes increasingly sparse  Density and distance between points, which is critical to clustering, outlier analysis, becomes less meaningful  The possible combinations of subspaces will grow exponentially  Dimensionality reduction  Avoid the curse of dimensionality  Help eliminate irrelevant features and reduce noise  Reduce time and space required in data mining  Allow easier visualization  Dimensionality reduction techniques  Wavelet transforms  Principal Component Analysis  Supervised and nonlinear techniques (e.g., feature selection)
  • 131.
    131 Mapping Data toa New Space Two Sine Waves Two Sine Waves + Noise Frequency  Fourier transform  Wavelet transform
  • 132.
    132 What Is WaveletTransform?  Decomposes a signal into different frequency subbands  Applicable to n- dimensional signals  Data are transformed to preserve relative distance between objects at different levels of resolution  Allow natural clusters to become more distinguishable  Used for image
  • 133.
    133 Wavelet Transformation  Discretewavelet transform (DWT) for linear signal processing, multi-resolution analysis  Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients  Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space  Method:  Length, L, must be an integer power of 2 (padding with 0’s, when necessary)  Each transform has 2 functions: smoothing, difference  Applies to pairs of data, resulting in two set of data of length L/2  Applies two functions recursively, until reaches the desired Haar2 Daubechie4
  • 134.
    134 Wavelet Decomposition  Wavelets:A math tool for space-efficient hierarchical decomposition of functions  S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S^ = [23 /4, -11 /4, 1 /2, 0, 0, -1, -1, 0]  Compression: many small detail coefficients can be replaced by 0’s, and only the significant coefficients are retained
  • 135.
    135 Haar Wavelet Coefficients Coefficient “Supports” 22 0 2 3 5 4 4 -1.25 2.75 0.5 0 0 -1 0 -1 + - + + + + + + + - - - - - - + - + + - + - + -+ - - + + - -1 -1 0.5 0 2.75 -1.25 0 0 Original frequency distribution Hierarchical decomposition structure (a.k.a. “error tree”)
  • 136.
    136 Why Wavelet Transform? Use hat-shape filters  Emphasize region where points cluster  Suppress weaker information in their boundaries  Effective removal of outliers  Insensitive to noise, insensitive to input order  Multi-resolution  Detect arbitrary shaped clusters at different scales  Efficient  Complexity O(N)  Only applicable to low dimensional data
  • 137.
    137 x2 x1 e Principal Component Analysis(PCA)  Find a projection that captures the largest amount of variation in data  The original data are projected onto a much smaller space, resulting in dimensionality reduction. We find the eigenvectors of the covariance matrix, and these eigenvectors define the new space
  • 138.
    138  Given Ndata vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components) that can be best used to represent data  Normalize input data: Each attribute falls within the same range  Compute k orthonormal (unit) vectors, i.e., principal components  Each input data (vector) is a linear combination of the k principal component vectors  The principal components are sorted in order of decreasing “significance” or strength  Since the components are sorted, the size of the data can be reduced by eliminating the weak components, i.e., those with low variance (i.e., using the strongest principal components, it is possible to reconstruct a good approximation of the original Principal Component Analysis (Steps)
  • 139.
    139 Attribute Subset Selection Another way to reduce dimensionality of data  Redundant attributes  Duplicate much or all of the information contained in one or more other attributes  E.g., purchase price of a product and the amount of sales tax paid  Irrelevant attributes  Contain no information that is useful for the data mining task at hand  E.g., students' ID is often irrelevant to the task of predicting students' GPA
  • 140.
    140 Heuristic Search inAttribute Selection  There are 2d possible attribute combinations of d attributes  Typical heuristic attribute selection methods:  Best single attribute under the attribute independence assumption: choose by significance tests  Best step-wise feature selection:  The best single-attribute is picked first  Then next best attribute condition to the first, ...  Step-wise attribute elimination:  Repeatedly eliminate the worst attribute  Best combined attribute selection and elimination  Optimal branch and bound:
  • 141.
    141 Attribute Creation (FeatureGeneration)  Create new attributes (features) that can capture the important information in a data set more effectively than the original ones  Three general methodologies  Attribute extraction  Domain-specific  Mapping data to new space (see: data reduction)  E.g., Fourier transformation, wavelet transformation, manifold approaches (not covered)  Attribute construction  Combining features (see: discriminative frequent patterns in Chapter 7) 
  • 142.
    142 Data Reduction 2:Numerosity Reduction  Reduce data volume by choosing alternative, smaller forms of data representation  Parametric methods (e.g., regression)  Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)  Ex.: Log-linear models—obtain value at a point in m-D space as the product on appropriate marginal subspaces  Non-parametric methods  Do not assume models  Major families: histograms, clustering, sampling, …
  • 143.
    143 Parametric Data Reduction:Regression and Log-Linear Models  Linear regression  Data modeled to fit a straight line  Often uses the least-square method to fit the line  Multiple regression  Allows a response variable Y to be modeled as a linear function of multidimensional feature vector  Log-linear model  Approximates discrete multidimensional probability distributions
  • 144.
    144 Regression Analysis  Regressionanalysis: A collective name for techniques for the modeling and analysis of numerical data consisting of values of a dependent variable (also called response variable or measurement) and of one or more independent variables (aka. explanatory variables or predictors)  The parameters are estimated so as to give a "best fit" of the data  Most commonly the best fit is evaluated by using the least squares method, but other criteria have also been used  Used for prediction (including forecasting of time-series data), inference, hypothesis testing, and modeling of causal relationships y x y = x + 1 X1 Y1 Y1’
  • 145.
    145  Linear regression:Y = w X + b  Two regression coefficients, w and b, specify the line and are to be estimated by using the data at hand  Using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….  Multiple regression: Y = b0 + b1 X1 + b2 X2  Many nonlinear functions can be transformed into the above  Log-linear models:  Approximate discrete multidimensional probability distributions  Estimate the probability of each point (tuple) in a multi- dimensional space for a set of discretized attributes, based on a smaller subset of dimensional combinations Regress Analysis and Log-Linear Models
  • 146.
    146 Histogram Analysis  Dividedata into buckets and store average (sum) for each bucket  Partitioning rules:  Equal-width: equal bucket range  Equal-frequency (or equal-depth) 0 5 10 15 20 25 30 35 40 1 0 0 0 0 2 0 0 0 0 3 0 0 0 0 4 0 0 0 0 5 0 0 0 0 6 0 0 0 0 7 0 0 0 0 8 0 0 0 0 9 0 0 0 0 1 0 0 0 0 0
  • 147.
    147 Clustering  Partition dataset into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only  Can be very effective if data is clustered but not if data is “smeared”  Can have hierarchical clustering and be stored in multi-dimensional index tree structures  There are many choices of clustering definitions and clustering algorithms  Cluster analysis will be studied in depth in Chapter 10
  • 148.
    148 Sampling  Sampling: obtaininga small sample s to represent the whole data set N  Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data  Key principle: Choose a representative subset of the data  Simple random sampling may have very poor performance in the presence of skew  Develop adaptive sampling methods, e.g., stratified sampling:  Note: Sampling may not reduce database I/Os (page at
  • 149.
    149 Types of Sampling Simple random sampling  There is an equal probability of selecting any particular item  Sampling without replacement  Once an object is selected, it is removed from the population  Sampling with replacement  A selected object is not removed from the population  Stratified sampling:  Partition the data set, and draw samples from each partition (proportionally, i.e., approximately the same percentage of the data)  Used in conjunction with skewed data
  • 150.
    150 Sampling: With orwithout Replacement SRSWOR (simple random sample without replacement) SRSWR Raw Data
  • 151.
    151 Sampling: Cluster orStratified Sampling Raw Data Cluster/Stratified Sample
  • 152.
    152 Data Cube Aggregation The lowest level of a data cube (base cuboid)  The aggregated data for an individual entity of interest  E.g., a customer in a phone calling data warehouse  Multiple levels of aggregation in data cubes  Further reduce the size of data to deal with  Reference appropriate levels  Use the smallest representation which is enough to solve the task  Queries regarding aggregated information should be answered using data cube, when possible
  • 153.
    153 Data Reduction 3:Data Compression  String compression  There are extensive theories and well-tuned algorithms  Typically lossless, but only limited manipulation is possible without expansion  Audio/video compression  Typically lossy compression, with progressive refinement  Sometimes small fragments of signal can be reconstructed without reconstructing the whole  Time sequence is not audio  Typically short and vary slowly with time  Dimensionality and numerosity reduction may also be
  • 154.
    154 Data Compression Original DataCompressed Data lossless Original Data Approximated lossy
  • 155.
    155 Chapter 3: DataPreprocessing  Data Preprocessing: An Overview  Data Quality  Major Tasks in Data Preprocessing  Data Cleaning  Data Integration  Data Reduction  Data Transformation and Data Discretization  Summary
  • 156.
    156 Data Transformation  Afunction that maps the entire set of values of a given attribute to a new set of replacement values s.t. each old value can be identified with one of the new values  Methods  Smoothing: Remove noise from data  Attribute/feature construction  New attributes constructed from the given ones  Aggregation: Summarization, data cube construction  Normalization: Scaled to fall within a smaller, specified range  min-max normalization  z-score normalization  normalization by decimal scaling  Discretization: Concept hierarchy climbing
  • 157.
    157 Normalization  Min-max normalization:to [new_minA, new_maxA]  Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to  Z-score normalization (μ: mean, σ: standard deviation):  Ex. Let μ = 54,000, σ = 16,000. Then  Normalization by decimal scaling 716 . 0 0 ) 0 0 . 1 ( 000 , 12 000 , 98 000 , 12 600 , 73      A A A A A A min new min new max new min max min v v _ ) _ _ ( '      A A v v     ' j v v 10 ' Where j is the smallest integer such that Max(|ν’|) < 1 225 . 1 000 , 16 000 , 54 600 , 73  
  • 158.
    158 Discretization  Three typesof attributes  Nominal—values from an unordered set, e.g., color, profession  Ordinal—values from an ordered set, e.g., military or academic rank  Numeric—real numbers, e.g., integer or real numbers  Discretization: Divide the range of a continuous attribute into intervals  Interval labels can then be used to replace actual data values  Reduce data size by discretization  Supervised vs. unsupervised  Split (top-down) vs. merge (bottom-up)  Discretization can be performed recursively on an attribute  Prepare for further analysis, e.g., classification
  • 159.
    159 Data Discretization Methods Typical methods: All the methods can be applied recursively  Binning  Top-down split, unsupervised  Histogram analysis  Top-down split, unsupervised  Clustering analysis (unsupervised, top-down split or bottom-up merge)  Decision-tree analysis (supervised, top-down split)  Correlation (e.g., 2 ) analysis (unsupervised, bottom- up merge)
  • 160.
    160 Simple Discretization: Binning Equal-width (distance) partitioning  Divides the range into N intervals of equal size: uniform grid  if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N.  The most straightforward, but outliers may dominate presentation  Skewed data is not handled well  Equal-depth (frequency) partitioning  Divides the range into N intervals, each containing approximately same number of samples  Good data scaling  Managing categorical attributes can be tricky
  • 161.
    161 Binning Methods forData Smoothing  Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
  • 162.
    162 Discretization Without UsingClass Labels (Binning vs. Clustering) Data Equal interval width (binning) Equal frequency (binning) K-means clustering leads to better results
  • 163.
    163 Discretization by Classification& Correlation Analysis  Classification (e.g., decision tree analysis)  Supervised: Given class labels, e.g., cancerous vs. benign  Using entropy to determine split point (discretization point)  Top-down, recursive split  Details to be covered in Chapter 7  Correlation analysis (e.g., Chi-merge: χ2 -based discretization)  Supervised: use class information  Bottom-up merge: find the best neighboring intervals (those having similar distributions of classes, i.e., low χ2 values) to merge  Merge performed recursively, until a predefined stopping
  • 164.
    164 Concept Hierarchy Generation Concept hierarchy organizes concepts (i.e., attribute values) hierarchically and is usually associated with each dimension in a data warehouse  Concept hierarchies facilitate drilling and rolling in data warehouses to view data in multiple granularity  Concept hierarchy formation: Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as youth, adult, or senior)  Concept hierarchies can be explicitly specified by domain experts and/or data warehouse designers  Concept hierarchy can be automatically formed for both numeric and nominal data. For numeric data, use discretization methods shown.
  • 165.
    165 Concept Hierarchy Generation forNominal Data  Specification of a partial/total ordering of attributes explicitly at the schema level by users or experts  street < city < state < country  Specification of a hierarchy for a set of values by explicit data grouping  {Urbana, Champaign, Chicago} < Illinois  Specification of only a partial set of attributes  E.g., only street < city, not others  Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values  E.g., for a set of attributes: {street, city, state, country}
  • 166.
    166 Automatic Concept HierarchyGeneration  Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set  The attribute with the most distinct values is placed at the lowest level of the hierarchy  Exceptions, e.g., weekday, month, quarter, year country province_or_ state city street 15 distinct values 365 distinct values 3567 distinct values 674,339 distinct values
  • 167.
    167 Chapter 3: DataPreprocessing  Data Preprocessing: An Overview  Data Quality  Major Tasks in Data Preprocessing  Data Cleaning  Data Integration  Data Reduction  Data Transformation and Data Discretization  Summary
  • 168.
    168 Summary  Data quality:accuracy, completeness, consistency, timeliness, believability, interpretability  Data cleaning: e.g. missing/noisy values, outliers  Data integration from multiple sources:  Entity identification problem  Remove redundancies  Detect inconsistencies  Data reduction  Dimensionality reduction  Numerosity reduction  Data compression  Data transformation and data discretization  Normalization  Concept hierarchy generation
  • 169.
    169 References  D. P.Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of ACM, 42:73-78, 1999  A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996  T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003  J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997.  H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: Language, model, and algorithms. VLDB'01  M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07  H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee on Data Engineering, 20(4), Dec. 1997  H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining Perspective. Kluwer Academic, 1998  J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003  D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999  V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and Transformation, VLDB’2001  T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001  R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans. Knowledge and Data Engineering, 7:623-640, 1995
  • 170.
    170 170 Data Mining: Concepts andTechniques (3rd ed.) — Chapter 4 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2011 Han, Kamber & Pei. All rights reserved.
  • 171.
    171 Chapter 4: DataWarehousing and On-line Analytical Processing  Data Warehouse: Basic Concepts  Data Warehouse Modeling: Data Cube and OLAP  Data Warehouse Design and Usage  Data Warehouse Implementation  Data Generalization by Attribute-Oriented Induction  Summary
  • 172.
    172 What is aData Warehouse?  Defined in many different ways, but not rigorously.  A decision support database that is maintained separately from the organization’s operational database  Support information processing by providing a solid platform of consolidated, historical data for analysis.  “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon  Data warehousing:  The process of constructing and using data warehouses
  • 173.
    173 Data Warehouse—Subject-Oriented  Organizedaround major subjects, such as customer, product, sales  Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing  Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process
  • 174.
    174 Data Warehouse—Integrated  Constructedby integrating multiple, heterogeneous data sources  relational databases, flat files, on-line transaction records  Data cleaning and data integration techniques are applied.  Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources  E.g., Hotel price: currency, tax, breakfast covered, etc.  When data is moved to the warehouse, it is converted.
  • 175.
    175 Data Warehouse—Time Variant The time horizon for the data warehouse is significantly longer than that of operational systems  Operational database: current value data  Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)  Every key structure in the data warehouse  Contains an element of time, explicitly or implicitly  But the key of operational data may or may not contain “time element”
  • 176.
    176 Data Warehouse—Nonvolatile  Aphysically separate store of data transformed from the operational environment  Operational update of data does not occur in the data warehouse environment  Does not require transaction processing, recovery, and concurrency control mechanisms  Requires only two operations in data accessing:  initial loading of data and access of data
  • 177.
    177 OLTP vs. OLAP OLTPOLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date detailed, flat relational isolated historical, summarized, multidimensional integrated, consolidated usage repetitive ad-hoc access read/write index/hash on prim. key lots of scans unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response
  • 178.
    178 Why a SeparateData Warehouse?  High performance for both systems  DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery  Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation  Different functions and different data:  missing data: Decision support requires historical data which operational DBs do not typically maintain  data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources  data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled  Note: There are more and more systems which perform OLAP analysis directly on relational databases
  • 179.
    179 Data Warehouse: AMulti-Tiered Architecture Data Warehouse Extract Transform Load Refresh OLAP Engine Analysis Query Reports Data mining Monitor & Integrator Metadata Data Sources Front-End Tools Serve Data Marts Operational DBs Other sources Data Storage OLAP Server
  • 180.
    180 Three Data WarehouseModels  Enterprise warehouse  collects all of the information about subjects spanning the entire organization  Data Mart  a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart  Independent vs. dependent (directly from warehouse) data mart  Virtual warehouse  A set of views over operational databases  Only some of the possible summary views may be
  • 181.
    181 Extraction, Transformation, andLoading (ETL)  Data extraction  get data from multiple, heterogeneous, and external sources  Data cleaning  detect errors in the data and rectify them when possible  Data transformation  convert data from legacy or host format to warehouse format  Load  sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions  Refresh  propagate the updates from the data sources to the warehouse
  • 182.
    182 Metadata Repository  Metadata is the data defining warehouse objects. It stores:  Description of the structure of the data warehouse  schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents  Operational meta-data  data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails)  The algorithms used for summarization  The mapping from operational environment to the data warehouse  Data related to system performance  warehouse schema, view and derived data definitions  Business data
  • 183.
    183 Chapter 4: DataWarehousing and On-line Analytical Processing  Data Warehouse: Basic Concepts  Data Warehouse Modeling: Data Cube and OLAP  Data Warehouse Design and Usage  Data Warehouse Implementation  Data Generalization by Attribute-Oriented Induction  Summary
  • 184.
    184 From Tables andSpreadsheets to Data Cubes  A data warehouse is based on a multidimensional data model which views data in the form of a data cube  A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions  Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year)  Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables  In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.
  • 185.
    185 Cube: A Latticeof Cuboids time,item time,item,location time, item, location, supplier all time item location supplier time,location time,supplier item,location item,supplier location,supplier time,item,supplier time,location,supplier item,location,supplier 0-D (apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D (base) cuboid
  • 186.
    186 Conceptual Modeling ofData Warehouses  Modeling data warehouses: dimensions & measures  Star schema: A fact table in the middle connected to a set of dimension tables  Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake  Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation
  • 187.
    187 Example of StarSchema time_key day day_of_the_week month quarter year time location_key street city state_or_province country location Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_type item branch_key branch_name branch_type branch
  • 188.
    188 Example of SnowflakeSchema time_key day day_of_the_week month quarter year time location_key street city_key location Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_key item branch_key branch_name branch_type branch supplier_key supplier_type supplier city_key city state_or_province country city
  • 189.
    189 Example of FactConstellation time_key day day_of_the_week month quarter year time location_key street city province_or_state country location Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_type item branch_key branch_name branch_type branch Shipping Fact Table time_key item_key shipper_key from_location to_location dollars_cost units_shipped shipper_key shipper_name location_key shipper_type shipper
  • 190.
    190 A Concept Hierarchy: Dimension(location) all Europe North_America Mexico Canada Spain Germany Vancouver M. Wind L. Chan ... ... ... ... ... ... all region office country Toronto Frankfurt city
  • 191.
    191 Data Cube Measures:Three Categories  Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning  E.g., count(), sum(), min(), max()  Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function  E.g., avg(), min_N(), standard_deviation()  Holistic: if there is no constant bound on the storage size needed to describe a subaggregate.  E.g., median(), mode(), rank()
  • 192.
    192 View of Warehousesand Hierarchies Specification of hierarchies  Schema hierarchy day < {month < quarter; week} < year  Set_grouping hierarchy {1..10} < inexpensive
  • 193.
    193 Multidimensional Data  Salesvolume as a function of product, month, and region Product R e g i o n Month Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region Year Category Country Quarter Product City Month Week Office Day
  • 194.
    194 A Sample DataCube Total annual sales of TVs in U.S.A. Date P r o d u c t Country sum sum TV VCR PC 1Qtr 2Qtr 3Qtr 4Qtr U.S.A Canada Mexico sum
  • 195.
    195 Cuboids Corresponding tothe Cube all product date country product,date product,country date, country product, date, country 0-D (apex) cuboid 1-D cuboids 2-D cuboids 3-D (base) cuboid
  • 196.
    196 Typical OLAP Operations Roll up (drill-up): summarize data  by climbing up hierarchy or by dimension reduction  Drill down (roll down): reverse of roll-up  from higher level summary to lower level summary or detailed data, or introducing new dimensions  Slice and dice: project and select  Pivot (rotate):  reorient the cube, visualization, 3D to series of 2D planes  Other operations  drill across: involving (across) more than one fact table  drill through: through the bottom level of the cube to its back-end relational tables (using SQL)
  • 197.
  • 198.
    198 A Star-Net QueryModel Shipping Method AIR-EXPRESS TRUCK ORDER Customer Orders CONTRACTS Customer Product PRODUCT GROUP PRODUCT LINE PRODUCT ITEM SALES PERSON DISTRICT DIVISION Organization Promotion CITY COUNTRY REGION Location DAILY QTRLY ANNUALY Time Each circle is called a footprint
  • 199.
    199 Browsing a DataCube  Visualization  OLAP capabilities  Interactive manipulation
  • 200.
    200 Chapter 4: DataWarehousing and On-line Analytical Processing  Data Warehouse: Basic Concepts  Data Warehouse Modeling: Data Cube and OLAP  Data Warehouse Design and Usage  Data Warehouse Implementation  Data Generalization by Attribute-Oriented Induction  Summary
  • 201.
    201 Design of DataWarehouse: A Business Analysis Framework  Four views regarding the design of a data warehouse  Top-down view  allows selection of the relevant information necessary for the data warehouse  Data source view  exposes the information being captured, stored, and managed by operational systems  Data warehouse view  consists of fact tables and dimension tables  Business query view  sees the perspectives of data in the warehouse from the view of end-user
  • 202.
    202 Data Warehouse DesignProcess  Top-down, bottom-up approaches or a combination of both  Top-down: Starts with overall design and planning (mature)  Bottom-up: Starts with experiments and prototypes (rapid)  From software engineering point of view  Waterfall: structured and systematic analysis at each step before proceeding to the next  Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around  Typical data warehouse design process  Choose a business process to model, e.g., orders, invoices, etc.  Choose the grain (atomic level of data) of the business process  Choose the dimensions that will apply to each fact table record  Choose the measure that will populate each fact table record
  • 203.
    203 Data Warehouse Development: ARecommended Approach Define a high-level corporate data model Data Mart Data Mart Distributed Data Marts Multi-Tier Data Warehouse Enterprise Data Warehouse Model refinement Model refinement
  • 204.
    204 Data Warehouse Usage Three kinds of data warehouse applications  Information processing  supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs  Analytical processing  multidimensional analysis of data warehouse data  supports basic OLAP operations, slice-dice, drilling, pivoting  Data mining  knowledge discovery from hidden patterns  supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools
  • 205.
    205 From On-Line AnalyticalProcessing (OLAP) to On Line Analytical Mining (OLAM)  Why online analytical mining?  High quality of data in data warehouses  DW contains integrated, consistent, cleaned data  Available information processing structure surrounding data warehouses  ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools  OLAP-based exploratory data analysis  Mining with drilling, dicing, pivoting, etc.  On-line selection of data mining functions  Integration and swapping of multiple mining functions, algorithms, and tasks
  • 206.
    206 Chapter 4: DataWarehousing and On-line Analytical Processing  Data Warehouse: Basic Concepts  Data Warehouse Modeling: Data Cube and OLAP  Data Warehouse Design and Usage  Data Warehouse Implementation  Data Generalization by Attribute-Oriented Induction  Summary
  • 207.
    207 Efficient Data CubeComputation  Data cube can be viewed as a lattice of cuboids  The bottom-most cuboid is the base cuboid  The top-most cuboid (apex) contains only one cell  How many cuboids in an n-dimensional cube with L levels?  Materialization of data cube  Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization)  Selection of which cuboids to materialize  Based on size, sharing, access frequency, etc. ) 1 1 (     n i i L T
  • 208.
    208 The “Compute Cube”Operator  Cube definition and computation in DMQL define cube sales [item, city, year]: sum (sales_in_dollars) compute cube sales  Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96) SELECT item, city, year, SUM (amount) FROM SALES CUBE BY item, city, year  Need compute the following Group-Bys (date, product, customer), (date,product),(date, customer), (product, customer), (date), (product), (customer) () (item) (city) () (year) (city, item) (city, year) (item, year) (city, item, year)
  • 209.
    209 Indexing OLAP Data:Bitmap Index  Index on a particular column  Each value in the column has a bit vector: bit-op is fast  The length of the bit vector: # of records in the base table  The i-th bit is set if the i-th row of the base table has the value for the indexed column  not suitable for high cardinality domains  A recent bit compression technique, Word-Aligned Hybrid (WAH), makes it work for high cardinality domain as well [Wu, et al. TODS’06] Cust Region Type C1 Asia Retail C2 Europe Dealer C3 Asia Dealer C4 America Retail C5 Europe Dealer RecID Retail Dealer 1 1 0 2 0 1 3 0 1 4 1 0 5 0 1 RecIDAsia Europe America 1 1 0 0 2 0 1 0 3 1 0 0 4 0 0 1 5 0 1 0 Base table Index on Region Index on Type
  • 210.
    210 Indexing OLAP Data:Join Indices  Join index: JI(R-id, S-id) where R (R-id, …)  S (S- id, …)  Traditional indices map the values to a list of record ids  It materializes relational join in JI file and speeds up relational join  In data warehouses, join index relates the values of the dimensions of a start schema to rows in the fact table.  E.g. fact table: Sales and two dimensions city and product  A join index on city maintains for each distinct city a list of R-IDs of the tuples recording the Sales in the city  Join indices can span multiple dimensions
  • 211.
    211 Efficient Processing OLAPQueries  Determine which operations should be performed on the available cuboids  Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice = selection + projection  Determine which materialized cuboid(s) should be selected for OLAP op.  Let the query to be processed be on {brand, province_or_state} with the condition “year = 2004”, and there are 4 materialized cuboids available: 1) {year, item_name, city} 2) {year, brand, country} 3) {year, brand, province_or_state} 4) {item_name, province_or_state} where year = 2004 Which should be selected to process the query?  Explore indexing structures and compressed vs. dense array structs in
  • 212.
    212 OLAP Server Architectures Relational OLAP (ROLAP)  Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware  Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services  Greater scalability  Multidimensional OLAP (MOLAP)  Sparse array-based multidimensional storage engine  Fast indexing to pre-computed summarized data  Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)  Flexibility, e.g., low level: relational, high-level: array  Specialized SQL servers (e.g., Redbricks)  Specialized support for SQL queries over star/snowflake
  • 213.
    213 Chapter 4: DataWarehousing and On-line Analytical Processing  Data Warehouse: Basic Concepts  Data Warehouse Modeling: Data Cube and OLAP  Data Warehouse Design and Usage  Data Warehouse Implementation  Data Generalization by Attribute-Oriented Induction  Summary
  • 214.
    214 Attribute-Oriented Induction  Proposedin 1989 (KDD ‘89 workshop)  Not confined to categorical data nor particular measures  How it is done?  Collect the task-relevant data (initial relation) using a relational database query  Perform generalization by attribute removal or attribute generalization  Apply aggregation by merging identical, generalized tuples and accumulating their respective counts  Interaction with users for knowledge presentation
  • 215.
    215 Attribute-Oriented Induction: AnExample Example: Describe general characteristics of graduate students in the University database  Step 1. Fetch relevant set of data using an SQL statement, e.g., Select * (i.e., name, gender, major, birth_place, birth_date, residence, phone#, gpa) from student where student_status in {“Msc”, “MBA”, “PhD” }  Step 2. Perform attribute-oriented induction  Step 3. Present results in generalized relation, cross-tab, or rule forms
  • 216.
    216 Class Characterization: AnExample Name Gender Major Birth-Place Birth_date Residence Phone # GPA Jim Woodman M CS Vancouver,BC, Canada 8-12-76 3511 Main St., Richmond 687-4598 3.67 Scott Lachance M CS Montreal, Que, Canada 28-7-75 345 1st Ave., Richmond 253-9106 3.70 Laura Lee … F … Physics … Seattle, WA, USA … 25-8-70 … 125 Austin Ave., Burnaby … 420-5232 … 3.83 … Removed Retained Sci,Eng, Bus Country Age range City Removed Excl, VG,.. Gender Major Birth_region Age_range Residence GPA Count M Science Canada 20-25 Richmond Very-good 16 F Science Foreign 25-30 Burnaby Excellent 22 … … … … … … … Birth_Region Gender Canada Foreign Total M 16 14 30 F 10 22 32 Total 26 36 62 Prime Generalized Relation Initial Relation
  • 217.
    217 Basic Principles ofAttribute-Oriented Induction  Data focusing: task-relevant data, including dimensions, and the result is the initial relation  Attribute-removal: remove attribute A if there is a large set of distinct values for A but (1) there is no generalization operator on A, or (2) A’s higher level concepts are expressed in terms of other attributes  Attribute-generalization: If there is a large set of distinct values for A, and there exists a set of generalization operators on A, then select an operator and generalize A  Attribute-threshold control: typical 2-8, specified/default
  • 218.
    218 Attribute-Oriented Induction: Basic Algorithm InitialRel: Query processing of task-relevant data, deriving the initial relation.  PreGen: Based on the analysis of the number of distinct values in each attribute, determine generalization plan for each attribute: removal? or how high to generalize?  PrimeGen: Based on the PreGen plan, perform generalization to the right level to derive a “prime generalized relation”, accumulating the counts.  Presentation: User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into rules, cross tabs, visualization presentations.
  • 219.
    219 Presentation of GeneralizedResults  Generalized relation:  Relations where some or all attributes are generalized, with counts or other aggregation values accumulated.  Cross tabulation:  Mapping results into cross tabulation form (similar to contingency tables).  Visualization techniques:  Pie charts, bar charts, curves, cubes, and other visual forms.  Quantitative characteristic rules:  Mapping generalized result into characteristic rules with quantitative information associated with it, e.g., . %] 47 : [ " " ) ( _ %] 53 : [ " " ) ( _ ) ( ) ( t foreign x region birth t Canada x region birth x male x grad     
  • 220.
    220 Mining Class Comparisons Comparison: Comparing two or more classes  Method:  Partition the set of relevant data into the target class and the contrasting class(es)  Generalize both classes to the same high level concepts  Compare tuples with the same high level descriptions  Present for every tuple its description and two measures  support - distribution within single class  comparison - distribution between classes  Highlight the tuples with strong discriminant features  Relevance Analysis:  Find attributes (features) which best distinguish different classes
  • 221.
    221 Concept Description vs.Cube-Based OLAP  Similarity:  Data generalization  Presentation of data summarization at multiple levels of abstraction  Interactive drilling, pivoting, slicing and dicing  Differences:  OLAP has systematic preprocessing, query independent, and can drill down to rather low level  AOI has automated desired level allocation, and may perform dimension relevance analysis/ranking when there are many relevant dimensions  AOI works on the data which are not in relational forms
  • 222.
    222 Chapter 4: DataWarehousing and On-line Analytical Processing  Data Warehouse: Basic Concepts  Data Warehouse Modeling: Data Cube and OLAP  Data Warehouse Design and Usage  Data Warehouse Implementation  Data Generalization by Attribute-Oriented Induction  Summary
  • 223.
    223 Summary  Data warehousing:A multi-dimensional model of a data warehouse  A data cube consists of dimensions & measures  Star schema, snowflake schema, fact constellations  OLAP operations: drilling, rolling, slicing, dicing and pivoting  Data Warehouse Architecture, Design, and Usage  Multi-tiered architecture  Business analysis design framework  Information processing, analytical processing, data mining, OLAM (Online Analytical Mining)  Implementation: Efficient computation of data cubes  Partial vs. full vs. no materialization  Indexing OALP data: Bitmap index and join index  OLAP query processing  OLAP servers: ROLAP, MOLAP, HOLAP  Data generalization: Attribute-oriented induction
  • 224.
    224 References (I)  S.Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. VLDB’96  D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data warehouses. SIGMOD’97  R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97  S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM SIGMOD Record, 26:65-74, 1997  E. F. Codd, S. B. Codd, and C. T. Salley. Beyond decision support. Computer World, 27, July 1993.  J. Gray, et al. Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997.  A. Gupta and I. S. Mumick. Materialized Views: Techniques, Implementations, and Applications. MIT Press, 1999.  J. Han. Towards on-line analytical mining in large databases. ACM SIGMOD Record, 27:97-107, 1998.  V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. SIGMOD’96  J. Hellerstein, P. Haas, and H. Wang. Online aggregation. SIGMOD'97
  • 225.
    225 References (II)  C.Imhoff, N. Galemmo, and J. G. Geiger. Mastering Data Warehouse Design: Relational and Dimensional Techniques. John Wiley, 2003  W. H. Inmon. Building the Data Warehouse. John Wiley, 1996  R. Kimball and M. Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling. 2ed. John Wiley, 2002  P. O’Neil and G. Graefe. Multi-table joins through bitmapped join indices. SIGMOD Record, 24:8– 11, Sept. 1995.  P. O'Neil and D. Quass. Improved query performance with variant indexes. SIGMOD'97  Microsoft. OLEDB for OLAP programmer's reference version 1.0. In http://www.microsoft.com/data/oledb/olap, 1998  S. Sarawagi and M. Stonebraker. Efficient organization of large multidimensional arrays. ICDE'94  A. Shoshani. OLAP and statistical databases: Similarities and differences. PODS’00.  D. Srivastava, S. Dar, H. V. Jagadish, and A. V. Levy. Answering queries with aggregation using views. VLDB'96  P. Valduriez. Join indices. ACM Trans. Database Systems, 12:218-246, 1987.  J. Widom. Research problems in data warehousing. CIKM’95  K. Wu, E. Otoo, and A. Shoshani, Optimal Bitmap Indices with Efficient Compression, ACM Trans. on Database Systems (TODS), 31(1): 1-38, 2006
  • 226.
  • 227.
    227 Compression of BitmapIndices  Bitmap indexes must be compressed to reduce I/O costs and minimize CPU usage—majority of the bits are 0’s  Two compression schemes:  Byte-aligned Bitmap Code (BBC)  Word-Aligned Hybrid (WAH) code  Time and space required to operate on compressed bitmap is proportional to the total size of the bitmap  Optimal on attributes of low cardinality as well as those of high cardinality.  WAH out performs BBC by about a factor of two
  • 228.
    228 228 Data Mining: Concepts andTechniques (3rd ed.) — Chapter 5 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2010 Han, Kamber & Pei. All rights reserved.
  • 229.
    229 229 Chapter 5: DataCube Technology  Data Cube Computation: Preliminary Concepts  Data Cube Computation Methods  Processing Advanced Queries by Exploring Data Cube Technology  Multidimensional Data Analysis in Cube Space  Summary
  • 230.
    230 230 Data Cube: ALattice of Cuboids time,item time,item,location time, item, location, supplierc all time item location supplier time,location time,supplier item,location item,supplier location,supplier time,item,supplier time,location,supplier item,location,supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid
  • 231.
    231 Data Cube: ALattice of Cuboids  Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells 1. (9/15, milk, Urbana, Dairy_land) 2. (9/15, milk, Urbana, *) 3. (*, milk, Urbana, *) 4. (*, milk, Urbana, *) 5. (*, milk, Chicago, *) 6. (*, milk, *, *) all time,item time,item,location time, item, location, supplier time item location supplier time,location time,supplier item,location item,supplier location,supplier time,item,supplier time,location,supplier item,location,supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid
  • 232.
    232 232 Cube Materialization: Full Cubevs. Iceberg Cube  Full cube vs. iceberg cube compute cube sales iceberg as select month, city, customer group, count(*) from salesInfo cube by month, city, customer group having count(*) >= min support  Computing only the cuboid cells whose measure satisfies the iceberg condition  Only a small portion of cells may be “above the water’’ in a sparse cube  Avoid explosive growth: A cube with 100 dimensions  2 base cells: (a1, a2, …., a100), (b1, b2, …, b100)  How many aggregate cells if “having count >= 1”?  What about “having count >= 2”? iceberg condition
  • 233.
    233 Iceberg Cube, ClosedCube & Cube Shell  Is iceberg cube good enough?  2 base cells: {(a1, a2, a3 . . . , a100):10, (a1, a2, b3, . . . , b100):10}  How many cells will the iceberg cube have if having count(*) >= 10? Hint: A huge but tricky number!  Close cube:  Closed cell c: if there exists no cell d, s.t. d is a descendant of c, and d has the same measure value as c.  Closed cube: a cube consisting of only closed cells  What is the closed cube of the above base cuboid? Hint: only 3 cells  Cube Shell  Precompute only the cuboids involving a small # of dimensions, e.g., 3  More dimension combinations will need to be computed on the fly For (A1, A2, … A10), how many combinations to compute?
  • 234.
    234 234 Roadmap for EfficientComputation  General cube computation heuristics (Agarwal et al.’96)  Computing full/iceberg cubes: 3 methodologies  Bottom-Up: Multi-Way array aggregation (Zhao, Deshpande & Naughton, SIGMOD’97)  Top-down:  BUC (Beyer & Ramarkrishnan, SIGMOD’99)  H-cubing technique (Han, Pei, Dong & Wang: SIGMOD’01)  Integrating Top-Down and Bottom-Up:  Star-cubing algorithm (Xin, Han, Li & Wah: VLDB’03)  High-dimensional OLAP: A Minimal Cubing Approach (Li, et al. VLDB’04)  Computing alternative kinds of cubes:  Partial cube, closed cube, approximate cube, etc.
  • 235.
    235 235 General Heuristics (Agarwalet al. VLDB’96)  Sorting, hashing, and grouping operations are applied to the dimension attributes in order to reorder and cluster related tuples  Aggregates may be computed from previously computed aggregates, rather than from the base fact table  Smallest-child: computing a cuboid from the smallest, previously computed cuboid  Cache-results: caching results of a cuboid from which other cuboids are computed to reduce disk I/Os  Amortize-scans: computing as many as possible cuboids at the same time to amortize disk reads  Share-sorts: sharing sorting costs cross multiple cuboids when sort-based method is used  Share-partitions: sharing the partitioning cost across multiple cuboids when hash-based algorithms are used
  • 236.
    236 236 Chapter 5: DataCube Technology  Data Cube Computation: Preliminary Concepts  Data Cube Computation Methods  Processing Advanced Queries by Exploring Data Cube Technology  Multidimensional Data Analysis in Cube Space  Summary
  • 237.
    237 237 Data Cube ComputationMethods  Multi-Way Array Aggregation  BUC  Star-Cubing  High-Dimensional OLAP
  • 238.
    238 238 Multi-Way Array Aggregation Array-based “bottom-up” algorithm  Using multi-dimensional chunks  No direct tuple comparisons  Simultaneous aggregation on multiple dimensions  Intermediate aggregate values are re-used for computing ancestor cuboids  Cannot do Apriori pruning: No iceberg optimization ABC AB A All B AC BC C
  • 239.
    239 239 Multi-way Array Aggregationfor Cube Computation (MOLAP)  Partition arrays into chunks (a small subcube which fits in memory).  Compressed sparse array addressing: (chunk_id, offset)  Compute aggregates in “multiway” by visiting cube cells in the order which minimizes the # of times to visit each cell, and reduces memory access and storage cost. What is the best traversing order to do multi-way aggregation? A B 29 30 31 32 1 2 3 4 5 9 13 14 15 16 64 63 62 61 48 47 46 45 a1 a0 c3 c2 c1 c 0 b3 b2 b1 b0 a2 a3 C B 44 28 56 40 24 52 36 20 60
  • 240.
    240 Multi-way Array Aggregationfor Cube Computation (3-D to 2-D) all A B A B A BC A C BC C  The best order is the one that minimizes the memory requirement and reduced I/Os ABC AB A All B AC BC C
  • 241.
    241 Multi-way Array Aggregationfor Cube Computation (2-D to 1-D) ABC AB A All B AC BC C
  • 242.
    242 242 Multi-Way Array Aggregationfor Cube Computation (Method Summary)  Method: the planes should be sorted and computed according to their size in ascending order  Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane  Limitation of the method: computing well only for a small number of dimensions  If there are a large number of dimensions, “top- down” computation and iceberg cube computation methods can be explored
  • 243.
    243 243 Data Cube ComputationMethods  Multi-Way Array Aggregation  BUC  Star-Cubing  High-Dimensional OLAP
  • 244.
    244 244 Bottom-Up Computation (BUC) BUC (Beyer & Ramakrishnan, SIGMOD’99)  Bottom-up cube computation (Note: top-down in our view!)  Divides dimensions into partitions and facilitates iceberg pruning  If a partition does not satisfy min_sup, its descendants can be pruned  If minsup = 1 Þ compute full CUBE!  No simultaneous aggregation all A B C A C BC A BC A BD A C D BC D A D BD C D D A BC D A B 1 all 2 A 10 B 14 C 7 A C 11 BC 4 A BC 6 A BD 8 A C D 12 BC D 9 A D 13 BD 15 C D 16 D 5 A BC D 3 A B
  • 245.
    245 245 BUC: Partitioning  Usually,entire data set can’t fit in main memory  Sort distinct values  partition into blocks that fit  Continue processing  Optimizations  Partitioning  External Sorting, Hashing, Counting Sort  Ordering dimensions to encourage pruning  Cardinality, Skew, Correlation  Collapsing duplicates  Can’t do holistic aggregates anymore!
  • 246.
    246 246 Data Cube ComputationMethods  Multi-Way Array Aggregation  BUC  Star-Cubing  High-Dimensional OLAP
  • 247.
    247 247 Star-Cubing: An IntegratingMethod  D. Xin, J. Han, X. Li, B. W. Wah, Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-Up Integration, VLDB'03  Explore shared dimensions  E.g., dimension A is the shared dimension of ACD and AD  ABD/AB means cuboid ABD has shared dimensions AB  Allows for shared computations  e.g., cuboid AB is computed simultaneously as ABD C/C AC/A C BC/BC ABC/ABC ABD/AB ACD/A BCD AD/A BD/B CD D ABC D/all  Aggregate in a top-down manner but with the bottom- up sub-layer underneath which will allow Apriori pruning  Shared dimensions grow in bottom-up fashion
  • 248.
    248 248 Iceberg Pruning inShared Dimensions  Anti-monotonic property of shared dimensions  If the measure is anti-monotonic, and if the aggregate value on a shared dimension does not satisfy the iceberg condition, then all the cells extended from this shared dimension cannot satisfy the condition either  Intuition: if we can compute the shared dimensions before the actual cuboid, we can use them to do Apriori pruning  Problem: how to prune while still aggregate simultaneously on multiple dimensions?
  • 249.
    249 249 Cell Trees  Usea tree structure similar to H-tree to represent cuboids  Collapses common prefixes to save memory  Keep count at node  Traverse the tree to retrieve a particular tuple
  • 250.
    250 250 Star Attributes andStar Nodes  Intuition: If a single-dimensional aggregate on an attribute value p does not satisfy the iceberg condition, it is useless to distinguish them during the iceberg computation  E.g., b2, b3, b4, c1, c2, c4, d1, d2, d3  Solution: Replace such attributes by a *. Such attributes are star attributes, and the corresponding nodes in the cell tree are star nodes A B C D Count a1 b1 c1 d1 1 a1 b1 c4 d3 1 a1 b2 c2 d2 1 a2 b3 c3 d4 1 a2 b4 c3 d4 1
  • 251.
    251 251 Example: Star Reduction Suppose minsup = 2  Perform one-dimensional aggregation. Replace attribute values whose count < 2 with *. And collapse all *’s together  Resulting table has all such attributes replaced with the star- attribute  With regards to the iceberg computation, this new table is a lossless compression of the original table A B C D Count a1 b1 * * 2 a1 * * * 1 a2 * c3 d4 2 A B C D Count a1 b1 * * 1 a1 b1 * * 1 a1 * * * 1 a2 * c3 d4 1 a2 * c3 d4 1
  • 252.
    252 252 Star Tree  Giventhe new compressed table, it is possible to construct the corresponding cell tree— called star tree  Keep a star table at the side for easy lookup of star attributes  The star tree is a lossless compression of the original cell tree A B C D Count a1 b1 * * 2 a1 * * * 1 a2 * c3 d4 2
  • 253.
    253 253 Star-Cubing Algorithm—DFS onLattice Tree all A B/B C/C AC/AC BC /BC ABC/ABC ABD/AB A CD /A BCD AD /A BD/B CD D/D A BC D /A AB/A B BCD : 51 b*: 33 b1: 26 c*: 27 c3: 211 c*: 14 d*: 15 d4: 212 d*: 28 root: 5 a1: 3 a2: 2 b*: 2 b1: 2 b*: 1 d*: 1 c*: 1 d*: 2 c*: 2 d4: 2 c3: 2
  • 254.
    254 254 Multi-Way Aggregation A BC/ABC ABD /AB A CD/A BCD ABC D
  • 255.
    255 255 Star-Cubing Algorithm—DFS onStar-Tree A BC /ABC ABD /AB A CD/A BCD ABC D
  • 256.
    256 256 Multi-Way Star-Tree Aggregation Start depth-first search at the root of the base star tree  At each new node in the DFS, create corresponding star tree that are descendants of the current tree according to the integrated traversal ordering  E.g., in the base tree, when DFS reaches a1, the ACD/A tree is created  When DFS reaches b*, the ABD/AD tree is created  The counts in the base tree are carried over to the new trees  When DFS reaches a leaf node (e.g., d*), start backtracking  On every backtracking branch, the count in the corresponding trees are output, the tree is destroyed, and the node in the base tree is destroyed  Example  When traversing from d* back to c*, the a1b*c*/a1b*c* tree is output and destroyed  When traversing from c* back to b*, the a1b*D/a1b* tree is output and destroyed  When at b*, jump to b1 and repeat similar process ABC /ABC ABD/AB ACD /A BCD ABCD
  • 257.
    257 257 Data Cube ComputationMethods  Multi-Way Array Aggregation  BUC  Star-Cubing  High-Dimensional OLAP
  • 258.
    258 258 The Curse ofDimensionality  None of the previous cubing method can handle high dimensionality!  A database of 600k tuples. Each dimension has cardinality of 100 and zipf of 2.
  • 259.
    259 259 Motivation of High-DOLAP  X. Li, J. Han, and H. Gonzalez, High-Dimensional OLAP: A Minimal Cubing Approach, VLDB'04  Challenge to current cubing methods:  The “curse of dimensionality’’ problem  Iceberg cube and compressed cubes: only delay the inevitable explosion  Full materialization: still significant overhead in accessing results on disk  High-D OLAP is needed in applications  Science and engineering analysis  Bio-data analysis: thousands of genes  Statistical surveys: hundreds of variables
  • 260.
    260 260 Fast High-D OLAPwith Minimal Cubing  Observation: OLAP occurs only on a small subset of dimensions at a time  Semi-Online Computational Model 1. Partition the set of dimensions into shell fragments 2. Compute data cubes for each shell fragment while retaining inverted indices or value-list indices 3. Given the pre-computed fragment cubes, dynamically compute cube cells of the high-
  • 261.
    261 261 Properties of ProposedMethod  Partitions the data vertically  Reduces high-dimensional cube into a set of lower dimensional cubes  Online re-construction of original high-dimensional space  Lossless reduction  Offers tradeoffs between the amount of pre- processing and the speed of online computation
  • 262.
    262 262 Example Computation  Letthe cube aggregation function be count  Divide the 5 dimensions into 2 shell fragments:  (A, B, C) and (D, E) tid A B C D E 1 a1 b1 c1 d1 e1 2 a1 b2 c1 d2 e1 3 a1 b2 c1 d1 e2 4 a2 b1 c1 d1 e2 5 a2 b1 c1 d1 e3
  • 263.
    263 263 1-D Inverted Indices Build traditional invert index or RID list Attribute Value TID List List Size a1 1 2 3 3 a2 4 5 2 b1 1 4 5 3 b2 2 3 2 c1 1 2 3 4 5 5 d1 1 3 4 5 4 d2 2 1 e1 1 2 2 e2 3 4 2 e3 5 1
  • 264.
    264 264 Shell Fragment Cubes:Ideas  Generalize the 1-D inverted indices to multi-dimensional ones in the data cube sense  Compute all cuboids for data cubes ABC and DE while retaining the inverted indices  For example, shell fragment cube ABC contains 7 cuboids:  A, B, C  AB, AC, BC  ABC  This completes the offline computation stage 1 1 1 2 3 1 4 5 a1 b1 0 4 5 2 3 a2 b2 2 4 5 4 5 1 4 5 a2 b1 2 2 3 1 2 3 2 3 a1 b2 List Size TID List Intersection Cell          
  • 265.
    265 265 Shell Fragment Cubes:Size and Design  Given a database of T tuples, D dimensions, and F shell fragment size, the fragment cubes’ space requirement is:  For F < 5, the growth is sub-linear  Shell fragments do not have to be disjoint  Fragment groupings can be arbitrary to allow for maximum online performance  Known common combinations (e.g.,<city, state>) should be grouped together.  Shell fragment sizes can be adjusted for optimal balance between offline and online computation  O T D F       (2F  1)      
  • 266.
    266 266 ID_Measure Table  Ifmeasures other than count are present, store in ID_measure table separate from the shell fragments tid count sum 1 5 70 2 3 10 3 8 20 4 5 40 5 2 30
  • 267.
    267 267 The Frag-Shells Algorithm 1.Partition set of dimension (A1,…,An) into a set of k fragments (P1, …,Pk). 2. Scan base table once and do the following 3. insert <tid, measure> into ID_measure table. 4. for each attribute value ai of each dimension Ai 5. build inverted index entry <ai, tidlist> 6. For each fragment partition Pi 7. build local fragment cube Si by intersecting tid-lists in bottom- up fashion.
  • 268.
    268 268 Frag-Shells (2) A BC D E F … ABC Cube DEF Cube D Cuboid EF Cuboid DE Cuboid Cell Tuple-ID List d1 e1 {1, 3, 8, 9} d1 e2 {2, 4, 6, 7} d2 e1 {5, 10} … … Dimensions
  • 269.
    269 269 Online Query Computation:Query  A query has the general form  Each ai has 3 possible values 1. Instantiated value 2. Aggregate * function 3. Inquire ? function  For example, returns a 2-D data cube.   a1,a2,,an : M  3 ? ? * 1:count
  • 270.
    270 270 Online Query Computation:Method  Given the fragment cubes, process a query as follows 1. Divide the query into fragment, same as the shell 2. Fetch the corresponding TID list for each fragment from the fragment cube 3. Intersect the TID lists from each fragment to construct instantiated base table 4. Compute the data cube using the base table with any cubing algorithm
  • 271.
    271 271 Online Query Computation:Sketch A B C D E F G H I J K L M N … Online Cube Instantiated Base Table
  • 272.
    272 272 Experiment: Size vs.Dimensionality (50 and 100 cardinality)  (50-C): 106 tuples, 0 skew, 50 cardinality, fragment size 3.  (100-C): 106 tuples, 2 skew, 100 cardinality, fragment size 2.
  • 273.
    273 273 Experiments on RealWorld Data  UCI Forest CoverType data set  54 dimensions, 581K tuples  Shell fragments of size 2 took 33 seconds and 325MB to compute  3-D subquery with 1 instantiate D: 85ms~1.4 sec.  Longitudinal Study of Vocational Rehab. Data  24 dimensions, 8818 tuples  Shell fragments of size 3 took 0.9 seconds and 60MB to compute  5-D query with 0 instantiated D: 227ms~2.6 sec.
  • 274.
    274 274 Chapter 5: DataCube Technology  Data Cube Computation: Preliminary Concepts  Data Cube Computation Methods  Processing Advanced Queries by Exploring Data Cube Technology  Sampling Cube  Ranking Cube  Multidimensional Data Analysis in Cube Space  Summary
  • 275.
    275 275 Processing Advanced Queriesby Exploring Data Cube Technology  Sampling Cube  X. Li, J. Han, Z. Yin, J.-G. Lee, Y. Sun, “Sampling Cube: A Framework for Statistical OLAP over Sampling Data”, SIGMOD’08  Ranking Cube  D. Xin, J. Han, H. Cheng, and X. Li. Answering top-k queries with multi-dimensional selections: The ranking cube approach. VLDB’06  Other advanced cubes for processing data and queries  Stream cube, spatial cube, multimedia cube, text cube, RFID cube, etc. — to be studied in volume 2
  • 276.
    276 276 Statistical Surveys andOLAP  Statistical survey: A popular tool to collect information about a population based on a sample  Ex.: TV ratings, US Census, election polls  A common tool in politics, health, market research, science, and many more  An efficient way of collecting information (Data collection is expensive)  Many statistical tools available, to determine validity  Confidence intervals  Hypothesis tests  OLAP (multidimensional analysis) on survey data  highly desirable but can it be done well?
  • 277.
    277 277 Surveys: Sample vs.Whole Population AgeEducation High-school College Graduate 18 19 20 … Data is only a sample of population
  • 278.
    278 278 Problems for Drillingin Multidim. Space AgeEducation High-school College Graduate 18 19 20 … Data is only a sample of population but samples could be small when drilling to certain multidimensional space
  • 279.
    279 279 OLAP on Survey(i.e., Sampling) Data Age/Education High-school College Graduate 18 19 20 …  Semantics of query is unchanged  Input data has changed
  • 280.
    280 280 Challenges for OLAPon Sampling Data  Computing confidence intervals in OLAP context  No data?  Not exactly. No data in subspaces in cube  Sparse data  Causes include sampling bias and query selection bias  Curse of dimensionality  Survey data can be high dimensional  Over 600 dimensions in real world example  Impossible to fully materialize
  • 281.
    281 281 Example 1: ConfidenceInterval Age/Education High-school College Graduate 18 19 20 … What is the average income of 19-year-old high-school students? Return not only query result but also confidence interval
  • 282.
    282 282 Confidence Interval  Confidenceinterval at :  x is a sample of data set; is the mean of sample  tc is the critical t-value, calculated by a look-up  is the estimated standard error of the mean  Example: $50,000 ± $3,000 with 95% confidence  Treat points in cube cell as samples  Compute confidence interval as traditional sample set  Return answer in the form of confidence interval  Indicates quality of query answer  No Image
  • 283.
    283 283 Efficient Computing ConfidenceInterval Measures  Efficient computation in all cells in data cube  Both mean and confidence interval are algebraic  Why confidence interval measure is algebraic? is algebraic where both s and l (count) are algebraic  Thus one can calculate cells efficiently at more general cuboids without having to start at the base cuboid each time No Image
  • 284.
    284 284 Example 2: QueryExpansion Age/Education High-school College Graduate 18 19 20 … What is the average income of 19-year-old college students?
  • 285.
    285 285 Boosting Confidence byQuery Expansion  From the example: The queried cell “19-year-old college students” contains only 2 samples  Confidence interval is large (i.e., low confidence). why?  Small sample size  High standard deviation with samples  Small sample sizes can occur at relatively low dimensional selections  Collect more data?― expensive!  Use data in other cells? Maybe, but have to be careful
  • 286.
    286 286 Intra-Cuboid Expansion: Choice1 Age/Education High-school College Graduate 18 19 20 … Expand query to include 18 and 20 year olds?
  • 287.
    287 287 Intra-Cuboid Expansion: Choice2 Age/Education High-school College Graduate 18 19 20 … Expand query to include high-school and graduate students?
  • 288.
  • 289.
    289 Intra-Cuboid Expansion  Combineother cells’ data into own to “boost” confidence  If share semantic and cube similarity  Use only if necessary  Bigger sample size will decrease confidence interval  Cell segment similarity  Some dimensions are clear: Age  Some are fuzzy: Occupation  May need domain knowledge  Cell value similarity  How to determine if two cells’ samples come from the same population?  Two-sample t-test (confidence-based)
  • 290.
    290 290 Inter-Cuboid Expansion  Ifa query dimension is  Not correlated with cube value  But is causing small sample size by drilling down too much  Remove dimension (i.e., generalize to *) and move to a more general cuboid  Can use two-sample t-test to determine similarity between two cells across cuboids  Can also use a different method to be shown later
  • 291.
    291 291 Query Expansion Experiments Real world sample data: 600 dimensions and 750,000 tuples  0.05% to simulate “sample” (allows error checking)
  • 292.
    292 292 Chapter 5: DataCube Technology  Data Cube Computation: Preliminary Concepts  Data Cube Computation Methods  Processing Advanced Queries by Exploring Data Cube Technology  Sampling Cube  Ranking Cube  Multidimensional Data Analysis in Cube Space  Summary
  • 293.
    293 Ranking Cubes –Efficient Computation of Ranking queries  Data cube helps not only OLAP but also ranked search  (top-k) ranking query: only returns the best k results according to a user-specified preference, consisting of (1) a selection condition and (2) a ranking function  Ex.: Search for apartments with expected price 1000 and expected square feet 800  Select top 1 from Apartment  where City = “LA” and Num_Bedroom = 2  order by [price – 1000]^2 + [sq feet - 800]^2 asc  Efficiency question: Can we only search what we need?  Build a ranking cube on both selection dimensions and ranking dimensions
  • 294.
    294 Sliced Partition for city=“LA” SlicedPartition for BR=2 Ranking Cube: Partition Data on Both Selection and Ranking Dimensions One single data partition as the template Slice the data partition by selection conditions Partition for all data
  • 295.
    295 Materialize Ranking-Cube tid CityBR Price Sq feet Block ID t1 SEA 1 500 600 5 t2 CLE 2 700 800 5 t3 SEA 1 800 900 2 t4 CLE 3 1000 1000 6 t5 LA 1 1100 200 15 t6 LA 2 1200 500 11 t7 LA 2 1200 560 11 t8 CLE 3 1350 1120 4 Step 1: Partition Data on Ranking Dimensions Step 2: Group data by Selection Dimensions City BR City & BR 3 4 2 1 CLE LA SEA Step 3: Compute Measures for each group For the cell (LA) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Block-level: {11, 15} Data-level: {11: t6, t7; 15: t5}
  • 296.
    296 Search with Ranking-Cube: SimultaneouslyPush Selection and Ranking Select top 1 from Apartment where city = “LA” order by [price – 1000]^2 + [sq feet - 800]^2 asc 800 1000 Without ranking-cube: start search from here With ranking-cube: start search from here Measure for LA: {11, 15} {11: t6,t7; 15:t5} 11 15 Given the bin boundaries, locate the block with top score Bin boundary for price [500, 600, 800, 1100,1350] Bin boundary for sq feet [200, 400, 600, 800, 1120]
  • 297.
    297 Processing Ranking Query:Execution Trace Select top 1 from Apartment where city = “LA” order by [price – 1000]^2 + [sq feet - 800]^2 asc 800 1000 With ranking- cube: start search from here Measure for LA: {11, 15} {11: t6,t7; 15:t5} 11 15 f=[price-1000]^2 + [sq feet – 800]^2 Bin boundary for price [500, 600, 800, 1100,1350] Bin boundary for sq feet [200, 400, 600, 800, 1120] Execution Trace: 1. Retrieve High-level measure for LA {11, 15} 2. Estimate lower bound score for block 11, 15 f(block 11) = 40,000, f(block 15) = 160,000 3. Retrieve block 11 4. Retrieve low-level measure for block 11 5. f(t6) = 130,000, f(t7) = 97,600 Output t7, done!
  • 298.
    298 Ranking Cube: Methodologyand Extension  Ranking cube methodology  Push selection and ranking simultaneously  It works for many sophisticated ranking functions  How to support high-dimensional data?  Materialize only those atomic cuboids that contain single selection dimensions  Uses the idea similar to high-dimensional OLAP  Achieves low space overhead and high performance in answering ranking queries with a high number of selection dimensions
  • 299.
    299 299 Chapter 5: DataCube Technology  Data Cube Computation: Preliminary Concepts  Data Cube Computation Methods  Processing Advanced Queries by Exploring Data Cube Technology  Multidimensional Data Analysis in Cube Space  Summary
  • 300.
    300 300 Multidimensional Data Analysisin Cube Space  Prediction Cubes: Data Mining in Multi- Dimensional Cube Space  Multi-Feature Cubes: Complex Aggregation at Multiple Granularities  Discovery-Driven Exploration of Data Cubes
  • 301.
    301 Data Mining inCube Space  Data cube greatly increases the analysis bandwidth  Four ways to interact OLAP-styled analysis and data mining  Using cube space to define data space for mining  Using OLAP queries to generate features and targets for mining, e.g., multi-feature cube  Using data-mining models as building blocks in a multi-step mining process, e.g., prediction cube  Using data-cube computation techniques to speed up repeated model construction  Cube-space data mining may require building a model for each candidate data space  Sharing computation across model-construction for different candidates may lead to efficient
  • 302.
    302 Prediction Cubes  Predictioncube: A cube structure that stores prediction models in multidimensional data space and supports prediction in OLAP manner  Prediction models are used as building blocks to define the interestingness of subsets of data, i.e., to answer which subsets of data indicate better prediction
  • 303.
    303 How to Determinethe Prediction Power of an Attribute?  Ex. A customer table D:  Two dimensions Z: Time (Month, Year ) and Location (State, Country)  Two features X: Gender and Salary  One class-label attribute Y: Valued Customer  Q: “Are there times and locations in which the value of a customer depended greatly on the customers gender (i.e., Gender: predictiveness attribute V)?”  Idea:  Compute the difference between the model built on that using X to predict Y and that built on using X – V to predict Y  If the difference is large, V must play an important role at predicting Y
  • 304.
    304 Efficient Computation ofPrediction Cubes  Naïve method: Fully materialize the prediction cube, i.e., exhaustively build models and evaluate them for each cell and for each granularity  Better approach: Explore score function decomposition that reduces prediction cube computation to data cube computation
  • 305.
    305 305 Multidimensional Data Analysisin Cube Space  Prediction Cubes: Data Mining in Multi- Dimensional Cube Space  Multi-Feature Cubes: Complex Aggregation at Multiple Granularities  Discovery-Driven Exploration of Data Cubes
  • 306.
    306 306 Complex Aggregation atMultiple Granularities: Multi-Feature Cubes  Multi-feature cubes (Ross, et al. 1998): Compute complex queries involving multiple dependent aggregates at multiple granularities  Ex. Grouping by all subsets of {item, region, month}, find the maximum price in 2010 for each group, and the total sales among all maximum price tuples select item, region, month, max(price), sum(R.sales) from purchases where year = 2010 cube by item, region, month: R such that R.price = max(price)  Continuing the last example, among the max price tuples, find the min and max shelf live, and find the fraction of the total sales due to tuple that have min shelf life within the set of all max price tuples
  • 307.
    307 307 Multidimensional Data Analysisin Cube Space  Prediction Cubes: Data Mining in Multi- Dimensional Cube Space  Multi-Feature Cubes: Complex Aggregation at Multiple Granularities  Discovery-Driven Exploration of Data Cubes
  • 308.
    308 308 Discovery-Driven Exploration ofData Cubes  Hypothesis-driven  exploration by user, huge search space  Discovery-driven (Sarawagi, et al.’98)  Effective navigation of large OLAP data cubes  pre-compute measures indicating exceptions, guide user in the data analysis, at all levels of aggregation  Exception: significantly different from the value anticipated, based on a statistical model  Visual cues such as background color are used to reflect the degree of exception of each cell
  • 309.
    309 309 Kinds of Exceptionsand their Computation  Parameters  SelfExp: surprise of cell relative to other cells at same level of aggregation  InExp: surprise beneath the cell  PathExp: surprise beneath cell for each drill-down path  Computation of exception indicator (modeling fitting and computing SelfExp, InExp, and PathExp values) can be overlapped with cube construction  Exception themselves can be stored, indexed and retrieved like precomputed aggregates
  • 310.
  • 311.
    311 311 Chapter 5: DataCube Technology  Data Cube Computation: Preliminary Concepts  Data Cube Computation Methods  Processing Advanced Queries by Exploring Data Cube Technology  Multidimensional Data Analysis in Cube Space  Summary
  • 312.
    312 312 Data Cube Technology:Summary  Data Cube Computation: Preliminary Concepts  Data Cube Computation Methods  MultiWay Array Aggregation  BUC  Star-Cubing  High-Dimensional OLAP with Shell-Fragments  Processing Advanced Queries by Exploring Data Cube Technology  Sampling Cubes  Ranking Cubes  Multidimensional Data Analysis in Cube Space  Discovery-Driven Exploration of Data Cubes  Multi-feature Cubes 
  • 313.
    313 313 Ref.(I) Data CubeComputation Methods  S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. VLDB’96  D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data warehouses. SIGMOD’97  K. Beyer and R. Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs.. SIGMOD’99  M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries efficiently. VLDB’98  J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29–54, 1997.  J. Han, J. Pei, G. Dong, K. Wang. Efficient Computation of Iceberg Cubes With Complex Measures. SIGMOD’01  L. V. S. Lakshmanan, J. Pei, and J. Han, Quotient Cube: How to Summarize the Semantics of a Data Cube, VLDB'02  X. Li, J. Han, and H. Gonzalez, High-Dimensional OLAP: A Minimal Cubing Approach, VLDB'04  Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. SIGMOD’97  K. Ross and D. Srivastava. Fast computation of sparse datacubes. VLDB’97  D. Xin, J. Han, X. Li, B. W. Wah, Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-Up Integration, VLDB'03  D. Xin, J. Han, Z. Shao, H. Liu, C-Cubing: Efficient Computation of Closed Cubes by Aggregation-Based Checking, ICDE'06
  • 314.
    314 314 Ref. (II) AdvancedApplications with Data Cubes  D. Burdick, P. Deshpande, T. S. Jayram, R. Ramakrishnan, and S. Vaithyanathan. OLAP over uncertain and imprecise data. VLDB’05  X. Li, J. Han, Z. Yin, J.-G. Lee, Y. Sun, “Sampling Cube: A Framework for Statistical OLAP over Sampling Data”, SIGMOD’08  C. X. Lin, B. Ding, J. Han, F. Zhu, and B. Zhao. Text Cube: Computing IR measures for multidimensional text database analysis. ICDM’08  D. Papadias, P. Kalnis, J. Zhang, and Y. Tao. Efficient OLAP operations in spatial data warehouses. SSTD’01  N. Stefanovic, J. Han, and K. Koperski. Object-based selective materialization for efficient implementation of spatial data cubes. IEEE Trans. Knowledge and Data Engineering, 12:938–958, 2000.  T. Wu, D. Xin, Q. Mei, and J. Han. Promotion analysis in multidimensional space. VLDB’09  T. Wu, D. Xin, and J. Han. ARCube: Supporting ranking aggregate queries in partially materialized data cubes. SIGMOD’08  D. Xin, J. Han, H. Cheng, and X. Li. Answering top-k queries with multi-dimensional selections: The ranking cube approach. VLDB’06  J. S. Vitter, M. Wang, and B. R. Iyer. Data cube approximation and histograms via wavelets. CIKM’98  D. Zhang, C. Zhai, and J. Han. Topic cube: Topic modeling for OLAP on multi-dimensional text databases. SDM’09
  • 315.
    315 Ref. (III) KnowledgeDiscovery with Data Cubes  R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97  B.-C. Chen, L. Chen, Y. Lin, and R. Ramakrishnan. Prediction cubes. VLDB’05  B.-C. Chen, R. Ramakrishnan, J.W. Shavlik, and P. Tamma. Bellwether analysis: Predicting global aggregates from local regions. VLDB’06  Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang, Multi-Dimensional Regression Analysis of Time-Series Data Streams, VLDB'02  G. Dong, J. Han, J. Lam, J. Pei, K. Wang. Mining Multi-dimensional Constrained Gradients in Data Cubes. VLDB’ 01  R. Fagin, R. V. Guha, R. Kumar, J. Novak, D. Sivakumar, and A. Tomkins. Multi-structural databases. PODS’05  J. Han. Towards on-line analytical mining in large databases. SIGMOD Record, 27:97–107, 1998  T. Imielinski, L. Khachiyan, and A. Abdulghani. Cubegrades: Generalizing association rules. Data Mining & Knowledge Discovery, 6:219–258, 2002.  R. Ramakrishnan and B.-C. Chen. Exploratory mining in cube space. Data Mining and Knowledge Discovery, 15:29–54, 2007.  K. A. Ross, D. Srivastava, and D. Chatziantoniou. Complex aggregation at multiple granularities. EDBT'98  S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP data cubes. EDBT'98  G. Sathe and S. Sarawagi. Intelligent Rollups in Multidimensional OLAP Data. VLDB'01
  • 316.
  • 317.
    317 317 Chapter 5: DataCube Technology  Efficient Methods for Data Cube Computation  Preliminary Concepts and General Strategies for Cube Computation  Multiway Array Aggregation for Full Cube Computation  BUC: Computing Iceberg Cubes from the Apex Cuboid Downward  H-Cubing: Exploring an H-Tree Structure  Star-cubing: Computing Iceberg Cubes Using a Dynamic Star-tree Structure  Precomputing Shell Fragments for Fast High-Dimensional OLAP  Data Cubes for Advanced Applications  Sampling Cubes: OLAP on Sampling Data  Ranking Cubes: Efficient Computation of Ranking Queries  Knowledge Discovery with Data Cubes  Discovery-Driven Exploration of Data Cubes  Complex Aggregation at Multiple Granularity: Multi-feature Cubes  Prediction Cubes: Data Mining in Multi-Dimensional Cube Space  Summary
  • 318.
    318 318 H-Cubing: Using H-TreeStructure  Bottom-up computation  Exploring an H-tree structure  If the current computation of an H- tree cannot pass min_sup, do not proceed further (pruning)  No simultaneous aggregation a ll A B C A C B C A B C A B D A C D B C D A D B D C D D A B C D A B
  • 319.
    319 319 H-tree: A PrefixHyper-tree Month City Cust_grp Prod Cost Price Jan Tor Edu Printer 500 485 Jan Tor Hhd TV 800 1200 Jan Tor Edu Camera 1160 1280 Feb Mon Bus Laptop 1500 2500 Mar Van Edu HD 540 520 … … … … … … root edu hhd bus Jan Mar Jan Feb Tor Van Tor Mon Q.I. Q.I. Q.I. Quant- Info Sum: 1765 Cnt: 2 bins Attr. Val. Quant-Info Side-link Edu Sum:2285 … Hhd … Bus … … … Jan … Feb … … … Tor … Van … Mon … … … Header table
  • 320.
    320 320 root Edu. Hhd. Bus. Jan.Mar. Jan. Feb. Tor. Van. Tor. Mon. Q.I. Q.I. Q.I. Quant- Info Sum: 1765 Cnt: 2 bins Attr. Val. Quant-Info Side-link Edu Sum:2285 … Hhd … Bus … … … Jan … Feb … … … Tor … Van … Mon … … … Attr. Val. Q.I. Side-link Edu … Hhd … Bus … … … Jan … Feb … … … Header Table HTor From (*, *, Tor) to (*, Jan, Tor) Computing Cells Involving “City”
  • 321.
    321 321 Computing Cells InvolvingMonth But No City root Edu. Hhd. Bus. Jan. Mar. Jan. Feb. Tor. Van. Tor. Mont. Q.I. Q.I. Q.I. Attr. Val. Quant-Info Side-link Edu. Sum:2285 … Hhd. … Bus. … … … Jan. … Feb. … Mar. … … … Tor. … Van. … Mont. … … … 1. Roll up quant-info 2. Compute cells involving month but no city Q.I. Top-k OK mark: if Q.I. in a child passes top-k avg threshold, so does its parents. No binning is needed!
  • 322.
    322 322 Computing Cells InvolvingOnly Cust_grp root edu hhd bus Jan Mar Jan Feb Tor Van Tor Mon Q.I. Q.I. Q.I. Attr. Val. Quant-Info Side-link Edu Sum:2285 … Hhd … Bus … … … Jan … Feb … Mar … … … Tor … Van … Mon … … … Check header table directly Q.I.
  • 323.
    323 323 Data Mining: Concepts andTechniques (3rd ed.) — Chapter 6 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2011 Han, Kamber & Pei. All rights reserved.
  • 324.
    324 Chapter 5: MiningFrequent Patterns, Association and Correlations: Basic Concepts and Methods  Basic Concepts  Frequent Itemset Mining Methods  Which Patterns Are Interesting?—Pattern Evaluation Methods  Summary
  • 325.
    325 What Is FrequentPattern Analysis?  Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set  First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining  Motivation: Finding inherent regularities in data  What products were often purchased together?— Beer and diapers?!  What are the subsequent purchases after buying a PC?  What kinds of DNA are sensitive to this new drug?  Can we automatically classify web documents?  Applications  Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA
  • 326.
    326 Why Is Freq.Pattern Mining Important?  Freq. pattern: An intrinsic and important property of datasets  Foundation for many essential data mining tasks  Association, correlation, and causality analysis  Sequential, structural (e.g., sub-graph) patterns  Pattern analysis in spatiotemporal, multimedia, time- series, and stream data  Classification: discriminative, frequent pattern analysis  Cluster analysis: frequent pattern-based clustering  Data warehousing: iceberg cube and cube-gradient  Semantic data compression: fascicles  Broad applications
  • 327.
    327 Basic Concepts: FrequentPatterns  itemset: A set of one or more items  k-itemset X = {x1, …, xk}  (absolute) support, or, support count of X: Frequency or occurrence of an itemset X  (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X)  An itemset X is frequent if X’s support is no less than a minsup threshold Customer buys diaper Customer buys both Customer buys beer Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk
  • 328.
    328 Basic Concepts: AssociationRules  Find all the rules X  Y with minimum support and confidence  support, s, probability that a transaction contains X  Y  confidence, c, conditional probability that a transaction having X also contains Y Let minsup = 50%, minconf = 50% Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3 Customer buys diaper Customer buys both Customer buys beer Nuts, Eggs, Milk 40 Nuts, Coffee, Diaper, Eggs, Milk 50 Beer, Diaper, Eggs 30 Beer, Coffee, Diaper 20 Beer, Nuts, Diaper 10 Items bought Tid  Association rules: (many more!)  Beer  Diaper (60%, 100%)  Diaper  Beer (60%, 75%)
  • 329.
    329 Closed Patterns andMax-Patterns  A long pattern contains a combinatorial number of sub-patterns, e.g., {a1, …, a100} contains (100 1 ) + (100 2 ) + … + (1 1 0 0 0 0 ) = 2100 – 1 = 1.27*1030 sub-patterns!  Solution: Mine closed patterns and max-patterns instead  An itemset X is closed if X is frequent and there exists no super-pattern Y ‫כ‬ X, with the same support as X (proposed by Pasquier, et al. @ ICDT’99)  An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y ‫כ‬ X (proposed by Bayardo @ SIGMOD’98)  Closed pattern is a lossless compression of freq. patterns 
  • 330.
    330 Closed Patterns andMax-Patterns  Exercise. DB = {<a1, …, a100>, < a1, …, a50>}  Min_sup = 1.  What is the set of closed itemset?  <a1, …, a100>: 1  < a1, …, a50>: 2  What is the set of max-pattern?  <a1, …, a100>: 1  What is the set of all patterns?  !!
  • 331.
    331 Computational Complexity ofFrequent Itemset Mining  How many itemsets are potentially to be generated in the worst case?  The number of frequent itemsets to be generated is senstive to the minsup threshold  When minsup is low, there exist potentially an exponential number of frequent itemsets  The worst case: MN where M: # distinct items, and N: max length of transactions  The worst case complexty vs. the expected probability  Ex. Suppose Walmart has 104 kinds of products  The chance to pick up one product 10-4  The chance to pick up a particular set of 10 products: ~10-40  What is the chance this particular set of 10 products to be frequent 103 times in 109 transactions?
  • 332.
    332 Chapter 5: MiningFrequent Patterns, Association and Correlations: Basic Concepts and Methods  Basic Concepts  Frequent Itemset Mining Methods  Which Patterns Are Interesting?—Pattern Evaluation Methods  Summary
  • 333.
    333 Scalable Frequent ItemsetMining Methods  Apriori: A Candidate Generation-and-Test Approach  Improving the Efficiency of Apriori  FPGrowth: A Frequent Pattern-Growth Approach  ECLAT: Frequent Pattern Mining with Vertical
  • 334.
    334 The Downward ClosureProperty and Scalable Mining Methods  The downward closure property of frequent patterns  Any subset of a frequent itemset must be frequent  If {beer, diaper, nuts} is frequent, so is {beer, diaper}  i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper}  Scalable mining methods: Three major approaches  Apriori (Agrawal & Srikant@VLDB’94)  Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)  Vertical data format approach (Charm—Zaki & Hsiao @SDM’02)
  • 335.
    335 Apriori: A CandidateGeneration & Test Approach  Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)  Method:  Initially, scan DB once to get frequent 1-itemset  Generate length (k+1) candidate itemsets from length k frequent itemsets  Test the candidates against DB  Terminate when no frequent or candidate set can be generated
  • 336.
    336 The Apriori Algorithm—AnExample Database TDB 1st scan C1 L1 L2 C2 C2 2nd scan C3 L3 3rd scan Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E Itemset sup {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Itemset sup {A, B} 1 {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2 Itemset sup {A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2 Itemset {B, C, E} Itemset sup {B, C, E} 2 Supmin = 2
  • 337.
    337 The Apriori Algorithm(Pseudo-Code) Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk;
  • 338.
    338 Implementation of Apriori How to generate candidates?  Step 1: self-joining Lk  Step 2: pruning  Example of Candidate-generation  L3={abc, abd, acd, ace, bcd}  Self-joining: L3*L3  abcd from abc and abd  acde from acd and ace  Pruning:  acde is removed because ade is not in L3  C4 = {abcd}
  • 339.
    339 How to CountSupports of Candidates?  Why counting supports of candidates a problem?  The total number of candidates can be very huge  One transaction may contain many candidates  Method:  Candidate itemsets are stored in a hash-tree  Leaf node of hash-tree contains a list of itemsets and counts  Interior node contains a hash table  Subset function: finds all the candidates contained in a transaction
  • 340.
    340 Counting Supports ofCandidates Using Hash Tree 1,4,7 2,5,8 3,6,9 Subset function 2 3 4 5 6 7 1 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 Transaction: 1 2 3 5 6 1 + 2 3 5 6 1 2 + 3 5 6 1 3 + 5 6
  • 341.
    341 Candidate Generation: AnSQL Implementation  SQL Implementation of candidate generation  Suppose the items in Lk-1 are listed in an order  Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1  Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck  Use object-relational extensions like UDFs, BLOBs, and Table functions for efficient implementation [See: S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD’98]
  • 342.
    342 Scalable Frequent ItemsetMining Methods  Apriori: A Candidate Generation-and-Test Approach  Improving the Efficiency of Apriori  FPGrowth: A Frequent Pattern-Growth Approach  ECLAT: Frequent Pattern Mining with Vertical Data Format 
  • 343.
    343 Further Improvement ofthe Apriori Method  Major computational challenges  Multiple scans of transaction database  Huge number of candidates  Tedious workload of support counting for candidates  Improving Apriori: general ideas  Reduce passes of transaction database scans  Shrink number of candidates  Facilitate support counting of candidates
  • 344.
    Partition: Scan DatabaseOnly Twice  Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB  Scan 1: partition database and find local frequent patterns  Scan 2: consolidate global frequent patterns  A. Savasere, E. Omiecinski and S. Navathe, VLDB’95 DB1 DB2 DBk + = DB + + sup1(i) < σDB1 sup2(i) < σDB2 supk(i) < σDBk sup(i) < σDB
  • 345.
    345 DHP: Reduce theNumber of Candidates  A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent  Candidates: a, b, c, d, e  Hash entries  {ab, ad, ae}  {bd, be, de}  …  Frequent 1-itemset: a, b, d, e  ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below support threshold  J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. SIGMOD’95 count itemset s 35 {ab, ad, ae} {yz, qs, wt} 88 102 . . . {bd, be, de} . . . Hash Table
  • 346.
    346 Sampling for FrequentPatterns  Select a sample of original database, mine frequent patterns within sample using Apriori  Scan database once to verify frequent itemsets found in sample, only borders of closure of frequent patterns are checked  Example: check abcd instead of ab, ac, …, etc.  Scan database again to find missed frequent patterns  H. Toivonen. Sampling large databases for association rules. In VLDB’96
  • 347.
    347 DIC: Reduce Numberof Scans ABCD ABC ABD ACD BCD AB AC BC AD BD CD A B C D {} Itemset lattice  Once both A and D are determined frequent, the counting of AD begins  Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins Transactions 1-itemsets 2-itemsets … Apriori 1-itemsets 2-items 3-items DIC S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. SIGMOD’97
  • 348.
    348 Scalable Frequent ItemsetMining Methods  Apriori: A Candidate Generation-and-Test Approach  Improving the Efficiency of Apriori  FPGrowth: A Frequent Pattern-Growth Approach  ECLAT: Frequent Pattern Mining with Vertical Data Format 
  • 349.
    349 Pattern-Growth Approach: MiningFrequent Patterns Without Candidate Generation  Bottlenecks of the Apriori approach  Breadth-first (i.e., level-wise) search  Candidate generation and test  Often generates a huge number of candidates  The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)  Depth-first search  Avoid explicit candidate generation  Major philosophy: Grow long patterns from short ones using local frequent items only  “abc” is a frequent pattern  Get all transactions having “abc”, i.e., project DB on abc: DB|abc  “d” is a local frequent item in DB|abc  abcd is a frequent pattern
  • 350.
    350 Construct FP-tree froma Transaction Database {} f:4 c:1 b:1 p:1 b:1 c:3 a:3 b:1 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 min_support = 3 TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o, w} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} 1. Scan DB once, find frequent 1-itemset (single item pattern) 2. Sort frequent items in frequency descending order, f-list 3. Scan DB again, construct FP-tree F-list = f-c-a-b-m-p
  • 351.
    351 Partition Patterns andDatabases  Frequent patterns can be partitioned into subsets according to f-list  F-list = f-c-a-b-m-p  Patterns containing p  Patterns having m but no p  …  Patterns having c but no a nor b, m, p  Pattern f  Completeness and non-redundency
  • 352.
    352 Find Patterns HavingP From P-conditional Database  Starting at the frequent item header table in the FP-tree  Traverse the FP-tree by following the link of each frequent item p  Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 {} f:4 c:1 b:1 p:1 b:1 c:3 a:3 b:1 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3
  • 353.
    353 From Conditional Pattern-basesto Conditional FP-trees  For each pattern-base  Accumulate the count for each item in the base  Construct the FP-tree for the frequent items of the pattern base m-conditional pattern base: fca:2, fcab:1 {} f:3 c:3 a:3 m-conditional FP-tree All frequent patterns relate to m m, fm, cm, am, fcm, fam, cam, fcam   {} f:4 c:1 b:1 p:1 b:1 c:3 a:3 b:1 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3
  • 354.
    354 Recursion: Mining EachConditional FP-tree {} f:3 c:3 a:3 m-conditional FP-tree Cond. pattern base of “am”: (fc:3) {} f:3 c:3 am-conditional FP-tree Cond. pattern base of “cm”: (f:3) {} f:3 cm-conditional FP-tree Cond. pattern base of “cam”: (f:3) {} f:3 cam-conditional FP-tree
  • 355.
    355 A Special Case:Single Prefix Path in FP-tree  Suppose a (conditional) FP-tree T has a shared single prefix-path P  Mining can be decomposed into two parts  Reduction of the single prefix path into one node  Concatenation of the mining results of the two parts  a2:n2 a3:n3 a1:n1 {} b1:m1 C1:k1 C2:k2 C3:k3 b1:m1 C1:k1 C2:k2 C3:k3 r1 + a2:n2 a3:n3 a1:n1 {} r1 =
  • 356.
    356 Benefits of theFP-tree Structure  Completeness  Preserve complete information for frequent pattern mining  Never break a long pattern of any transaction  Compactness  Reduce irrelevant info—infrequent items are gone  Items in frequency descending order: the more frequently occurring, the more likely to be shared  Never be larger than the original database (not count node-links and the count field)
  • 357.
    357 The Frequent PatternGrowth Mining Method  Idea: Frequent pattern growth  Recursively grow frequent patterns by pattern and database partition  Method  For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree  Repeat the process on each newly created conditional FP-tree  Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern
  • 358.
    358 Scaling FP-growth byDatabase Projection  What about if FP-tree cannot fit in memory?  DB projection  First partition a database into a set of projected DBs  Then construct and mine FP-tree for each projected DB  Parallel projection vs. partition projection techniques  Parallel projection  Project the DB in parallel for each frequent item  Parallel projection is space costly  All the partitions can be processed in parallel  Partition projection  Partition the DB based on the ordered frequent items  Passing the unprocessed parts to the subsequent partitions
  • 359.
    359 Partition-Based Projection  Parallelprojection needs a lot of disk space  Partition projection saves it Tran. DB fcamp fcabm fb cbp fcamp p-proj DB fcam cb fcam m-proj DB fcab fca fca b-proj DB f cb … a-proj DB fc … c-proj DB f … f-proj DB … am-proj DB fc fc fc cm-proj DB f f f …
  • 360.
    Performance of FPGrowthin Large Datasets FP-Growth vs. Apriori 360 0 10 20 30 40 50 60 70 80 90 100 0 0.5 1 1.5 2 2.5 3 Support threshold(%) Run tim e(sec.) D1 FP-grow th runtime D1 Apriori runtime Data set T25I20D10K 0 20 40 60 80 100 120 140 0 0.5 1 1.5 2 Support threshold (%) Runtime (sec.) D2 FP-growth D2 TreeProjection Data set T25I20D100K FP-Growth vs. Tree-Projection
  • 361.
    361 Advantages of thePattern Growth Approach  Divide-and-conquer:  Decompose both the mining task and DB according to the frequent patterns obtained so far  Lead to focused search of smaller databases  Other factors  No candidate generation, no candidate test  Compressed database: FP-tree structure  No repeated scan of entire database  Basic ops: counting local freq items and building sub FP-tree, no pattern search and matching  A good open-source implementation and refinement of FPGrowth  FPGrowth+ (Grahne and J. Zhu, FIMI'03)
  • 362.
    362 Further Improvements ofMining Methods  AFOPT (Liu, et al. @ KDD’03)  A “push-right” method for mining condensed frequent pattern (CFP) tree  Carpenter (Pan, et al. @ KDD’03)  Mine data sets with small rows but numerous columns  Construct a row-enumeration tree for efficient mining  FPgrowth+ (Grahne and Zhu, FIMI’03)  Efficiently Using Prefix-Trees in Mining Frequent Itemsets, Proc. ICDM'03 Int. Workshop on Frequent Itemset Mining Implementations (FIMI'03), Melbourne, FL, Nov. 2003  TD-Close (Liu, et al, SDM’06)
  • 363.
    363 Extension of PatternGrowth Mining Methodology  Mining closed frequent itemsets and max-patterns  CLOSET (DMKD’00), FPclose, and FPMax (Grahne & Zhu, Fimi’03)  Mining sequential patterns  PrefixSpan (ICDE’01), CloSpan (SDM’03), BIDE (ICDE’04)  Mining graph patterns  gSpan (ICDM’02), CloseGraph (KDD’03)  Constraint-based mining of frequent patterns  Convertible constraints (ICDE’01), gPrune (PAKDD’03)  Computing iceberg data cubes with complex measures  H-tree, H-cubing, and Star-cubing (SIGMOD’01, VLDB’03)  Pattern-growth-based Clustering  MaPle (Pei, et al., ICDM’03)  Pattern-Growth-Based Classification  Mining frequent and discriminative patterns (Cheng, et al, ICDE’07)
  • 364.
    364 Scalable Frequent ItemsetMining Methods  Apriori: A Candidate Generation-and-Test Approach  Improving the Efficiency of Apriori  FPGrowth: A Frequent Pattern-Growth Approach  ECLAT: Frequent Pattern Mining with Vertical Data Format 
  • 365.
    365 ECLAT: Mining byExploring Vertical Data Format  Vertical format: t(AB) = {T11, T25, …}  tid-list: list of trans.-ids containing an itemset  Deriving frequent patterns based on vertical intersections  t(X) = t(Y): X and Y always happen together  t(X)  t(Y): transaction having X always has Y  Using diffset to accelerate mining  Only keep track of differences of tids  t(X) = {T1, T2, T3}, t(XY) = {T1, T3}  Diffset (XY, X) = {T2}  Eclat (Zaki et al. @KDD’97)  Mining Closed patterns using vertical format: CHARM (Zaki & Hsiao@SDM’02)
  • 366.
    366 Scalable Frequent ItemsetMining Methods  Apriori: A Candidate Generation-and-Test Approach  Improving the Efficiency of Apriori  FPGrowth: A Frequent Pattern-Growth Approach  ECLAT: Frequent Pattern Mining with Vertical Data Format 
  • 367.
    Mining Frequent ClosedPatterns: CLOSET  Flist: list of all frequent items in support ascending order  Flist: d-a-f-e-c  Divide search space  Patterns having d  Patterns having d but no a, etc.  Find frequent closed pattern recursively  Every transaction having d also has cfa  cfad is a frequent closed pattern  J. Pei, J. Han & R. Mao. “CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets", DMKD'00. TID Items 10 a, c, d, e, f 20 a, b, e 30 c, e, f 40 a, c, d, f 50 c, e, f Min_sup=2
  • 368.
    CLOSET+: Mining ClosedItemsets by Pattern-Growth  Itemset merging: if Y appears in every occurrence of X, then Y is merged with X  Sub-itemset pruning: if Y ‫כ‬ X, and sup(X) = sup(Y), X and all of X’s descendants in the set enumeration tree can be pruned  Hybrid tree projection  Bottom-up physical tree-projection  Top-down pseudo tree-projection  Item skipping: if a local frequent item has the same support in several header tables at different levels, one can prune it from the header table at higher levels  Efficient subset checking
  • 369.
    MaxMiner: Mining Max-Patterns 1st scan: find frequent items  A, B, C, D, E  2nd scan: find support for  AB, AC, AD, AE, ABCDE  BC, BD, BE, BCDE  CD, CE, CDE, DE  Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in later scan  R. Bayardo. Efficiently mining long patterns from databases. SIGMOD’98 Tid Items 10 A, B, C, D, E 20 B, C, D, E, 30 A, C, D, F Potential max-patterns
  • 370.
    CHARM: Mining byExploring Vertical Data Format  Vertical format: t(AB) = {T11, T25, …}  tid-list: list of trans.-ids containing an itemset  Deriving closed patterns based on vertical intersections  t(X) = t(Y): X and Y always happen together  t(X)  t(Y): transaction having X always has Y  Using diffset to accelerate mining  Only keep track of differences of tids  t(X) = {T1, T2, T3}, t(XY) = {T1, T3}  Diffset (XY, X) = {T2}  Eclat/MaxEclat (Zaki et al. @KDD’97), VIPER(P. Shenoy
  • 371.
  • 372.
  • 373.
    373 Visualization of AssociationRules (SGI/MineSet 3.0)
  • 374.
    374 Chapter 5: MiningFrequent Patterns, Association and Correlations: Basic Concepts and Methods  Basic Concepts  Frequent Itemset Mining Methods  Which Patterns Are Interesting?—Pattern Evaluation Methods  Summary
  • 375.
    375 Interestingness Measure: Correlations(Lift)  play basketball  eat cereal [40%, 66.7%] is misleading  The overall % of students eating cereal is 75% > 66.7%.  play basketball  not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence  Measure of dependent/correlated events: lift 89 . 0 5000 / 3750 * 5000 / 3000 5000 / 2000 ) , (   C B lift Basketbal l Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 2000 5000 ) ( ) ( ) ( B P A P B A P lift   33 . 1 5000 / 1250 * 5000 / 3000 5000 / 1000 ) , (   C B lift
  • 376.
    376 Are lift and2 Good Measures of Correlation?  “Buy walnuts  buy milk [1%, 80%]” is misleading if 85% of customers buy milk  Support and confidence are not good to indicate correlations  Over 20 interestingness measures have been proposed (see Tan, Kumar, Sritastava @KDD’02)  Which are good ones?
  • 377.
  • 378.
    October 24, 2024Data Mining: Concepts and Techniques 378 Comparison of Interestingness Measures Milk No Milk Sum (row) Coffee m, c ~m, c c No Coffee m, ~c ~m, ~c ~c Sum(col.) m ~m   Null-(transaction) invariance is crucial for correlation analysis  Lift and 2 are not null-invariant  5 null-invariant measures Null-transactions w.r.t. m and c Null-invariant Subtle: They disagree Kulczynski measure (1927)
  • 379.
    379 Analysis of DBLPCoauthor Relationships Advisor-advisee relation: Kulc: high, coherence: low, cosine: middle Recent DB conferences, removing balanced associations, low sup, etc.  Tianyi Wu, Yuguo Chen and Jiawei Han, “ Association Mining in Large Databases: A Re-Examination of Its Me asures ”, Proc. 2007 Int. Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD'07), Sept. 2007
  • 380.
    Which Null-Invariant MeasureIs Better?  IR (Imbalance Ratio): measure the imbalance of two itemsets A and B in rule implications  Kulczynski and Imbalance Ratio (IR) together present a clear picture for all the three datasets D4 through D6  D4 is balanced & neutral  D5 is imbalanced & neutral  D6 is very imbalanced & neutral
  • 381.
    381 Chapter 5: MiningFrequent Patterns, Association and Correlations: Basic Concepts and Methods  Basic Concepts  Frequent Itemset Mining Methods  Which Patterns Are Interesting?—Pattern Evaluation Methods  Summary
  • 382.
    382 Summary  Basic concepts:association rules, support- confident framework, closed and max-patterns  Scalable frequent pattern mining methods  Apriori (Candidate generation & test)  Projection-based (FPgrowth, CLOSET+, ...)  Vertical format approach (ECLAT, CHARM, ...)  Which patterns are interesting?  Pattern evaluation methods
  • 383.
    383 Ref: Basic Conceptsof Frequent Pattern Mining  (Association Rules) R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD'93  (Max-pattern) R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD'98  (Closed-pattern) N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. ICDT'99  (Sequential pattern) R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95
  • 384.
    384 Ref: Apriori andIts Improvements  R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94  H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules. KDD'94  A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. VLDB'95  J. S. Park, M. S. Chen, and P. S. Yu. An effective hash-based algorithm for mining association rules. SIGMOD'95  H. Toivonen. Sampling large databases for association rules. VLDB'96  S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket analysis. SIGMOD'97  S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD'98
  • 385.
    385 Ref: Depth-First, Projection-BasedFP Mining  R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent itemsets. J. Parallel and Distributed Computing, 2002.  G. Grahne and J. Zhu, Efficiently Using Prefix-Trees in Mining Frequent Itemsets, Proc. FIMI'03  B. Goethals and M. Zaki. An introduction to workshop on frequent itemset mining implementations. Proc. ICDM’03 Int. Workshop on Frequent Itemset Mining Implementations (FIMI’03), Melbourne, FL, Nov. 2003  J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD’ 00  J. Liu, Y. Pan, K. Wang, and J. Han. Mining Frequent Item Sets by Opportunistic Projection. KDD'02  J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining Top-K Frequent Closed Patterns without Minimum Support. ICDM'02  J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets. KDD'03
  • 386.
    386 Ref: Vertical Formatand Row Enumeration Methods  M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for discovery of association rules. DAMI:97.  M. J. Zaki and C. J. Hsiao. CHARM: An Efficient Algorithm for Closed Itemset Mining, SDM'02.  C. Bucila, J. Gehrke, D. Kifer, and W. White. DualMiner: A Dual-Pruning Algorithm for Itemsets with Constraints. KDD’02.  F. Pan, G. Cong, A. K. H. Tung, J. Yang, and M. Zaki , CARPENTER: Finding Closed Patterns in Long Biological Datasets. KDD'03.  H. Liu, J. Han, D. Xin, and Z. Shao, Mining Interesting Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach, SDM'06.
  • 387.
    387 Ref: Mining Correlationsand Interesting Rules  S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. SIGMOD'97.  M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM'94.  R. J. Hilderman and H. J. Hamilton. Knowledge Discovery and Measures of Interest. Kluwer Academic, 2001.  C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining causal structures. VLDB'98.  P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure for Association Patterns. KDD'02.  E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE’03.  T. Wu, Y. Chen, and J. Han, “Re-Examination of Interestingness Measures in Pattern Mining: A Unified Framework", Data Mining and Knowledge Discovery, 21(3):371- 397, 2010
  • 388.
    388 388 Data Mining: Concepts andTechniques (3rd ed.) — Chapter 7 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2010 Han, Kamber & Pei. All rights reserved.
  • 389.
    October 24, 2024 DataMining: Concepts and Techniques 389
  • 390.
    390 Chapter 7 :Advanced Frequent Pattern Mining  Pattern Mining: A Road Map  Pattern Mining in Multi-Level, Multi-Dimensional Space  Constraint-Based Frequent Pattern Mining  Mining High-Dimensional Data and Colossal Patterns  Mining Compressed or Approximate Patterns  Pattern Exploration and Application  Summary
  • 391.
  • 392.
    392 Chapter 7 :Advanced Frequent Pattern Mining  Pattern Mining: A Road Map  Pattern Mining in Multi-Level, Multi-Dimensional Space  Mining Multi-Level Association  Mining Multi-Dimensional Association  Mining Quantitative Association Rules  Mining Rare Patterns and Negative Patterns  Constraint-Based Frequent Pattern Mining  Mining High-Dimensional Data and Colossal Patterns  Mining Compressed or Approximate Patterns  Pattern Exploration and Application  Summary
  • 393.
    393 Mining Multiple-Level AssociationRules  Items often form hierarchies  Flexible support settings  Items at the lower level are expected to have lower support  Exploration of shared multi-level mining (Agrawal & Srikant@VLB’95, Han & Fu@VLDB’95) uniform support Milk [support = 10%] 2% Milk [support = 6%] Skim Milk [support = 4%] Level 1 min_sup = 5% Level 2 min_sup = 5% Level 1 min_sup = 5% Level 2 min_sup = 3% reduced support
  • 394.
    394 Multi-level Association: FlexibleSupport and Redundancy filtering  Flexible min-support thresholds: Some items are more valuable but less frequent  Use non-uniform, group-based min-support  E.g., {diamond, watch, camera}: 0.05%; {bread, milk}: 5%; …  Redundancy Filtering: Some rules may be redundant due to “ancestor” relationships between items  milk  wheat bread [support = 8%, confidence = 70%]  2% milk  wheat bread [support = 2%, confidence = 72%] The first rule is an ancestor of the second rule  A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor
  • 395.
    395 Chapter 7 :Advanced Frequent Pattern Mining  Pattern Mining: A Road Map  Pattern Mining in Multi-Level, Multi-Dimensional Space  Mining Multi-Level Association  Mining Multi-Dimensional Association  Mining Quantitative Association Rules  Mining Rare Patterns and Negative Patterns  Constraint-Based Frequent Pattern Mining  Mining High-Dimensional Data and Colossal Patterns  Mining Compressed or Approximate Patterns  Pattern Exploration and Application  Summary
  • 396.
    396 Mining Multi-Dimensional Association Single-dimensional rules: buys(X, “milk”)  buys(X, “bread”)  Multi-dimensional rules:  2 dimensions or predicates  Inter-dimension assoc. rules (no repeated predicates) age(X,”19-25”)  occupation(X,“student”)  buys(X, “coke”)  hybrid-dimension assoc. rules (repeated predicates) age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”)  Categorical Attributes: finite number of possible values, no ordering among values—data cube approach  Quantitative Attributes: Numeric, implicit ordering among values—discretization, clustering, and gradient approaches
  • 397.
    397 Chapter 7 :Advanced Frequent Pattern Mining  Pattern Mining: A Road Map  Pattern Mining in Multi-Level, Multi-Dimensional Space  Mining Multi-Level Association  Mining Multi-Dimensional Association  Mining Quantitative Association Rules  Mining Rare Patterns and Negative Patterns  Constraint-Based Frequent Pattern Mining  Mining High-Dimensional Data and Colossal Patterns  Mining Compressed or Approximate Patterns  Pattern Exploration and Application  Summary
  • 398.
    398 Mining Quantitative Associations Techniquescan be categorized by how numerical attributes, such as age or salary are treated 1. Static discretization based on predefined concept hierarchies (data cube methods) 2. Dynamic discretization based on data distribution (quantitative rules, e.g., Agrawal & Srikant@SIGMOD96) 3. Clustering: Distance-based association (e.g., Yang & Miller@SIGMOD97)  One dimensional clustering then association 4. Deviation: (such as Aumann and Lindell@KDD99) Sex = female => Wage: mean=$7/hr (overall mean = $9)
  • 399.
    399 Static Discretization ofQuantitative Attributes  Discretized prior to mining using concept hierarchy.  Numeric values are replaced by ranges  In relational database, finding all frequent k-predicate sets will require k or k+1 table scans  Data cube is well suited for mining  The cells of an n-dimensional cuboid correspond to the predicate sets  Mining from data cubes can be much faster (income) (age) () (buys) (age, income) (age,buys) (income,buys) (age,income,buys)
  • 400.
    400 Quantitative Association RulesBased on Statistical Inference Theory [Aumann and Lindell@DMKD’03]  Finding extraordinary and therefore interesting phenomena, e.g., (Sex = female) => Wage: mean=$7/hr (overall mean = $9)  LHS: a subset of the population  RHS: an extraordinary behavior of this subset  The rule is accepted only if a statistical test (e.g., Z-test) confirms the inference with high confidence  Subrule: highlights the extraordinary behavior of a subset of the pop. of the super rule  E.g., (Sex = female) ^ (South = yes) => mean wage = $6.3/hr  Two forms of rules  Categorical => quantitative rules, or Quantitative => quantitative rules  E.g., Education in [14-18] (yrs) => mean wage = $11.64/hr  Open problem: Efficient methods for LHS containing two or more quantitative attributes
  • 401.
    401 Chapter 7 :Advanced Frequent Pattern Mining  Pattern Mining: A Road Map  Pattern Mining in Multi-Level, Multi-Dimensional Space  Mining Multi-Level Association  Mining Multi-Dimensional Association  Mining Quantitative Association Rules  Mining Rare Patterns and Negative Patterns  Constraint-Based Frequent Pattern Mining  Mining High-Dimensional Data and Colossal Patterns  Mining Compressed or Approximate Patterns  Pattern Exploration and Application  Summary
  • 402.
    402 Negative and RarePatterns  Rare patterns: Very low support but interesting  E.g., buying Rolex watches  Mining: Setting individual-based or special group- based support threshold for valuable items  Negative patterns  Since it is unlikely that one buys Ford Expedition (an SUV car) and Toyota Prius (a hybrid car) together, Ford Expedition and Toyota Prius are likely negatively correlated patterns  Negatively correlated patterns that are infrequent tend to be more interesting than those that are frequent
  • 403.
    403 Defining Negative CorrelatedPatterns (I)  Definition 1 (support-based)  If itemsets X and Y are both frequent but rarely occur together, i.e., sup(X U Y) < sup (X) * sup(Y)  Then X and Y are negatively correlated  Problem: A store sold two needle 100 packages A and B, only one transaction containing both A and B.  When there are in total 200 transactions, we have s(A U B) = 0.005, s(A) * s(B) = 0.25, s(A U B) < s(A) * s(B)  When there are 105 transactions, we have s(A U B) = 1/105 , s(A) * s(B) = 1/103 * 1/103 , s(A U B) > s(A) * s(B)  Where is the problem? —Null transactions, i.e., the support- based definition is not null-invariant!
  • 404.
    404 Defining Negative CorrelatedPatterns (II)  Definition 2 (negative itemset-based)  X is a negative itemset if (1) X = Ā U B, where B is a set of positive items, and Ā is a set of negative items, |Ā| 1, and (2) s(X) ≥ ≥ μ  Itemsets X is negatively correlated, if  This definition suffers a similar null-invariant problem  Definition 3 (Kulzynski measure-based) If itemsets X and Y are frequent, but (P(X|Y) + P(Y|X))/2 < є, where є is a negative pattern threshold, then X and Y are negatively correlated.  Ex. For the same needle package problem, when no matter there are 200 or 105 transactions, if є = 0.01, we have (P(A|B) + P(B|A))/2 = (0.01 + 0.01)/2 < є
  • 405.
    405 Chapter 7 :Advanced Frequent Pattern Mining  Pattern Mining: A Road Map  Pattern Mining in Multi-Level, Multi-Dimensional Space  Constraint-Based Frequent Pattern Mining  Mining High-Dimensional Data and Colossal Patterns  Mining Compressed or Approximate Patterns  Pattern Exploration and Application  Summary
  • 406.
    406 Constraint-based (Query-Directed) Mining Finding all the patterns in a database autonomously? — unrealistic!  The patterns could be too many but not focused!  Data mining should be an interactive process  User directs what to be mined using a data mining query language (or a graphical user interface)  Constraint-based mining  User flexibility: provides constraints on what to be mined  Optimization: explores such constraints for efficient mining — constraint-based mining: constraint-pushing, similar to push selection first in DB query processing  Note: still find all the answers satisfying constraints, not finding some answers in “heuristic search”
  • 407.
    407 Constraints in DataMining  Knowledge type constraint:  classification, association, etc.  Data constraint — using SQL-like queries  find product pairs sold together in stores in Chicago this year  Dimension/level constraint  in relevance to region, price, brand, customer category  Rule (or pattern) constraint  small sales (price < $10) triggers big sales (sum > $200)  Interestingness constraint  strong rules: min_support  3%, min_confidence  60%
  • 408.
    Meta-Rule Guided Mining Meta-rule can be in the rule form with partially instantiated predicates and constants P1(X, Y) ^ P2(X, W) => buys(X, “iPad”)  The resulting rule derived can be age(X, “15-25”) ^ profession(X, “student”) => buys(X, “iPad”)  In general, it can be in the form of P1 ^ P2 ^ … ^ Pl => Q1 ^ Q2 ^ … ^ Qr  Method to find meta-rules  Find frequent (l+r) predicates (based on min-support threshold)  Push constants deeply when possible into the mining process (see the remaining discussions on constraint-push techniques)  Use confidence, correlation, and other measures when 408
  • 409.
    409 Constraint-Based Frequent PatternMining  Pattern space pruning constraints  Anti-monotonic: If constraint c is violated, its further mining can be terminated  Monotonic: If c is satisfied, no need to check c again  Succinct: c must be satisfied, so one can start with the data sets satisfying c  Convertible: c is not monotonic nor anti-monotonic, but it can be converted into it if items in the transaction can be properly ordered  Data space pruning constraint  Data succinct: Data space can be pruned at the initial pattern mining process  Data anti-monotonic: If a transaction t does not satisfy c, t can be pruned from its further mining
  • 410.
    410 Pattern Space Pruningwith Anti-Monotonicity Constraints  A constraint C is anti-monotone if the super pattern satisfies C, all of its sub-patterns do so too  In other words, anti-monotonicity: If an itemset S violates the constraint, so does any of its superset  Ex. 1. sum(S.price)  v is anti-monotone  Ex. 2. range(S.profit)  15 is anti-monotone  Itemset ab violates C  So does every superset of ab  Ex. 3. sum(S.Price)  v is not anti-monotone  Ex. 4. support count is anti-monotone: core property used in Apriori TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g TDB (min_sup=2) Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10
  • 411.
    411 Pattern Space Pruningwith Monotonicity Constraints  A constraint C is monotone if the pattern satisfies C, we do not need to check C in subsequent mining  Alternatively, monotonicity: If an itemset S satisfies the constraint, so does any of its superset  Ex. 1. sum(S.Price)  v is monotone  Ex. 2. min(S.Price)  v is monotone  Ex. 3. C: range(S.profit)  15  Itemset ab satisfies C  So does every superset of ab TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g TDB (min_sup=2) Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10
  • 412.
    412 Data Space Pruningwith Data Anti-monotonicity  A constraint c is data anti-monotone if for a pattern p cannot satisfy a transaction t under c, p’s superset cannot satisfy t under c either  The key for data anti-monotone is recursive data reduction  Ex. 1. sum(S.Price)  v is data anti-monotone  Ex. 2. min(S.Price)  v is data anti-monotone  Ex. 3. C: range(S.profit)  25 is data anti- monotone  Itemset {b, c}’s projected DB:  T10’: {d, f, h}, T20’: {d, f, g, h}, T30’: {d, f, g}  since C cannot satisfy T10’, T10’ can be pruned TID Transaction 10 a, b, c, d, f, h 20 b, c, d, f, g, h 30 b, c, d, f, g 40 c, e, f, g TDB (min_sup=2) Item Profit a 40 b 0 c -20 d -15 e -30 f -10 g 20 h -5
  • 413.
    413 Pattern Space Pruningwith Succinctness  Succinctness:  Given A1, the set of items satisfying a succinctness constraint C, then any set S satisfying C is based on A1 , i.e., S contains a subset belonging to A1  Idea: Without looking at the transaction database, whether an itemset S satisfies constraint C can be determined based on the selection of items  min(S.Price)  v is succinct  sum(S.Price)  v is not succinct  Optimization: If C is succinct, C is pre-counting pushable
  • 414.
    414 Naïve Algorithm: Apriori+ Constraint TID Items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5 Database D itemset sup. {1} 2 {2} 3 {3} 3 {4} 1 {5} 3 itemset sup. {1} 2 {2} 3 {3} 3 {5} 3 Scan D C1 L1 itemset {1 2} {1 3} {1 5} {2 3} {2 5} {3 5} itemset sup {1 2} 1 {1 3} 2 {1 5} 1 {2 3} 2 {2 5} 3 {3 5} 2 itemset sup {1 3} 2 {2 3} 2 {2 5} 3 {3 5} 2 L2 C2 C2 Scan D C3 L3 itemset {2 3 5} Scan D itemset sup {2 3 5} 2 Constraint: Sum{S.price} < 5
  • 415.
    415 Constrained Apriori :Push a Succinct Constraint Deep TID Items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5 Database D itemset sup. {1} 2 {2} 3 {3} 3 {4} 1 {5} 3 itemset sup. {1} 2 {2} 3 {3} 3 {5} 3 Scan D C1 L1 itemset {1 2} {1 3} {1 5} {2 3} {2 5} {3 5} itemset sup {1 2} 1 {1 3} 2 {1 5} 1 {2 3} 2 {2 5} 3 {3 5} 2 itemset sup {1 3} 2 {2 3} 2 {2 5} 3 {3 5} 2 L2 C2 C2 Scan D C3 L3 itemset {2 3 5} Scan D itemset sup {2 3 5} 2 Constraint: min{S.price } <= 1 not immediately to be used
  • 416.
    416 Constrained FP-Growth: Pusha Succinct Constraint Deep Constraint: min{S.price } <= 1 TID Items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5 TID Items 100 1 3 200 2 3 5 300 1 2 3 5 400 2 5 Remove infrequent length 1 FP-Tree TID Items 100 3 4 300 2 3 5 1-Projected DB No Need to project on 2, 3, or 5
  • 417.
    417 Constrained FP-Growth: Pusha Data Anti-monotonic Constraint Deep Constraint: min{S.price } <= 1 TID Items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5 TID Items 100 1 3 300 1 3 FP-Tree Single branch, we are done Remove from data
  • 418.
    418 Constrained FP-Growth: Pusha Data Anti-monotonic Constraint Deep Constraint: range{S.price } > 25 min_sup >= 2 FP-Tree TID Transaction 10 a, c, d, f, h 20 c, d, f, g, h 30 c, d, f, g B-Projected DB B FP-Tree TID Transaction 10 a, b, c, d, f, h 20 b, c, d, f, g, h 30 b, c, d, f, g 40 a, c, e, f, g TID Transaction 10 a, b, c, d, f, h 20 b, c, d, f, g, h 30 b, c, d, f, g 40 a, c, e, f, g Item Profit a 40 b 0 c -20 d -15 e -30 f -10 g 20 h -5 Recursive Data Pruning Single branch: bcdfg: 2
  • 419.
    419 Convertible Constraints: OrderingData in Transactions  Convert tough constraints into anti- monotone or monotone by properly ordering items  Examine C: avg(S.profit)  25  Order items in value-descending order  <a, f, g, d, b, h, c, e>  If an itemset afb violates C  So does afbh, afb*  It becomes anti-monotone! TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g TDB (min_sup=2) Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10
  • 420.
    420 Strongly Convertible Constraints avg(X)  25 is convertible anti-monotone w.r.t. item value descending order R: <a, f, g, d, b, h, c, e>  If an itemset af violates a constraint C, so does every itemset with af as prefix, such as afd  avg(X)  25 is convertible monotone w.r.t. item value ascending order R-1 : <e, c, h, b, d, g, f, a>  If an itemset d satisfies a constraint C, so does itemsets df and dfa, which having d as a prefix  Thus, avg(X)  25 is strongly convertible Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10
  • 421.
    421 Can Apriori HandleConvertible Constraints?  A convertible, not monotone nor anti- monotone nor succinct constraint cannot be pushed deep into the an Apriori mining algorithm  Within the level wise framework, no direct pruning based on the constraint can be made  Itemset df violates constraint C: avg(X) >= 25  Since adf satisfies C, Apriori needs df to assemble adf, df cannot be pruned  But it can be pushed into frequent-pattern Item Value a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10
  • 422.
    422 Pattern Space Pruningw. Convertible Constraints  C: avg(X) >= 25, min_sup=2  List items in every transaction in value descending order R: <a, f, g, d, b, h, c, e>  C is convertible anti-monotone w.r.t. R  Scan TDB once  remove infrequent items  Item h is dropped  Itemsets a and f are good, …  Projection-based mining  Imposing an appropriate order on item projection  Many tough constraints can be converted into (anti)-monotone TID Transaction 10 a, f, d, b, c 20 f, g, d, b, c 30 a, f, d, c, e 40 f, g, h, c, e TDB (min_sup=2) Item Value a 40 f 30 g 20 d 10 b 0 h -10 c -20 e -30
  • 423.
    423 Handling Multiple Constraints Different constraints may require different or even conflicting item-ordering  If there exists an order R s.t. both C1 and C2 are convertible w.r.t. R, then there is no conflict between the two convertible constraints  If there exists conflict on order of items  Try to satisfy one constraint first  Then using the order for the other constraint to mine frequent itemsets in the corresponding projected database
  • 424.
    424 What Constraints AreConvertible? Constraint Convertible anti- monotone Convertible monotone Strongly convertible avg(S)  ,  v Yes Yes Yes median(S)  ,  v Yes Yes Yes sum(S)  v (items could be of any value, v  0) Yes No No sum(S)  v (items could be of any value, v  0) No Yes No sum(S)  v (items could be of any value, v  0) No Yes No sum(S)  v (items could be of any value, v  0) Yes No No ……
  • 425.
    425 Constraint-Based Mining —A General Picture Constraint Anti-monotone Monotone Succinct v  S no yes yes S  V no yes yes S  V yes no yes min(S)  v no yes yes min(S)  v yes no yes max(S)  v yes no yes max(S)  v no yes yes count(S)  v yes no weakly count(S)  v no yes weakly sum(S)  v ( a  S, a  0 ) yes no no sum(S)  v ( a  S, a  0 ) no yes no range(S)  v yes no no range(S)  v no yes no avg(S)  v,   { , ,  } convertible convertible no support(S)   yes no no support(S)   no yes no
  • 426.
    426 Chapter 7 :Advanced Frequent Pattern Mining  Pattern Mining: A Road Map  Pattern Mining in Multi-Level, Multi-Dimensional Space  Constraint-Based Frequent Pattern Mining  Mining High-Dimensional Data and Colossal Patterns  Mining Compressed or Approximate Patterns  Pattern Exploration and Application  Summary
  • 427.
    427 Mining Colossal FrequentPatterns  F. Zhu, X. Yan, J. Han, P. S. Yu, and H. Cheng, “Mining Colossal Frequent Patterns by Core Pattern Fusion”, ICDE'07.  We have many algorithms, but can we mine large (i.e., colossal) patterns? ― such as just size around 50 to 100? Unfortunately, not!  Why not? ― the curse of “downward closure” of frequent patterns  The “downward closure” property  Any sub-pattern of a frequent pattern is frequent.  Example. If (a1, a2, …, a100) is frequent, then a1, a2, …, a100, (a1, a2), (a1, a3), …, (a1, a100), (a1, a2, a3), … are all frequent! There are about 2100 such frequent itemsets!  No matter using breadth-first search (e.g., Apriori) or depth-first search (FPgrowth), we have to examine so many patterns  Thus the downward closure property leads to explosion!
  • 428.
    428 Closed/maximal patterns may partiallyalleviate the problem but not really solve it: We often need to mine scattered large patterns! Let the minimum support threshold σ= 20 There are frequent patterns of size 20 Each is closed and maximal # patterns = The size of the answer set is exponential to n Colossal Patterns: A Motivating Example T1 = 1 2 3 4 ….. 39 40 T2 = 1 2 3 4 ….. 39 40 : . : . : . : . T40=1 2 3 4 ….. 39 40         20 40 T1 = 2 3 4 ….. 39 40 T2 = 1 3 4 ….. 39 40 : . : . : . : . T40=1 2 3 4 …… 39 n n n n 2 / 2 2 /           Then delete the items on the diagonal Let’s make a set of 40 transactions
  • 429.
    429 Colossal Pattern Set:Small but Interesting  It is often the case that only a small number of patterns are colossal, i.e., of large size  Colossal patterns are usually attached with greater importance than those of small pattern sizes
  • 430.
    430 Mining Colossal Patterns:Motivation and Philosophy  Motivation: Many real-world tasks need mining colossal patterns  Micro-array analysis in bioinformatics (when support is low)  Biological sequence patterns  Biological/sociological/information graph pattern mining  No hope for completeness  If the mining of mid-sized patterns is explosive in size, there is no hope to find colossal patterns efficiently by insisting “complete set” mining philosophy  Jumping out of the swamp of the mid-sized results  What we may develop is a philosophy that may jump out of the swamp of mid-sized results that are explosive in size and jump to reach colossal patterns  Striving for mining almost complete colossal patterns  The key is to develop a mechanism that may quickly reach colossal patterns and discover most of them
  • 431.
    431 Let the min-supportthreshold σ= 20 Then there are closed/maximal frequent patterns of size 20 However, there is only one with size greater than 20, (i.e., colossal): α= {41,42,…,79} of size 39 Alas, A Show of Colossal Pattern Mining!         20 40 T1 = 2 3 4 ….. 39 40 T2 = 1 3 4 ….. 39 40 : . : . : . : . T40=1 2 3 4 …… 39 T41= 41 42 43 ….. 79 T42= 41 42 43 ….. 79 : . : . T60= 41 42 43 … 79 The existing fastest mining algorithms (e.g., FPClose, LCM) fail to complete running Our algorithm outputs this colossal pattern in seconds
  • 432.
    432 Methodology of Pattern-FusionStrategy  Pattern-Fusion traverses the tree in a bounded-breadth way  Always pushes down a frontier of a bounded-size candidate pool  Only a fixed number of patterns in the current candidate pool will be used as the starting nodes to go down in the pattern tree ― thus avoids the exponential search space  Pattern-Fusion identifies “shortcuts” whenever possible  Pattern growth is not performed by single-item addition but by leaps and bounded: agglomeration of multiple patterns in the pool  These shortcuts will direct the search down the tree much more rapidly towards the colossal patterns
  • 433.
    433 Observation: Colossal Patternsand Core Patterns A colossal pattern α D Dα α1 Transaction Database D Dα1 Dα2 α2 α αk Dαk Subpatterns α1 to αk cluster tightly around the colossal pattern α by sharing a similar support. We call such subpatterns core patterns of α
  • 434.
    434 Robustness of ColossalPatterns  Core Patterns Intuitively, for a frequent pattern α, a subpattern β is a τ-core pattern of α if β shares a similar support set with α, i.e., where τ is called the core ratio  Robustness of Colossal Patterns A colossal pattern is robust in the sense that it tends to have much more core patterns than small patterns     | | | | D D 1 0  
  • 435.
    435 Example: Core Patterns A colossal pattern has far more core patterns than a small-sized pattern  A colossal pattern has far more core descendants of a smaller size c  A random draw from a complete set of pattern of size c would more likely to pick a core descendant of a colossal pattern  A colossal pattern can be generated by merging a set of core patterns Transaction (# of Ts) Core Patterns (τ = 0.5) (abe) (100) (abe), (ab), (be), (ae), (e) (bcf) (100) (bcf), (bc), (bf) (acf) (100) (acf), (ac), (af) (abcef) (100) (ab), (ac), (af), (ae), (bc), (bf), (be) (ce), (fe), (e), (abc), (abf), (abe), (ace), (acf), (afe), (bcf), (bce), (bfe), (cfe), (abcf), (abce), (bcfe), (acfe), (abfe), (abcef)
  • 436.
    437 Colossal Patterns Correspondto Dense Balls  Due to their robustness, colossal patterns correspond to dense balls  Ω( 2^d) in population  A random draw in the pattern space will hit somewhere in the ball with high probability
  • 437.
    438 Idea of Pattern-FusionAlgorithm  Generate a complete set of frequent patterns up to a small size  Randomly pick a pattern β, and β has a high probability to be a core-descendant of some colossal pattern α  Identify all α’s descendants in this complete set, and merge all of them ― This would generate a much larger core-descendant of α  In the same fashion, we select K patterns. This set of larger core-descendants will be the candidate pool for the next iteration
  • 438.
    439 Pattern-Fusion: The Algorithm Initialization (Initial pool): Use an existing algorithm to mine all frequent patterns up to a small size, e.g., 3  Iteration (Iterative Pattern Fusion):  At each iteration, k seed patterns are randomly picked from the current pattern pool  For each seed pattern thus picked, we find all the patterns within a bounding ball centered at the seed pattern  All these patterns found are fused together to generate a set of super-patterns. All the super- patterns thus generated form a new pool for the next iteration  Termination: when the current pool contains no more than K patterns at the beginning of an iteration
  • 439.
    440 Why Is Pattern-FusionEfficient?  A bounded-breadth pattern tree traversal  It avoids explosion in mining mid-sized ones  Randomness comes to help to stay on the right path  Ability to identify “short- cuts” and take “leaps”  fuse small patterns together in one step to generate new patterns of significant sizes  Efficiency
  • 440.
    441 Pattern-Fusion Leads toGood Approximation  Gearing toward colossal patterns  The larger the pattern, the greater the chance it will be generated  Catching outliers  The more distinct the pattern, the greater the chance it will be generated
  • 441.
    442 Experimental Setting  Syntheticdata set  Diagn an n x (n-1) table where ith row has integers from 1 to n except i. Each row is taken as an itemset. min_support is n/2.  Real data set  Replace: A program trace data set collected from the “replace” program, widely used in software engineering research  ALL: A popular gene expression data set, a clinical data on ALL- AML leukemia (www.broad.mit.edu/tools/data.html).  Each item is a column, representing the activitiy level of gene/protein in the same  Frequent pattern would reveal important correlation between gene expression patterns and disease outcomes
  • 442.
    443 Experiment Results onDiagn  LCM run time increases exponentially with pattern size n  Pattern-Fusion finishes efficiently  The approximation error of Pattern-Fusion (with min- sup 20) in comparison with the complete set) is rather close to uniform sampling (which randomly picks K patterns from the complete answer set)
  • 443.
    444 Experimental Results onALL  ALL: A popular gene expression data set with 38 transactions, each with 866 columns  There are 1736 items in total  The table shows a high frequency threshold of 30
  • 444.
    445 Experimental Results onREPLACE  REPLACE  A program trace data set, recording 4395 calls and transitions  The data set contains 4395 transactions with 57 items in total  With support threshold of 0.03, the largest patterns are of size 44  They are all discovered by Pattern-Fusion with different settings of K and τ, when started with an initial pool of 20948 patterns of size <=3
  • 445.
    446 Experimental Results onREPLACE  Approximation error when compared with the complete mining result  Example. Out of the total 98 patterns of size >=42, when K=100, Pattern-Fusion returns 80 of them  A good approximation to the colossal patterns in the sense that any pattern in the complete set is on average at most 0.17 items away from one of these 80 patterns
  • 446.
    447 Chapter 7 :Advanced Frequent Pattern Mining  Pattern Mining: A Road Map  Pattern Mining in Multi-Level, Multi-Dimensional Space  Constraint-Based Frequent Pattern Mining  Mining High-Dimensional Data and Colossal Patterns  Mining Compressed or Approximate Patterns  Pattern Exploration and Application  Summary
  • 447.
    448 Mining Compressed Patterns:δ-clustering  Why compressed patterns?  too many, but less meaningful  Pattern distance measure  δ-clustering: For each pattern P, find all patterns which can be expressed by P and their distance to P are within δ (δ- cover)  All patterns in the cluster can be represented by P  Xin et al., “Mining Compressed ID Item-Sets Support P1 {38,16,18,12} 205227 P2 {38,16,18,12,17} 205211 P3 {39,38,16,18,12,17} 101758 P4 {39,16,18,12,17} 161563 P5 {39,16,18,12} 161576  Closed frequent pattern  Report P1, P2, P3, P4, P5  Emphasize too much on support  no compression  Max-pattern, P3: info loss  A desirable output: P2, P3, P4
  • 448.
    449 Redundancy-Award Top-k Patterns Why redundancy-aware top-k patterns?  Desired patterns: high significance & low redundancy  Propose the MMS (Maximal Marginal Significance) for measuring the combined significance of a pattern set  Xin et al., Extracting Redundancy-Aware Top-K Patterns, KDD’06
  • 449.
    450 Chapter 7 :Advanced Frequent Pattern Mining  Pattern Mining: A Road Map  Pattern Mining in Multi-Level, Multi-Dimensional Space  Constraint-Based Frequent Pattern Mining  Mining High-Dimensional Data and Colossal Patterns  Mining Compressed or Approximate Patterns  Pattern Exploration and Application  Summary
  • 450.
     Do theyall make sense?  What do they mean?  How are they useful? diaper beer female sterile (2) tekele Annotate patterns with semantic information morphological info. and simple statistics Semantic Information Not all frequent patterns are useful, only meaningful ones … How to Understand and Interpret Patterns?
  • 451.
    Word: “pattern” –from Merriam-Webster A Dictionary Analogy Non-semantic info. Examples of Usage Definitions indicating semantics Synonyms Related Words
  • 452.
    Semantic Analysis withContext Models  Task1: Model the context of a frequent pattern Based on the Context Model…  Task2: Extract strongest context indicators  Task3: Extract representative transactions  Task4: Extract semantically similar patterns
  • 453.
    Annotating DBLP Co-authorship& Title Pattern Substructure Similarity Search in Graph Databases X.Yan, P. Yu, J. Han … … … … Database: Title Authors Frequent Patterns P1: { x_yan, j_han } Frequent Itemset P2: “substructure search” Pattern { x_yan, j_han} Non Sup = … CI {p_yu}, graph pattern, … Trans. gSpan: graph-base…… SSPs { j_wang }, {j_han, p_yu}, … Semantic Annotations Context Units < { p_yu, j_han}, { d_xin }, … , “graph pattern”, … “substructure similarity”, … > Pattern = {xifeng_yan, jiawei_han} Annotation Results: Context Indicator (CI) graph; {philip_yu}; mine close; graph pattern; sequential pattern; … Representative Transactions (Trans) > gSpan: graph-base substructure pattern mining; > mining close relational graph connect constraint; … Semantically Similar Patterns (SSP) {jiawei_han, philip_yu}; {jian_pei, jiawei_han}; {jiong_yang, philip_yu, wei_wang}; …
  • 454.
    455 Chapter 7 :Advanced Frequent Pattern Mining  Pattern Mining: A Road Map  Pattern Mining in Multi-Level, Multi-Dimensional Space  Constraint-Based Frequent Pattern Mining  Mining High-Dimensional Data and Colossal Patterns  Mining Compressed or Approximate Patterns  Pattern Exploration and Application  Summary
  • 455.
    456 Summary  Roadmap: Manyaspects & extensions on pattern mining  Mining patterns in multi-level, multi dimensional space  Mining rare and negative patterns  Constraint-based pattern mining  Specialized methods for mining high-dimensional data and colossal patterns  Mining compressed or approximate patterns  Pattern exploration and understanding: Semantic
  • 456.
    457 Ref: Mining Multi-Leveland Quantitative Rules  Y. Aumann and Y. Lindell. A Statistical Theory for Quantitative Association Rules, KDD'99  T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using two-dimensional optimized association rules: Scheme, algorithms, and visualization. SIGMOD'96.  J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. VLDB'95.  R.J. Miller and Y. Yang. Association rules over interval data. SIGMOD'97.  R. Srikant and R. Agrawal. Mining generalized association rules. VLDB'95.  R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. SIGMOD'96.  K. Wang, Y. He, and J. Han. Mining frequent itemsets using support constraints. VLDB'00  K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Computing optimized rectilinear regions for association rules. KDD'97.
  • 457.
    458 Ref: Mining OtherKinds of Rules  F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new paradigm for fast, quantifiable data mining. VLDB'98  Y. Huhtala, J. Kärkkäinen, P. Porkka, H. Toivonen. Efficient Discovery of Functional and Approximate Dependencies Using Partitions. ICDE’98.  H. V. Jagadish, J. Madar, and R. Ng. Semantic Compression and Pattern Extraction with Fascicles. VLDB'99  B. Lent, A. Swami, and J. Widom. Clustering association rules. ICDE'97.  R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. VLDB'96.  A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong negative associations in a large database of customer transactions. ICDE'98.  D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A generalization of association-rule mining. SIGMOD'98.
  • 458.
    459 Ref: Constraint-Based PatternMining  R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. KDD'97  R. Ng, L.V.S. Lakshmanan, J. Han & A. Pang. Exploratory mining and pruning optimizations of constrained association rules. SIGMOD’98  G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of constrained correlated sets. ICDE'00  J. Pei, J. Han, and L. V. S. Lakshmanan. Mining Frequent Itemsets with Convertible Constraints. ICDE'01  J. Pei, J. Han, and W. Wang, Mining Sequential Patterns with Constraints in Large Databases, CIKM'02  F. Bonchi, F. Giannotti, A. Mazzanti, and D. Pedreschi. ExAnte: Anticipated Data Reduction in Constrained Pattern Mining, PKDD'03  F. Zhu, X. Yan, J. Han, and P. S. Yu, “gPrune: A Constraint Pushing Framework for Graph Pattern Mining”, PAKDD'07
  • 459.
    460 Ref: Mining SequentialPatterns  X. Ji, J. Bailey, and G. Dong. Mining minimal distinguishing subsequence patterns with gap constraints. ICDM'05  H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. DAMI:97.  J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. ICDE'01.  R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT’96.  X. Yan, J. Han, and R. Afshar. CloSpan: Mining Closed Sequential Patterns in Large Datasets. SDM'03.  M. Zaki. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning:01.
  • 460.
    Mining Graph andStructured Patterns  A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for mining frequent substructures from graph data. PKDD'00  M. Kuramochi and G. Karypis. Frequent Subgraph Discovery. ICDM'01.  X. Yan and J. Han. gSpan: Graph-based substructure pattern mining. ICDM'02  X. Yan and J. Han. CloseGraph: Mining Closed Frequent Graph Patterns. KDD'03  X. Yan, P. S. Yu, and J. Han. Graph indexing based on discriminative frequent structure analysis. ACM TODS, 30:960–993, 2005  X. Yan, F. Zhu, P. S. Yu, and J. Han. Feature-based substructure similarity search. ACM Trans. Database Systems, 31:1418–1453, 2006 461
  • 461.
    462 Ref: Mining Spatial,Spatiotemporal, Multimedia Data  H. Cao, N. Mamoulis, and D. W. Cheung. Mining frequent spatiotemporal sequential patterns. ICDM'05  D. Gunopulos and I. Tsoukatos. Efficient Mining of Spatiotemporal Patterns. SSTD'01  K. Koperski and J. Han, Discovery of Spatial Association Rules in Geographic Information Databases, SSD’95  H. Xiong, S. Shekhar, Y. Huang, V. Kumar, X. Ma, and J. S. Yoo. A framework for discovering co-location patterns in data sets with extended spatial objects. SDM'04  J. Yuan, Y. Wu, and M. Yang. Discovery of collocation patterns: From visual words to visual phrases. CVPR'07  O. R. Zaiane, J. Han, and H. Zhu, Mining Recurrent Items in Multimedia with Progressive Resolution Refinement. ICDE'00
  • 462.
    463 Ref: Mining FrequentPatterns in Time-Series Data  B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98.  J. Han, G. Dong and Y. Yin, Efficient Mining of Partial Periodic Patterns in Time Series Database, ICDE'99.  J. Shieh and E. Keogh. iSAX: Indexing and mining terabyte sized time series. KDD'08  B.-K. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish, C. Faloutsos, and A. Biliris. Online Data Mining for Co-Evolving Time Sequences. ICDE'00.  W. Wang, J. Yang, R. Muntz. TAR: Temporal Association Rules on Evolving Numerical Attributes. ICDE’01.  J. Yang, W. Wang, P. S. Yu. Mining Asynchronous Periodic Patterns in Time Series Data. TKDE’03  L. Ye and E. Keogh. Time series shapelets: A new primitive for data mining. KDD'09
  • 463.
    464 Ref: FP forClassification and Clustering  G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and differences. KDD'99.  B. Liu, W. Hsu, Y. Ma. Integrating Classification and Association Rule Mining. KDD’98.  W. Li, J. Han, and J. Pei. CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules. ICDM'01.  H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data sets. SIGMOD’ 02.  J. Yang and W. Wang. CLUSEQ: efficient and effective sequence clustering. ICDE’03.  X. Yin and J. Han. CPAR: Classification based on Predictive Association Rules. SDM'03.  H. Cheng, X. Yan, J. Han, and C.-W. Hsu, Discriminative Frequent Pattern Analysis for Effective Classification”, ICDE'07
  • 464.
    465 Ref: Privacy-Preserving FPMining  A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Privacy Preserving Mining of Association Rules. KDD’02.  A. Evfimievski, J. Gehrke, and R. Srikant. Limiting Privacy Breaches in Privacy Preserving Data Mining. PODS’03  J. Vaidya and C. Clifton. Privacy Preserving Association Rule Mining in Vertically Partitioned Data. KDD’02
  • 465.
    Mining Compressed Patterns D. Xin, H. Cheng, X. Yan, and J. Han. Extracting redundancy- aware top-k patterns. KDD'06  D. Xin, J. Han, X. Yan, and H. Cheng. Mining compressed frequent-pattern sets. VLDB'05  X. Yan, H. Cheng, J. Han, and D. Xin. Summarizing itemset patterns: A profile-based approach. KDD'05 466
  • 466.
    Mining Colossal Patterns F. Zhu, X. Yan, J. Han, P. S. Yu, and H. Cheng. Mining colossal frequent patterns by core pattern fusion. ICDE'07  F. Zhu, Q. Qu, D. Lo, X. Yan, J. Han. P. S. Yu, Mining Top-K Large Structural Patterns in a Massive Network. VLDB’11 467
  • 467.
    468 Ref: FP Miningfrom Data Streams  Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-Dimensional Regression Analysis of Time-Series Data Streams. VLDB'02.  R. M. Karp, C. H. Papadimitriou, and S. Shenker. A simple algorithm for finding frequent elements in streams and bags. TODS 2003.  G. Manku and R. Motwani. Approximate Frequency Counts over Data Streams. VLDB’02.  A. Metwally, D. Agrawal, and A. El Abbadi. Efficient computation of frequent and top-k elements in data streams. ICDT'05
  • 468.
    469 Ref: Freq. PatternMining Applications  T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining Database Structure; or How to Build a Data Quality Browser. SIGMOD'02  M. Khan, H. Le, H. Ahmadi, T. Abdelzaher, and J. Han. DustMiner: Troubleshooting interactive complexity bugs in sensor networks., SenSys'08  Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner: A tool for finding copy-paste and related bugs in operating system code. In Proc. 2004 Symp. Operating Systems Design and Implementation (OSDI'04)  Z. Li and Y. Zhou. PR-Miner: Automatically extracting implicit programming rules and detecting violations in large software code. FSE'05  D. Lo, H. Cheng, J. Han, S. Khoo, and C. Sun. Classification of software behaviors for failure detection: A discriminative pattern mining approach. KDD'09  Q. Mei, D. Xin, H. Cheng, J. Han, and C. Zhai. Semantic annotation of frequent patterns. ACM TKDD, 2007.  K. Wang, S. Zhou, J. Han. Profit Mining: From Patterns to Actions. EDBT’02.
  • 469.
    470 Data Mining: Concepts andTechniques (3rd ed.) — Chapter 8 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2011 Han, Kamber & Pei. All rights reserved.
  • 471.
    472 Chapter 8. Classification:Basic Concepts  Classification: Basic Concepts  Decision Tree Induction  Bayes Classification Methods  Rule-Based Classification  Model Evaluation and Selection  Techniques to Improve Classification Accuracy: Ensemble Methods  Summary
  • 472.
    473 Supervised vs. UnsupervisedLearning  Supervised learning (classification)  Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations  New data is classified based on the training set  Unsupervised learning (clustering)  The class labels of training data is unknown  Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
  • 473.
    474  Classification  predictscategorical class labels (discrete or nominal)  classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data  Numeric Prediction  models continuous-valued functions, i.e., predicts unknown or missing values  Typical applications  Credit/loan approval:  Medical diagnosis: if a tumor is cancerous or benign  Fraud detection: if a transaction is fraudulent  Web page categorization: which category it is Prediction Problems: Classification vs. Numeric Prediction
  • 474.
    475 Classification—A Two-Step Process Model construction: describing a set of predetermined classes  Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute  The set of tuples used for model construction is training set  The model is represented as classification rules, decision trees, or mathematical formulae  Model usage: for classifying future or unknown objects  Estimate accuracy of the model  The known label of test sample is compared with the classified result from the model  Accuracy rate is the percentage of test set samples that are correctly classified by the model  Test set is independent of training set (otherwise overfitting)  If the accuracy is acceptable, use the model to classify new data  Note: If the test set is used to select models, it is called validation (test) set
  • 475.
    476 Process (1): ModelConstruction Training Data NAME RANK YEARS TENURED Mike Assistant Prof 3 no Mary Assistant Prof 7 yes Bill Professor 2 yes Jim Associate Prof 7 yes Dave Assistant Prof 6 no Anne Associate Prof 3 no Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classifier (Model)
  • 476.
    477 Process (2): Usingthe Model in Prediction Classifier Testing Data NAME RANK YEARS TENURED Tom Assistant Prof 2 no Merlisa Associate Prof 7 no George Professor 5 yes Joseph Assistant Prof 7 yes Unseen Data (Jeff, Professor, 4) Tenured?
  • 477.
    478 Chapter 8. Classification:Basic Concepts  Classification: Basic Concepts  Decision Tree Induction  Bayes Classification Methods  Rule-Based Classification  Model Evaluation and Selection  Techniques to Improve Classification Accuracy: Ensemble Methods  Summary
  • 478.
    479 Decision Tree Induction:An Example age? overcast student? credit rating? <=30 >40 no yes yes yes 31..40 no fair excellent yes no age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no  Training data set: Buys_computer  The data set follows an example of Quinlan’s ID3 (Playing Tennis)  Resulting tree:
  • 479.
    480 Algorithm for DecisionTree Induction  Basic algorithm (a greedy algorithm)  Tree is constructed in a top-down recursive divide-and-conquer manner  At start, all the training examples are at the root  Attributes are categorical (if continuous-valued, they are discretized in advance)  Examples are partitioned recursively based on selected attributes  Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)  Conditions for stopping partitioning  All samples for a given node belong to the same class  There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf  There are no samples left
  • 480.
    Brief Review ofEntropy  481 m = 2
  • 481.
    482 Attribute Selection Measure: InformationGain (ID3/C4.5)  Select the attribute with the highest information gain  Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D|  Expected information (entropy) needed to classify a tuple in D:  Information needed (after using A to split D into v partitions) to classify D:  Information gained by branching on attribute A ) ( log ) ( 2 1 i m i i p p D Info     ) ( | | | | ) ( 1 j v j j A D Info D D D Info    (D) Info Info(D) Gain(A) A  
  • 482.
    483 Attribute Selection: InformationGain g Class P: buys_computer = “yes” g Class N: buys_computer = “no” means “age <=30” has 5 out of 14 samples, with 2 yes’es and 3 no’s. Hence Similarly, age pi ni I(pi, ni) <=30 2 3 0.971 31…40 4 0 0 >40 3 2 0.971 694 . 0 ) 2 , 3 ( 14 5 ) 0 , 4 ( 14 4 ) 3 , 2 ( 14 5 ) (     I I I D Infoage 048 . 0 ) _ ( 151 . 0 ) ( 029 . 0 ) (    rating credit Gain student Gain income Gain 246 . 0 ) ( ) ( ) (    D Info D Info age Gain age age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no ) 3 , 2 ( 14 5 I 940 . 0 ) 14 5 ( log 14 5 ) 14 9 ( log 14 9 ) 5 , 9 ( ) ( 2 2     I D Info
  • 483.
    484 Computing Information-Gain for Continuous-ValuedAttributes  Let attribute A be a continuous-valued attribute  Must determine the best split point for A  Sort the value A in increasing order  Typically, the midpoint between each pair of adjacent values is considered as a possible split point  (ai+ai+1)/2 is the midpoint between the values of ai and ai+1  The point with the minimum expected information requirement for A is selected as the split-point for A  Split:  D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples in D satisfying A > split-point
  • 484.
    485 Gain Ratio forAttribute Selection (C4.5)  Information gain measure is biased towards attributes with a large number of values  C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain)  GainRatio(A) = Gain(A)/SplitInfo(A)  Ex.  gain_ratio(income) = 0.029/1.557 = 0.019  The attribute with the maximum gain ratio is selected as the splitting attribute ) | | | | ( log | | | | ) ( 2 1 D D D D D SplitInfo j v j j A     
  • 485.
    486 Gini Index (CART,IBM IntelligentMiner)  If a data set D contains examples from n classes, gini index, gini(D) is defined as where pj is the relative frequency of class j in D  If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is defined as  Reduction in Impurity:  The attribute provides the smallest ginisplit(D) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute)     n j p j D gini 1 2 1 ) ( ) ( | | | | ) ( | | | | ) ( 2 2 1 1 D gini D D D gini D D D giniA   ) ( ) ( ) ( D gini D gini A gini A   
  • 486.
    487 Computation of GiniIndex  Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”  Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in D2 Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the {low,medium} (and {high}) since it has the lowest Gini index  All attributes are assumed continuous-valued  May need other tools, e.g., clustering, to get the possible split values  459 . 0 14 5 14 9 1 ) ( 2 2                 D gini ) ( 14 4 ) ( 14 10 ) ( 2 1 } , { D Gini D Gini D gini medium low income               
  • 487.
    488 Comparing Attribute SelectionMeasures  The three measures, in general, return good results but  Information gain:  biased towards multivalued attributes  Gain ratio:  tends to prefer unbalanced splits in which one partition is much smaller than the others  Gini index:  biased to multivalued attributes  has difficulty when # of classes is large  tends to favor tests that result in equal-sized partitions and purity in both partitions
  • 488.
    489 Other Attribute SelectionMeasures  CHAID: a popular decision tree algorithm, measure based on χ2 test for independence  C-SEP: performs better than info. gain and gini index in certain cases  G-statistic: has a close approximation to χ2 distribution  MDL (Minimal Description Length) principle (i.e., the simplest solution is preferred):  The best tree as the one that requires the fewest # of bits to both (1) encode the tree, and (2) encode the exceptions to the tree  Multivariate splits (partition based on multiple variable combinations)  CART: finds multivariate splits based on a linear comb. of attrs.  Which attribute selection measure is the best?  Most give good results, none is significantly superior than others
  • 489.
    490 Overfitting and TreePruning  Overfitting: An induced tree may overfit the training data  Too many branches, some may reflect anomalies due to noise or outliers  Poor accuracy for unseen samples  Two approaches to avoid overfitting  Prepruning: Halt tree construction early do not split a node ̵ if this would result in the goodness measure falling below a threshold  Difficult to choose an appropriate threshold  Postpruning: Remove branches from a “fully grown” tree— get a sequence of progressively pruned trees  Use a set of data different from the training data to decide which is the “best pruned tree”
  • 490.
    491 Enhancements to BasicDecision Tree Induction  Allow for continuous-valued attributes  Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals  Handle missing attribute values  Assign the most common value of the attribute  Assign probability to each of the possible values  Attribute construction  Create new attributes based on existing ones that are sparsely represented  This reduces fragmentation, repetition, and replication
  • 491.
    492 Classification in LargeDatabases  Classification—a classical problem extensively studied by statisticians and machine learning researchers  Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed  Why is decision tree induction popular?  relatively faster learning speed (than other classification methods)  convertible to simple and easy to understand classification rules  can use SQL queries for accessing databases  comparable classification accuracy with other methods  RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)  Builds an AVC-list (attribute, value, class label)
  • 492.
    493 Scalability Framework forRainForest  Separates the scalability aspects from the criteria that determine the quality of the tree  Builds an AVC-list: AVC (Attribute, Value, Class_label)  AVC-set (of an attribute X )  Projection of training dataset onto the attribute X and class label where counts of individual class label are aggregated  AVC-group (of a node n )  Set of AVC-sets of all predictor attributes at the node n
  • 493.
    494 Rainforest: Training Setand Its AVC Sets student Buy_Computer yes no yes 6 1 no 3 4 Age Buy_Computer yes no <=30 2 3 31..40 4 0 >40 3 2 Credit rating Buy_Computer yes no fair 6 2 excellent 3 3 age income studentcredit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no AVC-set on income AVC-set on Age AVC-set on Student Training Examples income Buy_Computer yes no high 2 2 medium 4 2 low 3 1 AVC-set on credit_rating
  • 494.
    495 BOAT (Bootstrapped Optimistic Algorithmfor Tree Construction)  Use a statistical technique called bootstrapping to create several smaller samples (subsets), each fits in memory  Each subset is used to create a tree, resulting in several trees  These trees are examined and used to construct a new tree T’  It turns out that T’ is very close to the tree that would be generated using the whole data set together  Adv: requires only two scans of DB, an incremental alg.
  • 495.
    October 24, 2024 DataMining: Concepts and Techniques 496 Presentation of Classification Results
  • 496.
    October 24, 2024 DataMining: Concepts and Techniques 497 Visualization of a Decision Tree in SGI/MineSet 3.0
  • 497.
    Data Mining: Conceptsand Techniques 498 Interactive Visual Mining by Perception- Based Classification (PBC)
  • 498.
    499 Chapter 8. Classification:Basic Concepts  Classification: Basic Concepts  Decision Tree Induction  Bayes Classification Methods  Rule-Based Classification  Model Evaluation and Selection  Techniques to Improve Classification Accuracy: Ensemble Methods  Summary
  • 499.
    500 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities  Foundation: Based on Bayes’ Theorem.  Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers  Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data  Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured
  • 500.
    501 Bayes’ Theorem: Basics Total probability Theorem:  Bayes’ Theorem:  Let X be a data sample (“evidence”): class label is unknown  Let H be a hypothesis that X belongs to class C  Classification is to determine P(H|X), (i.e., posteriori probability): the probability that the hypothesis holds given the observed data sample X  P(H) (prior probability): the initial probability  E.g., X will buy computer, regardless of age, income, …  P(X): probability that sample data is observed  P(X|H) (likelihood): the probability of observing the sample X, given that the hypothesis holds  E.g., Given that X will buy computer, the prob. that X is 31..40, medium income ) ( ) 1 | ( ) ( i A P M i i A B P B P    ) ( / ) ( ) | ( ) ( ) ( ) | ( ) | ( X X X X X P H P H P P H P H P H P   
  • 501.
    502 Prediction Based onBayes’ Theorem  Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes’ theorem  Informally, this can be viewed as posteriori = likelihood x prior/evidence  Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes  Practical difficulty: It requires initial knowledge of many probabilities, involving significant computational cost ) ( / ) ( ) | ( ) ( ) ( ) | ( ) | ( X X X X X P H P H P P H P H P H P   
  • 502.
    503 Classification Is toDerive the Maximum Posteriori  Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-D attribute vector X = (x1, x2, …, xn)  Suppose there are m classes C1, C2, …, Cm.  Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X)  This can be derived from Bayes’ theorem  Since P(X) is constant for all classes, only needs to be maximized ) ( ) ( ) | ( ) | ( X X X P i C P i C P i C P  ) ( ) | ( ) | ( i C P i C P i C P X X 
  • 503.
    504 Naïve Bayes Classifier A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between attributes):  This greatly reduces the computation cost: Only counts the class distribution  If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)  If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean μ and standard deviation σ and P(xk|Ci) is ) | ( ... ) | ( ) | ( 1 ) | ( ) | ( 2 1 Ci x P Ci x P Ci x P n k Ci x P Ci P n k        X 2 2 2 ) ( 2 1 ) , , (          x e x g ) , , ( ) | ( i i C C k x g Ci P    X
  • 504.
    505 Naïve Bayes Classifier:Training Dataset Class: C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’ Data to be classified: X = (age <=30, Income = medium, Student = yes Credit_rating = Fair) age income student credit_rating buys_compu <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no
  • 505.
    506 Naïve Bayes Classifier:An Example  P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 P(buys_computer = “no”) = 5/14= 0.357  Compute P(X|Ci) for each class P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4  X = (age <= 30 , income = medium, student = yes, credit_rating = fair) P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028 P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007 age income student credit_rating buys_comp <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no
  • 506.
    507 Avoiding the Zero-ProbabilityProblem  Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the predicted prob. will be zero  Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10)  Use Laplacian correction (or Laplacian estimator)  Adding 1 to each case Prob(income = low) = 1/1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003  The “corrected” prob. estimates are close to their “uncorrected” counterparts    n k Ci xk P Ci X P 1 ) | ( ) | (
  • 507.
    508 Naïve Bayes Classifier:Comments  Advantages  Easy to implement  Good results obtained in most of the cases  Disadvantages  Assumption: class conditional independence, therefore loss of accuracy  Practically, dependencies exist among variables  E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.  Dependencies among these cannot be modeled by Naïve Bayes Classifier  How to deal with these dependencies? Bayesian Belief Networks (Chapter 9)
  • 508.
    509 Chapter 8. Classification:Basic Concepts  Classification: Basic Concepts  Decision Tree Induction  Bayes Classification Methods  Rule-Based Classification  Model Evaluation and Selection  Techniques to Improve Classification Accuracy: Ensemble Methods  Summary
  • 509.
    510 Using IF-THEN Rulesfor Classification  Represent the knowledge in the form of IF-THEN rules R: IF age = youth AND student = yes THEN buys_computer = yes  Rule antecedent/precondition vs. rule consequent  Assessment of a rule: coverage and accuracy  ncovers = # of tuples covered by R  ncorrect = # of tuples correctly classified by R coverage(R) = ncovers /|D| /* D: training data set */ accuracy(R) = ncorrect / ncovers  If more than one rule are triggered, need conflict resolution  Size ordering: assign the highest priority to the triggering rules that has the “toughest” requirement (i.e., with the most attribute tests)  Class-based ordering: decreasing order of prevalence or misclassification cost per class  Rule-based ordering (decision list): rules are organized into one long priority list, according to some measure of rule quality or by experts
  • 510.
    511 age? student? credit rating? <=30>40 no yes yes yes 31..40 no fair excellent yes no  Example: Rule extraction from our buys_computer decision-tree IF age = young AND student = no THEN buys_computer = no IF age = young AND student = yes THEN buys_computer = yes IF age = mid-age THEN buys_computer = yes IF age = old AND credit_rating = excellent THEN buys_computer = no IF age = old AND credit_rating = fair THEN buys_computer = yes Rule Extraction from a Decision Tree  Rules are easier to understand than large trees  One rule is created for each path from the root to a leaf  Each attribute-value pair along a path forms a conjunction: the leaf holds the class prediction  Rules are mutually exclusive and exhaustive
  • 511.
    512 Rule Induction: SequentialCovering Method  Sequential covering algorithm: Extracts rules directly from training data  Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER  Rules are learned sequentially, each for a given class Ci will cover many tuples of Ci but none (or few) of the tuples of other classes  Steps:  Rules are learned one at a time  Each time a rule is learned, the tuples covered by the rules are removed  Repeat the process on the remaining tuples until termination condition, e.g., when no more training examples or when the quality of a rule returned is below a user-specified threshold  Comp. w. decision-tree induction: learning a set of rules simultaneously
  • 512.
    513 Sequential Covering Algorithm while(enough target tuples left) generate a rule remove positive target tuples satisfying this rule Examples covered by Rule 3 Examples covered by Rule 2 Examples covered by Rule 1 Positive examples
  • 513.
    514 Rule Generation  Togenerate a rule while(true) find the best predicate p if foil-gain(p) > threshold then add p to current rule else break Positive examples Negative examples A3=1 A3=1&&A1=2 A3=1&&A1=2 &&A8=5
  • 514.
    515 How to Learn-One-Rule? Start with the most general rule possible: condition = empty  Adding new attributes by adopting a greedy depth-first strategy  Picks the one that most improves the rule quality  Rule-Quality measures: consider both coverage and accuracy  Foil-gain (in FOIL & RIPPER): assesses info_gain by extending condition  favors rules that have high accuracy and cover many positive tuples  Rule pruning based on an independent set of test tuples Pos/neg are # of positive/negative tuples covered by R. If FOIL_Prune is higher for the pruned version of R, prune R ) log ' ' ' (log ' _ 2 2 neg pos pos neg pos pos pos Gain FOIL      neg pos neg pos R Prune FOIL    ) ( _
  • 515.
    516 Chapter 8. Classification:Basic Concepts  Classification: Basic Concepts  Decision Tree Induction  Bayes Classification Methods  Rule-Based Classification  Model Evaluation and Selection  Techniques to Improve Classification Accuracy: Ensemble Methods  Summary
  • 516.
    Model Evaluation andSelection  Evaluation metrics: How can we measure accuracy? Other metrics to consider?  Use validation test set of class-labeled tuples instead of training set when assessing accuracy  Methods for estimating a classifier’s accuracy:  Holdout method, random subsampling  Cross-validation  Bootstrap  Comparing classifiers:  Confidence intervals  Cost-benefit analysis and ROC Curves 517
  • 517.
    Classifier Evaluation Metrics:Confusion Matrix Actual classPredicted class buy_computer = yes buy_computer = no Total buy_computer = yes 6954 46 7000 buy_computer = no 412 2588 3000 Total 7366 2634 10000  Given m classes, an entry, CMi,j in a confusion matrix indicates # of tuples in class i that were labeled by the classifier as class j  May have extra rows/columns to provide totals Confusion Matrix: Actual classPredicted class C1 ¬ C1 C1 True Positives (TP) False Negatives (FN) ¬ C1 False Positives (FP) True Negatives (TN) Example of Confusion Matrix: 518
  • 518.
    Classifier Evaluation Metrics:Accuracy, Error Rate, Sensitivity and Specificity  Classifier Accuracy, or recognition rate: percentage of test set tuples that are correctly classified Accuracy = (TP + TN)/All  Error rate: 1 – accuracy, or Error rate = (FP + FN)/All  Class Imbalance Problem:  One class may be rare, e.g. fraud, or HIV-positive  Significant majority of the negative class and minority of the positive class  Sensitivity: True Positive recognition rate  Sensitivity = TP/P  Specificity: True Negative recognition rate  Specificity = TN/N AP C ¬C C TP FN P ¬C FP TN N P’ N’ All 519
  • 519.
    Classifier Evaluation Metrics: Precisionand Recall, and F-measures  Precision: exactness – what % of tuples that the classifier labeled as positive are actually positive  Recall: completeness – what % of positive tuples did the classifier label as positive?  Perfect score is 1.0  Inverse relationship between precision & recall  F measure (F1 or F-score): harmonic mean of precision and recall,  Fß: weighted measure of precision and recall  assigns ß times as much weight to recall as to precision 520
  • 520.
    Classifier Evaluation Metrics:Example 521  Precision = 90/230 = 39.13% Recall = 90/300 = 30.00% Actual ClassPredicted class cancer = yes cancer = no Total Recognition(%) cancer = yes 90 210 300 30.00 (sensitivity cancer = no 140 9560 9700 98.56 (specificity) Total 230 9770 10000 96.40 (accuracy)
  • 521.
    Evaluating Classifier Accuracy: Holdout& Cross-Validation Methods  Holdout method  Given data is randomly partitioned into two independent sets  Training set (e.g., 2/3) for model construction  Test set (e.g., 1/3) for accuracy estimation  Random sampling: a variation of holdout  Repeat holdout k times, accuracy = avg. of the accuracies obtained  Cross-validation (k-fold, where k = 10 is most popular)  Randomly partition the data into k mutually exclusive subsets, each approximately equal size  At i-th iteration, use Di as test set and others as training set  Leave-one-out: k folds where k = # of tuples, for small sized data  *Stratified cross-validation*: folds are stratified so that class dist. in each fold is approx. the same as that in the initial data 522
  • 522.
    Evaluating Classifier Accuracy:Bootstrap  Bootstrap  Works well with small data sets  Samples the given training tuples uniformly with replacement  i.e., each time a tuple is selected, it is equally likely to be selected again and re-added to the training set  Several bootstrap methods, and a common one is .632 boostrap  A data set with d tuples is sampled d times, with replacement, resulting in a training set of d samples. The data tuples that did not make it into the training set end up forming the test set. About 63.2% of the original data end up in the bootstrap, and the remaining 36.8% form the test set (since (1 – 1/d)d ≈ e-1 = 0.368)  Repeat the sampling procedure k times, overall accuracy of the model: 523
  • 523.
    Estimating Confidence Intervals: ClassifierModels M1 vs. M2  Suppose we have 2 classifiers, M1 and M2, which one is better?  Use 10-fold cross-validation to obtain and  These mean error rates are just estimates of error on the true population of future data cases  What if the difference between the 2 error rates is just attributed to chance?  Use a test of statistical significance  Obtain confidence limits for our error estimates 524
  • 524.
    Estimating Confidence Intervals: NullHypothesis  Perform 10-fold cross-validation  Assume samples follow a t distribution with k–1 degrees of freedom (here, k=10)  Use t-test (or Student’s t-test)  Null Hypothesis: M1 & M2 are the same  If we can reject null hypothesis, then  we conclude that the difference between M1 & M2 is statistically significant  Chose model with lower error rate 525
  • 525.
    Estimating Confidence Intervals:t-test  If only 1 test set available: pairwise comparison  For ith round of 10-fold cross-validation, the same cross partitioning is used to obtain err(M1)i and err(M2)i  Average over 10 rounds to get  t-test computes t-statistic with k-1 degrees of freedom:  If two test sets available: use non-paired t-test where an d wher e where k1 & k2 are # of cross-validation samples used for M1 & M2, resp. 526
  • 526.
    Estimating Confidence Intervals: Tablefor t-distribution  Symmetric  Significance level, e.g., sig = 0.05 or 5% means M1 & M2 are significantly different for 95% of population  Confidence limit, z = sig/2 527
  • 527.
    Estimating Confidence Intervals: StatisticalSignificance  Are M1 & M2 significantly different?  Compute t. Select significance level (e.g. sig = 5%)  Consult table for t-distribution: Find t value corresponding to k-1 degrees of freedom (here, 9)  t-distribution is symmetric: typically upper % points of distribution shown → look up value for confidence limit z=sig/2 (here, 0.025)  If t > z or t < -z, then t value lies in rejection region:  Reject null hypothesis that mean error rates of M1 & M2 are same  Conclude: statistically significant difference between M1 & M2  Otherwise, conclude that any difference is chance 528
  • 528.
    Model Selection: ROCCurves  ROC (Receiver Operating Characteristics) curves: for visual comparison of classification models  Originated from signal detection theory  Shows the trade-off between the true positive rate and the false positive rate  The area under the ROC curve is a measure of the accuracy of the model  Rank the test tuples in decreasing order: the one that is most likely to belong to the positive class appears at the top of the list  The closer to the diagonal line (i.e., the closer the area is to 0.5), the less accurate is the model  Vertical axis represents the true positive rate  Horizontal axis rep. the false positive rate  The plot also shows a diagonal line  A model with perfect accuracy will have an area of 1.0 529
  • 529.
    Issues Affecting ModelSelection  Accuracy  classifier accuracy: predicting class label  Speed  time to construct the model (training time)  time to use the model (classification/prediction time)  Robustness: handling noise and missing values  Scalability: efficiency in disk-resident databases  Interpretability  understanding and insight provided by the model  Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules 530
  • 530.
    531 Chapter 8. Classification:Basic Concepts  Classification: Basic Concepts  Decision Tree Induction  Bayes Classification Methods  Rule-Based Classification  Model Evaluation and Selection  Techniques to Improve Classification Accuracy: Ensemble Methods  Summary
  • 531.
    Ensemble Methods: Increasingthe Accuracy  Ensemble methods  Use a combination of models to increase accuracy  Combine a series of k learned models, M1, M2, …, Mk, with the aim of creating an improved model M*  Popular ensemble methods  Bagging: averaging the prediction over a collection of classifiers  Boosting: weighted vote with a collection of classifiers  Ensemble: combining a set of heterogeneous classifiers 532
  • 532.
    Bagging: Boostrap Aggregation Analogy: Diagnosis based on multiple doctors’ majority vote  Training  Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with replacement from D (i.e., bootstrap)  A classifier model Mi is learned for each training set Di  Classification: classify an unknown sample X  Each classifier Mi returns its class prediction  The bagged classifier M* counts the votes and assigns the class with the most votes to X  Prediction: can be applied to the prediction of continuous values by taking the average value of each prediction for a given test tuple  Accuracy  Often significantly better than a single classifier derived from D  For noise data: not considerably worse, more robust  Proved improved accuracy in prediction 533
  • 533.
    Boosting  Analogy: Consultseveral doctors, based on a combination of weighted diagnoses—weight assigned based on the previous diagnosis accuracy  How boosting works?  Weights are assigned to each training tuple  A series of k classifiers is iteratively learned  After a classifier Mi is learned, the weights are updated to allow the subsequent classifier, Mi+1, to pay more attention to the training tuples that were misclassified by Mi  The final M* combines the votes of each individual classifier, where the weight of each classifier's vote is a function of its accuracy  Boosting algorithm can be extended for numeric prediction  Comparing with bagging: Boosting tends to have greater accuracy, but it also risks overfitting the model to misclassified data 534
  • 534.
    535 Adaboost (Freund andSchapire, 1997)  Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)  Initially, all the weights of tuples are set the same (1/d)  Generate k classifiers in k rounds. At round i,  Tuples from D are sampled (with replacement) to form a training set Di of the same size  Each tuple’s chance of being selected is based on its weight  A classification model Mi is derived from Di  Its error rate is calculated using Di as a test set  If a tuple is misclassified, its weight is increased, o.w. it is decreased  Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error rate is the sum of the weights of the misclassified tuples:  The weight of classifier Mi’s vote is ) ( ) ( 1 log i i M error M error     d j j i err w M error ) ( ) ( j X
  • 535.
    Random Forest (Breiman2001)  Random Forest:  Each classifier in the ensemble is a decision tree classifier and is generated using a random selection of attributes at each node to determine the split  During classification, each tree votes and the most popular class is returned  Two Methods to construct Random Forest:  Forest-RI (random input selection): Randomly select, at each node, F attributes as candidates for the split at the node. The CART methodology is used to grow the trees to maximum size  Forest-RC (random linear combinations): Creates new attributes (or features) that are a linear combination of the existing attributes (reduces the correlation between individual classifiers)  Comparable in accuracy to Adaboost, but more robust to errors and outliers  Insensitive to the number of attributes selected for consideration at each split, and faster than bagging or boosting 536
  • 536.
    Classification of Class-ImbalancedData Sets  Class-imbalance problem: Rare positive example but numerous negative ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc.  Traditional methods assume a balanced distribution of classes and equal error costs: not suitable for class-imbalanced data  Typical methods for imbalance data in 2-class classification:  Oversampling: re-sampling of data from positive class  Under-sampling: randomly eliminate tuples from negative class  Threshold-moving: moves the decision threshold, t, so that the rare class tuples are easier to classify, and hence, less chance of costly false negative errors  Ensemble techniques: Ensemble multiple classifiers introduced above  Still difficult for class imbalance problem on multiclass tasks 537
  • 537.
    538 Chapter 8. Classification:Basic Concepts  Classification: Basic Concepts  Decision Tree Induction  Bayes Classification Methods  Rule-Based Classification  Model Evaluation and Selection  Techniques to Improve Classification Accuracy: Ensemble Methods  Summary
  • 538.
    Summary (I)  Classificationis a form of data analysis that extracts models describing important data classes.  Effective and scalable methods have been developed for decision tree induction, Naive Bayesian classification, rule-based classification, and many other classification methods.  Evaluation metrics include: accuracy, sensitivity, specificity, precision, recall, F measure, and Fß measure.  Stratified k-fold cross-validation is recommended for accuracy estimation. Bagging and boosting can be used to increase overall accuracy by learning and combining a series of individual models. 539
  • 539.
    Summary (II)  Significancetests and ROC curves are useful for model selection.  There have been numerous comparisons of the different classification methods; the matter remains a research topic  No single method has been found to be superior over all others for all data sets  Issues such as accuracy, training time, robustness, scalability, and interpretability must be considered and can involve trade- offs, further complicating the quest for an overall superior method 540
  • 540.
    References (1)  C.Apte and S. Weiss. Data mining with decision trees and decision rules. Future Generation Computer Systems, 13, 1997  C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University Press, 1995  L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth International Group, 1984  C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2(2): 121-168, 1998  P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data for scaling machine learning. KDD'95  H. Cheng, X. Yan, J. Han, and C.-W. Hsu, Discriminative Frequent Pattern Analysis for Effective Classification, ICDE'07  H. Cheng, X. Yan, J. Han, and P. S. Yu, Direct Discriminative Pattern Mining for Effective Classification, ICDE'08  W. Cohen. Fast effective rule induction. ICML'95  G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule groups for gene expression data. SIGMOD'05 541
  • 541.
    References (2)  A.J. Dobson. An Introduction to Generalized Linear Models. Chapman & Hall, 1990.  G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and differences. KDD'99.  R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed. John Wiley, 2001  U. M. Fayyad. Branching on attribute values in decision tree generation. AAAI’94.  Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. J. Computer and System Sciences, 1997.  J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree construction of large datasets. VLDB’98.  J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision Tree Construction. SIGMOD'99.  T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, 2001.  D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 1995.  W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules, ICDM'01. 542
  • 542.
    References (3)  T.-S.Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 2000.  J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing Research, Blackwell Business, 1994.  M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining. EDBT'96.  T. M. Mitchell. Machine Learning. McGraw Hill, 1997.  S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi- Disciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998  J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.  J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. ECML’93.  J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.  J. R. Quinlan. Bagging, boosting, and c4.5. AAAI'96. 543
  • 543.
    References (4)  R.Rastogi and K. Shim. Public: A decision tree classifier that integrates building and pruning. VLDB’98.  J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data mining. VLDB’96.  J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan Kaufmann, 1990.  P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley, 2005.  S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman, 1991.  S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997.  I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques, 2ed. Morgan Kaufmann, 2005.  X. Yin and J. Han. CPAR: Classification based on predictive association rules. SDM'03  H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with hierarchical clusters. KDD'03. 544
  • 545.
    CS412 Midterm ExamStatistics  Opinion Question Answering:  Like the style: 70.83%, dislike: 29.16%  Exam is hard: 55.75%, easy: 0.6%, just right: 43.63%  Time: plenty:3.03%, enough: 36.96%, not: 60%  Score distribution: # of students (Total: 180)  >=90: 24  80-89: 54  70-79: 46  Final grading are based on overall score accumulation and relative class distributions 546  60-69: 37  50-59: 15  40-49: 2  <40: 2
  • 546.
    547 Issues: Evaluating ClassificationMethods  Accuracy  classifier accuracy: predicting class label  predictor accuracy: guessing value of predicted attributes  Speed  time to construct the model (training time)  time to use the model (classification/prediction time)  Robustness: handling noise and missing values  Scalability: efficiency in disk-resident databases  Interpretability  understanding and insight provided by the model  Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules
  • 547.
    548 Predictor Error Measures Measure predictor accuracy: measure how far off the predicted value is from the actual known value  Loss function: measures the error betw. yi and the predicted value yi’  Absolute error: | yi – yi’|  Squared error: (yi – yi’)2  Test error (generalization error): the average loss over the test set  Mean absolute error: Mean squared error:  Relative absolute error: Relative squared error: The mean squared-error exaggerates the presence of outliers Popularly use (square) root mean-square error, similarly, root relative squared error d y y d i i i    1 | ' | d y y d i i i    1 2 ) ' (       d i i d i i i y y y y 1 1 | | | ' |       d i i d i i i y y y y 1 2 1 2 ) ( ) ' (
  • 548.
    549 Scalable Decision TreeInduction Methods  SLIQ (EDBT’96 — Mehta et al.)  Builds an index for each attribute and only class list and the current attribute list reside in memory  SPRINT (VLDB’96 — J. Shafer et al.)  Constructs an attribute list data structure  PUBLIC (VLDB’98 — Rastogi & Shim)  Integrates tree splitting and tree pruning: stop growing the tree earlier  RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)  Builds an AVC-list (attribute, value, class label)  BOAT (PODS’99 — Gehrke, Ganti, Ramakrishnan & Loh)  Uses bootstrapping to create several small samples
  • 549.
    550 Data Cube-Based Decision-TreeInduction  Integration of generalization with decision-tree induction (Kamber et al.’97)  Classification at primitive concept levels  E.g., precise temperature, humidity, outlook, etc.  Low-level concepts, scattered classes, bushy classification- trees  Semantic interpretation problems  Cube-based multi-level classification  Relevance analysis at multi-levels  Information-gain analysis with dimension + level
  • 550.
    551 Data Mining: Concepts andTechniques (3rd ed.) — Chapter 9 — Classification: Advanced Methods Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2011 Han, Kamber & Pei. All rights reserved.
  • 551.
    552 Chapter 9. Classification:Advanced Methods  Bayesian Belief Networks  Classification by Backpropagation  Support Vector Machines  Classification by Using Frequent Patterns  Lazy Learners (or Learning from Your Neighbors)  Other Classification Methods  Additional Topics Regarding Classification  Summary
  • 552.
    553 Bayesian Belief Networks Bayesian belief networks (also known as Bayesian networks, probabilistic networks): allow class conditional independencies between subsets of variables  A (directed acyclic) graphical model of causal relationships  Represents dependency among the variables  Gives a specification of joint probability distribution X Y Z P  Nodes: random variables  Links: dependency  X and Y are the parents of Z, and Y is the parent of P  No dependency between Z and P  Has no loops/cycles
  • 553.
    554 Bayesian Belief Network:An Example Family History (FH) LungCancer (LC) PositiveXRay Smoker (S) Emphysema Dyspnea LC ~LC (FH, S) (FH, ~S) (~FH, S) (~FH, ~S) 0.8 0.2 0.5 0.5 0.7 0.3 0.1 0.9 Bayesian Belief Network CPT: Conditional Probability Table for variable LungCancer:    n i Y Parents i xi P x x P n 1 )) ( | ( ) ,..., ( 1 shows the conditional probability for each possible combination of its parents Derivation of the probability of a particular combination of values of X, from CPT:
  • 554.
    555 Training Bayesian Networks:Several Scenarios  Scenario 1: Given both the network structure and all variables observable: compute only the CPT entries  Scenario 2: Network structure known, some variables hidden: gradient descent (greedy hill-climbing) method, i.e., search for a solution along the steepest descent of a criterion function  Weights are initialized to random probability values  At each iteration, it moves towards what appears to be the best solution at the moment, w.o. backtracking  Weights are updated at each iteration & converge to local optimum  Scenario 3: Network structure unknown, all variables observable: search through the model space to reconstruct network topology  Scenario 4: Unknown structure, all hidden variables: No good algorithms known for this purpose  D. Heckerman. A Tutorial on Learning with Bayesian Networks. In Learning in Graphical Models, M. Jordan, ed.. MIT Press, 1999.
  • 555.
    556 Chapter 9. Classification:Advanced Methods  Bayesian Belief Networks  Classification by Backpropagation  Support Vector Machines  Classification by Using Frequent Patterns  Lazy Learners (or Learning from Your Neighbors)  Other Classification Methods  Additional Topics Regarding Classification  Summary
  • 556.
    557 Classification by Backpropagation Backpropagation: A neural network learning algorithm  Started by psychologists and neurobiologists to develop and test computational analogues of neurons  A neural network: A set of connected input/output units where each connection has a weight associated with it  During the learning phase, the network learns by adjusting the weights so as to be able to predict the correct class label of the input tuples  Also referred to as connectionist learning due to the
  • 557.
    558 Neural Network asa Classifier  Weakness  Long training time  Require a number of parameters typically best determined empirically, e.g., the network topology or “structure.”  Poor interpretability: Difficult to interpret the symbolic meaning behind the learned weights and of “hidden units” in the network  Strength  High tolerance to noisy data  Ability to classify untrained patterns  Well-suited for continuous-valued inputs and outputs  Successful on an array of real-world data, e.g., hand-written letters  Algorithms are inherently parallel  Techniques have recently been developed for the extraction of
  • 558.
    559 A Multi-Layer Feed-ForwardNeural Network Output layer Input layer Hidden layer Output vector Input vector: X wij ij k i i k j k j x y y w w ) ˆ ( ) ( ) ( ) 1 (     
  • 559.
    560 How A Multi-LayerNeural Network Works  The inputs to the network correspond to the attributes measured for each training tuple  Inputs are fed simultaneously into the units making up the input layer  They are then weighted and fed simultaneously to a hidden layer  The number of hidden layers is arbitrary, although usually only one  The weighted outputs of the last hidden layer are input to units making up the output layer, which emits the network's prediction  The network is feed-forward: None of the weights cycles back to an input unit or to an output unit of a previous layer  From a statistical point of view, networks perform nonlinear
  • 560.
    561 Defining a NetworkTopology  Decide the network topology: Specify # of units in the input layer, # of hidden layers (if > 1), # of units in each hidden layer, and # of units in the output layer  Normalize the input values for each attribute measured in the training tuples to [0.0—1.0]  One input unit per domain value, each initialized to 0  Output, if for classification and more than two classes, one output unit per class is used  Once a network has been trained and its accuracy is unacceptable, repeat the training process with a different network topology or a different set of initial weights
  • 561.
    562 Backpropagation  Iteratively processa set of training tuples & compare the network's prediction with the actual known target value  For each training tuple, the weights are modified to minimize the mean squared error between the network's prediction and the actual target value  Modifications are made in the “backwards” direction: from the output layer, through each hidden layer down to the first hidden layer, hence “backpropagation”  Steps  Initialize weights to small random numbers, associated with biases  Propagate the inputs forward (by applying activation function)  Backpropagate the error (by updating weights and biases) 
  • 562.
    563 Neuron: A Hidden/OutputLayer Unit  An n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping  The inputs to unit are outputs from the previous layer. They are multiplied by their corresponding weights to form a weighted sum, which is added to the bias associated with unit. Then a nonlinear activation function is applied to it. mk f weighted sum Input vector x output y Activation function weight vector w å w0 w1 wn x0 x1 xn ) sign( y Example For n 0 i k i i x w      bias
  • 563.
    564 Efficiency and Interpretability Efficiency of backpropagation: Each epoch (one iteration through the training set) takes O(|D| * w), with |D| tuples and w weights, but # of epochs can be exponential to n, the number of inputs, in worst case  For easier comprehension: Rule extraction by network pruning  Simplify the network structure by removing weighted links that have the least effect on the trained network  Then perform link, unit, or activation value clustering  The set of input and activation values are studied to derive rules describing the relationship between the input and hidden unit layers  Sensitivity analysis: assess the impact that a given input variable has on a network output. The knowledge gained from this analysis can be represented in rules
  • 564.
    565 Chapter 9. Classification:Advanced Methods  Bayesian Belief Networks  Classification by Backpropagation  Support Vector Machines  Classification by Using Frequent Patterns  Lazy Learners (or Learning from Your Neighbors)  Other Classification Methods  Additional Topics Regarding Classification  Summary
  • 565.
    566 Classification: A MathematicalMapping  Classification: predicts categorical class labels  E.g., Personal homepage classification  xi = (x1, x2, x3, …), yi = +1 or –1  x1 : # of word “homepage”  x2 : # of word “welcome”  Mathematically, x  X = n , y  Y = {+1, –1},  We want to derive a function f: X  Y  Linear Classification  Binary Classification problem  Data above the red line belongs to class ‘x’  Data below red line belongs to class ‘o’  Examples: SVM, Perceptron, Probabilistic Classifiers x x x x x x x x x x o o o o o o o o o o o o o
  • 566.
    567 Discriminative Classifiers  Advantages Prediction accuracy is generally high  As compared to Bayesian methods – in general  Robust, works when training examples contain errors  Fast evaluation of the learned target function  Bayesian networks are normally slow  Criticism  Long training time  Difficult to understand the learned function (weights)  Bayesian networks can be used easily for pattern discovery  Not easy to incorporate domain knowledge  Easy in the form of priors on the data or distributions
  • 567.
    568 SVM—Support Vector Machines A relatively new classification method for both linear and nonlinear data  It uses a nonlinear mapping to transform the original training data into a higher dimension  With the new dimension, it searches for the linear optimal separating hyperplane (i.e., “decision boundary”)  With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a hyperplane  SVM finds this hyperplane using support vectors (“essential” training tuples) and margins (defined by the support vectors)
  • 568.
    569 SVM—History and Applications Vapnik and colleagues (1992)—groundwork from Vapnik & Chervonenkis’ statistical learning theory in 1960s  Features: training can be slow but accuracy is high owing to their ability to model complex nonlinear decision boundaries (margin maximization)  Used for: classification and numeric prediction  Applications:  handwritten digit recognition, object recognition, speaker identification, benchmarking time-series prediction tests
  • 569.
  • 570.
    October 24, 2024 DataMining: Concepts and Techniques 571 SVM—Margins and Support Vectors
  • 571.
    572 SVM—When Data IsLinearly Separable m Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples associated with the class labels yi There are infinite lines (hyperplanes) separating the two classes but we want to find the best one (the one that minimizes classification error on unseen data) SVM searches for the hyperplane with the largest margin, i.e., maximum marginal hyperplane (MMH)
  • 572.
    573 SVM—Linearly Separable  Aseparating hyperplane can be written as W ● X + b = 0 where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)  For 2-D it can be written as w0 + w1 x1 + w2 x2 = 0  The hyperplane defining the sides of the margin: H1: w0 + w1 x1 + w2 x2 1 for y ≥ i = +1, and H2: w0 + w1 x1 + w2 x2 – 1 for y ≤ i = –1  Any training tuples that fall on hyperplanes H1 or H2 (i.e., the sides defining the margin) are support vectors  This becomes a constrained (convex) quadratic optimization problem: Quadratic objective function and linear constraints  Quadratic Programming (QP)  Lagrangian multipliers
  • 573.
    574 Why Is SVMEffective on High Dimensional Data?  The complexity of trained classifier is characterized by the # of support vectors rather than the dimensionality of the data  The support vectors are the essential or critical training examples —they lie closest to the decision boundary (MMH)  If all other training examples are removed and the training is repeated, the same separating hyperplane would be found  The number of support vectors found can be used to compute an (upper) bound on the expected error rate of the SVM classifier, which is independent of the data dimensionality  Thus, an SVM with a small number of support vectors can have good generalization, even when the dimensionality of the data is high
  • 574.
    575 SVM—Linearly Inseparable  Transformthe original input data into a higher dimensional space  Search for a linear separating hyperplane in the new space A1 A2
  • 575.
    576 SVM: Different Kernelfunctions  Instead of computing the dot product on the transformed data, it is math. equivalent to applying a kernel function K(Xi, Xj) to the original data, i.e., K(Xi, Xj) = Φ(Xi) Φ(Xj)  Typical Kernel Functions  SVM can also be used for classifying multiple (> 2) classes and for regression analysis (with additional
  • 576.
    577 Scaling SVM byHierarchical Micro-Clustering  SVM is not scalable to the number of data objects in terms of training time and memory usage  H. Yu, J. Yang, and J. Han, “ Classifying Large Data Sets Using SVM with Hierarchical Clusters”, KDD'03)  CB-SVM (Clustering-Based SVM)  Given limited amount of system resources (e.g., memory), maximize the SVM performance in terms of accuracy and the training speed  Use micro-clustering to effectively reduce the number of points to be considered  At deriving support vectors, de-cluster micro-clusters near “candidate vector” to ensure high classification accuracy
  • 577.
    578 CF-Tree: Hierarchical Micro-cluster Read the data set once, construct a statistical summary of the data (i.e., hierarchical clusters) given a limited amount of memory  Micro-clustering: Hierarchical indexing structure  provide finer samples closer to the boundary and coarser samples farther from the boundary
  • 578.
    579 Selective Declustering: EnsureHigh Accuracy  CF tree is a suitable base structure for selective declustering  De-cluster only the cluster Ei such that  Di – Ri < Ds, where Di is the distance from the boundary to the center point of Ei and Ri is the radius of Ei  Decluster only the cluster whose subclusters have possibilities to be the support cluster of the boundary  “Support cluster”: The cluster whose centroid is a support vector
  • 579.
    580 CB-SVM Algorithm: Outline Construct two CF-trees from positive and negative data sets independently  Need one scan of the data set  Train an SVM from the centroids of the root entries  De-cluster the entries near the boundary into the next level  The children entries de-clustered from the parent entries are accumulated into the training set with the non-declustered parent entries  Train an SVM again from the centroids of the entries in the training set  Repeat until nothing is accumulated
  • 580.
    581 Accuracy and Scalabilityon Synthetic Dataset  Experiments on large synthetic data sets shows better accuracy than random sampling approaches and far more scalable than the original SVM algorithm
  • 581.
    582 SVM vs. NeuralNetwork  SVM  Deterministic algorithm  Nice generalization properties  Hard to learn – learned in batch mode using quadratic programming techniques  Using kernels can  Neural Network  Nondeterministic algorithm  Generalizes well but doesn’t have strong mathematical foundation  Can easily be learned in incremental fashion  To learn complex functions—use multilayer perceptron
  • 582.
    583 SVM Related Links SVM Website: http://www.kernel-machines.org/  Representative implementations  LIBSVM: an efficient implementation of SVM, multi- class classifications, nu-SVM, one-class SVM, including also various interfaces with java, python, etc.  SVM-light: simpler but performance is not better than LIBSVM, support only binary classification and only in C  SVM-torch: another recent implementation also
  • 583.
    584 Chapter 9. Classification:Advanced Methods  Bayesian Belief Networks  Classification by Backpropagation  Support Vector Machines  Classification by Using Frequent Patterns  Lazy Learners (or Learning from Your Neighbors)  Other Classification Methods  Additional Topics Regarding Classification  Summary
  • 584.
    585 Associative Classification  Associativeclassification: Major steps  Mine data to find strong associations between frequent patterns (conjunctions of attribute-value pairs) and class labels  Association rules are generated in the form of P1 ^ p2 … ^ pl  “Aclass = C” (conf, sup)  Organize the rules to form a rule-based classifier  Why effective?  It explores highly confident associations among multiple attributes and may overcome some constraints introduced by decision-tree induction, which considers only one attribute at a time  Associative classification has been found to be often more accurate than some traditional classification methods, such as
  • 585.
    586 Typical Associative ClassificationMethods  CBA (Classification Based on Associations: Liu, Hsu & Ma, KDD’98)  Mine possible association rules in the form of  Cond-set (a set of attribute-value pairs)  class label  Build classifier: Organize rules according to decreasing precedence based on confidence and then support  CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01)  Classification: Statistical analysis on multiple rules  CPAR (Classification based on Predictive Association Rules: Yin & Han, SDM’03)  Generation of predictive rules (FOIL-like analysis) but allow covered rules to retain with reduced weight  Prediction using best k rules 
  • 586.
    587 Frequent Pattern-Based Classification H. Cheng, X. Yan, J. Han, and C.-W. Hsu, “ Discriminative Frequent Pattern Analysis for Effective Cl assification ”, ICDE'07  Accuracy issue  Increase the discriminative power  Increase the expressive power of the feature space  Scalability issue  It is computationally infeasible to generate all feature combinations and filter them with an information gain threshold  Efficient method (DDPMine: FPtree pruning): H. Cheng, X. Yan, J. Han, and P. S. Yu, " Direct Discriminative Pattern Mining for Effective Cla
  • 587.
    588 Frequent Pattern vs.Single Feature (a) Austral (c) Sonar (b) Cleve Fig. 1. Information Gain vs. Pattern Length The discriminative power of some frequent patterns is higher than that of single features.
  • 588.
    589 Empirical Results 0 100200 300 400 500 600 700 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 InfoGain IG_UpperBnd Support Information Gain (a) Austral (c) Sonar (b) Breast Fig. 2. Information Gain vs. Pattern Frequency
  • 589.
    590 Feature Selection  Givena set of frequent patterns, both non- discriminative and redundant patterns exist, which can cause overfitting  We want to single out the discriminative patterns and remove redundant ones  The notion of Maximal Marginal Relevance (MMR) is borrowed  A document has high marginal relevance if it is both relevant to the query and contains minimal marginal similarity to previously selected documents
  • 590.
  • 591.
  • 592.
    593 DDPMine: Branch-and-Bound Search Associationbetween information gain and frequency a b a: constant, a parent node b: variable, a descendent ) sup( ) sup( parent child  ) sup( ) sup( a b 
  • 593.
  • 594.
    595 Chapter 9. Classification:Advanced Methods  Bayesian Belief Networks  Classification by Backpropagation  Support Vector Machines  Classification by Using Frequent Patterns  Lazy Learners (or Learning from Your Neighbors)  Other Classification Methods  Additional Topics Regarding Classification  Summary
  • 595.
    596 Lazy vs. EagerLearning  Lazy vs. eager learning  Lazy learning (e.g., instance-based learning): Simply stores training data (or only minor processing) and waits until it is given a test tuple  Eager learning (the above discussed methods): Given a set of training tuples, constructs a classification model before receiving new (e.g., test) data to classify  Lazy: less time in training but more time in predicting  Accuracy  Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form an implicit global approximation to the target function  Eager: must commit to a single hypothesis that
  • 596.
    597 Lazy Learner: Instance-BasedMethods  Instance-based learning:  Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified  Typical approaches  k-nearest neighbor approach  Instances represented as points in a Euclidean space.  Locally weighted regression  Constructs local approximation  Case-based reasoning  Uses symbolic representations and knowledge- based inference
  • 597.
    598 The k-Nearest NeighborAlgorithm  All instances correspond to points in the n-D space  The nearest neighbor are defined in terms of Euclidean distance, dist(X1, X2)  Target function could be discrete- or real- valued  For discrete-valued, k-NN returns the most common value among the k training examples nearest to xq  Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples . _ + _ xq + _ _ + _ _ + . . . . .
  • 598.
    599 Discussion on thek-NN Algorithm  k-NN for real-valued prediction for a given unknown tuple  Returns the mean values of the k nearest neighbors  Distance-weighted nearest neighbor algorithm  Weight the contribution of each of the k neighbors according to their distance to the query xq  Give greater weight to closer neighbors  Robust to noisy data by averaging k-nearest neighbors  Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes  To overcome it, axes stretch or elimination of the least relevant attributes 2 ) , ( 1 i x q x d w
  • 599.
    600 Case-Based Reasoning (CBR) CBR: Uses a database of problem solutions to solve new problems  Store symbolic description (tuples or cases)—not points in a Euclidean space  Applications: Customer-service (product-related diagnosis), legal ruling  Methodology  Instances represented by rich symbolic descriptions (e.g., function graphs)  Search for similar cases, multiple retrieved cases may be combined  Tight coupling between case retrieval, knowledge-based reasoning, and problem solving  Challenges  Find a good similarity metric  Indexing based on syntactic similarity measure, and when
  • 600.
    601 Chapter 9. Classification:Advanced Methods  Bayesian Belief Networks  Classification by Backpropagation  Support Vector Machines  Classification by Using Frequent Patterns  Lazy Learners (or Learning from Your Neighbors)  Other Classification Methods  Additional Topics Regarding Classification  Summary
  • 601.
    602 Genetic Algorithms (GA) Genetic Algorithm: based on an analogy to biological evolution  An initial population is created consisting of randomly generated rules  Each rule is represented by a string of bits  E.g., if A1 and ¬A2 then C2 can be encoded as 100  If an attribute has k > 2 values, k bits can be used  Based on the notion of survival of the fittest, a new population is formed to consist of the fittest rules and their offspring  The fitness of a rule is represented by its classification accuracy on a set of training examples  Offspring are generated by crossover and mutation  The process continues until a population P evolves when each rule in P satisfies a prespecified threshold  Slow but easily parallelizable
  • 602.
    603 Rough Set Approach Rough sets are used to approximately or “roughly” define equivalent classes  A rough set for a given class C is approximated by two sets: a lower approximation (certain to be in C) and an upper approximation (cannot be described as not belonging to C)  Finding the minimal subsets (reducts) of attributes for feature reduction is NP-hard but a discernibility matrix (which stores the differences between attribute values for each pair of data tuples) is used to reduce the computation intensity
  • 603.
    604 Fuzzy Set Approaches  Fuzzylogic uses truth values between 0.0 and 1.0 to represent the degree of membership (such as in a fuzzy membership graph)  Attribute values are converted to fuzzy values. Ex.:  Income, x, is assigned a fuzzy membership value to each of the discrete categories {low, medium, high}, e.g. $49K belongs to “medium income” with fuzzy value 0.15 but belongs to “high income” with fuzzy value 0.96  Fuzzy membership values do not have to sum to 1.  Each applicable rule contributes a vote for membership in the categories  Typically, the truth values for each predicted category are summed, and these sums are combined
  • 604.
    605 Chapter 9. Classification:Advanced Methods  Bayesian Belief Networks  Classification by Backpropagation  Support Vector Machines  Classification by Using Frequent Patterns  Lazy Learners (or Learning from Your Neighbors)  Other Classification Methods  Additional Topics Regarding Classification  Summary
  • 605.
    Multiclass Classification  Classificationinvolving more than two classes (i.e., > 2 Classes)  Method 1. One-vs.-all (OVA): Learn a classifier one at a time  Given m classes, train m classifiers: one for each class  Classifier j: treat tuples in class j as positive & all others as negative  To classify a tuple X, the set of classifiers vote as an ensemble  Method 2. All-vs.-all (AVA): Learn a classifier for each pair of classes  Given m classes, construct m(m-1)/2 binary classifiers  A classifier is trained using tuples of the two classes  To classify a tuple X, each classifier votes. X is assigned to the class with maximal vote  Comparison  All-vs.-all tends to be superior to one-vs.-all  Problem: Binary classifier is sensitive to errors, and errors affect 606
  • 606.
    Error-Correcting Codes forMulticlass Classification  Originally designed to correct errors during data transmission for communication tasks by exploring data redundancy  Example  A 7-bit codeword associated with classes 1-4 607 Class Error-Corr. Codeword C1 1 1 1 1 1 1 1 C2 0 0 0 0 1 1 1 C3 0 0 1 1 0 0 1 C4 0 1 0 1 0 1 0  Given a unknown tuple X, the 7-trained classifiers output: 0001010  Hamming distance: # of different bits between two codewords  H(X, C1) = 5, by checking # of bits between [1111111] & [0001010]  H(X, C2) = 3, H(X, C3) = 3, H(X, C4) = 1, thus C4 as the label for X  Error-correcting codes can correct up to (h-1)/h 1-bit error, where h is the minimum Hamming distance between any two codewords  If we use 1-bit per class, it is equiv. to one-vs.-all approach, the code are insufficient to self-correct  When selecting error-correcting codes, there should be good row- wise and col.-wise separation between the codewords
  • 607.
    Semi-Supervised Classification  Semi-supervised:Uses labeled and unlabeled data to build a classifier  Self-training:  Build a classifier using the labeled data  Use it to label the unlabeled data, and those with the most confident label prediction are added to the set of labeled data  Repeat the above process  Adv: easy to understand; disadv: may reinforce errors  Co-training: Use two or more classifiers to teach each other  Each learner uses a mutually independent set of features of each tuple to train a good classifier, say f1  Then f1 and f2 are used to predict the class label for unlabeled data X  Teach each other: The tuple having the most confident prediction from f1 is added to the set of labeled data for f2, & vice versa 608
  • 608.
    Active Learning  Classlabels are expensive to obtain  Active learner: query human (oracle) for labels  Pool-based approach: Uses a pool of unlabeled data  L: a small subset of D is labeled, U: a pool of unlabeled data in D  Use a query function to carefully select one or more tuples from U and request labels from an oracle (a human annotator)  The newly labeled samples are added to L, and learn a model  Goal: Achieve high accuracy using as few labeled data as possible  Evaluated using learning curves: Accuracy as a function of the number of instances queried (# of tuples to be queried should be small)  Research issue: How to choose the data tuples to be queried?  Uncertainty sampling: choose the least certain ones  Reduce version space, the subset of hypotheses consistent w. the training data  Reduce expected entropy over U: Find the greatest reduction in 609
  • 609.
    Transfer Learning: ConceptualFramework  Transfer learning: Extract knowledge from one or more source tasks and apply the knowledge to a target task  Traditional learning: Build a new classifier for each new task  Transfer learning: Build new classifier by applying existing knowledge learned from source tasks Learning System Learning System Learning System Different Tasks 610 Traditional Learning Framework Transfer Learning Framework Knowledge Learning System Source Tasks Target Task
  • 610.
    Transfer Learning: Methodsand Applications  Applications: Especially useful when data is outdated or distribution changes, e.g., Web document classification, e-mail spam filtering  Instance-based transfer learning: Reweight some of the data from source tasks and use it to learn the target task  TrAdaBoost (Transfer AdaBoost)  Assume source and target data each described by the same set of attributes (features) & class labels, but rather diff. distributions  Require only labeling a small amount of target data  Use source data in training: When a source tuple is misclassified, reduce the weight of such tupels so that they will have less effect on the subsequent classifier  Research issues  Negative transfer: When it performs worse than no transfer at all  Heterogeneous transfer learning: Transfer knowledge from different feature space or multiple source domains  Large-scale transfer learning 611
  • 611.
    612 Chapter 9. Classification:Advanced Methods  Bayesian Belief Networks  Classification by Backpropagation  Support Vector Machines  Classification by Using Frequent Patterns  Lazy Learners (or Learning from Your Neighbors)  Other Classification Methods  Additional Topics Regarding Classification  Summary
  • 612.
    613 Summary  Effective andadvanced classification methods  Bayesian belief network (probabilistic networks)  Backpropagation (Neural networks)  Support Vector Machine (SVM)  Pattern-based classification  Other classification methods: lazy learners (KNN, case-based reasoning), genetic algorithms, rough set and fuzzy set approaches  Additional Topics on Classification  Multiclass classification  Semi-supervised classification  Active learning  Transfer learning
  • 613.
    614 References  Please seethe references of Chapter 8
  • 614.
  • 615.
    616 What Is Prediction? (Numerical) prediction is similar to classification  construct a model  use model to predict continuous or ordered value for a given input  Prediction is different from classification  Classification refers to predict categorical class label  Prediction models continuous-valued functions  Major method for prediction: regression  model the relationship between one or more independent or predictor variables and a dependent or response variable  Regression analysis  Linear and multiple regression  Non-linear regression  Other regression methods: generalized linear model, Poisson regression, log-linear models, regression trees
  • 616.
    617 Linear Regression  Linearregression: involves a response variable y and a single predictor variable x y = w0 + w1 x where w0 (y-intercept) and w1 (slope) are regression coefficients  Method of least squares: estimates the best-fitting straight line  Multiple linear regression: involves more than one predictor variable  Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)  Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2  Solvable by extension of least square method or using SAS, S-         | | 1 2 | | 1 ) ( ) )( ( 1 D i i D i i i x x y y x x w x w y w 1 0  
  • 617.
    618  Some nonlinearmodels can be modeled by a polynomial function  A polynomial regression model can be transformed into linear regression model. For example, y = w0 + w1 x + w2 x2 + w3 x3 convertible to linear with new variables: x2 = x2 , x3= x3 y = w0 + w1 x + w2 x2 + w3 x3  Other functions, such as power function, can also be transformed to linear model  Some models are intractable nonlinear (e.g., sum of exponential terms)  possible to obtain least square estimates through extensive calculation on more complex formulae Nonlinear Regression
  • 618.
    619  Generalized linearmodel:  Foundation on which linear regression can be applied to modeling categorical response variables  Variance of y is a function of the mean value of y, not a constant  Logistic regression: models the prob. of some event occurring as a linear function of a set of predictor variables  Poisson regression: models the data that exhibit a Poisson distribution  Log-linear models: (for categorical data)  Approximate discrete multidimensional prob. distributions  Also useful for data compression and smoothing  Regression trees and model trees  Trees to predict continuous values rather than class labels Other Regression-Based Models
  • 619.
    620 Regression Trees andModel Trees  Regression tree: proposed in CART system (Breiman et al. 1984)  CART: Classification And Regression Trees  Each leaf stores a continuous-valued prediction  It is the average value of the predicted attribute for the training tuples that reach the leaf  Model tree: proposed by Quinlan (1992)  Each leaf holds a regression model—a multivariate linear equation for the predicted attribute  A more general case than regression tree  Regression and model trees tend to be more accurate than linear regression when the data are not represented well by a simple linear model
  • 620.
    621  Predictive modeling:Predict data values or construct generalized linear models based on the database data  One can only predict value ranges or category distributions  Method outline:  Minimal generalization  Attribute relevance analysis  Generalized linear model construction  Prediction  Determine the major factors which influence the prediction  Data relevance analysis: uncertainty measurement, entropy analysis, expert judgement, etc. Predictive Modeling in Multidimensional Databases
  • 621.
  • 622.
  • 623.
    624 SVM—Introductory Literature  “StatisticalLearning Theory” by Vapnik: extremely hard to understand, containing many errors too.  C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998.  Better than the Vapnik’s book, but still written too hard for introduction, and the examples are so not-intuitive  The book “An Introduction to Support Vector Machines” by N. Cristianini and J. Shawe-Taylor  Also written hard for introduction, but the explanation about the mercer’s theorem is better than above literatures  The neural network book by Haykins  Contains one nice chapter of SVM introduction
  • 624.
    625 Notes about SVM— IntroductoryLiterature  “Statistical Learning Theory” by Vapnik: difficult to understand, containing many errors.  C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998.  Easier than Vapnik’s book, but still not introductory level; the examples are not so intuitive  The book An Introduction to Support Vector Machines by Cristianini and Shawe-Taylor  Not introductory level, but the explanation about Mercer’s Theorem is better than above literatures  Neural Networks and Learning Machines by Haykin  Contains a nice chapter on SVM introduction
  • 625.
    626 Associative Classification CanAchieve High Accuracy and Efficiency (Cong et al. SIGMOD05)
  • 626.
    627 A Closer Lookat CMAR  CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01)  Efficiency: Uses an enhanced FP-tree that maintains the distribution of class labels among tuples satisfying each frequent itemset  Rule pruning whenever a rule is inserted into the tree  Given two rules, R1 and R2, if the antecedent of R1 is more general than that of R2 and conf(R1) conf(R ≥ 2), then prune R2  Prunes rules for which the rule antecedent and class are not positively correlated, based on a χ2 test of statistical significance  Classification based on generated/pruned rules  If only one rule satisfies tuple X, assign the class label of the rule  If a rule set S satisfies X, CMAR  divides S into groups according to class labels  uses a weighted χ2 measure to find the strongest group of rules, based on the statistical correlation of rules within a group  assigns X the class label of the strongest group
  • 627.
    628 Perceptron & Winnow •Vector: x, w • Scalar: x, y, w Input: {(x1, y1), …} Output: classification function f(x) f(xi) > 0 for yi = +1 f(xi) < 0 for yi = -1 f(x) => wx + b = 0 or w1x1+w2x2+b = 0 x1 x2 • Perceptron: update W additively • Winnow: update W multiplicatively
  • 628.
    Data Mining: Concepts andTechniques (3rd ed.) — Chapter 10 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2011 Han, Kamber & Pei. All rights reserved. 629
  • 629.
    630 Chapter 10. ClusterAnalysis: Basic Concepts and Methods  Cluster Analysis: Basic Concepts  Partitioning Methods  Hierarchical Methods  Density-Based Methods  Grid-Based Methods  Evaluation of Clustering  Summary 630
  • 630.
    631 What is ClusterAnalysis?  Cluster: A collection of data objects  similar (or related) to one another within the same group  dissimilar (or unrelated) to the objects in other groups  Cluster analysis (or clustering, data segmentation, …)  Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters  Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning by examples: supervised)  Typical applications  As a stand-alone tool to get insight into data distribution  As a preprocessing step for other algorithms
  • 631.
    632 Clustering for DataUnderstanding and Applications  Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus and species  Information retrieval: document clustering  Land use: Identification of areas of similar land use in an earth observation database  Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs  City-planning: Identifying groups of houses according to their house type, value, and geographical location  Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults  Climate: understanding earth climate, find patterns of atmospheric and ocean  Economic Science: market resarch
  • 632.
    633 Clustering as aPreprocessing Tool (Utility)  Summarization:  Preprocessing for regression, PCA, classification, and association analysis  Compression:  Image processing: vector quantization  Finding K-nearest Neighbors  Localizing search to one or a small number of clusters  Outlier detection  Outliers are often viewed as those “far away” from any cluster
  • 633.
    Quality: What IsGood Clustering?  A good clustering method will produce high quality clusters  high intra-class similarity: cohesive within clusters  low inter-class similarity: distinctive between clusters  The quality of a clustering method depends on  the similarity measure used by the method  its implementation, and  Its ability to discover some or all of the hidden patterns 634
  • 634.
    Measure the Qualityof Clustering  Dissimilarity/Similarity metric  Similarity is expressed in terms of a distance function, typically metric: d(i, j)  The definitions of distance functions are usually rather different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables  Weights should be associated with different variables based on applications and data semantics  Quality of clustering:  There is usually a separate “quality” function that measures the “goodness” of a cluster.  It is hard to define “similar enough” or “good enough”  The answer is typically highly subjective 635
  • 635.
    Considerations for ClusterAnalysis  Partitioning criteria  Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable)  Separation of clusters  Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one document may belong to more than one class)  Similarity measure  Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or contiguity)  Clustering space  Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering) 636
  • 636.
    Requirements and Challenges Scalability  Clustering all the data instead of only on samples  Ability to deal with different types of attributes  Numerical, binary, categorical, ordinal, linked, and mixture of these  Constraint-based clustering  User may give inputs on constraints  Use domain knowledge to determine input parameters  Interpretability and usability  Others  Discovery of clusters with arbitrary shape  Ability to deal with noisy data  Incremental clustering and insensitivity to input order  High dimensionality 637
  • 637.
    Major Clustering Approaches(I)  Partitioning approach:  Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors  Typical methods: k-means, k-medoids, CLARANS  Hierarchical approach:  Create a hierarchical decomposition of the set of data (or objects) using some criterion  Typical methods: Diana, Agnes, BIRCH, CAMELEON  Density-based approach:  Based on connectivity and density functions  Typical methods: DBSACN, OPTICS, DenClue  Grid-based approach:  based on a multiple-level granularity structure  Typical methods: STING, WaveCluster, CLIQUE 638
  • 638.
    Major Clustering Approaches(II)  Model-based:  A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other  Typical methods: EM, SOM, COBWEB  Frequent pattern-based:  Based on the analysis of frequent patterns  Typical methods: p-Cluster  User-guided or constraint-based:  Clustering by considering user-specified or application-specific constraints  Typical methods: COD (obstacles), constrained clustering  Link-based clustering:  Objects are often linked together in various ways  Massive links can be used to cluster objects: SimRank, LinkClus 639
  • 639.
    640 Chapter 10. ClusterAnalysis: Basic Concepts and Methods  Cluster Analysis: Basic Concepts  Partitioning Methods  Hierarchical Methods  Density-Based Methods  Grid-Based Methods  Evaluation of Clustering  Summary 640
  • 640.
    Partitioning Algorithms: BasicConcept  Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where ci is the centroid or medoid of cluster Ci)  Given k, find a partition of k clusters that optimizes the chosen partitioning criterion  Global optimal: exhaustively enumerate all partitions  Heuristic methods: k-means and k-medoids algorithms  k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the center of the cluster  k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster 2 1 ) ( i C p k i c p E i       641
  • 641.
    The K-Means ClusteringMethod  Given k, the k-means algorithm is implemented in four steps:  Partition objects into k nonempty subsets  Compute seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point, of the cluster)  Assign each object to the cluster with the nearest seed point  Go back to Step 2, stop when the assignment does not change 642
  • 642.
    An Example ofK-Means Clustering K=2 Arbitrarily partition objects into k groups Update the cluster centroids Update the cluster centroids Reassign objects Loop if needed 643 The initial data set  Partition objects into k nonempty subsets  Repeat  Compute centroid (i.e., mean point) for each partition  Assign each object to the cluster of its nearest centroid  Until no change
  • 643.
    Comments on theK-Means Method  Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.  Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))  Comment: Often terminates at a local optimal.  Weakness  Applicable only to objects in a continuous n-dimensional space  Using the k-modes method for categorical data  In comparison, k-medoids can be applied to a wide range of data  Need to specify k, the number of clusters, in advance (there are ways to automatically determine the best k (see Hastie et al., 2009)  Sensitive to noisy data and outliers  Not suitable to discover clusters with non-convex shapes 644
  • 644.
    Variations of theK-Means Method  Most of the variants of the k-means which differ in  Selection of the initial k means  Dissimilarity calculations  Strategies to calculate cluster means  Handling categorical data: k-modes  Replacing means of clusters with modes  Using new dissimilarity measures to deal with categorical objects  Using a frequency-based method to update modes of clusters  A mixture of categorical and numerical data: k-prototype method 645
  • 645.
    What Is theProblem of the K-Means Method?  The k-means algorithm is sensitive to outliers !  Since an object with an extremely large value may substantially distort the distribution of the data  K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 646
  • 646.
    647 PAM: A TypicalK-Medoids Algorithm 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Total Cost = 20 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 K=2 Arbitrar y choose k object as initial medoids 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Assign each remaini ng object to nearest medoids Randomly select a nonmedoid object,Oramdom Compute total cost of swapping 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Total Cost = 26 Swapping O and Oramdom If quality is improved. Do loop Until no change 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
  • 647.
    The K-Medoid ClusteringMethod  K-Medoids Clustering: Find representative objects (medoids) in clusters  PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)  Starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering  PAM works effectively for small data sets, but does not scale well for large data sets (due to the computational complexity)  Efficiency improvement on PAM  CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples  CLARANS (Ng & Han, 1994): Randomized re-sampling 648
  • 648.
    649 Chapter 10. ClusterAnalysis: Basic Concepts and Methods  Cluster Analysis: Basic Concepts  Partitioning Methods  Hierarchical Methods  Density-Based Methods  Grid-Based Methods  Evaluation of Clustering  Summary 649
  • 649.
    Hierarchical Clustering  Usedistance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e Step 4 Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive (DIANA) 650
  • 650.
    AGNES (Agglomerative Nesting) Introduced in Kaufmann and Rousseeuw (1990)  Implemented in statistical packages, e.g., Splus  Use the single-link method and the dissimilarity matrix  Merge nodes that have the least dissimilarity  Go on in a non-descending fashion  Eventually all nodes belong to the same cluster 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 651
  • 651.
    Dendrogram: Shows HowClusters are Merged Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster 652
  • 652.
    DIANA (Divisive Analysis) Introduced in Kaufmann and Rousseeuw (1990)  Implemented in statistical analysis packages, e.g., Splus  Inverse order of AGNES  Eventually each node forms a cluster on its own 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 653
  • 653.
    Distance between Clusters Single link: smallest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)  Complete link: largest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)  Average: avg distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)  Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) = dist(Ci, Cj)  Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi, Mj)  Medoid: a chosen, centrally located object in the cluster X X 654
  • 654.
    Centroid, Radius andDiameter of a Cluster (for numerical data sets)  Centroid: the “middle” of a cluster  Radius: square root of average distance from any point of the cluster to its centroid  Diameter: square root of average mean squared distance between all pairs of points in the cluster N t N i ip m C ) ( 1    N m c ip t N i m R 2 ) ( 1     ) 1 ( 2 ) ( 1 1        N N iq t ip t N i N i m D 655
  • 655.
    Extensions to HierarchicalClustering  Major weakness of agglomerative clustering methods  Can never undo what was done previously  Do not scale well: time complexity of at least O(n2 ), where n is the number of total objects  Integration of hierarchical & distance-based clustering  BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters  CHAMELEON (1999): hierarchical clustering using dynamic modeling 656
  • 656.
    BIRCH (Balanced IterativeReducing and Clustering Using Hierarchies)  Zhang, Ramakrishnan & Livny, SIGMOD’96  Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering  Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data)  Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree  Scales linearly: finds a good clustering with a single scan and improves the quality with a few additional scans  Weakness: handles only numeric data, and sensitive to the order of the data record 657
  • 657.
    Clustering Feature Vectorin BIRCH Clustering Feature (CF): CF = (N, LS, SS) N: Number of data points LS: linear sum of N points: SS: square sum of N points 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 CF = (5, (16,30),(54,190)) (3,4) (2,6) (4,5) (4,7) (3,8)   N i i X 1 2 1   N i i X 658
  • 658.
    CF-Tree in BIRCH Clustering feature:  Summary of the statistics for a given subcluster: the 0-th, 1st, and 2nd moments of the subcluster from the statistical point of view  Registers crucial measurements for computing cluster and utilizes storage efficiently A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering  A nonleaf node in a tree has descendants or “children”  The nonleaf nodes store sums of the CFs of their children  A CF tree has two parameters  Branching factor: max # of children  Threshold: max diameter of sub-clusters stored at the leaf nodes 659
  • 659.
    The CF TreeStructure CF1 child1 CF3 child3 CF2 child2 CF6 child6 CF1 child1 CF3 child3 CF2 child2 CF5 child5 CF1 CF2 CF6 prev next CF1 CF2 CF4 prev next B = 7 L = 6 Root Non-leaf node Leaf node Leaf node 660
  • 660.
    The Birch Algorithm Cluster Diameter  For each point in the input  Find closest leaf entry  Add point to leaf entry and update CF  If entry diameter > max_diameter, then split leaf, and possibly parents  Algorithm is O(n)  Concerns  Sensitive to insertion order of data points  Since we fix the size of leaf nodes, so clusters may not be so natural  Clusters tend to be spherical given the radius and diameter measures    2 ) ( ) 1 ( 1 j x i x n n 661
  • 661.
    CHAMELEON: Hierarchical ClusteringUsing Dynamic Modeling (1999)  CHAMELEON: G. Karypis, E. H. Han, and V. Kumar, 1999  Measures the similarity based on a dynamic model  Two clusters are merged only if the interconnectivity and closeness (proximity) between two clusters are high relative to the internal interconnectivity of the clusters and closeness of items within the clusters  Graph-based, and a two-phase algorithm 1. Use a graph-partitioning algorithm: cluster objects into a large number of relatively small sub-clusters 2. Use an agglomerative hierarchical clustering algorithm: find the genuine clusters by repeatedly combining these sub-clusters 662
  • 662.
    Overall Framework ofCHAMELEON Construct (K-NN) Sparse Graph Partition the Graph Merge Partition Final Clusters Data Set K-NN Graph P and q are connected if q is among the top k closest neighbors of p Relative interconnectivity: connectivity of c1 and c2 over internal connectivity Relative closeness: closeness of c1 and c2 over internal closeness 663
  • 663.
  • 664.
    Probabilistic Hierarchical Clustering Algorithmic hierarchical clustering  Nontrivial to choose a good distance measure  Hard to handle missing attribute values  Optimization goal not clear: heuristic, local search  Probabilistic hierarchical clustering  Use probabilistic models to measure distances between clusters  Generative model: Regard the set of data objects to be clustered as a sample of the underlying data generation mechanism to be analyzed  Easy to understand, same efficiency as algorithmic agglomerative clustering method, can handle partially observed data  In practice, assume the generative models adopt common distributions functions, e.g., Gaussian distribution or Bernoulli distribution, governed by parameters 665
  • 665.
    Generative Model  Givena set of 1-D points X = {x1, …, xn} for clustering analysis & assuming they are generated by a Gaussian distribution:  The probability that a point xi ∈ X is generated by the model  The likelihood that X is generated by the model:  The task of learning the generative model: find the parameters μ and σ2 such that the maximum likelihood 666
  • 666.
    A Probabilistic HierarchicalClustering Algorithm  For a set of objects partitioned into m clusters C1, . . . ,Cm, the quality can be measured by, where P() is the maximum likelihood  Distance between clusters C1 and C2:  Algorithm: Progressively merge points and clusters Input: D = {o1, ..., on}: a data set containing n objects Output: A hierarchy of clusters Method Create a cluster for each object Ci = {oi}, 1 ≤ i ≤ n; For i = 1 to n { Find pair of clusters Ci and Cj such that Ci,Cj = argmaxi ≠ j {log (P(Ci C ∪ j )/(P(Ci)P(Cj ))}; If log (P(Ci C ∪ j )/(P(Ci)P(Cj )) > 0 then merge Ci and Cj } 667
  • 667.
    668 Chapter 10. ClusterAnalysis: Basic Concepts and Methods  Cluster Analysis: Basic Concepts  Partitioning Methods  Hierarchical Methods  Density-Based Methods  Grid-Based Methods  Evaluation of Clustering  Summary 668
  • 668.
    Density-Based Clustering Methods Clustering based on density (local cluster criterion), such as density-connected points  Major features:  Discover clusters of arbitrary shape  Handle noise  One scan  Need density parameters as termination condition  Several interesting studies:  DBSCAN: Ester, et al. (KDD’96)  OPTICS: Ankerst, et al (SIGMOD’99).  DENCLUE: Hinneburg & D. Keim (KDD’98)  CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid- based) 669
  • 669.
    Density-Based Clustering: BasicConcepts  Two parameters:  Eps: Maximum radius of the neighbourhood  MinPts: Minimum number of points in an Eps- neighbourhood of that point  NEps(p): {q belongs to D | dist(p,q) ≤ Eps}  Directly density-reachable: A point p is directly density- reachable from a point q w.r.t. Eps, MinPts if  p belongs to NEps(q)  core point condition: |NEps (q)| ≥ MinPts MinPts = 5 Eps = 1 cm p q 670
  • 670.
    Density-Reachable and Density-Connected Density-reachable:  A point p is density-reachable from a point q w.r.t. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi  Density-connected  A point p is density-connected to a point q w.r.t. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t. Eps and MinPts p q p1 p q o 671
  • 671.
    DBSCAN: Density-Based SpatialClustering of Applications with Noise  Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points  Discovers clusters of arbitrary shape in spatial databases with noise Core Border Outlier Eps = 1cm MinPts = 5 672
  • 672.
    DBSCAN: The Algorithm Arbitrary select a point p  Retrieve all points density-reachable from p w.r.t. Eps and MinPts  If p is a core point, a cluster is formed  If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database  Continue the process until all of the points have been processed 673
  • 673.
    DBSCAN: Sensitive toParameters 674
  • 674.
    OPTICS: A Cluster-OrderingMethod (1999)  OPTICS: Ordering Points To Identify the Clustering Structure  Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)  Produces a special order of the database wrt its density-based clustering structure  This cluster-ordering contains info equiv to the density- based clusterings corresponding to a broad range of parameter settings  Good for both automatic and interactive cluster analysis, including finding intrinsic clustering structure  Can be represented graphically or using visualization techniques 675
  • 675.
    OPTICS: Some Extensionfrom DBSCAN  Index-based:  k = number of dimensions  N = 20  p = 75%  M = N(1-p) = 5  Complexity: O(NlogN)  Core Distance:  min eps s.t. point is core  Reachability Distance D p2 MinPts = 5 e = 3 cm Max (core-distance (o), d (o, p)) r(p1, o) = 2.8cm. r(p2,o) = 4cm o o p1 676
  • 676.
  • 677.
  • 678.
    DENCLUE: Using StatisticalDensity Functions  DENsity-based CLUstEring by Hinneburg & Keim (KDD’98)  Using statistical density functions:  Major features  Solid mathematical foundation  Good for data sets with large amounts of noise  Allows a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets  Significant faster than existing algorithm (e.g., DBSCAN)  But needs a large number of parameters f x y e Gaussian d x y ( , ) ( , )   2 2 2     N i x x d D Gaussian i e x f 1 2 ) , ( 2 2 ) (         N i x x d i i D Gaussian i e x x x x f 1 2 ) , ( 2 2 ) ( ) , (  influence of y on x total influence on x gradient of x in the direction of xi 679
  • 679.
     Uses gridcells but only keeps information about grid cells that do actually contain data points and manages these cells in a tree-based access structure  Influence function: describes the impact of a data point within its neighborhood  Overall density of the data space can be calculated as the sum of the influence function of all data points  Clusters can be determined mathematically by identifying density attractors  Density attractors are local maximal of the overall density function  Center defined clusters: assign to each density attractor the points density attracted to it  Arbitrary shaped cluster: merge density attractors that are connected through paths of high density (> threshold) Denclue: Technical Essence 680
  • 680.
  • 681.
  • 682.
    683 Chapter 10. ClusterAnalysis: Basic Concepts and Methods  Cluster Analysis: Basic Concepts  Partitioning Methods  Hierarchical Methods  Density-Based Methods  Grid-Based Methods  Evaluation of Clustering  Summary 683
  • 683.
    Grid-Based Clustering Method Using multi-resolution grid data structure  Several interesting methods  STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997)  WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98)  A multi-resolution clustering approach using wavelet method  CLIQUE: Agrawal, et al. (SIGMOD’98)  Both grid-based and subspace clustering 684
  • 684.
    STING: A StatisticalInformation Grid Approach  Wang, Yang and Muntz (VLDB’97)  The spatial area is divided into rectangular cells  There are several levels of cells corresponding to different levels of resolution 685 i-th layer (i-1)st layer 1st layer
  • 685.
    The STING ClusteringMethod  Each cell at a high level is partitioned into a number of smaller cells in the next lower level  Statistical info of each cell is calculated and stored beforehand and is used to answer queries  Parameters of higher level cells can be easily calculated from parameters of lower level cell  count, mean, s, min, max  type of distribution—normal, uniform, etc.  Use a top-down approach to answer spatial data queries  Start from a pre-selected layer—typically with a small number of cells  For each cell in the current level compute the confidence interval 686
  • 686.
    STING Algorithm andIts Analysis  Remove the irrelevant cells from further consideration  When finish examining the current layer, proceed to the next lower level  Repeat this process until the bottom layer is reached  Advantages:  Query-independent, easy to parallelize, incremental update  O(K), where K is the number of grid cells at the lowest level  Disadvantages:  All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected 687
  • 687.
    688 CLIQUE (Clustering InQUEst)  Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)  Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space  CLIQUE can be considered as both density-based and grid-based  It partitions each dimension into the same number of equal length interval  It partitions an m-dimensional data space into non-overlapping rectangular units  A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter  A cluster is a maximal set of connected dense units within a subspace
  • 688.
    689 CLIQUE: The MajorSteps  Partition the data space and find the number of points that lie inside each cell of the partition.  Identify the subspaces that contain clusters using the Apriori principle  Identify clusters  Determine dense units in all subspaces of interests  Determine connected dense units in all subspaces of interests.  Generate minimal description for the clusters  Determine maximal regions that cover a cluster of connected dense units for each cluster  Determination of minimal cover for each cluster
  • 689.
    690 Salary (10,000) 20 30 4050 60 age 5 4 3 1 2 6 7 0 20 30 40 50 60 age 5 4 3 1 2 6 7 0 Vacation( week) age Vacation Salary 30 50  = 3
  • 690.
    691 Strength and Weaknessof CLIQUE  Strength  automatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspaces  insensitive to the order of records in input and does not presume some canonical data distribution  scales linearly with the size of input and has good scalability as the number of dimensions in the data increases  Weakness  The accuracy of the clustering result may be degraded at the expense of simplicity of the method
  • 691.
    692 Chapter 10. ClusterAnalysis: Basic Concepts and Methods  Cluster Analysis: Basic Concepts  Partitioning Methods  Hierarchical Methods  Density-Based Methods  Grid-Based Methods  Evaluation of Clustering  Summary 692
  • 692.
    Assessing Clustering Tendency Assess if non-random structure exists in the data by measuring the probability that the data is generated by a uniform data distribution  Test spatial randomness by statistic test: Hopkins Static  Given a dataset D regarded as a sample of a random variable o, determine how far away o is from being uniformly distributed in the data space  Sample n points, p1, …, pn, uniformly from D. For each pi, find its nearest neighbor in D: xi = min{dist (pi, v)} where v in D  Sample n points, q1, …, qn, uniformly from D. For each qi, find its nearest neighbor in D – {qi}: yi = min{dist (qi, v)} where v in D and v ≠ qi  Calculate the Hopkins Statistic:  If D is uniformly distributed, ∑ xi and ∑ yi will be close to each other and H is close to 0.5. If D is highly skewed, H is close to 0 693
  • 693.
    Determine the Numberof Clusters  Empirical method  # of clusters ≈√n/2 for a dataset of n points  Elbow method  Use the turning point in the curve of sum of within cluster variance w.r.t the # of clusters  Cross validation method  Divide a given data set into m parts  Use m – 1 parts to obtain a clustering model  Use the remaining part to test the quality of the clustering  E.g., For each point in the test set, find the closest centroid, and use the sum of squared distance between all points in the test set and the closest centroids to measure how well the model fits the test set  For any k > 0, repeat it m times, compare the overall quality measure w.r.t. different k’s, and find # of clusters that fits the data the best 694
  • 694.
    Measuring Clustering Quality Two methods: extrinsic vs. intrinsic  Extrinsic: supervised, i.e., the ground truth is available  Compare a clustering against the ground truth using certain clustering quality measure  Ex. BCubed precision and recall metrics  Intrinsic: unsupervised, i.e., the ground truth is unavailable  Evaluate the goodness of a clustering by considering how well the clusters are separated, and how compact the clusters are  Ex. Silhouette coefficient 695
  • 695.
    Measuring Clustering Quality:Extrinsic Methods  Clustering quality measure: Q(C, Cg), for a clustering C given the ground truth Cg.  Q is good if it satisfies the following 4 essential criteria  Cluster homogeneity: the purer, the better  Cluster completeness: should assign objects belong to the same category in the ground truth to the same cluster  Rag bag: putting a heterogeneous object into a pure cluster should be penalized more than putting it into a rag bag (i.e., “miscellaneous” or “other” category)  Small cluster preservation: splitting a small category into pieces is more harmful than splitting a large category into pieces 696
  • 696.
    697 Chapter 10. ClusterAnalysis: Basic Concepts and Methods  Cluster Analysis: Basic Concepts  Partitioning Methods  Hierarchical Methods  Density-Based Methods  Grid-Based Methods  Evaluation of Clustering  Summary 697
  • 697.
    Summary  Cluster analysisgroups objects based on their similarity and has wide applications  Measure of similarity can be computed for various types of data  Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods  K-means and K-medoids algorithms are popular partitioning-based clustering algorithms  Birch and Chameleon are interesting hierarchical clustering algorithms, and there are also probabilistic hierarchical clustering algorithms  DBSCAN, OPTICS, and DENCLU are interesting density-based algorithms  STING and CLIQUE are grid-based methods, where CLIQUE is also a subspace clustering algorithm  Quality of clustering results can be evaluated in various ways 698
  • 698.
    699 CS512-Spring 2011: AnIntroduction  Coverage  Cluster Analysis: Chapter 11  Outlier Detection: Chapter 12  Mining Sequence Data: BK2: Chapter 8  Mining Graphs Data: BK2: Chapter 9  Social and Information Network Analysis  BK2: Chapter 9  Partial coverage: Mark Newman: “Networks: An Introduction”, Oxford U., 2010  Scattered coverage: Easley and Kleinberg, “Networks, Crowds, and Markets: Reasoning About a Highly Connected World”, Cambridge U., 2010  Recent research papers  Mining Data Streams: BK2: Chapter 8  Requirements  One research project  One class presentation (15 minutes)  Two homeworks (no programming assignment)  Two midterm exams (no final exam)
  • 699.
    References (1)  R.Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD'98  M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.  M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering structure, SIGMOD’99.  Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", KDD'02  M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-Based Local Outliers. SIGMOD 2000.  M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. KDD'96.  M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing techniques for efficient class identification. SSD'95.  D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139-172, 1987.  D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic systems. VLDB’98.  V. Ganti, J. Gehrke, R. Ramakrishan. CACTUS Clustering Categorical Data Using Summaries. KDD'99. 700
  • 700.
    References (2)  D.Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic systems. In Proc. VLDB’98.  S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. SIGMOD'98.  S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical attributes. In ICDE'99, pp. 512-521, Sydney, Australia, March 1999.  A. Hinneburg, D.l A. Keim: An Efficient Approach to Clustering in Large Multimedia Databases with Noise. KDD’98.  A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.  G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling. COMPUTER, 32(8): 68-75, 1999.  L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990.  E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98. 701
  • 701.
    References (3)  G.J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley and Sons, 1988.  R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94.  L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional Data: A Review, SIGKDD Explorations, 6(1), June 2004  E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets. Proc. 1996 Int. Conf. on Pattern Recognition  G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering approach for very large spatial databases. VLDB’98.  A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based Clustering in Large Databases, ICDT'01.  A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of Obstacles, ICDE'01  H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data sets, SIGMOD’02  W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data Mining, VLDB’97  T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : An efficient data clustering method for very large databases. SIGMOD'96  X. Yin, J. Han, and P. S. Yu, “LinkClus: Efficient Clustering via Heterogeneous Semantic Links”, VLDB'06 702
  • 702.
  • 703.
    704 A Typical K-MedoidsAlgorithm (PAM) 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Total Cost = 20 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 K=2 Arbitrar y choose k object as initial medoids 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Assign each remaini ng object to nearest medoids Randomly select a nonmedoid object,Oramdom Compute total cost of swapping 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Total Cost = 26 Swapping O and Oramdom If quality is improved. Do loop Until no change 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
  • 704.
    705 PAM (Partitioning AroundMedoids) (1987)  PAM (Kaufman and Rousseeuw, 1987), built in Splus  Use real object to represent the cluster  Select k representative objects arbitrarily  For each pair of non-selected object h and selected object i, calculate the total swapping cost TCih  For each pair of i and h,  If TCih < 0, i is replaced by h  Then assign each non-selected object to the most similar representative object  repeat steps 2-3 until there is no change
  • 705.
    706 PAM Clustering: Findingthe Best Cluster Center  Case 1: p currently belongs to oj. If oj is replaced by orandom as a representative object and p is the closest to one of the other representative object oi, then p is reassigned to oi
  • 706.
    707 What Is theProblem with PAM?  Pam is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean  Pam works efficiently for small data sets but does not scale well for large data sets.  O(k(n-k)2 ) for each iteration where n is # of data,k is # of clusters  Sampling-based method CLARA(Clustering LARge Applications)
  • 707.
    708 CLARA (Clustering LargeApplications) (1990)  CLARA (Kaufmann and Rousseeuw in 1990)  Built in statistical analysis packages, such as SPlus  It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output  Strength: deals with larger data sets than PAM  Weakness:  Efficiency depends on the sample size  A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased
  • 708.
    709 CLARANS (“Randomized” CLARA)(1994)  CLARANS (A Clustering Algorithm based on Randomized Search) (Ng and Han’94)  Draws sample of neighbors dynamically  The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids  If the local optimum is found, it starts with new randomly selected node in search for a new local optimum  Advantages: More efficient and scalable than both PAM and CLARA  Further improvement: Focusing techniques and spatial access structures (Ester et al.’95)
  • 709.
    710 ROCK: Clustering CategoricalData  ROCK: RObust Clustering using linKs  S. Guha, R. Rastogi & K. Shim, ICDE’99  Major ideas  Use links to measure similarity/proximity  Not distance-based  Algorithm: sampling-based clustering  Draw random sample  Cluster with links  Label data in disk  Experiments  Congressional voting, mushroom data
  • 710.
    711 Similarity Measure inROCK  Traditional measures for categorical data may not work well, e.g., Jaccard coefficient  Example: Two groups (clusters) of transactions  C1. <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}  C2. <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}  Jaccard co-efficient may lead to wrong clustering result  C1: 0.2 ({a, b, c}, {b, d, e}} to 0.5 ({a, b, c}, {a, b, d})  C1 & C2: could be as high as 0.5 ({a, b, c}, {a, b, f})  Jaccard co-efficient-based similarity function:  Ex. Let T1 = {a, b, c}, T2 = {c, d, e} Sim T T T T T T ( , ) 1 2 1 2 1 2    2 . 0 5 1 } , , , , { } { ) , ( 2 1    e d c b a c T T Sim
  • 711.
    712 Link Measure inROCK  Clusters  C1:<a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}  C2: <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}  Neighbors  Two transactions are neighbors if sim(T1,T2) > threshold  Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}  T1 connected to: {a,b,d}, {a,b,e}, {a,c,d}, {a,c,e}, {b,c,d}, {b,c,e}, {a,b,f}, {a,b,g}  T2 connected to: {a,c,d}, {a,c,e}, {a,d,e}, {b,c,e}, {b,d,e}, {b,c,d}  T3 connected to: {a,b,c}, {a,b,d}, {a,b,e}, {a,b,g}, {a,f,g}, {b,f,g}  Link Similarity  Link similarity between two transactions is the # of common neighbors  link(T1, T2) = 4, since they have 4 common neighbors  {a, c, d}, {a, c, e}, {b, c, d}, {b, c, e}  link(T1, T3) = 3, since they have 3 common neighbors  {a, b, d}, {a, b, e}, {a, b, g}
  • 712.
    Aggregation-Based Similarity Computation 45 10 12 13 14 a b ST2 ST1 11 0.2 0.9 1.0 0.8 0.9 1.0 For each node nk { ∈ n10, n11, n12} and nl { ∈ n13, n14}, their path- based similarity simp(nk, nl) = s(nk, n4)·s(n4, n5)·s(n5, nl).         171 . 0 2 , , 3 , , 14 13 5 5 4 12 10 4         l l k k b a n n s n n s n n s n n sim After aggregation, we reduce quadratic time computation to linear time computation. takes O(3+2) time 714
  • 713.
    Computing Similarity withAggregation To compute sim(na,nb):  Find all pairs of sibling nodes ni and nj, so that na linked with ni and nb with nj.  Calculate similarity (and weight) between na and nb w.r.t. ni and nj.  Calculate weighted average similarity between na and nb w.r.t. all such pairs. sim(na, nb) = avg_sim(na,n4) x s(n4, n5) x avg_sim(nb,n5) = 0.9 x 0.2 x 0.95 = 0.171 sim(na, nb) can be computed from aggregated similarities Average similarity and total weight 4 5 10 12 13 14 a b a: (0.9,3) b:(0.95,2) 11 0.2 715
  • 714.
    716 Chapter 10. ClusterAnalysis: Basic Concepts and Methods  Cluster Analysis: Basic Concepts  Overview of Clustering Methods  Partitioning Methods  Hierarchical Methods  Density-Based Methods  Grid-Based Methods  Summary 716
  • 715.
    Link-Based Clustering: CalculateSimilarities Based On Links Jeh & Widom, KDD’2002: SimRank Two objects are similar if they are linked with the same or similar objects  The similarity between two objects x and y is defined as the average similarity between objects linked with x and those with y:  Issue: Expensive to compute:  For a dataset of N objects and M links, it takes O(N2 ) space and O(M2 ) time to compute all similarities. Tom sigmod03 Mike Cathy John sigmod04 sigmod05 vldb03 vldb04 vldb05 sigmod vldb Mary aaai04 aaai05 aaai Authors Proceedings Conferences                      a I i b I j j i b I a I b I a I C b a 1 1 , sim , sim 717
  • 716.
    Observation 1: HierarchicalStructures  Hierarchical structures often exist naturally among objects (e.g., taxonomy of animals) All electronics grocery apparel DVD camera TV A hierarchical structure of products in Walmart Articles Words Relationships between articles and words (Chakrabarti, Papadimitriou, Modha, Faloutsos, 2004) 718
  • 717.
    Observation 2: Distributionof Similarity  Power law distribution exists in similarities  56% of similarity entries are in [0.005, 0.015]  1.4% of similarity entries are larger than 0.1  Can we design a data structure that stores the significant similarities and compresses insignificant ones? 0 0.1 0.2 0.3 0.4 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 similarity value portion of entries Distribution of SimRank similarities among DBLP authors 719
  • 718.
    A Novel DataStructure: SimTree Each leaf node represents an object Each non-leaf node represents a group of similar lower-level nodes Similarities between siblings are stored Consumer electronics Apparels Canon A40 digital camera Sony V3 digital camera Digital Cameras TVs 720
  • 719.
    Similarity Defined bySimTree  Path-based node similarity  simp(n7,n8) = s(n7, n4) x s(n4, n5) x s(n5, n8)  Similarity between two nodes is the average similarity between objects linked with them in other SimTrees  Adjust/ ratio for x = n1 n2 n4 n5 n6 n3 0.9 1.0 0.9 0.8 0.2 n7 n9 0.3 n8 0.8 0.9 Similarity between two sibling nodes n1 and n2 Adjustment ratio for node n7 Average similarity between x and all other nodes Average similarity between x’s parent and all other nodes 721
  • 720.
    LinkClus: Efficient Clusteringvia Heterogeneous Semantic Links Method  Initialize a SimTree for objects of each type  Repeat until stable  For each SimTree, update the similarities between its nodes using similarities in other SimTrees  Similarity between two nodes x and y is the average similarity between objects linked with them  Adjust the structure of each SimTree  Assign each node to the parent node that it is most similar to For details: X. Yin, J. Han, and P. S. Yu, “LinkClus: Efficient Clustering via Heterogeneous Semantic Links”, VLDB'06 722
  • 721.
    Initialization of SimTrees Initializing a SimTree  Repeatedly find groups of tightly related nodes, which are merged into a higher-level node  Tightness of a group of nodes  For a group of nodes {n1, …, nk}, its tightness is defined as the number of leaf nodes in other SimTrees that are connected to all of {n1, …, nk} n1 1 2 3 4 5 n2 The tightness of {n1, n2} is 3 Nodes Leaf nodes in another SimTree 723
  • 722.
    Finding Tight Groupsby Freq. Pattern Mining  Finding tight groups Frequent pattern mining  Procedure of initializing a tree  Start from leaf nodes (level-0)  At each level l, find non-overlapping groups of similar nodes with frequent pattern mining Reduced to g1 g2 {n1} {n1, n2} {n2} {n1, n2} {n1, n2} {n2, n3, n4} {n4} {n3, n4} {n3, n4} Transactions n1 1 2 3 4 5 6 7 8 9 n2 n3 n4 The tightness of a group of nodes is the support of a frequent pattern 724
  • 723.
    Adjusting SimTree Structures After similarity changes, the tree structure also needs to be changed  If a node is more similar to its parent’s sibling, then move it to be a child of that sibling  Try to move each node to its parent’s sibling that it is most similar to, under the constraint that each parent node can have at most c children n1 n2 n4 n5 n6 n3 n7 n9 n8 0.8 0.9 n7 725
  • 724.
    Complexity Time Space Updating similaritiesO(M(logN)2 ) O(M+N) Adjusting tree structures O(N) O(N) LinkClus O(M(logN)2 ) O(M+N) SimRank O(M2 ) O(N2 ) For two types of objects, N in each, and M linkages between them. 726
  • 725.
    Experiment: Email Dataset F. Nielsen. Email dataset. www.imm.dtu.dk/~rem/data/Email-1431.zip  370 emails on conferences, 272 on jobs, and 789 spam emails  Accuracy: measured by manually labeled data  Accuracy of clustering: % of pairs of objects in the same cluster that share common label Approach Accuracy time (s) LinkClus 0.8026 1579.6 SimRank 0.7965 39160 ReCom 0.5711 74.6 F-SimRank 0.3688 479.7 CLARANS 0.4768 8.55  Approaches compared:  SimRank (Jeh & Widom, KDD 2002): Computing pair-wise similarities  SimRank with FingerPrints (F-SimRank): Fogaras & R´acz, WWW 2005  pre-computes a large sample of random paths from each object and uses samples of two objects to estimate SimRank similarity  ReCom (Wang et al. SIGIR 2003)  Iteratively clustering objects using cluster labels of linked objects 727
  • 726.
    WaveCluster: Clustering byWavelet Analysis (1998)  Sheikholeslami, Chatterjee, and Zhang (VLDB’98)  A multi-resolution clustering approach which applies wavelet transform to the feature space; both grid-based and density-based  Wavelet transform: A signal processing technique that decomposes a signal into different frequency sub-band  Data are transformed to preserve relative distance between objects at different levels of resolution  Allows natural clusters to become more distinguishable 728
  • 727.
    The WaveCluster Algorithm How to apply wavelet transform to find clusters  Summarizes the data by imposing a multidimensional grid structure onto data space  These multidimensional spatial data objects are represented in a n-dimensional feature space  Apply wavelet transform on feature space to find the dense regions in the feature space  Apply wavelet transform multiple times which result in clusters at different scales from fine to coarse  Major features:  Complexity O(N)  Detect arbitrary shaped clusters at different scales  Not sensitive to noise, not sensitive to input order  Only applicable to low dimensional data 729
  • 728.
    730 Quantization & Transformation  Quantizedata into m-D grid structure, then wavelet transform a) scale 1: high resolution b) scale 2: medium resolution c) scale 3: low resolution
  • 729.
    731 Data Mining: Concepts andTechniques (3rd ed.) — Chapter 11 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2011 Han, Kamber & Pei. All rights reserved. 731
  • 730.
    732 Review: Basic ClusterAnalysis Methods (Chap. 10)  Cluster Analysis: Basic Concepts  Group data so that object similarity is high within clusters but low across clusters  Partitioning Methods  K-means and k-medoids algorithms and their refinements  Hierarchical Methods  Agglomerative and divisive method, Birch, Cameleon  Density-Based Methods  DBScan, Optics and DenCLu  Grid-Based Methods  STING and CLIQUE (subspace clustering)  Evaluation of Clustering  Assess clustering tendency, determine # of clusters, and measure clustering quality 732
  • 731.
    K-Means Clustering K=2 Arbitrarily partition objects into k groups Update thecluster centroids Update the cluster centroids Reassign objects Loop if needed 733 The initial data set  Partition objects into k nonempty subsets  Repeat  Compute centroid (i.e., mean point) for each partition  Assign each object to the cluster of its nearest centroid  Until no change
  • 732.
    Hierarchical Clustering  Usedistance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e Step 4 Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive (DIANA) 734
  • 733.
    Distance between Clusters Single link: smallest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)  Complete link: largest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)  Average: avg distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)  Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) = dist(Ci, Cj)  Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi, Mj)  Medoid: a chosen, centrally located object in the cluster X X 735
  • 734.
    BIRCH and theClustering Feature (CF) Tree Structure CF1 child1 CF3 child3 CF2 child2 CF6 child6 CF1 child1 CF3 child3 CF2 child2 CF5 child5 CF1 CF2 CF6 prev next CF1 CF2 CF4 prev next B = 7 L = 6 Root Non-leaf node Leaf node Leaf node 736 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 CF = (5, (16,30), (54,190)) (3,4) (2,6) (4,5) (4,7) (3,8)
  • 735.
    Overall Framework ofCHAMELEON Construct (K-NN) Sparse Graph Partition the Graph Merge Partition Final Clusters Data Set K-NN Graph P and q are connected if q is among the top k closest neighbors of p Relative interconnectivity: connectivity of c1 and c2 over internal connectivity Relative closeness: closeness of c1 and c2 over internal closeness 737
  • 736.
    Density-Based Clustering: DBSCAN Two parameters:  Eps: Maximum radius of the neighbourhood  MinPts: Minimum number of points in an Eps- neighbourhood of that point  NEps(p): {q belongs to D | dist(p,q) ≤ Eps}  Directly density-reachable: A point p is directly density- reachable from a point q w.r.t. Eps, MinPts if  p belongs to NEps(q)  core point condition: |NEps (q)| ≥ MinPts MinPts = 5 Eps = 1 cm p q 738
  • 737.
  • 738.
  • 739.
    STING: A StatisticalInformation Grid Approach  Wang, Yang and Muntz (VLDB’97)  The spatial area is divided into rectangular cells  There are several levels of cells corresponding to different levels of resolution 741 i-th layer (i-1)st layer 1st layer
  • 740.
    Evaluation of ClusteringQuality  Assessing Clustering Tendency  Assess if non-random structure exists in the data by measuring the probability that the data is generated by a uniform data distribution  Determine the Number of Clusters  Empirical method: # of clusters ≈√n/2  Elbow method: Use the turning point in the curve of sum of within cluster variance w.r.t # of clusters  Cross validation method  Measuring Clustering Quality  Extrinsic: supervised  Compare a clustering against the ground truth using certain clustering quality measure  Intrinsic: unsupervised  Evaluate the goodness of a clustering by considering how well the clusters are separated, and how compact the clusters are 742
  • 741.
    743 Outline of AdvancedClustering Analysis  Probability Model-Based Clustering  Each object may take a probability to belong to a cluster  Clustering High-Dimensional Data  Curse of dimensionality: Difficulty of distance measure in high-D space  Clustering Graphs and Network Data  Similarity measurement and clustering methods for graph and networks  Clustering with Constraints  Cluster analysis under different kinds of constraints, e.g., that raised from background knowledge or spatial distribution of the objects
  • 742.
    744 Chapter 11. ClusterAnalysis: Advanced Methods  Probability Model-Based Clustering  Clustering High-Dimensional Data  Clustering Graphs and Network Data  Clustering with Constraints  Summary 744
  • 743.
    Fuzzy Set andFuzzy Cluster  Clustering methods discussed so far  Every data object is assigned to exactly one cluster  Some applications may need for fuzzy or soft cluster assignment  Ex. An e-game could belong to both entertainment and software  Methods: fuzzy clusters and probabilistic model-based clusters  Fuzzy cluster: A fuzzy set S: FS : X → [0, 1] (value between 0 and 1)  Example: Popularity of cameras is defined as a fuzzy mapping  Then, A(0.05), B(1), C(0.86), D(0.27) 745
  • 744.
    Fuzzy (Soft) Clustering Example: Let cluster features be  C1 :“digital camera” and “lens”  C2: “computer“  Fuzzy clustering  k fuzzy clusters C1, …,Ck ,represented as a partition matrix M = [wij]  P1: for each object oi and cluster Cj, 0 ≤ wij ≤ 1 (fuzzy set)  P2: for each object oi, , equal participation in the clustering  P3: for each cluster Cj , ensures there is no empty cluster  Let c1, …, ck as the center of the k clusters  For an object oi, sum of the squared error (SSE), p is a parameter:  For a cluster Ci, SSE:  Measure how well a clustering fits the data: 746
  • 745.
    Probabilistic Model-Based Clustering Cluster analysis is to find hidden categories.  A hidden category (i.e., probabilistic cluster) is a distribution over the data space, which can be mathematically represented using a probability density function (or distribution function).  Ex. 2 categories for digital cameras sold  consumer line vs. professional line  density functions f1, f2 for C1, C2  obtained by probabilistic clustering  A mixture model assumes that a set of observed objects is a mixture of instances from multiple probabilistic clusters, and conceptually each observed object is generated independently  Out task: infer a set of k probabilistic clusters that is mostly likely to generate D using the above data generation process 747
  • 746.
    748 Model-Based Clustering  Aset C of k probabilistic clusters C1, …,Ck with probability density functions f1, …, fk, respectively, and their probabilities ω1, …, ωk.  Probability of an object o generated by cluster Cj is  Probability of o generated by the set of cluster C is  Since objects are assumed to be generated independently, for a data set D = {o1, …, on}, we have,  Task: Find a set C of k probabilistic clusters s.t. P(D|C) is maximized  However, maximizing P(D|C) is often intractable since the probability density function of a cluster can take an arbitrarily complicated form  To make it computationally feasible (as a compromise), assume the probability density functions being some parameterized distributions
  • 747.
    749 Univariate Gaussian MixtureModel  O = {o1, …, on} (n observed objects), Θ = {θ1, …, θk} (parameters of the k distributions), and Pj(oi| θj) is the probability that oi is generated from the j-th distribution using parameter θj, we have  Univariate Gaussian mixture model  Assume the probability density function of each cluster follows a 1- d Gaussian distribution. Suppose that there are k clusters.  The probability density function of each cluster are centered at μj with standard deviation σj, θj, = (μj, σj), we have
  • 748.
    The EM (ExpectationMaximization) Algorithm  The k-means algorithm has two steps at each iteration:  Expectation Step (E-step): Given the current cluster centers, each object is assigned to the cluster whose center is closest to the object: An object is expected to belong to the closest cluster  Maximization Step (M-step): Given the cluster assignment, for each cluster, the algorithm adjusts the center so that the sum of distance from the objects assigned to this cluster and the new center is minimized  The (EM) algorithm: A framework to approach maximum likelihood or maximum a posteriori estimates of parameters in statistical models.  E-step assigns objects to clusters according to the current fuzzy clustering or parameters of probabilistic clusters  M-step finds the new clustering or parameters that maximize the sum of squared error (SSE) or the expected likelihood 750
  • 749.
    Fuzzy Clustering Usingthe EM Algorithm  Initially, let c1 = a and c2 = b  1st E-step: assign o to c1,w. wt =   1st M-step: recalculate the centroids according to the partition matrix, minimizing the sum of squared error (SSE)  Iteratively calculate this until the cluster centers converge or the change is small enough
  • 750.
    752 Univariate Gaussian MixtureModel  O = {o1, …, on} (n observed objects), Θ = {θ1, …, θk} (parameters of the k distributions), and Pj(oi| θj) is the probability that oi is generated from the j-th distribution using parameter θj, we have  Univariate Gaussian mixture model  Assume the probability density function of each cluster follows a 1- d Gaussian distribution. Suppose that there are k clusters.  The probability density function of each cluster are centered at μj with standard deviation σj, θj, = (μj, σj), we have
  • 751.
    753 Computing Mixture Modelswith EM  Given n objects O = {o1, …, on}, we want to mine a set of parameters Θ = {θ1, …, θk} s.t.,P(O|Θ) is maximized, where θj = (μj, σj) are the mean and standard deviation of the j-th univariate Gaussian distribution  We initially assign random values to parameters θj, then iteratively conduct the E- and M- steps until converge or sufficiently small change  At the E-step, for each object oi, calculate the probability that oi belongs to each distribution,  At the M-step, adjust the parameters θj = (μj, σj) so that the expected likelihood P(O|Θ) is maximized
  • 752.
    Advantages and Disadvantagesof Mixture Models  Strength  Mixture models are more general than partitioning and fuzzy clustering  Clusters can be characterized by a small number of parameters  The results may satisfy the statistical assumptions of the generative models  Weakness  Converge to local optimal (overcome: run multi-times w. random initialization)  Computationally expensive if the number of distributions is large, or the data set contains very few observed data points  Need large data sets  Hard to estimate the number of clusters 754
  • 753.
    755 Chapter 11. ClusterAnalysis: Advanced Methods  Probability Model-Based Clustering  Clustering High-Dimensional Data  Clustering Graphs and Network Data  Clustering with Constraints  Summary 755
  • 754.
    756 Clustering High-Dimensional Data Clustering high-dimensional data (How high is high-D in clustering?)  Many applications: text documents, DNA micro-array data  Major challenges:  Many irrelevant dimensions may mask clusters  Distance measure becomes meaningless—due to equi-distance  Clusters may exist only in some subspaces  Methods  Subspace-clustering: Search for clusters existing in subspaces of the given high dimensional data space  CLIQUE, ProClus, and bi-clustering approaches  Dimensionality reduction approaches: Construct a much lower dimensional space and search for clusters there (may construct new dimensions by combining some dimensions in the original data)  Dimensionality reduction methods and spectral clustering
  • 755.
    Traditional Distance MeasuresMay Not Be Effective on High-D Data  Traditional distance measure could be dominated by noises in many dimensions  Ex. Which pairs of customers are more similar?  By Euclidean distance, we get,  despite Ada and Cathy look more similar  Clustering should not only consider dimensions but also attributes (features)  Feature transformation: effective if most dimensions are relevant (PCA & SVD useful when features are highly correlated/redundant)  Feature selection: useful to find a subspace where the data have nice clusters 757
  • 756.
    758 The Curse ofDimensionality (graphs adapted from Parsons et al. KDD Explorations 2004)  Data in only one dimension is relatively packed  Adding a dimension “stretch” the points across that dimension, making them further apart  Adding more dimensions will make the points further apart—high dimensional data is extremely sparse  Distance measure becomes meaningless—due to equi-distance
  • 757.
    759 Why Subspace Clustering? (adaptedfrom Parsons et al. SIGKDD Explorations 2004)  Clusters may exist only in some subspaces  Subspace-clustering: find clusters in all the subspaces
  • 758.
    Subspace Clustering Methods Subspace search methods: Search various subspaces to find clusters  Bottom-up approaches  Top-down approaches  Correlation-based clustering methods  E.g., PCA based approaches  Bi-clustering methods  Optimization-based methods  Enumeration methods
  • 759.
    Subspace Clustering Method(I): Subspace Search Methods  Search various subspaces to find clusters  Bottom-up approaches  Start from low-D subspaces and search higher-D subspaces only when there may be clusters in such subspaces  Various pruning techniques to reduce the number of higher-D subspaces to be searched  Ex. CLIQUE (Agrawal et al. 1998)  Top-down approaches  Start from full space and search smaller subspaces recursively  Effective only if the locality assumption holds: restricts that the subspace of a cluster can be determined by the local neighborhood  Ex. PROCLUS (Aggarwal et al. 1999): a k-medoid-like method 761
  • 760.
    762 Salary (10,000 ) 20 30 4050 60 age 5 4 3 1 2 6 7 0 20 30 40 50 60 age 5 4 3 1 2 6 7 0 Vacation (week) age Vacatio n Salary 30 50  = 3 CLIQUE: SubSpace Clustering with Aprori Pruning
  • 761.
    Subspace Clustering Method(II): Correlation-Based Methods  Subspace search method: similarity based on distance or density  Correlation-based method: based on advanced correlation models  Ex. PCA-based approach:  Apply PCA (for Principal Component Analysis) to derive a set of new, uncorrelated dimensions,  then mine clusters in the new space or its subspaces  Other space transformations:  Hough transform  Fractal dimensions 763
  • 762.
    Subspace Clustering Method(III): Bi-Clustering Methods  Bi-clustering: Cluster both objects and attributes simultaneously (treat objs and attrs in symmetric way)  Four requirements:  Only a small set of objects participate in a cluster  A cluster only involves a small number of attributes  An object may participate in multiple clusters, or does not participate in any cluster at all  An attribute may be involved in multiple clusters, or is not involved in any cluster at all 764  Ex 1. Gene expression or microarray data: a gene sample/condition matrix.  Each element in the matrix, a real number, records the expression level of a gene under a specific condition  Ex. 2. Clustering customers and products  Another bi-clustering problem
  • 763.
    Types of Bi-clusters Let A = {a1, ..., an} be a set of genes, B = {b1, …, bn} a set of conditions  A bi-cluster: A submatrix where genes and conditions follow some consistent patterns  4 types of bi-clusters (ideal cases)  Bi-clusters with constant values:  for any i in I and j in J, eij = c  Bi-clusters with constant values on rows:  eij = c + αi  Also, it can be constant values on columns  Bi-clusters with coherent values (aka. pattern-based clusters)  eij = c + αi + βj  Bi-clusters with coherent evolutions on rows  eij (ei1j1− ei1j2)(ei2j1− ei2j2) ≥ 0  i.e., only interested in the up- or down- regulated changes across genes or conditions without constraining on the exact values 765
  • 764.
    Bi-Clustering Methods  Real-worlddata is noisy: Try to find approximate bi-clusters  Methods: Optimization-based methods vs. enumeration methods  Optimization-based methods  Try to find a submatrix at a time that achieves the best significance as a bi-cluster  Due to the cost in computation, greedy search is employed to find local optimal bi-clusters  Ex. δ-Cluster Algorithm (Cheng and Church, ISMB’2000)  Enumeration methods  Use a tolerance threshold to specify the degree of noise allowed in the bi-clusters to be mined  Then try to enumerate all submatrices as bi-clusters that satisfy the requirements  Ex. δ-pCluster Algorithm (H. Wang et al.’ SIGMOD’2002, MaPle: Pei et al., ICDM’2003) 766
  • 765.
    767 Bi-Clustering for Micro-ArrayData Analysis  Left figure: Micro-array “raw” data shows 3 genes and their values in a multi-D space: Difficult to find their patterns  Right two: Some subsets of dimensions form nice shift and scaling patterns  No globally defined similarity/distance measure  Clusters may not be exclusive  An object can appear in multiple clusters
  • 766.
    Bi-Clustering (I): δ-Bi-Cluster For a submatrix I x J, the mean of the i-th row:  The mean of the j-th column:  The mean of all elements in the submatrix is  The quality of the submatrix as a bi-cluster can be measured by the mean squared residue value  A submatrix I x J is δ-bi-cluster if H(I x J) ≤ δ where δ ≥ 0 is a threshold. When δ = 0, I x J is a perfect bi-cluster with coherent values. By setting δ > 0, a user can specify the tolerance of average noise per element against a perfect bi-cluster  residue(eij) = eij − eiJ − eIj + eIJ 768
  • 767.
    Bi-Clustering (I): Theδ-Cluster Algorithm  Maximal δ-bi-cluster is a δ-bi-cluster I x J such that there does not exist another δ-bi-cluster I′ x J′ which contains I x J  Computing is costly: Use heuristic greedy search to obtain local optimal clusters  Two phase computation: deletion phase and additional phase  Deletion phase: Start from the whole matrix, iteratively remove rows and columns while the mean squared residue of the matrix is over δ  At each iteration, for each row/column, compute the mean squared residue:  Remove the row or column of the largest mean squared residue  Addition phase:  Expand iteratively the δ-bi-cluster I x J obtained in the deletion phase as long as the δ-bi-cluster requirement is maintained  Consider all the rows/columns not involved in the current bi-cluster I x J by calculating their mean squared residues  A row/column of the smallest mean squared residue is added into the current δ-bi-cluster  It finds only one δ-bi-cluster, thus needs to run multiple times: replacing the elements in the output bi-cluster by random numbers 769
  • 768.
    Bi-Clustering (II): δ-pCluster Enumerating all bi-clusters (δ-pClusters) [H. Wang, et al., Clustering by pattern similarity in large data sets. SIGMOD’02]  Since a submatrix I x J is a bi-cluster with (perfect) coherent values iff ei1j1 − ei2j1 = ei1j2 − ei2j2. For any 2 x 2 submatrix of I x J, define p-score  A submatrix I x J is a δ-pCluster (pattern-based cluster) if the p-score of every 2 x 2 submatrix of I x J is at most δ, where δ ≥ 0 is a threshold specifying a user's tolerance of noise against a perfect bi-cluster  The p-score controls the noise on every element in a bi-cluster, while the mean squared residue captures the average noise  Monotonicity: If I x J is a δ-pClusters, every x x y (x,y ≥ 2) submatrix of I x J is also a δ-pClusters.  A δ-pCluster is maximal if no more row or column can be added into the cluster and retain δ-pCluster: We only need to compute all maximal δ-pClusters. 770
  • 769.
    MaPle: Efficient Enumerationof δ-pClusters  Pei et al., MaPle: Efficient enumerating all maximal δ- pClusters. ICDM'03  Framework: Same as pattern-growth in frequent pattern mining (based on the downward closure property)  For each condition combination J, find the maximal subsets of genes I such that I x J is a δ-pClusters  If I x J is not a submatrix of another δ-pClusters  then I x J is a maximal δ-pCluster.  Algorithm is very similar to mining frequent closed itemsets  Additional advantages of δ-pClusters:  Due to averaging of δ-cluster, it may contain outliers but still within δ-threshold  Computing bi-clusters for scaling patterns, take logarithmic on will lead to the p-score form 771   yb xb ya xa d d d d / /
  • 770.
    Dimensionality-Reduction Methods  Dimensionalityreduction: In some situations, it is more effective to construct a new space instead of using some subspaces of the original data 772  Ex. To cluster the points in the right figure, any subspace of the original one, X and Y, cannot help, since all the three clusters will be projected into the overlapping areas in X and Y axes.  Construct a new dimension as the dashed one, the three clusters become apparent when the points projected into the new dimension  Dimensionality reduction methods  Feature selection and extraction: But may not focus on clustering structure finding  Spectral clustering: Combining feature extraction and clustering (i.e., use the spectrum of the similarity matrix of the data to perform dimensionality reduction for clustering in fewer dimensions)  Normalized Cuts (Shi and Malik, CVPR’97 or PAMI’2000)  The Ng-Jordan-Weiss algorithm (NIPS’01)
  • 771.
    Spectral Clustering: The Ng-Jordan-Weiss(NJW) Algorithm  Given a set of objects o1, …, on, and the distance between each pair of objects, dist(oi, oj), find the desired number k of clusters  Calculate an affinity matrix W, where σ is a scaling parameter that controls how fast the affinity Wij decreases as dist(oi, oj) increases. In NJW, set Wij = 0  Derive a matrix A = f(W). NJW defines a matrix D to be a diagonal matrix s.t. Dii is the sum of the i-th row of W, i.e., Then, A is set to  A spectral clustering method finds the k leading eigenvectors of A  A vector v is an eigenvector of matrix A if Av = λv, where λ is the corresponding eigen-value  Using the k leading eigenvectors, project the original data into the new space defined by the k leading eigenvectors, and run a clustering algorithm, such as k-means, to find k clusters  Assign the original data points to clusters according to how the transformed points are assigned in the clusters obtained 773
  • 772.
    Spectral Clustering: Illustrationand Comments  Spectral clustering: Effective in tasks like image processing  Scalability challenge: Computing eigenvectors on a large matrix is costly  Can be combined with other clustering methods, such as bi-clustering 774
  • 773.
    775 Chapter 11. ClusterAnalysis: Advanced Methods  Probability Model-Based Clustering  Clustering High-Dimensional Data  Clustering Graphs and Network Data  Clustering with Constraints  Summary 775
  • 774.
    Clustering Graphs andNetwork Data  Applications  Bi-partite graphs, e.g., customers and products, authors and conferences  Web search engines, e.g., click through graphs and Web graphs  Social networks, friendship/coauthor graphs  Similarity measures  Geodesic distances  Distance based on random walk (SimRank)  Graph clustering methods  Minimum cuts: FastModularity (Clauset, Newman & Moore, 2004)  Density-based clustering: SCAN (Xu et al., KDD’2007) 776
  • 775.
    Similarity Measure (I):Geodesic Distance  Geodesic distance (A, B): length (i.e., # of edges) of the shortest path between A and B (if not connected, defined as infinite)  Eccentricity of v, eccen(v): The largest geodesic distance between v and any other vertex u V − {v}. ∈  E.g., eccen(a) = eccen(b) = 2; eccen(c) = eccen(d) = eccen(e) = 3  Radius of graph G: The minimum eccentricity of all vertices, i.e., the distance between the “most central point” and the “farthest border”  r = min v V ∈ eccen(v)  E.g., radius (g) = 2  Diameter of graph G: The maximum eccentricity of all vertices, i.e., the largest distance between any pair of vertices in G  d = max v V ∈ eccen(v)  E.g., diameter (g) = 3  A peripheral vertex is a vertex that achieves the diameter.  E.g., Vertices c, d, and e are peripheral vertices 777
  • 776.
    SimRank: Similarity Basedon Random Walk and Structural Context  SimRank: structural-context similarity, i.e., based on the similarity of its neighbors  In a directed graph G = (V,E),  individual in-neighborhood of v: I(v) = {u | (u, v) E} ∈  individual out-neighborhood of v: O(v) = {w | (v, w) E} ∈  Similarity in SimRank:  Initialization:  Then we can compute si+1 from si based on the definition  Similarity based on random walk: in a strongly connected component  Expected distance:  Expected meeting distance:  Expected meeting probability: 778 P[t] is the probability of the tour
  • 777.
    Graph Clustering: SparsestCut  G = (V,E). The cut set of a cut is the set of edges {(u, v) E | u S, v T } ∈ ∈ ∈ and S and T are in two partitions  Size of the cut: # of edges in the cut set  Min-cut (e.g., C1) is not a good partition  A better measure: Sparsity:  A cut is sparsest if its sparsity is not greater than that of any other cut  Ex. Cut C2 = ({a, b, c, d, e, f, l}, {g, h, i, j, k}) is the sparsest cut  For k clusters, the modularity of a clustering assesses the quality of the clustering:  The modularity of a clustering of a graph is the difference between the fraction of all edges that fall into individual clusters and the fraction that would do so if the graph vertices were randomly connected  The optimal clustering of graphs maximizes the modularity li: # edges between vertices in the i-th cluster di: the sum of the degrees of the vertices in the i-th cluster 779
  • 778.
    Graph Clustering: Challengesof Finding Good Cuts  High computational cost  Many graph cut problems are computationally expensive  The sparsest cut problem is NP-hard  Need to tradeoff between efficiency/scalability and quality  Sophisticated graphs  May involve weights and/or cycles.  High dimensionality  A graph can have many vertices. In a similarity matrix, a vertex is represented as a vector (a row in the matrix) whose dimensionality is the number of vertices in the graph  Sparsity  A large graph is often sparse, meaning each vertex on average connects to only a small number of other vertices  A similarity matrix from a large sparse graph can also be sparse 780
  • 779.
    Two Approaches forGraph Clustering  Two approaches for clustering graph data  Use generic clustering methods for high-dimensional data  Designed specifically for clustering graphs  Using clustering methods for high-dimensional data  Extract a similarity matrix from a graph using a similarity measure  A generic clustering method can then be applied on the similarity matrix to discover clusters  Ex. Spectral clustering: approximate optimal graph cut solutions  Methods specific to graphs  Search the graph to find well-connected components as clusters  Ex. SCAN (Structural Clustering Algorithm for Networks)  X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger, “SCAN: A Structural Clustering Algorithm for Networks”, KDD'07 781
  • 780.
    SCAN: Density-Based Clusteringof Networks  How many clusters?  What size should they be?  What is the best partitioning?  Should some points be segregated? 782 An Example Network  Application: Given simply information of who associates with whom, could one identify clusters of individuals with common interests or special relationships (families, cliques, terrorist cells)?
  • 781.
    A Social NetworkModel  Cliques, hubs and outliers  Individuals in a tight social group, or clique, know many of the same people, regardless of the size of the group  Individuals who are hubs know many people in different groups but belong to no single group. Politicians, for example bridge multiple groups  Individuals who are outliers reside at the margins of society. Hermits, for example, know few people and belong to no group  The Neighborhood of a Vertex 783 v  Define () as the immediate neighborhood of a vertex (i.e. the set of people that an individual knows )
  • 782.
    Structure Similarity  Thedesired features tend to be captured by a measure we call Structural Similarity  Structural similarity is large for members of a clique and small for hubs and outliers | ) ( || ) ( | | ) ( ) ( | ) , ( w v w v w v        784 v
  • 783.
    Structural Connectivity [1] -Neighborhood:  Core:  Direct structure reachable:  Structure reachable: transitive closure of direct structure reachability  Structure connected: } ) , ( | ) ( { ) (        w v v w v N       | ) ( | ) ( , v N v CORE ) ( ) ( ) , ( , , v N w v CORE w v DirRECH         ) , ( ) , ( : ) , ( , , , w u RECH v u RECH V u w v CONNECT           [1] M. Ester, H. P. Kriegel, J. Sander, & X. Xu (KDD'96) “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases 785
  • 784.
    Structure-Connected Clusters  Structure-connectedcluster C  Connectivity:  Maximality:  Hubs:  Not belong to any cluster  Bridge to many clusters  Outliers:  Not belong to any cluster  Connect to less clusters ) , ( : , , w v CONNECT C w v     C w w v REACH C v V w v       ) , ( : , ,  hub outlier 786
  • 785.
  • 786.
  • 787.
  • 788.
  • 789.
  • 790.
  • 791.
  • 792.
  • 793.
  • 794.
  • 795.
  • 796.
  • 797.
  • 798.
    Running Time  Runningtime = O(|E|)  For sparse networks = O(|V|) [2] A. Clauset, M. E. J. Newman, & C. Moore, Phys. Rev. E 70, 066111 (2004). 800
  • 799.
    Chapter 11. ClusterAnalysis: Advanced Methods  Probability Model-Based Clustering  Clustering High-Dimensional Data  Clustering Graphs and Network Data  Clustering with Constraints  Summary 801
  • 800.
    802 Why Constraint-Based ClusterAnalysis?  Need user feedback: Users know their applications the best  Less parameters but more user-desired constraints, e.g., an ATM allocation problem: obstacle & desired clusters
  • 801.
    803 Categorization of Constraints Constraints on instances: specifies how a pair or a set of instances should be grouped in the cluster analysis  Must-link vs. cannot link constraints  must-link(x, y): x and y should be grouped into one cluster  Constraints can be defined using variables, e.g.,  cannot-link(x, y) if dist(x, y) > d  Constraints on clusters: specifies a requirement on the clusters  E.g., specify the min # of objects in a cluster, the max diameter of a cluster, the shape of a cluster (e.g., a convex), # of clusters (e.g., k)  Constraints on similarity measurements: specifies a requirement that the similarity calculation must respect  E.g., driving on roads, obstacles (e.g., rivers, lakes)  Issues: Hard vs. soft constraints; conflicting or redundant constraints
  • 802.
    804 Constraint-Based Clustering Methods(I): Handling Hard Constraints  Handling hard constraints: Strictly respect the constraints in cluster assignments  Example: The COP-k-means algorithm  Generate super-instances for must-link constraints  Compute the transitive closure of the must-link constraints  To represent such a subset, replace all those objects in the subset by the mean.  The super-instance also carries a weight, which is the number of objects it represents  Conduct modified k-means clustering to respect cannot-link constraints  Modify the center-assignment process in k-means to a nearest feasible center assignment  An object is assigned to the nearest center so that the assignment respects all cannot-link constraints
  • 803.
    Constraint-Based Clustering Methods(II): Handling Soft Constraints  Treated as an optimization problem: When a clustering violates a soft constraint, a penalty is imposed on the clustering  Overall objective: Optimizing the clustering quality, and minimizing the constraint violation penalty  Ex. CVQE (Constrained Vector Quantization Error) algorithm: Conduct k-means clustering while enforcing constraint violation penalties  Objective function: Sum of distance used in k-means, adjusted by the constraint violation penalties  Penalty of a must-link violation  If objects x and y must-be-linked but they are assigned to two different centers, c1 and c2, dist(c1, c2) is added to the objective function as the penalty  Penalty of a cannot-link violation  If objects x and y cannot-be-linked but they are assigned to a common center c, dist(c, c′), between c and c′ is added to the objective function as the penalty, where c′ is the closest cluster to c that can accommodate x or y 805
  • 804.
    806 Speeding Up ConstrainedClustering  It is costly to compute some constrained clustering  Ex. Clustering with obstacle objects: Tung, Hou, and Han. Spatial clustering in the presence of obstacles, ICDE'01  K-medoids is more preferable since k-means may locate the ATM center in the middle of a lake  Visibility graph and shortest path  Triangulation and micro-clustering  Two kinds of join indices (shortest-paths) worth pre-computation  VV index: indices for any pair of obstacle vertices  MV index: indices for any pair of micro- cluster and obstacle indices
  • 805.
    807 An Example: ClusteringWith Obstacle Objects Taking obstacles into account Not Taking obstacles into account
  • 806.
    808 User-Guided Clustering: ASpecial Kind of Constraints name office position Professor course-id name area course semester instructor office position Student name student course semester unit Register grade professor student degree Advise name Group person group Work-In area year conf Publication title title Publish author Target of clustering User hint Course Open-course  X. Yin, J. Han, P. S. Yu, “Cross-Relational Clustering with User's Guidance”, KDD'05  User usually has a goal of clustering, e.g., clustering students by research area  User specifies his clustering goal to CrossClus
  • 807.
    809 Comparing with Classification User-specified feature (in the form of attribute) is used as a hint, not class labels  The attribute may contain too many or too few distinct values, e.g., a user may want to cluster students into 20 clusters instead of 3  Additional features need to be included in cluster analysis All tuples for clustering User hint
  • 808.
    810 Comparing with Semi-SupervisedClustering  Semi-supervised clustering: User provides a training set consisting of “similar” (“must-link) and “dissimilar” (“cannot link”) pairs of objects  User-guided clustering: User specifies an attribute as a hint, and more relevant features are found for clustering All tuples for clustering Semi-supervised clustering All tuples for clustering User-guided clustering x
  • 809.
    811 Why Not Semi-SupervisedClustering?  Much information (in multiple relations) is needed to judge whether two tuples are similar  A user may not be able to provide a good training set  It is much easier for a user to specify an attribute as a hint, such as a student’s research area Tom Smith SC1211 TA Jane Chang BI205 RA Tuples to be compared User hint
  • 810.
    812 CrossClus: An Overview Measure similarity between features by how they group objects into clusters  Use a heuristic method to search for pertinent features  Start from user-specified feature and gradually expand search range  Use tuple ID propagation to create feature values  Features can be easily created during the expansion of search range, by propagating IDs  Explore three clustering algorithms: k-means, k-medoids, and hierarchical clustering
  • 811.
    813 Multi-Relational Features  Amulti-relational feature is defined by:  A join path, e.g., Student → Register → OpenCourse → Course  An attribute, e.g., Course.area  (For numerical feature) an aggregation operator, e.g., sum or average  Categorical feature f = [Student → Register → OpenCourse → Course, Course.area, null] Tuple Areas of courses DB AI TH t1 5 5 0 t2 0 3 7 t3 1 5 4 t4 5 0 5 t5 3 3 4 areas of courses of each student Tuple Feature f DB AI TH t1 0.5 0.5 0 t2 0 0.3 0.7 t3 0.1 0.5 0.4 t4 0.5 0 0.5 t5 0.3 0.3 0.4 Values of feature f f(t1) f(t2) f(t3) f(t4) f(t5) DB AI TH
  • 812.
    814 Representing Features  Similaritybetween tuples t1 and t2 w.r.t. categorical feature f  Cosine similarity between vectors f(t1) and f(t2)  Most important information of a feature f is how f groups tuples into clusters  f is represented by similarities between every pair of tuples indicated by f  The horizontal axes are the tuple indices, and the vertical axis is the similarity  This can be considered as a vector of N x N dimensions Similarity vector Vf                    L k k L k k L k k k f p t f p t f p t f p t f t t 1 2 2 1 2 1 1 2 1 2 1 . . . . , sim
  • 813.
    815 Similarity Between Features Featuref (course) Feature g (group) DB AI TH Info sys Cog sci Theory t1 0.5 0.5 0 1 0 0 t2 0 0.3 0.7 0 0 1 t3 0.1 0.5 0.4 0 0.5 0.5 t4 0.5 0 0.5 0.5 0 0.5 t5 0.3 0.3 0.4 0.5 0.5 0 Values of Feature f and g Similarity between two features – cosine similarity of two vectors Vf Vg   g f g f V V V V g f sim   ,
  • 814.
    816 Computing Feature Similarity Tuples Featuref Feature g DB AI TH Info sys Cog sci Theory Similarity between feature values w.r.t. the tuples sim(fk,gq)=Σi=1 to N f(ti).pk∙g(ti).pq DB Info sys       2 1 1 1 1 , , ,           l k m q q k N i N j j i g j i f g f g f sim t t sim t t sim V V Tuple similarities, hard to compute Feature value similarities, easy to compute DB AI TH Info sys Cog sci Theory Compute similarity between each pair of feature values by one scan on data
  • 815.
    817 Searching for PertinentFeatures  Different features convey different aspects of information  Features conveying same aspect of information usually cluster tuples in more similar ways  Research group areas vs. conferences of publications  Given user specified feature  Find pertinent features by computing feature similarity Research group area Advisor Conferences of papers Research area GPA Number of papers GRE score Academic Performances Nationality Permanent address Demographic info
  • 816.
    818 Heuristic Search forPertinent Features Overall procedure 1. Start from the user- specified feature 2. Search in neighborhood of existing pertinent features 3. Expand search range gradually name office position Professor office position Student name student course semester unit Register grade professor student degree Advise person group Work-In name Group area year conf Publication title title Publish author Target of clustering User hint course-id name area Course course semester instructor Open-course 1 2  Tuple ID propagation is used to create multi-relational features  IDs of target tuples can be propagated along any join path, from which we can find tuples joinable with each target tuple
  • 817.
    819 Clustering with Multi-RelationalFeatures  Given a set of L pertinent features f1, …, fL, similarity between two tuples  Weight of a feature is determined in feature search by its similarity with other pertinent features  Clustering methods  CLARANS [Ng & Han 94], a scalable clustering algorithm for non-Euclidean space  K-means  Agglomerative hierarchical clustering         L i i f weight f t t t t i 1 2 1 2 1 . , sim , sim
  • 818.
    820 Experiments: Compare CrossCluswith  Baseline: Only use the user specified feature  PROCLUS [Aggarwal, et al. 99]: a state-of-the-art subspace clustering algorithm  Use a subset of features for each cluster  We convert relational database to a table by propositionalization  User-specified feature is forced to be used in every cluster  RDBC [Kirsten and Wrobel’00]  A representative ILP clustering algorithm  Use neighbor information of objects for clustering  User-specified feature is forced to be used
  • 819.
    821 Measure of ClusteringAccuracy  Accuracy  Measured by manually labeled data  We manually assign tuples into clusters according to their properties (e.g., professors in different research areas)  Accuracy of clustering: Percentage of pairs of tuples in the same cluster that share common label  This measure favors many small clusters  We let each approach generate the same number of clusters
  • 820.
    822 DBLP Dataset Clustering Accurarcy- DBLP 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Conf W ord Coauthor Conf+W ord Conf+Coauthor W ord+Coauthor A ll three CrossClus K-Medoids CrossClus K-Means CrossClus Agglm Baseline PROCLUS RDBC
  • 821.
    823 Chapter 11. ClusterAnalysis: Advanced Methods  Probability Model-Based Clustering  Clustering High-Dimensional Data  Clustering Graphs and Network Data  Clustering with Constraints  Summary 823
  • 822.
    824 Summary  Probability Model-BasedClustering  Fuzzy clustering  Probability-model-based clustering  The EM algorithm  Clustering High-Dimensional Data  Subspace clustering: bi-clustering methods  Dimensionality reduction: Spectral clustering  Clustering Graphs and Network Data  Graph clustering: min-cut vs. sparsest cut  High-dimensional clustering methods  Graph-specific clustering methods, e.g., SCAN  Clustering with Constraints  Constraints on instance objects, e.g., Must link vs. Cannot Link  Constraint-based clustering algorithms
  • 823.
    825 References (I)  R.Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD’98  C. C. Aggarwal, C. Procopiuc, J. Wolf, P. S. Yu, and J.-S. Park. Fast algorithms for projected clustering. SIGMOD’99  S. Arora, S. Rao, and U. Vazirani. Expander flows, geometric embeddings and graph partitioning. J. ACM, 56:5:1–5:37, 2009.  J. C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, 1981.  K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is ”nearest neighbor” meaningful? ICDT’99  Y. Cheng and G. Church. Biclustering of expression data. ISMB’00  I. Davidson and S. S. Ravi. Clustering with constraints: Feasibility issues and the k-means algorithm. SDM’05  I. Davidson, K. L. Wagstaff, and S. Basu. Measuring constraint-set utility for partitional clustering algorithms. PKDD’06  C. Fraley and A. E. Raftery. Model-based clustering, discriminant analysis, and density estimation. J. American Stat. Assoc., 97:611–631, 2002.  F. H¨oppner, F. Klawonn, R. Kruse, and T. Runkler. Fuzzy Cluster Analysis: Methods for Classification, Data Analysis and Image Recognition. Wiley, 1999.  G. Jeh and J. Widom. SimRank: a measure of structural-context similarity. KDD’02  H.-P. Kriegel, P. Kroeger, and A. Zimek. Clustering high dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowledge Discovery from Data (TKDD), 3, 2009.  U. Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17:395–416, 2007
  • 824.
    References (II)  G.J. McLachlan and K. E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley & Sons, 1988.  B. Mirkin. Mathematical classification and clustering. J. of Global Optimization, 12:105–108, 1998.  S. C. Madeira and A. L. Oliveira. Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 1, 2004.  A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. NIPS’01  J. Pei, X. Zhang, M. Cho, H. Wang, and P. S. Yu. Maple: A fast algorithm for maximal pattern-based clustering. ICDM’03  M. Radovanovi´c, A. Nanopoulos, and M. Ivanovi´c. Nearest neighbors in high-dimensional data: the emergence and influence of hubs. ICML’09  S. E. Schaeffer. Graph clustering. Computer Science Review, 1:27–64, 2007.  A. K. H. Tung, J. Hou, and J. Han. Spatial clustering in the presence of obstacles. ICDE’01  A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-based clustering in large databases. ICDT’01  A. Tanay, R. Sharan, and R. Shamir. Biclustering algorithms: A survey. In Handbook of Computational Molecular Biology, Chapman & Hall, 2004.  K. Wagstaff, C. Cardie, S. Rogers, and S. Schr¨odl. Constrained k-means clustering with background knowledge. ICML’01  H. Wang, W. Wang, J. Yang, and P. S. Yu. Clustering by pattern similarity in large data sets. SIGMOD’02  X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger. SCAN: A structural clustering algorithm for networks. KDD’07  X. Yin, J. Han, and P.S. Yu, “Cross-Relational Clustering with User's Guidance”, KDD'05
  • 825.
    Slides Not toBe Used in Class 827
  • 826.
    828 Conceptual Clustering  Conceptualclustering  A form of clustering in machine learning  Produces a classification scheme for a set of unlabeled objects  Finds characteristic description for each concept (class)  COBWEB (Fisher’87)  A popular a simple method of incremental conceptual learning  Creates a hierarchical clustering in the form of a classification tree  Each node refers to a concept and contains a probabilistic description of that concept
  • 827.
    829 COBWEB Clustering Method Aclassification tree
  • 828.
    830 More on ConceptualClustering  Limitations of COBWEB  The assumption that the attributes are independent of each other is often too strong because correlation may exist  Not suitable for clustering large database data – skewed tree and expensive probability distributions  CLASSIT  an extension of COBWEB for incremental clustering of continuous data  suffers similar problems as COBWEB  AutoClass (Cheeseman and Stutz, 1996)  Uses Bayesian statistical analysis to estimate the number of clusters  Popular in industry
  • 829.
    831 Neural Network Approaches Neural network approaches  Represent each cluster as an exemplar, acting as a “prototype” of the cluster  New objects are distributed to the cluster whose exemplar is the most similar according to some distance measure  Typical methods  SOM (Soft-Organizing feature Map)  Competitive learning  Involves a hierarchical architecture of several units (neurons)  Neurons compete in a “winner-takes-all” fashion for the object currently being presented
  • 830.
    832 Self-Organizing Feature Map(SOM)  SOMs, also called topological ordered maps, or Kohonen Self- Organizing Feature Map (KSOMs)  It maps all the points in a high-dimensional source space into a 2 to 3-d target space, s.t., the distance and proximity relationship (i.e., topology) are preserved as much as possible  Similar to k-means: cluster centers tend to lie in a low-dimensional manifold in the feature space  Clustering is performed by having several units competing for the current object  The unit whose weight vector is closest to the current object wins  The winner and its neighbors learn by having their weights adjusted  SOMs are believed to resemble processing that can occur in the brain  Useful for visualizing high-dimensional data in 2- or 3-D space
  • 831.
    833 Web Document ClusteringUsing SOM  The result of SOM clustering of 12088 Web articles  The picture on the right: drilling down on the keyword “mining”  Based on websom.hut.fi Web page
  • 832.
    845 Data Mining: Concepts andTechniques (3rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2011 Han, Kamber & Pei. All rights reserved.
  • 833.
    846 Chapter 12. OutlierAnalysis  Outlier and Outlier Analysis  Outlier Detection Methods  Statistical Approaches  Proximity-Base Approaches  Clustering-Base Approaches  Classification Approaches  Mining Contextual and Collective Outliers  Outlier Detection in High Dimensional Data  Summary
  • 834.
    847 What Are Outliers? Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism  Ex.: Unusual credit card purchase, sports: Michael Jordon, Wayne Gretzky, ...  Outliers are different from the noise data  Noise is random error or variance in a measured variable  Noise should be removed before outlier detection  Outliers are interesting: It violates the mechanism that generates the normal data  Outlier detection vs. novelty detection: early stage, outlier; but later merged into the model  Applications:  Credit card fraud detection  Telecom fraud detection  Customer segmentation  Medical analysis
  • 835.
    848 Types of Outliers(I)  Three kinds: global, contextual and collective outliers  Global outlier (or point anomaly)  Object is Og if it significantly deviates from the rest of the data set  Ex. Intrusion detection in computer networks  Issue: Find an appropriate measurement of deviation  Contextual outlier (or conditional outlier)  Object is Oc if it deviates significantly based on a selected context  Ex. 80o F in Urbana: outlier? (depending on summer or winter?)  Attributes of data objects should be divided into two groups  Contextual attributes: defines the context, e.g., time & location  Behavioral attributes: characteristics of the object, used in outlier evaluation, e.g., temperature  Can be viewed as a generalization of local outliers—whose density significantly deviates from its local area  Issue: How to define or formulate meaningful context? Global Outlier
  • 836.
    849 Types of Outliers(II)  Collective Outliers  A subset of data objects collectively deviate significantly from the whole data set, even if the individual data objects may not be outliers  Applications: E.g., intrusion detection:  When a number of computers keep sending denial-of-service packages to each other Collective Outlier  Detection of collective outliers  Consider not only behavior of individual objects, but also that of groups of objects  Need to have the background knowledge on the relationship among data objects, such as a distance or similarity measure on objects.  A data set may have multiple types of outlier  One object may belong to more than one type of outlier
  • 837.
    850 Challenges of OutlierDetection  Modeling normal objects and outliers properly  Hard to enumerate all possible normal behaviors in an application  The border between normal and outlier objects is often a gray area  Application-specific outlier detection  Choice of distance measure among objects and the model of relationship among objects are often application-dependent  E.g., clinic data: a small deviation could be an outlier; while in marketing analysis, larger fluctuations  Handling noise in outlier detection  Noise may distort the normal objects and blur the distinction between normal objects and outliers. It may help hide outliers and reduce the effectiveness of outlier detection  Understandability  Understand why these are outliers: Justification of the detection  Specify the degree of an outlier: the unlikelihood of the object being generated by a normal mechanism
  • 838.
    851 Chapter 12. OutlierAnalysis  Outlier and Outlier Analysis  Outlier Detection Methods  Statistical Approaches  Proximity-Base Approaches  Clustering-Base Approaches  Classification Approaches  Mining Contextual and Collective Outliers  Outlier Detection in High Dimensional Data  Summary
  • 839.
    Outlier Detection I:Supervised Methods  Two ways to categorize outlier detection methods:  Based on whether user-labeled examples of outliers can be obtained:  Supervised, semi-supervised vs. unsupervised methods  Based on assumptions about normal data and outliers:  Statistical, proximity-based, and clustering-based methods  Outlier Detection I: Supervised Methods  Modeling outlier detection as a classification problem  Samples examined by domain experts used for training & testing  Methods for Learning a classifier for outlier detection effectively:  Model normal objects & report those not matching the model as outliers, or  Model outliers and treat those not matching the model as normal  Challenges  Imbalanced classes, i.e., outliers are rare: Boost the outlier class and make up some artificial outliers  Catch as many outliers as possible, i.e., recall is more important than accuracy (i.e., not mislabeling normal objects as outliers) 852
  • 840.
    Outlier Detection II:Unsupervised Methods  Assume the normal objects are somewhat ``clustered'‘ into multiple groups, each having some distinct features  An outlier is expected to be far away from any groups of normal objects  Weakness: Cannot detect collective outlier effectively  Normal objects may not share any strong patterns, but the collective outliers may share high similarity in a small area  Ex. In some intrusion or virus detection, normal activities are diverse  Unsupervised methods may have a high false positive rate but still miss many real outliers.  Supervised methods can be more effective, e.g., identify attacking some key resources  Many clustering methods can be adapted for unsupervised methods  Find clusters, then outliers: not belonging to any cluster  Problem 1: Hard to distinguish noise from outliers  Problem 2: Costly since first clustering: but far less outliers than normal objects  Newer methods: tackle outliers directly 853
  • 841.
    Outlier Detection III:Semi-Supervised Methods  Situation: In many applications, the number of labeled data is often small: Labels could be on outliers only, normal objects only, or both  Semi-supervised outlier detection: Regarded as applications of semi- supervised learning  If some labeled normal objects are available  Use the labeled examples and the proximate unlabeled objects to train a model for normal objects  Those not fitting the model of normal objects are detected as outliers  If only some labeled outliers are available, a small number of labeled outliers many not cover the possible outliers well  To improve the quality of outlier detection, one can get help from models for normal objects learned from unsupervised methods 854
  • 842.
    Outlier Detection (1):Statistical Methods  Statistical methods (also known as model-based methods) assume that the normal data follow some statistical model (a stochastic model)  The data not following the model are outliers. 855  Effectiveness of statistical methods: highly depends on whether the assumption of statistical model holds in the real data  There are rich alternatives to use various statistical models  E.g., parametric vs. non-parametric  Example (right figure): First use Gaussian distribution to model the normal data  For each object y in region R, estimate gD(y), the probability of y fits the Gaussian distribution  If gD(y) is very low, y is unlikely generated by the Gaussian model, thus an outlier
  • 843.
    Outlier Detection (2):Proximity-Based Methods  An object is an outlier if the nearest neighbors of the object are far away, i.e., the proximity of the object is significantly deviates from the proximity of most of the other objects in the same data set 856  The effectiveness of proximity-based methods highly relies on the proximity measure.  In some applications, proximity or distance measures cannot be obtained easily.  Often have a difficulty in finding a group of outliers which stay close to each other  Two major types of proximity-based outlier detection  Distance-based vs. density-based  Example (right figure): Model the proximity of an object using its 3 nearest neighbors  Objects in region R are substantially different from other objects in the data set.  Thus the objects in R are outliers
  • 844.
    Outlier Detection (3):Clustering-Based Methods  Normal data belong to large and dense clusters, whereas outliers belong to small or sparse clusters, or do not belong to any clusters 857  Since there are many clustering methods, there are many clustering-based outlier detection methods as well  Clustering is expensive: straightforward adaption of a clustering method for outlier detection can be costly and does not scale up well for large data sets  Example (right figure): two clusters  All points not in R form a large cluster  The two points in R form a tiny cluster, thus are outliers
  • 845.
    858 Chapter 12. OutlierAnalysis  Outlier and Outlier Analysis  Outlier Detection Methods  Statistical Approaches  Proximity-Base Approaches  Clustering-Base Approaches  Classification Approaches  Mining Contextual and Collective Outliers  Outlier Detection in High Dimensional Data  Summary
  • 846.
    Statistical Approaches  Statisticalapproaches assume that the objects in a data set are generated by a stochastic process (a generative model)  Idea: learn a generative model fitting the given data set, and then identify the objects in low probability regions of the model as outliers  Methods are divided into two categories: parametric vs. non- parametric  Parametric method  Assumes that the normal data is generated by a parametric distribution with parameter θ  The probability density function of the parametric distribution f(x, θ) gives the probability that object x is generated by the distribution  The smaller this value, the more likely x is an outlier  Non-parametric method  Not assume an a-priori statistical model and determine the model from the input data  Not completely parameter free but consider the number and nature of the parameters are flexible and not fixed in advance  Examples: histogram and kernel density estimation 859
  • 847.
    Parametric Methods I:Detection Univariate Outliers Based on Normal Distribution  Univariate data: A data set involving only one attribute or variable  Often assume that data are generated from a normal distribution, learn the parameters from the input data, and identify the points with low probability as outliers  Ex: Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}  Use the maximum likelihood method to estimate μ and σ 860  Taking derivatives with respect to μ and σ2 , we derive the following maximum likelihood estimates  For the above data with n = 10, we have  Then (24 – 28.61) /1.51 = – 3.04 < –3, 24 is an outlier since
  • 848.
    Parametric Methods I:The Grubb’s Test  Univariate outlier detection: The Grubb's test (maximum normed residual test) ─ another statistical method under normal distribution  For each object x in a data set, compute its z-score: x is an outlier if where is the value taken by a t-distribution at a significance level of α/(2N), and N is the # of objects in the data set 861
  • 849.
    Parametric Methods II:Detection of Multivariate Outliers  Multivariate data: A data set involving two or more attributes or variables  Transform the multivariate outlier detection task into a univariate outlier detection problem  Method 1. Compute Mahalaobis distance  Let ō be the mean vector for a multivariate data set. Mahalaobis distance for an object o to ō is MDist(o, ō) = (o – ō )T S –1 (o – ō) where S is the covariance matrix  Use the Grubb's test on this measure to detect outliers  Method 2. Use χ2 –statistic:  where Ei is the mean of the i-dimension among all objects, and n is the dimensionality  If χ2 –statistic is large, then object oi is an outlier 862
  • 850.
    Parametric Methods III:Using Mixture of Parametric Distributions  Assuming data generated by a normal distribution could be sometimes overly simplified  Example (right figure): The objects between the two clusters cannot be captured as outliers since they are close to the estimated mean 863  To overcome this problem, assume the normal data is generated by two normal distributions. For any object o in the data set, the probability that o is generated by the mixture of the two distributions is given by where fθ1 and fθ2 are the probability density functions of θ1 and θ2  Then use EM algorithm to learn the parameters μ1, σ1, μ2, σ2 from data  An object o is an outlier if it does not belong to any cluster
  • 851.
    Non-Parametric Methods: DetectionUsing Histogram  The model of normal data is learned from the input data without any a priori structure.  Often makes fewer assumptions about the data, and thus can be applicable in more scenarios  Outlier detection using histogram: 864  Figure shows the histogram of purchase amounts in transactions  A transaction in the amount of $7,500 is an outlier, since only 0.2% transactions have an amount higher than $5,000  Problem: Hard to choose an appropriate bin size for histogram  Too small bin size → normal objects in empty/rare bins, false positive  Too big bin size → outliers in some frequent bins, false negative  Solution: Adopt kernel density estimation to estimate the probability density distribution of the data. If the estimated density function is high, the object is likely normal. Otherwise, it is likely an outlier.
  • 852.
    865 Chapter 12. OutlierAnalysis  Outlier and Outlier Analysis  Outlier Detection Methods  Statistical Approaches  Proximity-Base Approaches  Clustering-Base Approaches  Classification Approaches  Mining Contextual and Collective Outliers  Outlier Detection in High Dimensional Data  Summary
  • 853.
    Proximity-Based Approaches: Distance-Basedvs. Density-Based Outlier Detection  Intuition: Objects that are far away from the others are outliers  Assumption of proximity-based approach: The proximity of an outlier deviates significantly from that of most of the others in the data set  Two types of proximity-based outlier detection methods  Distance-based outlier detection: An object o is an outlier if its neighborhood does not have enough other points  Density-based outlier detection: An object o is an outlier if its density is relatively much lower than that of its neighbors 866
  • 854.
    Distance-Based Outlier Detection For each object o, examine the # of other objects in the r- neighborhood of o, where r is a user-specified distance threshold  An object o is an outlier if most (taking π as a fraction threshold) of the objects in D are far away from o, i.e., not in the r-neighborhood of o  An object o is a DB(r, π) outlier if  Equivalently, one can check the distance between o and its k-th nearest neighbor ok, where . o is an outlier if dist(o, ok) > r  Efficient computation: Nested loop algorithm  For any object oi, calculate its distance from other objects, and count the # of other objects in the r-neighborhood.  If π∙n other objects are within r distance, terminate the inner loop  Otherwise, oi is a DB(r, π) outlier  Efficiency: Actually CPU time is not O(n2 ) but linear to the data set size since for most non-outlier objects, the inner loop terminates early 867
  • 855.
    Distance-Based Outlier Detection:A Grid-Based Method  Why efficiency is still a concern? When the complete set of objects cannot be held into main memory, cost I/O swapping  The major cost: (1) each object tests against the whole data set, why not only its close neighbor? (2) check objects one by one, why not group by group?  Grid-based method (CELL): Data space is partitioned into a multi-D grid. Each cell is a hyper cube with diagonal length r/2 868  Pruning using the level-1 & level 2 cell properties:  For any possible point x in cell C and any possible point y in a level-1 cell, dist(x,y) ≤ r  For any possible point x in cell C and any point y such that dist(x,y) ≥ r, y is in a level-2 cell  Thus we only need to check the objects that cannot be pruned, and even for such an object o, only need to compute the distance between o and the objects in the level-2 cells (since beyond level-2, the distance from o is more than r)
  • 856.
    Density-Based Outlier Detection Local outliers: Outliers comparing to their local neighborhoods, instead of the global data distribution  In Fig., o1 and o2 are local outliers to C1, o3 is a global outlier, but o4 is not an outlier. However, proximity-based clustering cannot find o1 and o2 are outlier (e.g., comparing with O4). 869  Intuition (density-based outlier detection): The density around an outlier object is significantly different from the density around its neighbors  Method: Use the relative density of an object against its neighbors as the indicator of the degree of the object being outliers  k-distance of an object o, distk(o): distance between o and its k-th NN  k-distance neighborhood of o, Nk(o) = {o’| o’ in D, dist(o, o’) ≤ distk(o)}  Nk(o) could be bigger than k since multiple objects may have identical distance to o
  • 857.
    Local Outlier Factor:LOF  Reachability distance from o’ to o:  where k is a user-specified parameter  Local reachability density of o: 870  LOF (Local outlier factor) of an object o is the average of the ratio of local reachability of o and those of o’s k-nearest neighbors  The lower the local reachability density of o, and the higher the local reachability density of the kNN of o, the higher LOF  This captures a local outlier whose local density is relatively low comparing to the local densities of its kNN
  • 858.
    871 Chapter 12. OutlierAnalysis  Outlier and Outlier Analysis  Outlier Detection Methods  Statistical Approaches  Proximity-Base Approaches  Clustering-Base Approaches  Classification Approaches  Mining Contextual and Collective Outliers  Outlier Detection in High Dimensional Data  Summary
  • 859.
    Clustering-Based Outlier Detection(1 & 2): Not belong to any cluster, or far from the closest one  An object is an outlier if (1) it does not belong to any cluster, (2) there is a large distance between the object and its closest cluster , or (3) it belongs to a small or sparse cluster  Case I: Not belong to any cluster  Identify animals not part of a flock: Using a density- based clustering method such as DBSCAN  Case 2: Far from its closest cluster  Using k-means, partition data points of into clusters  For each object o, assign an outlier score based on its distance from its closest center  If dist(o, co)/avg_dist(co) is large, likely an outlier  Ex. Intrusion detection: Consider the similarity between data points and the clusters in a training data set  Use a training set to find patterns of “normal” data, e.g., frequent itemsets in each segment, and cluster similar connections into groups  Compare new data points with the clusters mined—Outliers are possible attacks 872
  • 860.
     FindCBLOF: Detectoutliers in small clusters  Find clusters, and sort them in decreasing size  To each data point, assign a cluster-based local outlier factor (CBLOF):  If obj p belongs to a large cluster, CBLOF = cluster_size X similarity between p and cluster  If p belongs to a small one, CBLOF = cluster size X similarity betw. p and the closest large cluster 873 Clustering-Based Outlier Detection (3): Detecting Outliers in Small Clusters  Ex. In the figure, o is outlier since its closest large cluster is C1, but the similarity between o and C1 is small. For any point in C3, its closest large cluster is C2 but its similarity from C2 is low, plus |C3| = 3 is small
  • 861.
    Clustering-Based Method: Strengthand Weakness  Strength  Detect outliers without requiring any labeled data  Work for many types of data  Clusters can be regarded as summaries of the data  Once the cluster are obtained, need only compare any object against the clusters to determine whether it is an outlier (fast)  Weakness  Effectiveness depends highly on the clustering method used—they may not be optimized for outlier detection  High computational cost: Need to first find clusters  A method to reduce the cost: Fixed-width clustering  A point is assigned to a cluster if the center of the cluster is within a pre-defined distance threshold from the point  If a point cannot be assigned to any existing cluster, a new cluster is created and the distance threshold may be learned from the training data under certain conditions
  • 862.
    875 Chapter 12. OutlierAnalysis  Outlier and Outlier Analysis  Outlier Detection Methods  Statistical Approaches  Proximity-Base Approaches  Clustering-Base Approaches  Classification Approaches  Mining Contextual and Collective Outliers  Outlier Detection in High Dimensional Data  Summary
  • 863.
    Classification-Based Method I:One-Class Model  Idea: Train a classification model that can distinguish “normal” data from outliers  A brute-force approach: Consider a training set that contains samples labeled as “normal” and others labeled as “outlier”  But, the training set is typically heavily biased: # of “normal” samples likely far exceeds # of outlier samples  Cannot detect unseen anomaly 876  One-class model: A classifier is built to describe only the normal class.  Learn the decision boundary of the normal class using classification methods such as SVM  Any samples that do not belong to the normal class (not within the decision boundary) are declared as outliers  Adv: can detect new outliers that may not appear close to any outlier objects in the training set  Extension: Normal objects may belong to multiple classes
  • 864.
    Classification-Based Method II:Semi-Supervised Learning  Semi-supervised learning: Combining classification- based and clustering-based methods  Method  Using a clustering-based approach, find a large cluster, C, and a small cluster, C1  Since some objects in C carry the label “normal”, treat all objects in C as normal  Use the one-class model of this cluster to identify normal objects in outlier detection  Since some objects in cluster C1 carry the label “outlier”, declare all objects in C1 as outliers  Any object that does not fall into the model for C (such as a) is considered an outlier as well 877  Comments on classification-based outlier detection methods  Strength: Outlier detection is fast  Bottleneck: Quality heavily depends on the availability and quality of the training set, but often difficult to obtain representative and high- quality training data
  • 865.
    878 Chapter 12. OutlierAnalysis  Outlier and Outlier Analysis  Outlier Detection Methods  Statistical Approaches  Proximity-Base Approaches  Clustering-Base Approaches  Classification Approaches  Mining Contextual and Collective Outliers  Outlier Detection in High Dimensional Data  Summary
  • 866.
    Mining Contextual OutliersI: Transform into Conventional Outlier Detection  If the contexts can be clearly identified, transform it to conventional outlier detection 1. Identify the context of the object using the contextual attributes 2. Calculate the outlier score for the object in the context using a conventional outlier detection method  Ex. Detect outlier customers in the context of customer groups  Contextual attributes: age group, postal code  Behavioral attributes: # of trans/yr, annual total trans. amount  Steps: (1) locate c’s context, (2) compare c with the other customers in the same group, and (3) use a conventional outlier detection method  If the context contains very few customers, generalize contexts  Ex. Learn a mixture model U on the contextual attributes, and another mixture model V of the data on the behavior attributes  Learn a mapping p(Vi|Uj): the probability that a data object o belonging to cluster Uj on the contextual attributes is generated by cluster Vi on the behavior attributes  Outlier score: 879
  • 867.
    Mining Contextual OutliersII: Modeling Normal Behavior with Respect to Contexts  In some applications, one cannot clearly partition the data into contexts  Ex. if a customer suddenly purchased a product that is unrelated to those she recently browsed, it is unclear how many products browsed earlier should be considered as the context  Model the “normal” behavior with respect to contexts  Using a training data set, train a model that predicts the expected behavior attribute values with respect to the contextual attribute values  An object is a contextual outlier if its behavior attribute values significantly deviate from the values predicted by the model  Using a prediction model that links the contexts and behavior, these methods avoid the explicit identification of specific contexts  Methods: A number of classification and prediction techniques can be used to build such models, such as regression, Markov Models, and Finite State Automaton 880
  • 868.
    Mining Collective OutliersI: On the Set of “Structured Objects”  Collective outlier if objects as a group deviate significantly from the entire data  Need to examine the structure of the data set, i.e, the relationships between multiple data objects 881  Each of these structures is inherent to its respective type of data  For temporal data (such as time series and sequences), we explore the structures formed by time, which occur in segments of the time series or subsequences  For spatial data, explore local areas  For graph and network data, we explore subgraphs  Difference from the contextual outlier detection: the structures are often not explicitly defined, and have to be discovered as part of the outlier detection process.  Collective outlier detection methods: two categories  Reduce the problem to conventional outlier detection  Identify structure units, treat each structure unit (e.g., subsequence, time series segment, local area, or subgraph) as a data object, and extract features  Then outlier detection on the set of “structured objects” constructed as such using the extracted features
  • 869.
    Mining Collective OutliersII: Direct Modeling of the Expected Behavior of Structure Units  Models the expected behavior of structure units directly  Ex. 1. Detect collective outliers in online social network of customers  Treat each possible subgraph of the network as a structure unit  Collective outlier: An outlier subgraph in the social network  Small subgraphs that are of very low frequency  Large subgraphs that are surprisingly frequent  Ex. 2. Detect collective outliers in temporal sequences  Learn a Markov model from the sequences  A subsequence can then be declared as a collective outlier if it significantly deviates from the model  Collective outlier detection is subtle due to the challenge of exploring the structures in data  The exploration typically uses heuristics, and thus may be application dependent  The computational cost is often high due to the sophisticated mining process 882
  • 870.
    883 Chapter 12. OutlierAnalysis  Outlier and Outlier Analysis  Outlier Detection Methods  Statistical Approaches  Proximity-Base Approaches  Clustering-Base Approaches  Classification Approaches  Mining Contextual and Collective Outliers  Outlier Detection in High Dimensional Data  Summary
  • 871.
    Challenges for OutlierDetection in High- Dimensional Data  Interpretation of outliers  Detecting outliers without saying why they are outliers is not very useful in high-D due to many features (or dimensions) are involved in a high-dimensional data set  E.g., which subspaces that manifest the outliers or an assessment regarding the “outlier-ness” of the objects  Data sparsity  Data in high-D spaces are often sparse  The distance between objects becomes heavily dominated by noise as the dimensionality increases  Data subspaces  Adaptive to the subspaces signifying the outliers  Capturing the local behavior of data  Scalable with respect to dimensionality  # of subspaces increases exponentially 884
  • 872.
    Approach I: ExtendingConventional Outlier Detection  Method 1: Detect outliers in the full space, e.g., HilOut Algorithm  Find distance-based outliers, but use the ranks of distance instead of the absolute distance in outlier detection  For each object o, find its k-nearest neighbors: nn1(o), . . . , nnk(o)  The weight of object o:  All objects are ranked in weight-descending order  Top-l objects in weight are output as outliers (l: user-specified parm)  Employ space-filling curves for approximation: scalable in both time and space w.r.t. data size and dimensionality  Method 2: Dimensionality reduction  Works only when in lower-dimensionality, normal instances can still be distinguished from outliers  PCA: Heuristically, the principal components with low variance are preferred because, on such dimensions, normal objects are likely close to each other and outliers often deviate from the majority 885
  • 873.
    Approach II: FindingOutliers in Subspaces  Extending conventional outlier detection: Hard for outlier interpretation  Find outliers in much lower dimensional subspaces: easy to interpret why and to what extent the object is an outlier  E.g., find outlier customers in certain subspace: average transaction amount >> avg. and purchase frequency << avg.  Ex. A grid-based subspace outlier detection method  Project data onto various subspaces to find an area whose density is much lower than average  Discretize the data into a grid with φ equi-depth (why?) regions  Search for regions that are significantly sparse  Consider a k-d cube: k ranges on k dimensions, with n objects  If objects are independently distributed, the expected number of objects falling into a k-dimensional region is (1/ φ)k n = fk n,the standard deviation is  The sparsity coefficient of cube C:  If S(C) < 0, C contains less objects than expected  The more negative, the sparser C is and the more likely the objects in C are outliers in the subspace 886
  • 874.
    Approach III: ModelingHigh-Dimensional Outliers  Ex. Angle-based outliers: Kriegel, Schubert, and Zimek [KSZ08]  For each point o, examine the angle ∆xoy for every pair of points x, y.  Point in the center (e.g., a), the angles formed differ widely  An outlier (e.g., c), angle variable is substantially smaller  Use the variance of angles for a point to determine outlier  Combine angles and distance to model outliers  Use the distance-weighted angle variance as the outlier score  Angle-based outlier factor (ABOF):  Efficient approximation computation method is developed  It can be generalized to handle arbitrary types of data 887  Develop new models for high- dimensional outliers directly  Avoid proximity measures and adopt new heuristics that do not deteriorate in high-dimensional data A set of points form a cluster except c (outlier)
  • 875.
    888 Chapter 12. OutlierAnalysis  Outlier and Outlier Analysis  Outlier Detection Methods  Statistical Approaches  Proximity-Base Approaches  Clustering-Base Approaches  Classification Approaches  Mining Contextual and Collective Outliers  Outlier Detection in High Dimensional Data  Summary
  • 876.
    Summary  Types ofoutliers  global, contextual & collective outliers  Outlier detection  supervised, semi-supervised, or unsupervised  Statistical (or model-based) approaches  Proximity-base approaches  Clustering-base approaches  Classification approaches  Mining contextual and collective outliers  Outlier detection in high dimensional data 889
  • 877.
    References (I)  B.Abraham and G.E.P. Box. Bayesian analysis of some outlier problems in time series. Biometrika, 66:229–248, 1979.  M. Agyemang, K. Barker, and R. Alhajj. A comprehensive survey of numeric and symbolic outlier mining techniques. Intell. Data Anal., 10:521–538, 2006.  F. J. Anscombe and I. Guttman. Rejection of outliers. Technometrics, 2:123–147, 1960.  D. Agarwal. Detecting anomalies in cross-classified streams: a bayesian approach. Knowl. Inf. Syst., 11:29–44, 2006.  F. Angiulli and C. Pizzuti. Outlier mining in large high-dimensional data sets. TKDE, 2005.  C. C. Aggarwal and P. S. Yu. Outlier detection for high dimensional data. SIGMOD’01  R.J. Beckman and R.D. Cook. Outlier...s. Technometrics, 25:119–149, 1983.  I. Ben-Gal. Outlier detection. In Maimon O. and Rockach L. (eds.) Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers, Kluwer Academic, 2005.  M. M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander. LOF: Identifying density-based local outliers. SIGMOD’00  D. Barbar´a, Y. Li, J. Couto, J.-L. Lin, and S. Jajodia. Bootstrapping a data mining intrusion detection system. SAC’03  Z. A. Bakar, R. Mohemad, A. Ahmad, and M. M. Deris. A comparative study for outlier detection techniques in data mining. IEEE Conf. on Cybernetics and Intelligent Systems, 2006.  S. D. Bay and M. Schwabacher. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. KDD’03  D. Barbara, N. Wu, and S. Jajodia. Detecting novel network intrusion using bayesian estimators. SDM’01  V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing Surveys, 41:1–58, 2009.  D. Dasgupta and N.S. Majumdar. Anomaly detection in multidimensional data using negative selection algorithm. In CEC’02
  • 878.
    References (2)  E.Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo. A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data. In Proc. 2002 Int. Conf. of Data Mining for Security Applications, 2002.  E. Eskin. Anomaly detection over noisy data using learned probability distributions. ICML’00  T. Fawcett and F. Provost. Adaptive fraud detection. Data Mining and Knowledge Discovery, 1:291–316, 1997.  V. J. Hodge and J. Austin. A survey of outlier detection methdologies. Artif. Intell. Rev., 22:85–126, 2004.  D. M. Hawkins. Identification of Outliers. Chapman and Hall, London, 1980.  Z. He, X. Xu, and S. Deng. Discovering cluster-based local outliers. Pattern Recogn. Lett., 24, June, 2003.  W. Jin, K. H. Tung, and J. Han. Mining top-n local outliers in large databases. KDD’01  W. Jin, A. K. H. Tung, J. Han, and W. Wang. Ranking outliers using symmetric neighborhood relationship. PAKDD’06  E. Knorr and R. Ng. A unified notion of outliers: Properties and computation. KDD’97  E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98  E. M. Knorr, R. T. Ng, and V. Tucakov. Distance-based outliers: Algorithms and applications. VLDB J., 8:237–253, 2000.  H.-P. Kriegel, M. Schubert, and A. Zimek. Angle-based outlier detection in high-dimensional data. KDD’08  M. Markou and S. Singh. Novelty detection: A review—part 1: Statistical approaches. Signal Process., 83:2481– 2497, 2003.  M. Markou and S. Singh. Novelty detection: A review—part 2: Neural network based approaches. Signal Process., 83:2499–2521, 2003.  C. C. Noble and D. J. Cook. Graph-based anomaly detection. KDD’03
  • 879.
    References (3)  S.Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. Loci: Fast outlier detection using the local correlation integral. ICDE’03  A. Patcha and J.-M. Park. An overview of anomaly detection techniques: Existing solutions and latest technological trends. Comput. Netw., 51, 2007.  X. Song, M. Wu, C. Jermaine, and S. Ranka. Conditional anomaly detection. IEEE Trans. on Knowl. and Data Eng., 19, 2007.  Y. Tao, X. Xiao, and S. Zhou. Mining distance-based outliers from large databases in any metric space. KDD’06  N. Ye and Q. Chen. An anomaly detection technique based on a chi-square statistic for detecting intrusions into information systems. Quality and Reliability Engineering International, 17:105–112, 2001.  B.-K. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish, C. Faloutsos, and A. Biliris. Online data mining for co- evolving time sequences. ICDE’00
  • 880.
  • 881.
    894 Outlier Discovery: Statistical Approaches Assumea model underlying distribution that generates data set (e.g. normal distribution)  Use discordancy tests depending on  data distribution  distribution parameter (e.g., mean, variance)  number of expected outliers  Drawbacks  most tests are for single attribute  In many cases, data distribution may not be known
  • 882.
    895 Outlier Discovery: Distance-BasedApproach  Introduced to counter the main limitations imposed by statistical methods  We need multi-dimensional analysis without knowing data distribution  Distance-based outlier: A DB(p, D)-outlier is an object O in a dataset T such that at least a fraction p of the objects in T lies at a distance greater than D from O  Algorithms for mining distance-based outliers [Knorr & Ng, VLDB’98]  Index-based algorithm  Nested-loop algorithm  Cell-based algorithm
  • 883.
    896 Density-Based Local Outlier Detection M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-Based Local Outliers. SIGMOD 2000.  Distance-based outlier detection is based on global distance distribution  It encounters difficulties to identify outliers if data is not uniformly distributed  Ex. C1 contains 400 loosely distributed points, C2 has 100 tightly condensed points, 2 outlier points o1, o2  Distance-based method cannot identify o2 as an outlier  Need the concept of local outlier  Local outlier factor (LOF)  Assume outlier is not crisp  Each point has a LOF
  • 884.
    897 Outlier Discovery: Deviation-BasedApproach  Identifies outliers by examining the main characteristics of objects in a group  Objects that “deviate” from this description are considered outliers  Sequential exception technique  simulates the way in which humans can distinguish unusual objects from among a series of supposedly like objects  OLAP data cube technique  uses data cubes to identify regions of anomalies in large multidimensional data
  • 885.
    898 References (1)  B.Abraham and G.E.P. Box. Bayesian analysis of some outlier problems in time series. Biometrika, 1979.  Malik Agyemang, Ken Barker, and Rada Alhajj. A comprehensive survey of numeric and symbolic outlier mining techniques. Intell. Data Anal., 2006.  Deepak Agarwal. Detecting anomalies in cross-classied streams: a bayesian approach. Knowl. Inf. Syst., 2006.  C. C. Aggarwal and P. S. Yu. Outlier detection for high dimensional data. SIGMOD'01.  M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Optics-of: Identifying local outliers. PKDD '99  M. M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander. LOF: Identifying density-based local outliers. SIGMOD'00.  V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Comput. Surv., 2009.  D. Dasgupta and N.S. Majumdar. Anomaly detection in multidimensional data using negative selection algorithm. Computational Intelligence, 2002.  E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo. A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data. In Proc. 2002 Int. Conf. of Data Mining for Security Applications, 2002.  E. Eskin. Anomaly detection over noisy data using learned probability distributions. ICML’00.  T. Fawcett and F. Provost. Adaptive fraud detection. Data Mining and Knowledge Discovery, 1997.  R. Fujimaki, T. Yairi, and K. Machida. An approach to spacecraft anomaly detection problem using kernel feature space. KDD '05  F. E. Grubbs. Procedures for detecting outlying observations in samples. Technometrics, 1969.
  • 886.
    899 References (2)  V.Hodge and J. Austin. A survey of outlier detection methodologies. Artif. Intell. Rev., 2004.  Douglas M Hawkins. Identification of Outliers. Chapman and Hall, 1980.  P. S. Horn, L. Feng, Y. Li, and A. J. Pesce. Effect of Outliers and Nonhealthy Individuals on Reference Interval Estimation. Clin Chem, 2001.  W. Jin, A. K. H. Tung, J. Han, and W. Wang. Ranking outliers using symmetric neighborhood relationship. PAKDD'06  E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98  M. Markou and S. Singh.. Novelty detection: a review| part 1: statistical approaches. Signal Process., 83(12), 2003.  M. Markou and S. Singh. Novelty detection: a review| part 2: neural network based approaches. Signal Process., 83(12), 2003.  S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. Loci: Fast outlier detection using the local correlation integral. ICDE'03.  A. Patcha and J.-M. Park. An overview of anomaly detection techniques: Existing solutions and latest technological trends. Comput. Netw., 51(12):3448{3470, 2007.  W. Stefansky. Rejecting outliers in factorial designs. Technometrics, 14(2):469{479, 1972.  X. Song, M. Wu, C. Jermaine, and S. Ranka. Conditional anomaly detection. IEEE Trans. on Knowl. and Data Eng., 19(5):631{645, 2007.  Y. Tao, X. Xiao, and S. Zhou. Mining distance-based outliers from large databases in any metric space. KDD '06:  N. Ye and Q. Chen. An anomaly detection technique based on a chi-square statistic for detecting intrusions into information systems. Quality and Reliability Engineering International, 2001.
  • 887.
    Data Mining: Concepts andTechniques (3rd ed.) — Chapter 13 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2011 Han, Kamber & Pei. All rights reserved.
  • 889.
    902 Chapter 13: DataMining Trends and Research Frontiers  Mining Complex Types of Data  Other Methodologies of Data Mining  Data Mining Applications  Data Mining and Society  Data Mining Trends  Summary
  • 890.
    903 Mining Complex Typesof Data  Mining Sequence Data  Mining Time Series  Mining Symbolic Sequences  Mining Biological Sequences  Mining Graphs and Networks  Mining Other Kinds of Data
  • 891.
    904 Mining Sequence Data Similarity Search in Time Series Data  Subsequence match, dimensionality reduction, query-based similarity search, motif-based similarity search  Regression and Trend Analysis in Time-Series Data  long term + cyclic + seasonal variation + random movements  Sequential Pattern Mining in Symbolic Sequences  GSP, PrefixSpan, constraint-based sequential pattern mining  Sequence Classification  Feature-based vs. sequence-distance-based vs. model-based  Alignment of Biological Sequences  Pair-wise vs. multi-sequence alignment, substitution matirces, BLAST  Hidden Markov Model for Biological Sequence Analysis  Markov chain vs. hidden Markov models, forward vs. Viterbi vs. Baum-Welch algorithms
  • 892.
    905 Mining Graphs andNetworks  Graph Pattern Mining  Frequent subgraph patterns, closed graph patterns, gSpan vs. CloseGraph  Statistical Modeling of Networks  Small world phenomenon, power law (log-tail) distribution, densification  Clustering and Classification of Graphs and Homogeneous Networks  Clustering: Fast Modularity vs. SCAN  Classification: model vs. pattern-based mining  Clustering, Ranking and Classification of Heterogeneous Networks  RankClus, RankClass, and meta path-based, user-guided methodology  Role Discovery and Link Prediction in Information Networks  PathPredict  Similarity Search and OLAP in Information Networks: PathSim, GraphCube  Evolution of Social and Information Networks: EvoNetClus
  • 893.
    906 Mining Other Kindsof Data  Mining Spatial Data  Spatial frequent/co-located patterns, spatial clustering and classification  Mining Spatiotemporal and Moving Object Data  Spatiotemporal data mining, trajectory mining, periodica, swarm, …  Mining Cyber-Physical System Data  Applications: healthcare, air-traffic control, flood simulation  Mining Multimedia Data  Social media data, geo-tagged spatial clustering, periodicity analysis, …  Mining Text Data  Topic modeling, i-topic model, integration with geo- and networked data  Mining Web Data  Web content, web structure, and web usage mining  Mining Data Streams 
  • 894.
    907 Chapter 13: DataMining Trends and Research Frontiers  Mining Complex Types of Data  Other Methodologies of Data Mining  Data Mining Applications  Data Mining and Society  Data Mining Trends  Summary
  • 895.
    908 Other Methodologies ofData Mining  Statistical Data Mining  Views on Data Mining Foundations  Visual and Audio Data Mining
  • 896.
    909 Major Statistical DataMining Methods  Regression  Generalized Linear Model  Analysis of Variance  Mixed-Effect Models  Factor Analysis  Discriminant Analysis  Survival Analysis
  • 897.
    910 Statistical Data Mining(1)  There are many well-established statistical techniques for data analysis, particularly for numeric data  applied extensively to data from scientific experiments and data from economics and the social sciences  Regression  predict the value of a response (dependent) variable from one or more predictor (independent) variables where the variables are numeric  forms of regression: linear, multiple, weighted, polynomial, nonparametric, and robust
  • 898.
    911 Scientific and StatisticalData Mining (2)  Generalized linear models  allow a categorical response variable (or some transformation of it) to be related to a set of predictor variables  similar to the modeling of a numeric response variable using linear regression  include logistic regression and Poisson regression  Mixed-effect models  For analyzing grouped data, i.e. data that can be classified according to one or more grouping variables  Typically describe relationships between a response variable and some covariates in data grouped according to one or more factors
  • 899.
    912 Scientific and StatisticalData Mining (3)  Regression trees  Binary trees used for classification and prediction  Similar to decision trees:Tests are performed at the internal nodes  In a regression tree the mean of the objective attribute is computed and used as the predicted value  Analysis of variance  Analyze experimental data for two or more populations described by a numeric response variable and one or more categorical variables (factors)
  • 900.
    913 Statistical Data Mining(4)  Factor analysis  determine which variables are combined to generate a given factor  e.g., for many psychiatric data, one can indirectly measure other quantities (such as test scores) that reflect the factor of interest  Discriminant analysis  predict a categorical response variable, commonly used in social science  Attempts to determine several discriminant functions (linear combinations of the independent variables) that discriminate among the groups defined by the response variable www.spss.com/datamine/factor.htm
  • 901.
    914 Statistical Data Mining(5)  Time series: many methods such as autoregression, ARIMA (Autoregressive integrated moving-average modeling), long memory time-series modeling  Quality control: displays group summary charts  Survival analysis  Predicts the probability that a patient undergoing a medical treatment would survive at least to time t (life span prediction)
  • 902.
    915 Other Methodologies ofData Mining  Statistical Data Mining  Views on Data Mining Foundations  Visual and Audio Data Mining
  • 903.
    916 Views on DataMining Foundations (I)  Data reduction  Basis of data mining: Reduce data representation  Trades accuracy for speed in response  Data compression  Basis of data mining: Compress the given data by encoding in terms of bits, association rules, decision trees, clusters, etc.  Probability and statistical theory  Basis of data mining: Discover joint probability distributions of random variables
  • 904.
    917  Microeconomic view A view of utility: Finding patterns that are interesting only to the extent in that they can be used in the decision-making process of some enterprise  Pattern Discovery and Inductive databases  Basis of data mining: Discover patterns occurring in the database, such as associations, classification models, sequential patterns, etc.  Data mining is the problem of performing inductive logic on databases  The task is to query the data and the theory (i.e., patterns) of the database  Popular among many researchers in database systems Views on Data Mining Foundations (II)
  • 905.
    918 Other Methodologies ofData Mining  Statistical Data Mining  Views on Data Mining Foundations  Visual and Audio Data Mining
  • 906.
    919 Visual Data Mining Visualization: Use of computer graphics to create visual images which aid in the understanding of complex, often massive representations of data  Visual Data Mining: discovering implicit but useful knowledge from large data sets using visualization techniques Compute r Graphics High Performance Computing Pattern Recognitio n Human Compute r Interface s Multimedia Systems Visual Data Mining
  • 907.
    920 Visualization  Purpose ofVisualization  Gain insight into an information space by mapping data onto graphical primitives  Provide qualitative overview of large data sets  Search for patterns, trends, structure, irregularities, relationships among data.  Help find interesting regions and suitable parameters for further quantitative analysis.  Provide a visual proof of computer representations derived
  • 908.
    921 Visual Data Mining& Data Visualization  Integration of visualization and data mining  data visualization  data mining result visualization  data mining process visualization  interactive visual data mining  Data visualization  Data in a database or data warehouse can be viewed  at different levels of abstraction  as different combinations of attributes or dimensions  Data can be presented in various visual forms
  • 909.
    922 Data Mining ResultVisualization  Presentation of the results or knowledge obtained from data mining in visual forms  Examples  Scatter plots and boxplots (obtained from descriptive data mining)  Decision trees  Association rules  Clusters  Outliers  Generalized rules
  • 910.
    923 Boxplots from Statsoft:Multiple Variable Combinations
  • 911.
    924 Visualization of DataMining Results in SAS Enterprise Miner: Scatter Plots
  • 912.
    925 Visualization of AssociationRules in SGI/MineSet 3.0
  • 913.
    926 Visualization of aDecision Tree in SGI/MineSet 3.0
  • 914.
    927 Visualization of ClusterGrouping in IBM Intelligent Miner
  • 915.
    928 Data Mining ProcessVisualization  Presentation of the various processes of data mining in visual forms so that users can see  Data extraction process  Where the data is extracted  How the data is cleaned, integrated, preprocessed, and mined  Method selected for data mining  Where the results are stored  How they may be viewed
  • 916.
    929 Visualization of DataMining Processes by Clementine Understand variations with visualized data See your solution discovery process clearly
  • 917.
    930 Interactive Visual DataMining  Using visualization tools in the data mining process to help users make smart data mining decisions  Example  Display the data distribution in a set of attributes using colored sectors or columns (depending on whether the whole space is represented by either a circle or a set of columns)  Use the display to which sector should first be selected for classification and where a good split point for this sector may be
  • 918.
    931 Interactive Visual Miningby Perception-Based Classification (PBC)
  • 919.
    932 Audio Data Mining Uses audio signals to indicate the patterns of data or the features of data mining results  An interesting alternative to visual mining  An inverse task of mining audio (such as music) databases which is to find patterns from audio data  Visual data mining may disclose interesting patterns using graphical displays, but requires users to concentrate on watching patterns  Instead, transform patterns into sound and music and listen to pitches, rhythms, tune, and melody in order to identify anything interesting or unusual
  • 920.
    933 Chapter 13: DataMining Trends and Research Frontiers  Mining Complex Types of Data  Other Methodologies of Data Mining  Data Mining Applications  Data Mining and Society  Data Mining Trends  Summary
  • 921.
    934 Data Mining Applications Data mining: A young discipline with broad and diverse applications  There still exists a nontrivial gap between generic data mining methods and effective and scalable data mining tools for domain-specific applications  Some application domains (briefly discussed here)  Data Mining for Financial data analysis  Data Mining for Retail and Telecommunication Industries  Data Mining in Science and Engineering  Data Mining for Intrusion Detection and Prevention  Data Mining and Recommender Systems
  • 922.
    935 Data Mining forFinancial Data Analysis (I)  Financial data collected in banks and financial institutions are often relatively complete, reliable, and of high quality  Design and construction of data warehouses for multidimensional data analysis and data mining  View the debt and revenue changes by month, by region, by sector, and by other factors  Access statistical information such as max, min, total, average, trend, etc.  Loan payment prediction/consumer credit policy analysis  feature selection and attribute relevance ranking  Loan payment performance
  • 923.
    936  Classification andclustering of customers for targeted marketing  multidimensional segmentation by nearest- neighbor, classification, decision trees, etc. to identify customer groups or associate a new customer to an appropriate customer group  Detection of money laundering and other financial crimes  integration of from multiple DBs (e.g., bank transactions, federal/state crime history DBs)  Tools: data visualization, linkage analysis, classification, clustering tools, outlier analysis, and sequential pattern analysis tools (find unusual access sequences) Data Mining for Financial Data Analysis (II)
  • 924.
    937 Data Mining forRetail & Telcomm. Industries (I)  Retail industry: huge amounts of data on sales, customer shopping history, e-commerce, etc.  Applications of retail data mining  Identify customer buying behaviors  Discover customer shopping patterns and trends  Improve the quality of customer service  Achieve better customer retention and satisfaction  Enhance goods consumption ratios  Design more effective goods transportation and distribution policies  Telcomm. and many other industries: Share many similar goals and expectations of retail data mining
  • 925.
    938 Data Mining Practicefor Retail Industry  Design and construction of data warehouses  Multidimensional analysis of sales, customers, products, time, and region  Analysis of the effectiveness of sales campaigns  Customer retention: Analysis of customer loyalty  Use customer loyalty card information to register sequences of purchases of particular customers  Use sequential pattern mining to investigate changes in customer consumption or loyalty  Suggest adjustments on the pricing and variety of goods  Product recommendation and cross-reference of items  Fraudulent analysis and the identification of usual patterns  Use of visualization tools in data analysis
  • 926.
    939 Data Mining inScience and Engineering  Data warehouses and data preprocessing  Resolving inconsistencies or incompatible data collected in diverse environments and different periods (e.g. eco-system studies)  Mining complex data types  Spatiotemporal, biological, diverse semantics and relationships  Graph-based and network-based mining  Links, relationships, data flow, etc.  Visualization tools and domain-specific knowledge  Other issues  Data mining in social sciences and social studies: text and social media  Data mining in computer science: monitoring systems,
  • 927.
    940 Data Mining forIntrusion Detection and Prevention  Majority of intrusion detection and prevention systems use  Signature-based detection: use signatures, attack patterns that are preconfigured and predetermined by domain experts  Anomaly-based detection: build profiles (models of normal behavior) and detect those that are substantially deviate from the profiles  What data mining can help  New data mining algorithms for intrusion detection  Association, correlation, and discriminative pattern analysis help select and build discriminative classifiers  Analysis of stream data: outlier detection, clustering, model shifting  Distributed data mining  Visualization and querying tools
  • 928.
    941 Data Mining andRecommender Systems  Recommender systems: Personalization, making product recommendations that are likely to be of interest to a user  Approaches: Content-based, collaborative, or their hybrid  Content-based: Recommends items that are similar to items the user preferred or queried in the past  Collaborative filtering: Consider a user's social environment, opinions of other customers who have similar tastes or preferences  Data mining and recommender systems  Users C × items S: extract from known to unknown ratings to predict user-item combinations  Memory-based method often uses k-nearest neighbor approach  Model-based method uses a collection of ratings to learn a model (e.g., probabilistic models, clustering, Bayesian networks, etc.)
  • 929.
    942 Chapter 13: DataMining Trends and Research Frontiers  Mining Complex Types of Data  Other Methodologies of Data Mining  Data Mining Applications  Data Mining and Society  Data Mining Trends  Summary
  • 930.
    943 Ubiquitous and InvisibleData Mining  Ubiquitous Data Mining  Data mining is used everywhere, e.g., online shopping  Ex. Customer relationship management (CRM)  Invisible Data Mining  Invisible: Data mining functions are built in daily life operations  Ex. Google search: Users may be unaware that they are examining results returned by data  Invisible data mining is highly desirable  Invisible mining needs to consider efficiency and scalability, user interaction, incorporation of background knowledge and visualization techniques, finding interesting patterns, real- time, …  Further work: Integration of data mining into existing business and scientific technologies to provide domain-
  • 931.
    944 Privacy, Security andSocial Impacts of Data Mining  Many data mining applications do not touch personal data  E.g., meteorology, astronomy, geography, geology, biology, and other scientific and engineering data  Many DM studies are on developing scalable algorithms to find general or statistically significant patterns, not touching individuals  The real privacy concern: unconstrained access of individual records, especially privacy-sensitive information  Method 1: Removing sensitive IDs associated with the data  Method 2: Data security-enhancing methods  Multi-level security model: permit to access to only authorized level  Encryption: e.g., blind signatures, biometric encryption, and anonymous databases (personal information is encrypted and stored at different locations)  Method 3: Privacy-preserving data mining methods
  • 932.
    945 Privacy-Preserving Data Mining Privacy-preserving (privacy-enhanced or privacy-sensitive) mining:  Obtaining valid mining results without disclosing the underlying sensitive data values  Often needs trade-off between information loss and privacy  Privacy-preserving data mining methods:  Randomization (e.g., perturbation): Add noise to the data in order to mask some attribute values of records  K-anonymity and l-diversity: Alter individual records so that they cannot be uniquely identified  k-anonymity: Any given record maps onto at least k other records  l-diversity: enforcing intra-group diversity of sensitive values  Distributed privacy preservation: Data partitioned and distributed either horizontally, vertically, or a combination of both  Downgrading the effectiveness of data mining: The output of data mining may violate privacy
  • 933.
    946 Chapter 13: DataMining Trends and Research Frontiers  Mining Complex Types of Data  Other Methodologies of Data Mining  Data Mining Applications  Data Mining and Society  Data Mining Trends  Summary
  • 934.
    947 Trends of DataMining  Application exploration: Dealing with application-specific problems  Scalable and interactive data mining methods  Integration of data mining with Web search engines, database systems, data warehouse systems and cloud computing systems  Mining social and information networks  Mining spatiotemporal, moving objects and cyber-physical systems  Mining multimedia, text and web data  Mining biological and biomedical data  Data mining with software engineering and system engineering  Visual and audio data mining  Distributed data mining and real-time data stream mining  Privacy protection and information security in data mining
  • 935.
    948 Chapter 13: DataMining Trends and Research Frontiers  Mining Complex Types of Data  Other Methodologies of Data Mining  Data Mining Applications  Data Mining and Society  Data Mining Trends  Summary
  • 936.
    949 Summary  We presenta high-level overview of mining complex data types  Statistical data mining methods, such as regression, generalized linear models, analysis of variance, etc., are popularly adopted  Researchers also try to build theoretical foundations for data mining  Visual/audio data mining has been popular and effective  Application-based mining integrates domain-specific knowledge with data analysis techniques and provide mission-specific solutions  Ubiquitous data mining and invisible data mining are penetrating our data lives  Privacy and data security are importance issues in data mining, and privacy-preserving data mining has been developed recently  Our discussion on trends in data mining shows that data mining is
  • 937.
    950 References and FurtherReading  The books lists a lot of references for further reading. Here we only list a few books  E. Alpaydin. Introduction to Machine Learning, 2nd ed., MIT Press, 2011  S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002  R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed., Wiley-Interscience, 2000  D. Easley and J. Kleinberg. Networks, Crowds, and Markets: Reasoning about a Highly Connected World. Cambridge University Press, 2010.  U. Fayyad, G. Grinstein, and A. Wierse (eds.), Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001  J. Han, M. Kamber, J. Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed. 2011  T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer-Verlag, 2009  D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.  B. Liu. Web Data Mining, Springer 2006.  T. M. Mitchell. Machine Learning, McGraw Hill, 1997  M. Newman. Networks: An Introduction. Oxford University Press, 2010.  P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005  I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2nd ed. 2005
  • 938.

Editor's Notes

  • #5 Two slides should be added after this one 1. Evolution of machine learning 2. Evolution of statistics methods
  • #19 I BELIEVE WE MAY NEED TO DO IT IN MORE IN-DEPTH INTRODUCTION, USING SOME EXAMPLES. So it will take one slide for one function, i.e., one chapter we want to cover. Do we need to cover chapter 2: preprocessing and 3. Statistical methods?
  • #25 This chapter will not be in the new version, will it? BUT SHOULD WESTILL INTRODCE THEM SO THAT THEY WILL GET AN OVERALL PICTURE?
  • #29 Add a definition/description of “traditional data analysis”.
  • #63 Note: We need to label the dark plotted points as Q1, Median, Q3 – that would help in understanding this graph. Tell audience: There is a shift in distribution of branch 1 WRT branch 2 in that the unit prices of items sold at branch 1 tend to be lower than those at branch 2.
  • #72 http://books.elsevier.com/companions/1558606890/pictures/Chapter_01/fig1-6b.gif
  • #227 K. Wu, E. Otoo, and A. Shoshani, Bitmap Index Compression Optimality, VLDB’04 Need to digest and rewrite it using an example! See full abstract on slide towards end of file.
  • #232 2*2^{100}-1, 1
  • #389 Sacre Coeur in Montmartre
  • #482 I : the expected information needed to classify a given sample E (entropy) : expected information based on the partitioning into subsets by A
  • #559 MK: Note – different notation than used in book. Will have to standardize notation.
  • #593 Explore the bound in mining
  • #594 One fig
  • #628 MK: Do we want to keep this slide? It is not in the text and may confuse the students.
  • #723 We use this simple definition of tightness for efficiency concerns.
  • #815 But, how to compute the similarity efficiently? Computing inner product of two NxN matrices is too expensive.
  • #816 Very expensive to compute directly We convert it into another form
  • #912 Mixed-effects models provide a powerful and flexible tool for the analysis of balanced and unbalanced grouped data. These data arise in several areas of investigation and are characterized by the presence of correlation between observations within the same group. Some examples are repeated measures data, longitudinal studies, and nested designs. Classical modeling techniques which assume independence of the observations are not appropriate for grouped data.
  • #929 How the interactive Clementine knowledge discovery process works See your solution discovery process clearly The interactive stream approach to data mining is the key to Clementine's power. Using icons that represent steps in the data mining process, you mine your data by building a stream - a visual map of the process your data flows through. Start by simply dragging a source icon from the object palette onto the Clementine desktop to access your data flow. Then, explore your data visually with graphs. Apply several types of algorithms to build your model by simply placing the appropriate icons onto the desktop to form a stream. Discover knowledge interactively Data mining with Clementine is a "discovery-driven" process. Work toward a solution by applying your business expertise to select the next step in your stream, based on the discoveries made in the previous step. You can continually adapt or extend initial streams as you work through the solution to your business problem. Easily build and test models All of Clementine's advanced techniques work together to quickly give you the best answer to your business problems. You can build and test numerous models to immediately see which model produces the best result. Or you can even combine models by using the results of one model as input into another model. These "meta-models" consider the initial model's decisions and can improve results substantially. Understand variations in your business with visualized data Powerful data visualization techniques help you understand key relationships in your data and guide the way to the best results. Spot characteristics and patterns at a glance with Clementine's interactive graphs. Then "query by mouse" to explore these patterns by selecting subsets of data or deriving new variables on the fly from discoveries made within the graph. How Clementine scales to the size of the challenge The Clementine approach to scaling is unique in the way it aims to scale the complete data mining process to the size of large, challenging datasets. Clementine executes common operations used throughout the data mining process in the database through SQL queries. This process leverages the power of the database for faster processing, enabling you to get better results with large datasets.
  • #944 Buying patterns, targeted marketing